Compare commits

...

42 Commits

Author SHA1 Message Date
902703d145 Fix: skip tag query if tag kbs are invalid (#10168)
### What problem does this PR solve?

Skip `tag_query` step if `tag_kbs` are empty. 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-19 19:12:18 +08:00
7ccca2143c perf: add get_all_kb_doc_count func to simplify kb.doc_num updating (#10169)
### What problem does this PR solve?

Add get_all_kb_doc_count func to simplify kb.doc_num updating.

### Type of change

- [x] Performance Improvement
2025-09-19 19:11:50 +08:00
70ce02faf4 Feat: add support for Anthropic third-party API (#10173)
### What problem does this PR solve?
issue:
[Bug]: anthropic model have not baseurl selecting,need add #8546
change:
This PR adds support for using Anthropic models through a third-party
API by allowing a custom base_url.
It ensures compatibility with both the official Anthropic endpoint and
external providers.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-19 19:06:14 +08:00
3f1741c8c6 Docs: How to accelerate question answering (#10179)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2025-09-19 18:18:46 +08:00
6c24ad7966 fix: correct rerank_model condition logic (#10174)
### What problem does this PR solve?

fix the rerank_model condition logic by correcting the np.isclose check.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-19 16:02:10 +08:00
4846589599 Docs: Input and output variables defined in the Input and Output sections must also be implemented in your code. (#10162)
### What problem does this PR solve?
 
#10089 

### Type of change

- [x] Documentation Update
2025-09-19 11:35:58 +08:00
a24547aa66 Support server health check by http://localhost:<port>/v1/system/healthz (#10150)
### What problem does this PR solve?

Support server health check. Solved issue: #10106

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-19 11:11:07 +08:00
a04c5247ab Feat: Add file convert to document API just like file2document_app.py (#10158)
### What problem does this PR solve?

Add file convert to document API just like file2document_app.py

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-19 09:59:54 +08:00
ed6a76dcc0 Add Firecrawl integration for RAGFlow (#10152)
## 🚀 Firecrawl Integration for RAGFlow

This PR implements the Firecrawl integration for RAGFlow as requested in
issue https://github.com/firecrawl/firecrawl/issues/2167

###  Features Implemented

- **Data Source Integration**: Firecrawl appears as a selectable data
source in RAGFlow
- **Configuration Management**: Users can input Firecrawl API keys
through RAGFlow's interface
- **Web Scraping**: Supports single URL scraping, website crawling, and
batch processing
- **Content Processing**: Converts scraped content to RAGFlow's document
format with chunking
- **Error Handling**: Comprehensive error handling for rate limits,
failed requests, and malformed content
- **UI Components**: Complete UI schema and workflow components for
RAGFlow integration

### 📁 Files Added

- `intergrations/firecrawl/` - Complete integration package
- `intergrations/firecrawl/integration.py` - RAGFlow integration entry
point
- `intergrations/firecrawl/firecrawl_connector.py` - API communication
- `intergrations/firecrawl/firecrawl_config.py` - Configuration
management
- `intergrations/firecrawl/firecrawl_processor.py` - Content processing
- `intergrations/firecrawl/firecrawl_ui.py` - UI components
- `intergrations/firecrawl/ragflow_integration.py` - Main integration
class
- `intergrations/firecrawl/README.md` - Complete documentation
- `intergrations/firecrawl/example_usage.py` - Usage examples

### 🧪 Testing

The integration has been thoroughly tested with:
- Configuration validation
- Connection testing
- Content processing and chunking
- UI component rendering
- Error handling scenarios

### 📋 Acceptance Criteria Met

-  Integration appears as selectable data source in RAGFlow's data
source options
-  Users can input Firecrawl API keys through RAGFlow's configuration
interface
-  Successfully scrapes content from provided URLs and imports into
RAGFlow's document store
-  Handles common edge cases (rate limits, failed requests, malformed
content)
-  Includes basic documentation and README updates
-  Code follows RAGFlow's existing patterns and coding standards

### �� Related Issue

https://github.com/firecrawl/firecrawl/issues/2167

---------

Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
2025-09-19 09:58:17 +08:00
a0ccbec8bd Fix: knowledge base's embedded model form layout and dependency imports in the main branch. #9869 (#10160)
### What problem does this PR solve?

Fix: Fixed the knowledge base's embedded model form layout and
dependency imports in the main branch.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-19 09:57:21 +08:00
4693c5382a Feat: migrate OpenAI-compatible chats to LiteLLM (#10148)
### What problem does this PR solve?

Migrate OpenAI-compatible chats to LiteLLM.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-18 17:16:59 +08:00
ff3b4d0dcd Fix: Merge different types of models from the same manufacturer #10146 (#10157)
### What problem does this PR solve?

Fix: Merge different types of models from the same manufacturer #10146

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-18 17:15:54 +08:00
62d35b1b73 Fix: handle zero (#10149)
### What problem does this PR solve?

Handle zero and nan in calculate.
#10125

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-18 16:28:03 +08:00
91b609447d Fix: embedding model failure in CometAPI (#10137)
### What problem does this PR solve?

Related PR:
Feat: add CometAPI to LLMFactory and update related mappings #10119 

Change:
Fixes the issue where the embedding model in CometAPI was not being
called correctly

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: TensorNull <tensor.null@gmail.com>
2025-09-18 14:49:47 +08:00
c353840244 Feat: add support for KB document basic info (#10134)
### What problem does this PR solve?

Add support for KB document basic info

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-18 09:52:33 +08:00
f12b9fdcd4 Feat: add CometAPI to LLMFactory and update related mappings (#10119)
### Related issues
#10078

### What problem does this PR solve?
Integrate CometAPI provider.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
2025-09-18 09:51:29 +08:00
80ede65bbe Docs: Updated database types supported by the Execute SQL tool (#10113)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2025-09-18 09:47:35 +08:00
52cf186028 Correct the text of vectorSimilarityWeight in zh.ts (#10128)
### What problem does this PR solve?

The original text for vectorSimilarityWeight in Chinese version was
"相似度相似度权重," which is obviously a malformed phrase. It has now been
changed to "向量相似度权重". Also, align it with the English version 'Vector
similarity weight'.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-18 09:46:54 +08:00
ea0f1d47a5 Support image recognition for url links in Markdown file, fix log error in code_exec (#10139)
### What problem does this PR solve?

Support image recognition with image links in markdown files, solved
issue: #8755
Fixed log info error in code_exec, solved issue: #10064

### Type of change (8755)

- [x] New Feature (non-breaking change which adds functionality)

### Type of change (10064)

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-18 09:44:17 +08:00
9fe7c92217 Build(deps): Bump axios from 1.9.0 to 1.12.0 in /sandbox/sandbox_base_image/nodejs (#10091)
Bumps [axios](https://github.com/axios/axios) from 1.9.0 to 1.12.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/axios/axios/releases">axios's
releases</a>.</em></p>
<blockquote>
<h2>Release v1.12.0</h2>
<h2>Release notes:</h2>
<h3>Bug Fixes</h3>
<ul>
<li>adding build artifacts (<a
href="9ec86de257">9ec86de</a>)</li>
<li>dont add dist on release (<a
href="a2edc3606a">a2edc36</a>)</li>
<li><strong>fetch-adapter:</strong> set correct Content-Type for Node
FormData (<a
href="https://redirect.github.com/axios/axios/issues/6998">#6998</a>)
(<a
href="a9f47afbf3">a9f47af</a>)</li>
<li><strong>node:</strong> enforce maxContentLength for data: URLs (<a
href="https://redirect.github.com/axios/axios/issues/7011">#7011</a>)
(<a
href="945435fc51">945435f</a>)</li>
<li>package exports (<a
href="https://redirect.github.com/axios/axios/issues/5627">#5627</a>)
(<a
href="aa78ac23fc">aa78ac2</a>)</li>
<li><strong>params:</strong> removing '[' and ']' from URL encode
exclude characters (<a
href="https://redirect.github.com/axios/axios/issues/3316">#3316</a>)
(<a
href="https://redirect.github.com/axios/axios/issues/5715">#5715</a>)
(<a
href="6d84189349">6d84189</a>)</li>
<li>release pr run (<a
href="fd7f404488">fd7f404</a>)</li>
<li><strong>types:</strong> change the type guard on isCancel (<a
href="https://redirect.github.com/axios/axios/issues/5595">#5595</a>)
(<a
href="0dbb7fd4f6">0dbb7fd</a>)</li>
</ul>
<h3>Features</h3>
<ul>
<li><strong>adapter:</strong> surface low‑level network error details;
attach original error via cause (<a
href="https://redirect.github.com/axios/axios/issues/6982">#6982</a>)
(<a
href="78b290c57c">78b290c</a>)</li>
<li><strong>fetch:</strong> add fetch, Request, Response env config
variables for the adapter; (<a
href="https://redirect.github.com/axios/axios/issues/7003">#7003</a>)
(<a
href="c959ff2901">c959ff2</a>)</li>
<li>support reviver on JSON.parse (<a
href="https://redirect.github.com/axios/axios/issues/5926">#5926</a>)
(<a
href="2a9763426e">2a97634</a>),
closes <a
href="https://redirect.github.com/axios/axios/issues/5924">#5924</a></li>
<li><strong>types:</strong> extend AxiosResponse interface to include
custom headers type (<a
href="https://redirect.github.com/axios/axios/issues/6782">#6782</a>)
(<a
href="7960d34ede">7960d34</a>)</li>
</ul>
<h3>Contributors to this release</h3>
<ul>
<li><!-- raw HTML omitted --> <a
href="https://github.com/WillianAgostini" title="+132/-16760
([#7002](https://github.com/axios/axios/issues/7002)
[#5926](https://github.com/axios/axios/issues/5926)
[#6782](https://github.com/axios/axios/issues/6782) )">Willian
Agostini</a></li>
<li><!-- raw HTML omitted --> <a
href="https://github.com/DigitalBrainJS" title="+4263/-293
([#7006](https://github.com/axios/axios/issues/7006)
[#7003](https://github.com/axios/axios/issues/7003) )">Dmitriy
Mozgovoy</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/mkhani01"
title="+111/-15 ([#6982](https://github.com/axios/axios/issues/6982)
)">khani</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/AmeerAssadi"
title="+123/-0 ([#7011](https://github.com/axios/axios/issues/7011)
)">Ameer Assadi</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/emiedonmokumo"
title="+55/-35 ([#6998](https://github.com/axios/axios/issues/6998)
)">Emiedonmokumo Dick-Boro</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/opsysdebug"
title="+8/-8 ([#6980](https://github.com/axios/axios/issues/6980)
)">Zeroday BYTE</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/jasonsaayman"
title="+7/-7 ([#6985](https://github.com/axios/axios/issues/6985)
[#6985](https://github.com/axios/axios/issues/6985) )">Jason
Saayman</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/HealGaren"
title="+5/-7 ([#5715](https://github.com/axios/axios/issues/5715)
)">최예찬</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/gligorkot"
title="+3/-1 ([#5627](https://github.com/axios/axios/issues/5627)
)">Gligor Kotushevski</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/adimit"
title="+2/-1 ([#5595](https://github.com/axios/axios/issues/5595)
)">Aleksandar Dimitrov</a></li>
</ul>
<h2>Release v1.11.0</h2>
<h2>Release notes:</h2>
<h3>Bug Fixes</h3>
<ul>
<li>form-data npm pakcage (<a
href="https://redirect.github.com/axios/axios/issues/6970">#6970</a>)
(<a
href="e72c193722">e72c193</a>)</li>
<li>prevent RangeError when using large Buffers (<a
href="https://redirect.github.com/axios/axios/issues/6961">#6961</a>)
(<a
href="a2214ca1bc">a2214ca</a>)</li>
<li><strong>types:</strong> resolve type discrepancies between ESM and
CJS TypeScript declaration files (<a
href="https://redirect.github.com/axios/axios/issues/6956">#6956</a>)
(<a
href="8517aa16f8">8517aa1</a>)</li>
</ul>
<h3>Contributors to this release</h3>
<ul>
<li><!-- raw HTML omitted --> <a href="https://github.com/izzygld"
title="+186/-93 ([#6970](https://github.com/axios/axios/issues/6970)
)">izzy goldman</a></li>
<li><!-- raw HTML omitted --> <a
href="https://github.com/manishsahanidev" title="+70/-0
([#6961](https://github.com/axios/axios/issues/6961) )">Manish
Sahani</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/noritaka1166"
title="+12/-10 ([#6938](https://github.com/axios/axios/issues/6938)
[#6939](https://github.com/axios/axios/issues/6939) )">Noritaka
Kobayashi</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/jrnail23"
title="+13/-2 ([#6956](https://github.com/axios/axios/issues/6956)
)">James Nail</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/Tejaswi1305"
title="+1/-1 ([#6894](https://github.com/axios/axios/issues/6894)
)">Tejaswi1305</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/axios/axios/blob/v1.x/CHANGELOG.md">axios's
changelog</a>.</em></p>
<blockquote>
<h1><a
href="https://github.com/axios/axios/compare/v1.11.0...v1.12.0">1.12.0</a>
(2025-09-11)</h1>
<h3>Bug Fixes</h3>
<ul>
<li>adding build artifacts (<a
href="9ec86de257">9ec86de</a>)</li>
<li>dont add dist on release (<a
href="a2edc3606a">a2edc36</a>)</li>
<li><strong>fetch-adapter:</strong> set correct Content-Type for Node
FormData (<a
href="https://redirect.github.com/axios/axios/issues/6998">#6998</a>)
(<a
href="a9f47afbf3">a9f47af</a>)</li>
<li><strong>node:</strong> enforce maxContentLength for data: URLs (<a
href="https://redirect.github.com/axios/axios/issues/7011">#7011</a>)
(<a
href="945435fc51">945435f</a>)</li>
<li>package exports (<a
href="https://redirect.github.com/axios/axios/issues/5627">#5627</a>)
(<a
href="aa78ac23fc">aa78ac2</a>)</li>
<li><strong>params:</strong> removing '[' and ']' from URL encode
exclude characters (<a
href="https://redirect.github.com/axios/axios/issues/3316">#3316</a>)
(<a
href="https://redirect.github.com/axios/axios/issues/5715">#5715</a>)
(<a
href="6d84189349">6d84189</a>)</li>
<li>release pr run (<a
href="fd7f404488">fd7f404</a>)</li>
<li><strong>types:</strong> change the type guard on isCancel (<a
href="https://redirect.github.com/axios/axios/issues/5595">#5595</a>)
(<a
href="0dbb7fd4f6">0dbb7fd</a>)</li>
</ul>
<h3>Features</h3>
<ul>
<li><strong>adapter:</strong> surface low‑level network error details;
attach original error via cause (<a
href="https://redirect.github.com/axios/axios/issues/6982">#6982</a>)
(<a
href="78b290c57c">78b290c</a>)</li>
<li><strong>fetch:</strong> add fetch, Request, Response env config
variables for the adapter; (<a
href="https://redirect.github.com/axios/axios/issues/7003">#7003</a>)
(<a
href="c959ff2901">c959ff2</a>)</li>
<li>support reviver on JSON.parse (<a
href="https://redirect.github.com/axios/axios/issues/5926">#5926</a>)
(<a
href="2a9763426e">2a97634</a>),
closes <a
href="https://redirect.github.com/axios/axios/issues/5924">#5924</a></li>
<li><strong>types:</strong> extend AxiosResponse interface to include
custom headers type (<a
href="https://redirect.github.com/axios/axios/issues/6782">#6782</a>)
(<a
href="7960d34ede">7960d34</a>)</li>
</ul>
<h3>Contributors to this release</h3>
<ul>
<li><!-- raw HTML omitted --> <a
href="https://github.com/WillianAgostini" title="+132/-16760
([#7002](https://github.com/axios/axios/issues/7002)
[#5926](https://github.com/axios/axios/issues/5926)
[#6782](https://github.com/axios/axios/issues/6782) )">Willian
Agostini</a></li>
<li><!-- raw HTML omitted --> <a
href="https://github.com/DigitalBrainJS" title="+4263/-293
([#7006](https://github.com/axios/axios/issues/7006)
[#7003](https://github.com/axios/axios/issues/7003) )">Dmitriy
Mozgovoy</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/mkhani01"
title="+111/-15 ([#6982](https://github.com/axios/axios/issues/6982)
)">khani</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/AmeerAssadi"
title="+123/-0 ([#7011](https://github.com/axios/axios/issues/7011)
)">Ameer Assadi</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/emiedonmokumo"
title="+55/-35 ([#6998](https://github.com/axios/axios/issues/6998)
)">Emiedonmokumo Dick-Boro</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/opsysdebug"
title="+8/-8 ([#6980](https://github.com/axios/axios/issues/6980)
)">Zeroday BYTE</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/jasonsaayman"
title="+7/-7 ([#6985](https://github.com/axios/axios/issues/6985)
[#6985](https://github.com/axios/axios/issues/6985) )">Jason
Saayman</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/HealGaren"
title="+5/-7 ([#5715](https://github.com/axios/axios/issues/5715)
)">최예찬</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/gligorkot"
title="+3/-1 ([#5627](https://github.com/axios/axios/issues/5627)
)">Gligor Kotushevski</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/adimit"
title="+2/-1 ([#5595](https://github.com/axios/axios/issues/5595)
)">Aleksandar Dimitrov</a></li>
</ul>
<h1><a
href="https://github.com/axios/axios/compare/v1.10.0...v1.11.0">1.11.0</a>
(2025-07-22)</h1>
<h3>Bug Fixes</h3>
<ul>
<li>form-data npm pakcage (<a
href="https://redirect.github.com/axios/axios/issues/6970">#6970</a>)
(<a
href="e72c193722">e72c193</a>)</li>
<li>prevent RangeError when using large Buffers (<a
href="https://redirect.github.com/axios/axios/issues/6961">#6961</a>)
(<a
href="a2214ca1bc">a2214ca</a>)</li>
<li><strong>types:</strong> resolve type discrepancies between ESM and
CJS TypeScript declaration files (<a
href="https://redirect.github.com/axios/axios/issues/6956">#6956</a>)
(<a
href="8517aa16f8">8517aa1</a>)</li>
</ul>
<h3>Contributors to this release</h3>
<ul>
<li><!-- raw HTML omitted --> <a href="https://github.com/izzygld"
title="+186/-93 ([#6970](https://github.com/axios/axios/issues/6970)
)">izzy goldman</a></li>
<li><!-- raw HTML omitted --> <a
href="https://github.com/manishsahanidev" title="+70/-0
([#6961](https://github.com/axios/axios/issues/6961) )">Manish
Sahani</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/noritaka1166"
title="+12/-10 ([#6938](https://github.com/axios/axios/issues/6938)
[#6939](https://github.com/axios/axios/issues/6939) )">Noritaka
Kobayashi</a></li>
<li><!-- raw HTML omitted --> <a href="https://github.com/jrnail23"
title="+13/-2 ([#6956](https://github.com/axios/axios/issues/6956)
)">James Nail</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0d8ad6e1de"><code>0d8ad6e</code></a>
chore(release): v1.12.0 (<a
href="https://redirect.github.com/axios/axios/issues/7013">#7013</a>)</li>
<li><a
href="fd7f404488"><code>fd7f404</code></a>
fix: release pr run</li>
<li><a
href="a2edc3606a"><code>a2edc36</code></a>
fix: dont add dist on release</li>
<li><a
href="9ec86de257"><code>9ec86de</code></a>
fix: adding build artifacts</li>
<li><a
href="945435fc51"><code>945435f</code></a>
fix(node): enforce maxContentLength for data: URLs (<a
href="https://redirect.github.com/axios/axios/issues/7011">#7011</a>)</li>
<li><a
href="28e5e3016d"><code>28e5e30</code></a>
chore(sponsor): update sponsor block (<a
href="https://redirect.github.com/axios/axios/issues/7005">#7005</a>)</li>
<li><a
href="d03f245a40"><code>d03f245</code></a>
chore(CI): fixed release info script to use npm registry instead of git
as fi...</li>
<li><a
href="a0bc911379"><code>a0bc911</code></a>
chore: removing dist files from src (<a
href="https://redirect.github.com/axios/axios/issues/7002">#7002</a>)</li>
<li><a
href="c959ff2901"><code>c959ff2</code></a>
feat(fetch): add fetch, Request, Response env config variables for the
adapte...</li>
<li><a
href="a9f47afbf3"><code>a9f47af</code></a>
fix(fetch-adapter): set correct Content-Type for Node FormData (<a
href="https://redirect.github.com/axios/axios/issues/6998">#6998</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/axios/axios/compare/v1.9.0...v1.12.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=axios&package-manager=npm_and_yarn&previous-version=1.9.0&new-version=1.12.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/infiniflow/ragflow/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-18 09:41:24 +08:00
d353f7f7f8 Feat/parse audio (#10133)
### What problem does this PR solve?

Dataflow support audio.  And fix giteeAI's sequence2text model. 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-09-18 09:31:32 +08:00
f3738b06f1 Fixes session_id passing in agent_openai completion. (#10124)
### What problem does this PR solve?

An exception happens if you give session_id to agent_open_ai completion.
Because session_id is being given as well as **req so it tries to send
session_id twice. But also the logic seemed odd on picking one of
session_id, id, metadata.id. So cleaned it up a little.

See #10111 

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
2025-09-17 17:54:06 +08:00
5a8bc88147 Docs: Removed /v1 from Ollama base URLs (#10067)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2025-09-17 13:48:29 +08:00
04ef5b2783 Fix: usage of postgresql -> postgres for db_type (#10120)
### What problem does this PR solve?

This PR fixes incorrect naming for PostgreSQL usage by replacing all
instances of `postgresql` with the correct `postgres` in the `db_type`
field. This resolves potential configuration errors and ensures
consistency when specifying the database type.

Also fixed handling of None for `get_queue_length`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: cucusenok <BP-116: updated readme.md>
2025-09-17 10:30:45 +08:00
c9ea22ef69 Fix: set default chunk_token_num in html_parser (#10118)
### What problem does this PR solve?

issue:
[Bug]: Agent component (HTTP Request) "'>' not supported between
instances of 'int' and 'NoneType'"
[#10096](https://github.com/infiniflow/ragflow/issues/10096)

Change:
When the Invoke class instantiates HtmlParser without providing the
chunk_token_num parameter, the value defaults to None, leading to a
comparison error with block_token_count.

This change sets the default chunk_token_num to 512 to prevent such
errors.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: BadwomanCraZY <511528396@qq.com>
2025-09-17 09:36:31 +08:00
152111fd9d Feat/parse img (#10112)
### What problem does this PR solve?

support parse image by OCR or VLM.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 17:53:37 +08:00
86f6da2f74 Feat: add support for the Ascend table structure recognizer (#10110)
### What problem does this PR solve?

Add support for the Ascend table structure recognizer.

Use the environment variable `TABLE_STRUCTURE_RECOGNIZER_TYPE=ascend` to
enable the Ascend table structure recognizer.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 13:57:06 +08:00
8c00cbc87a Fix(agent template): wrap template variables in curly braces (#10109)
### What problem does this PR solve?

Updated SQL assistant template to wrap variables like 'sys.query' and
'Agent:WickedGoatsDivide@content' in curly braces for better template
variable syntax consistency.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-16 13:56:56 +08:00
41e808f4e6 Docs: Added an Execute SQL tool reference (#10108)
### What problem does this PR solve?


### Type of change


- [x] Documentation Update
2025-09-16 11:39:56 +08:00
bc0281040b Feat: add support for the Ascend layout recognizer (#10105)
### What problem does this PR solve?

Supports Ascend layout recognizer.

Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable
the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n`
(for example, n=0) to specify the Ascend device ID.

Ensure that you have installed the [ais
tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)
properly.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 09:51:15 +08:00
341a7b1473 Fix: judge not empty before delete (#10099)
### What problem does this PR solve?

judge not empty before delete session.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-15 17:49:52 +08:00
c29c395390 Fix: The same model appears twice in the drop-down box. #10102 (#10103)
### What problem does this PR solve?

Fix: The same model appears twice in the drop-down box. #10102

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-15 16:38:08 +08:00
a23a0f230c feat: add multiple docker tags (latest, latest_full, latest_slim) to … (#10040)
…release workflow (#10039)  
This change updates the GitHub Actions workflow to push additional
stable tags alongside version tags, enabling automated update tools like
Watchtower to detect and pull the latest images correctly.
Refs:
[https://github.com/infiniflow/ragflow/issues/10039](https://github.com/infiniflow/ragflow/issues/10039)

### What problem does this PR solve?  
Automated container update tools such as Watchtower rely on stable tags
like `latest` to identify the newest images. Previously, only
version-specific tags were pushed, which prevented these tools from
detecting new releases automatically. This PR adds multiple stable tags
(`latest-full`, `latest-slim`) alongside version tags to the Docker
image publishing workflow, ensuring smooth and reliable automated
updates without manual tag management.

### Type of change  
- [ ] Bug Fix (non-breaking change which fixes an issue)  
- [x] New Feature (non-breaking change which adds functionality)  
- [ ] Documentation Update  
- [ ] Refactoring  
- [ ] Performance Improvement  
- [ ] Other (please describe):

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-13 21:44:53 +08:00
2a88ce6be1 Fix: terminate onnx inference session manually (#10076)
### What problem does this PR solve?

terminate onnx inference session and release memory manually.

Issue #5050 
Issue #9992 
Issue #8805

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-12 17:18:26 +08:00
664b781d62 Feat: Translate the fields of the embedded dialog box on the agent page #3221 (#10072)
### What problem does this PR solve?

Feat: Translate the fields of the embedded dialog box on the agent page
#3221
### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-09-12 16:01:12 +08:00
65571e5254 Feat: dataflow supports text (#10058)
### What problem does this PR solve?

dataflow supports text.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-11 19:03:51 +08:00
aa30f20730 Feat: Agent component support inserting variables(#10048) (#10055)
### What problem does this PR solve?

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-11 19:03:19 +08:00
b9b278d441 Docs: How to connect to an MCP server as a client (#10043)
### What problem does this PR solve?

#9769 

### Type of change


- [x] Documentation Update
2025-09-11 19:02:50 +08:00
e1d86cfee3 Feat: add TokenPony model provider (#9932)
### What problem does this PR solve?

Add TokenPony as a LLM provider

Co-authored-by: huangzl <huangzl@shinemo.com>
2025-09-11 17:25:31 +08:00
8ebd07337f The chat dialog box cannot be fully displayed on a small screen #10034 (#10049)
### What problem does this PR solve?

The chat dialog box cannot be fully displayed on a small screen #10034

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-11 13:32:23 +08:00
dd584d57b0 Fix: Hide dataflow related functions #9869 (#10045)
### What problem does this PR solve?

Fix: Hide dataflow related functions #9869

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-11 12:02:26 +08:00
3d39b96c6f Fix: token num exceed (#10046)
### What problem does this PR solve?

fix text input exceed token num limit when using siliconflow's embedding
model BAAI/bge-large-zh-v1.5 and BAAI/bge-large-en-v1.5, truncate before
input.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-11 12:02:12 +08:00
78 changed files with 4088 additions and 715 deletions

View File

@ -88,7 +88,9 @@ jobs:
with:
context: .
push: true
tags: infiniflow/ragflow:${{ env.RELEASE_TAG }}
tags: |
infiniflow/ragflow:${{ env.RELEASE_TAG }}
infiniflow/ragflow:latest-full
file: Dockerfile
platforms: linux/amd64
@ -98,7 +100,9 @@ jobs:
with:
context: .
push: true
tags: infiniflow/ragflow:${{ env.RELEASE_TAG }}-slim
tags: |
infiniflow/ragflow:${{ env.RELEASE_TAG }}-slim
infiniflow/ragflow:latest-slim
file: Dockerfile
build-args: LIGHTEN=1
platforms: linux/amd64

View File

@ -83,7 +83,7 @@
},
"password": "20010812Yy!",
"port": 3306,
"sql": "Agent:WickedGoatsDivide@content",
"sql": "{Agent:WickedGoatsDivide@content}",
"username": "13637682833@163.com"
}
},
@ -114,9 +114,7 @@
"params": {
"cross_languages": [],
"empty_response": "",
"kb_ids": [
"ed31364c727211f0bdb2bafe6e7908e6"
],
"kb_ids": [],
"keywords_similarity_weight": 0.7,
"outputs": {
"formalized_content": {
@ -124,7 +122,7 @@
"value": ""
}
},
"query": "sys.query",
"query": "{sys.query}",
"rerank_id": "",
"similarity_threshold": 0.2,
"top_k": 1024,
@ -145,9 +143,7 @@
"params": {
"cross_languages": [],
"empty_response": "",
"kb_ids": [
"0f968106727311f08357bafe6e7908e6"
],
"kb_ids": [],
"keywords_similarity_weight": 0.7,
"outputs": {
"formalized_content": {
@ -155,7 +151,7 @@
"value": ""
}
},
"query": "sys.query",
"query": "{sys.query}",
"rerank_id": "",
"similarity_threshold": 0.2,
"top_k": 1024,
@ -176,9 +172,7 @@
"params": {
"cross_languages": [],
"empty_response": "",
"kb_ids": [
"4ad1f9d0727311f0827dbafe6e7908e6"
],
"kb_ids": [],
"keywords_similarity_weight": 0.7,
"outputs": {
"formalized_content": {
@ -186,7 +180,7 @@
"value": ""
}
},
"query": "sys.query",
"query": "{sys.query}",
"rerank_id": "",
"similarity_threshold": 0.2,
"top_k": 1024,
@ -347,9 +341,7 @@
"form": {
"cross_languages": [],
"empty_response": "",
"kb_ids": [
"ed31364c727211f0bdb2bafe6e7908e6"
],
"kb_ids": [],
"keywords_similarity_weight": 0.7,
"outputs": {
"formalized_content": {
@ -357,7 +349,7 @@
"value": ""
}
},
"query": "sys.query",
"query": "{sys.query}",
"rerank_id": "",
"similarity_threshold": 0.2,
"top_k": 1024,
@ -387,9 +379,7 @@
"form": {
"cross_languages": [],
"empty_response": "",
"kb_ids": [
"0f968106727311f08357bafe6e7908e6"
],
"kb_ids": [],
"keywords_similarity_weight": 0.7,
"outputs": {
"formalized_content": {
@ -397,7 +387,7 @@
"value": ""
}
},
"query": "sys.query",
"query": "{sys.query}",
"rerank_id": "",
"similarity_threshold": 0.2,
"top_k": 1024,
@ -427,9 +417,7 @@
"form": {
"cross_languages": [],
"empty_response": "",
"kb_ids": [
"4ad1f9d0727311f0827dbafe6e7908e6"
],
"kb_ids": [],
"keywords_similarity_weight": 0.7,
"outputs": {
"formalized_content": {
@ -437,7 +425,7 @@
"value": ""
}
},
"query": "sys.query",
"query": "{sys.query}",
"rerank_id": "",
"similarity_threshold": 0.2,
"top_k": 1024,
@ -539,7 +527,7 @@
},
"password": "20010812Yy!",
"port": 3306,
"sql": "Agent:WickedGoatsDivide@content",
"sql": "{Agent:WickedGoatsDivide@content}",
"username": "13637682833@163.com"
},
"label": "ExeSQL",

View File

@ -157,7 +157,7 @@ class CodeExec(ToolBase, ABC):
try:
resp = requests.post(url=f"http://{settings.SANDBOX_HOST}:9385/run", json=code_req, timeout=os.environ.get("COMPONENT_EXEC_TIMEOUT", 10*60))
logging.info(f"http://{settings.SANDBOX_HOST}:9385/run", code_req, resp.status_code)
logging.info(f"http://{settings.SANDBOX_HOST}:9385/run, code_req: {code_req}, resp.status_code {resp.status_code}:")
if resp.status_code != 200:
resp.raise_for_status()
body = resp.json()

View File

@ -53,7 +53,7 @@ class ExeSQLParam(ToolParamBase):
self.max_records = 1024
def check(self):
self.check_valid_value(self.db_type, "Choose DB type", ['mysql', 'postgresql', 'mariadb', 'mssql'])
self.check_valid_value(self.db_type, "Choose DB type", ['mysql', 'postgres', 'mariadb', 'mssql'])
self.check_empty(self.database, "Database name")
self.check_empty(self.username, "database username")
self.check_empty(self.host, "IP Address")
@ -111,7 +111,7 @@ class ExeSQL(ToolBase, ABC):
if self._param.db_type in ["mysql", "mariadb"]:
db = pymysql.connect(db=self._param.database, user=self._param.username, host=self._param.host,
port=self._param.port, password=self._param.password)
elif self._param.db_type == 'postgresql':
elif self._param.db_type == 'postgres':
db = psycopg2.connect(dbname=self._param.database, user=self._param.username, host=self._param.host,
port=self._param.port, password=self._param.password)
elif self._param.db_type == 'mssql':

View File

@ -332,7 +332,7 @@ def test_db_connect():
if req["db_type"] in ["mysql", "mariadb"]:
db = MySQLDatabase(req["database"], user=req["username"], host=req["host"], port=req["port"],
password=req["password"])
elif req["db_type"] == 'postgresql':
elif req["db_type"] == 'postgres':
db = PostgresqlDatabase(req["database"], user=req["username"], host=req["host"], port=req["port"],
password=req["password"])
elif req["db_type"] == 'mssql':

View File

@ -379,3 +379,19 @@ def get_meta():
code=settings.RetCode.AUTHENTICATION_ERROR
)
return get_json_result(data=DocumentService.get_meta_by_kbs(kb_ids))
@manager.route("/basic_info", methods=["GET"]) # noqa: F821
@login_required
def get_basic_info():
kb_id = request.args.get("kb_id", "")
if not KnowledgebaseService.accessible(kb_id, current_user.id):
return get_json_result(
data=False,
message='No authorization.',
code=settings.RetCode.AUTHENTICATION_ERROR
)
basic_info = DocumentService.knowledgebase_basic_info(kb_id)
return get_json_result(data=basic_info)

View File

@ -3,9 +3,11 @@ import re
import flask
from flask import request
from pathlib import Path
from api.db.services.document_service import DocumentService
from api.db.services.file2document_service import File2DocumentService
from api.db.services.knowledgebase_service import KnowledgebaseService
from api.utils.api_utils import server_error_response, token_required
from api.utils import get_uuid
from api.db import FileType
@ -666,3 +668,71 @@ def move(tenant_id):
return get_json_result(data=True)
except Exception as e:
return server_error_response(e)
@manager.route('/file/convert', methods=['POST']) # noqa: F821
@token_required
def convert(tenant_id):
req = request.json
kb_ids = req["kb_ids"]
file_ids = req["file_ids"]
file2documents = []
try:
files = FileService.get_by_ids(file_ids)
files_set = dict({file.id: file for file in files})
for file_id in file_ids:
file = files_set[file_id]
if not file:
return get_json_result(message="File not found!", code=404)
file_ids_list = [file_id]
if file.type == FileType.FOLDER.value:
file_ids_list = FileService.get_all_innermost_file_ids(file_id, [])
for id in file_ids_list:
informs = File2DocumentService.get_by_file_id(id)
# delete
for inform in informs:
doc_id = inform.document_id
e, doc = DocumentService.get_by_id(doc_id)
if not e:
return get_json_result(message="Document not found!", code=404)
tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id:
return get_json_result(message="Tenant not found!", code=404)
if not DocumentService.remove_document(doc, tenant_id):
return get_json_result(
message="Database error (Document removal)!", code=404)
File2DocumentService.delete_by_file_id(id)
# insert
for kb_id in kb_ids:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
return get_json_result(
message="Can't find this knowledgebase!", code=404)
e, file = FileService.get_by_id(id)
if not e:
return get_json_result(
message="Can't find this file!", code=404)
doc = DocumentService.insert({
"id": get_uuid(),
"kb_id": kb.id,
"parser_id": FileService.get_parser(file.type, file.name, kb.parser_id),
"parser_config": kb.parser_config,
"created_by": tenant_id,
"type": file.type,
"name": file.name,
"suffix": Path(file.name).suffix.lstrip("."),
"location": file.location,
"size": file.size
})
file2document = File2DocumentService.insert({
"id": get_uuid(),
"file_id": id,
"document_id": doc.id,
})
file2documents.append(file2document.to_json())
return get_json_result(data=file2documents)
except Exception as e:
return server_error_response(e)

View File

@ -414,7 +414,7 @@ def agents_completion_openai_compatibility(tenant_id, agent_id):
tenant_id,
agent_id,
question,
session_id=req.get("session_id", req.get("id", "") or req.get("metadata", {}).get("id", "")),
session_id=req.pop("session_id", req.get("id", "")) or req.get("metadata", {}).get("id", ""),
stream=True,
**req,
),
@ -432,7 +432,7 @@ def agents_completion_openai_compatibility(tenant_id, agent_id):
tenant_id,
agent_id,
question,
session_id=req.get("session_id", req.get("id", "") or req.get("metadata", {}).get("id", "")),
session_id=req.pop("session_id", req.get("id", "")) or req.get("metadata", {}).get("id", ""),
stream=False,
**req,
)

View File

@ -36,6 +36,8 @@ from rag.utils.storage_factory import STORAGE_IMPL, STORAGE_IMPL_TYPE
from timeit import default_timer as timer
from rag.utils.redis_conn import REDIS_CONN
from flask import jsonify
from api.utils.health_utils import run_health_checks
@manager.route("/version", methods=["GET"]) # noqa: F821
@login_required
@ -169,6 +171,12 @@ def status():
return get_json_result(data=res)
@manager.route("/healthz", methods=["GET"]) # noqa: F821
def healthz():
result, all_ok = run_health_checks()
return jsonify(result), (200 if all_ok else 500)
@manager.route("/new_token", methods=["POST"]) # noqa: F821
@login_required
def new_token():

View File

@ -144,8 +144,9 @@ def init_llm_factory():
except Exception:
pass
break
doc_count = DocumentService.get_all_kb_doc_count()
for kb_id in KnowledgebaseService.get_all_ids():
KnowledgebaseService.update_document_number_in_init(kb_id=kb_id, doc_num=DocumentService.get_kb_doc_count(kb_id))
KnowledgebaseService.update_document_number_in_init(kb_id=kb_id, doc_num=doc_count.get(kb_id, 0))

View File

@ -24,7 +24,7 @@ from io import BytesIO
import trio
import xxhash
from peewee import fn
from peewee import fn, Case
from api import settings
from api.constants import IMG_BASE64_PREFIX, FILE_NAME_LEN_LIMIT
@ -660,8 +660,16 @@ class DocumentService(CommonService):
@classmethod
@DB.connection_context()
def get_kb_doc_count(cls, kb_id):
return len(cls.model.select(cls.model.id).where(
cls.model.kb_id == kb_id).dicts())
return cls.model.select().where(cls.model.kb_id == kb_id).count()
@classmethod
@DB.connection_context()
def get_all_kb_doc_count(cls):
result = {}
rows = cls.model.select(cls.model.kb_id, fn.COUNT(cls.model.id).alias('count')).group_by(cls.model.kb_id)
for row in rows:
result[row.kb_id] = row.count
return result
@classmethod
@DB.connection_context()
@ -674,6 +682,53 @@ class DocumentService(CommonService):
return False
@classmethod
@DB.connection_context()
def knowledgebase_basic_info(cls, kb_id: str) -> dict[str, int]:
# cancelled: run == "2" but progress can vary
cancelled = (
cls.model.select(fn.COUNT(1))
.where((cls.model.kb_id == kb_id) & (cls.model.run == TaskStatus.CANCEL))
.scalar()
)
row = (
cls.model.select(
# finished: progress == 1
fn.COALESCE(fn.SUM(Case(None, [(cls.model.progress == 1, 1)], 0)), 0).alias("finished"),
# failed: progress == -1
fn.COALESCE(fn.SUM(Case(None, [(cls.model.progress == -1, 1)], 0)), 0).alias("failed"),
# processing: 0 <= progress < 1
fn.COALESCE(
fn.SUM(
Case(
None,
[
(((cls.model.progress == 0) | ((cls.model.progress > 0) & (cls.model.progress < 1))), 1),
],
0,
)
),
0,
).alias("processing"),
)
.where(
(cls.model.kb_id == kb_id)
& ((cls.model.run.is_null(True)) | (cls.model.run != TaskStatus.CANCEL))
)
.dicts()
.get()
)
return {
"processing": int(row["processing"]),
"finished": int(row["finished"]),
"failed": int(row["failed"]),
"cancelled": int(cancelled),
}
def queue_raptor_o_graphrag_tasks(doc, ty, priority):
chunking_config = DocumentService.get_chunking_config(doc["id"])
hasher = xxhash.xxh64()
@ -702,6 +757,8 @@ def queue_raptor_o_graphrag_tasks(doc, ty, priority):
def get_queue_length(priority):
group_info = REDIS_CONN.queue_info(get_svr_queue_name(priority), SVR_CONSUMER_GROUP_NAME)
if not group_info:
return 0
return int(group_info.get("lag", 0) or 0)
@ -847,3 +904,4 @@ def doc_upload_and_parse(conversation_id, file_objs, user_id):
doc_id, kb.id, token_counts[doc_id], chunk_counts[doc_id], 0)
return [d["id"] for d, _ in files]

90
api/utils/health_utils.py Normal file
View File

@ -0,0 +1,90 @@
from timeit import default_timer as timer
from api import settings
from api.db.db_models import DB
from rag.utils.redis_conn import REDIS_CONN
from rag.utils.storage_factory import STORAGE_IMPL
def _ok_nok(ok: bool) -> str:
return "ok" if ok else "nok"
def check_db() -> tuple[bool, dict]:
st = timer()
try:
# lightweight probe; works for MySQL/Postgres
DB.execute_sql("SELECT 1")
return True, {"elapsed": f"{(timer() - st) * 1000.0:.1f}"}
except Exception as e:
return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)}
def check_redis() -> tuple[bool, dict]:
st = timer()
try:
ok = bool(REDIS_CONN.health())
return ok, {"elapsed": f"{(timer() - st) * 1000.0:.1f}"}
except Exception as e:
return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)}
def check_doc_engine() -> tuple[bool, dict]:
st = timer()
try:
meta = settings.docStoreConn.health()
# treat any successful call as ok
return True, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", **(meta or {})}
except Exception as e:
return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)}
def check_storage() -> tuple[bool, dict]:
st = timer()
try:
STORAGE_IMPL.health()
return True, {"elapsed": f"{(timer() - st) * 1000.0:.1f}"}
except Exception as e:
return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)}
def run_health_checks() -> tuple[dict, bool]:
result: dict[str, str | dict] = {}
db_ok, db_meta = check_db()
result["db"] = _ok_nok(db_ok)
if not db_ok:
result.setdefault("_meta", {})["db"] = db_meta
try:
redis_ok, redis_meta = check_redis()
result["redis"] = _ok_nok(redis_ok)
if not redis_ok:
result.setdefault("_meta", {})["redis"] = redis_meta
except Exception:
result["redis"] = "nok"
try:
doc_ok, doc_meta = check_doc_engine()
result["doc_engine"] = _ok_nok(doc_ok)
if not doc_ok:
result.setdefault("_meta", {})["doc_engine"] = doc_meta
except Exception:
result["doc_engine"] = "nok"
try:
sto_ok, sto_meta = check_storage()
result["storage"] = _ok_nok(sto_ok)
if not sto_ok:
result.setdefault("_meta", {})["storage"] = sto_meta
except Exception:
result["storage"] = "nok"
all_ok = (result.get("db") == "ok") and (result.get("redis") == "ok") and (result.get("doc_engine") == "ok") and (result.get("storage") == "ok")
result["status"] = "ok" if all_ok else "nok"
return result, all_ok

View File

@ -219,6 +219,70 @@
}
]
},
{
"name": "TokenPony",
"logo": "",
"tags": "LLM",
"status": "1",
"llm": [
{
"llm_name": "qwen3-8b",
"tags": "LLM,CHAT,131k",
"max_tokens": 131000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-v3-0324",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "qwen3-32b",
"tags": "LLM,CHAT,131k",
"max_tokens": 131000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "kimi-k2-instruct",
"tags": "LLM,CHAT,128K",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-r1-0528",
"tags": "LLM,CHAT,164k",
"max_tokens": 164000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "qwen3-coder-480b",
"tags": "LLM,CHAT,1024k",
"max_tokens": 1024000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "glm-4.5",
"tags": "LLM,CHAT,131K",
"max_tokens": 131000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-v3.1",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
}
]
},
{
"name": "Tongyi-Qianwen",
"logo": "",
@ -625,7 +689,7 @@
},
{
"llm_name": "glm-4",
"tags":"LLM,CHAT,128K",
"tags": "LLM,CHAT,128K",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
@ -4477,6 +4541,273 @@
}
]
},
{
"name": "CometAPI",
"logo": "",
"tags": "LLM,TEXT EMBEDDING,IMAGE2TEXT",
"status": "1",
"llm": [
{
"llm_name": "gpt-5-chat-latest",
"tags": "LLM,CHAT,400k",
"max_tokens": 400000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "chatgpt-4o-latest",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-5-mini",
"tags": "LLM,CHAT,400k",
"max_tokens": 400000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-5-nano",
"tags": "LLM,CHAT,400k",
"max_tokens": 400000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-5",
"tags": "LLM,CHAT,400k",
"max_tokens": 400000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-4.1-mini",
"tags": "LLM,CHAT,1M",
"max_tokens": 1047576,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-4.1-nano",
"tags": "LLM,CHAT,1M",
"max_tokens": 1047576,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-4.1",
"tags": "LLM,CHAT,1M",
"max_tokens": 1047576,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-4o-mini",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "o4-mini-2025-04-16",
"tags": "LLM,CHAT,200k",
"max_tokens": 200000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "o3-pro-2025-06-10",
"tags": "LLM,CHAT,200k",
"max_tokens": 200000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "claude-opus-4-1-20250805",
"tags": "LLM,CHAT,200k,IMAGE2TEXT",
"max_tokens": 200000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "claude-opus-4-1-20250805-thinking",
"tags": "LLM,CHAT,200k,IMAGE2TEXT",
"max_tokens": 200000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "claude-sonnet-4-20250514",
"tags": "LLM,CHAT,200k,IMAGE2TEXT",
"max_tokens": 200000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "claude-sonnet-4-20250514-thinking",
"tags": "LLM,CHAT,200k,IMAGE2TEXT",
"max_tokens": 200000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "claude-3-7-sonnet-latest",
"tags": "LLM,CHAT,200k",
"max_tokens": 200000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "claude-3-5-haiku-latest",
"tags": "LLM,CHAT,200k",
"max_tokens": 200000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gemini-2.5-pro",
"tags": "LLM,CHAT,1M,IMAGE2TEXT",
"max_tokens": 1000000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "gemini-2.5-flash",
"tags": "LLM,CHAT,1M,IMAGE2TEXT",
"max_tokens": 1000000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "gemini-2.5-flash-lite",
"tags": "LLM,CHAT,1M,IMAGE2TEXT",
"max_tokens": 1000000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "gemini-2.0-flash",
"tags": "LLM,CHAT,1M,IMAGE2TEXT",
"max_tokens": 1000000,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "grok-4-0709",
"tags": "LLM,CHAT,131k",
"max_tokens": 131072,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "grok-3",
"tags": "LLM,CHAT,131k",
"max_tokens": 131072,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "grok-3-mini",
"tags": "LLM,CHAT,131k",
"max_tokens": 131072,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "grok-2-image-1212",
"tags": "LLM,CHAT,32k,IMAGE2TEXT",
"max_tokens": 32768,
"model_type": "image2text",
"is_tools": true
},
{
"llm_name": "deepseek-v3.1",
"tags": "LLM,CHAT,64k",
"max_tokens": 64000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-v3",
"tags": "LLM,CHAT,64k",
"max_tokens": 64000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-r1-0528",
"tags": "LLM,CHAT,164k",
"max_tokens": 164000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-chat",
"tags": "LLM,CHAT,32k",
"max_tokens": 32000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-reasoner",
"tags": "LLM,CHAT,64k",
"max_tokens": 64000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "qwen3-30b-a3b",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "qwen3-coder-plus-2025-07-22",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "text-embedding-ada-002",
"tags": "TEXT EMBEDDING,8K",
"max_tokens": 8191,
"model_type": "embedding",
"is_tools": false
},
{
"llm_name": "text-embedding-3-small",
"tags": "TEXT EMBEDDING,8K",
"max_tokens": 8191,
"model_type": "embedding",
"is_tools": false
},
{
"llm_name": "text-embedding-3-large",
"tags": "TEXT EMBEDDING,8K",
"max_tokens": 8191,
"model_type": "embedding",
"is_tools": false
},
{
"llm_name": "whisper-1",
"tags": "SPEECH2TEXT",
"max_tokens": 26214400,
"model_type": "speech2text",
"is_tools": false
},
{
"llm_name": "tts-1",
"tags": "TTS",
"max_tokens": 2048,
"model_type": "tts",
"is_tools": false
}
]
},
{
"name": "Meituan",
"logo": "",
@ -4493,4 +4824,4 @@
]
}
]
}
}

View File

@ -37,7 +37,7 @@ TITLE_TAGS = {"h1": "#", "h2": "##", "h3": "###", "h4": "#####", "h5": "#####",
class RAGFlowHtmlParser:
def __call__(self, fnm, binary=None, chunk_token_num=None):
def __call__(self, fnm, binary=None, chunk_token_num=512):
if binary:
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")

View File

@ -34,7 +34,7 @@ from pypdf import PdfReader as pdf2_read
from api import settings
from api.utils.file_utils import get_project_base_directory
from deepdoc.vision import OCR, LayoutRecognizer, Recognizer, TableStructureRecognizer
from deepdoc.vision import OCR, AscendLayoutRecognizer, LayoutRecognizer, Recognizer, TableStructureRecognizer
from rag.app.picture import vision_llm_chunk as picture_vision_llm_chunk
from rag.nlp import rag_tokenizer
from rag.prompts import vision_llm_describe_prompt
@ -64,33 +64,38 @@ class RAGFlowPdfParser:
if PARALLEL_DEVICES > 1:
self.parallel_limiter = [trio.CapacityLimiter(1) for _ in range(PARALLEL_DEVICES)]
layout_recognizer_type = os.getenv("LAYOUT_RECOGNIZER_TYPE", "onnx").lower()
if layout_recognizer_type not in ["onnx", "ascend"]:
raise RuntimeError("Unsupported layout recognizer type.")
if hasattr(self, "model_speciess"):
self.layouter = LayoutRecognizer("layout." + self.model_speciess)
recognizer_domain = "layout." + self.model_speciess
else:
self.layouter = LayoutRecognizer("layout")
recognizer_domain = "layout"
if layout_recognizer_type == "ascend":
logging.debug("Using Ascend LayoutRecognizer")
self.layouter = AscendLayoutRecognizer(recognizer_domain)
else: # onnx
logging.debug("Using Onnx LayoutRecognizer")
self.layouter = LayoutRecognizer(recognizer_domain)
self.tbl_det = TableStructureRecognizer()
self.updown_cnt_mdl = xgb.Booster()
if not settings.LIGHTEN:
try:
import torch.cuda
if torch.cuda.is_available():
self.updown_cnt_mdl.set_param({"device": "cuda"})
except Exception:
logging.exception("RAGFlowPdfParser __init__")
try:
model_dir = os.path.join(
get_project_base_directory(),
"rag/res/deepdoc")
self.updown_cnt_mdl.load_model(os.path.join(
model_dir, "updown_concat_xgb.model"))
model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
self.updown_cnt_mdl.load_model(os.path.join(model_dir, "updown_concat_xgb.model"))
except Exception:
model_dir = snapshot_download(
repo_id="InfiniFlow/text_concat_xgb_v1.0",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False)
self.updown_cnt_mdl.load_model(os.path.join(
model_dir, "updown_concat_xgb.model"))
model_dir = snapshot_download(repo_id="InfiniFlow/text_concat_xgb_v1.0", local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"), local_dir_use_symlinks=False)
self.updown_cnt_mdl.load_model(os.path.join(model_dir, "updown_concat_xgb.model"))
self.page_from = 0
self.column_num = 1
@ -102,13 +107,10 @@ class RAGFlowPdfParser:
return c["bottom"] - c["top"]
def _x_dis(self, a, b):
return min(abs(a["x1"] - b["x0"]), abs(a["x0"] - b["x1"]),
abs(a["x0"] + a["x1"] - b["x0"] - b["x1"]) / 2)
return min(abs(a["x1"] - b["x0"]), abs(a["x0"] - b["x1"]), abs(a["x0"] + a["x1"] - b["x0"] - b["x1"]) / 2)
def _y_dis(
self, a, b):
return (
b["top"] + b["bottom"] - a["top"] - a["bottom"]) / 2
def _y_dis(self, a, b):
return (b["top"] + b["bottom"] - a["top"] - a["bottom"]) / 2
def _match_proj(self, b):
proj_patt = [
@ -130,10 +132,7 @@ class RAGFlowPdfParser:
LEN = 6
tks_down = rag_tokenizer.tokenize(down["text"][:LEN]).split()
tks_up = rag_tokenizer.tokenize(up["text"][-LEN:]).split()
tks_all = up["text"][-LEN:].strip() \
+ (" " if re.match(r"[a-zA-Z0-9]+",
up["text"][-1] + down["text"][0]) else "") \
+ down["text"][:LEN].strip()
tks_all = up["text"][-LEN:].strip() + (" " if re.match(r"[a-zA-Z0-9]+", up["text"][-1] + down["text"][0]) else "") + down["text"][:LEN].strip()
tks_all = rag_tokenizer.tokenize(tks_all).split()
fea = [
up.get("R", -1) == down.get("R", -1),
@ -144,39 +143,30 @@ class RAGFlowPdfParser:
down["layout_type"] == "text",
up["layout_type"] == "table",
down["layout_type"] == "table",
True if re.search(
r"([。?!;!?;+)]|[a-z]\.)$",
up["text"]) else False,
True if re.search(r"([。?!;!?;+)]|[a-z]\.)$", up["text"]) else False,
True if re.search(r"[“、0-9+-]$", up["text"]) else False,
True if re.search(
r"(^.?[/,?;:\],。;:’”?!》】)-])",
down["text"]) else False,
True if re.search(r"(^.?[/,?;:\],。;:’”?!》】)-])", down["text"]) else False,
True if re.match(r"[\(][^\(\)]+[\)]$", up["text"]) else False,
True if re.search(r"[,][^。.]+$", up["text"]) else False,
True if re.search(r"[,][^。.]+$", up["text"]) else False,
True if re.search(r"[\(][^\)]+$", up["text"])
and re.search(r"[\)]", down["text"]) else False,
True if re.search(r"[\(][^\)]+$", up["text"]) and re.search(r"[\)]", down["text"]) else False,
self._match_proj(down),
True if re.match(r"[A-Z]", down["text"]) else False,
True if re.match(r"[A-Z]", up["text"][-1]) else False,
True if re.match(r"[a-z0-9]", up["text"][-1]) else False,
True if re.match(r"[0-9.%,-]+$", down["text"]) else False,
up["text"].strip()[-2:] == down["text"].strip()[-2:] if len(up["text"].strip()
) > 1 and len(
down["text"].strip()) > 1 else False,
up["text"].strip()[-2:] == down["text"].strip()[-2:] if len(up["text"].strip()) > 1 and len(down["text"].strip()) > 1 else False,
up["x0"] > down["x1"],
abs(self.__height(up) - self.__height(down)) / min(self.__height(up),
self.__height(down)),
abs(self.__height(up) - self.__height(down)) / min(self.__height(up), self.__height(down)),
self._x_dis(up, down) / max(w, 0.000001),
(len(up["text"]) - len(down["text"])) /
max(len(up["text"]), len(down["text"])),
(len(up["text"]) - len(down["text"])) / max(len(up["text"]), len(down["text"])),
len(tks_all) - len(tks_up) - len(tks_down),
len(tks_down) - len(tks_up),
tks_down[-1] == tks_up[-1] if tks_down and tks_up else False,
max(down["in_row"], up["in_row"]),
abs(down["in_row"] - up["in_row"]),
len(tks_down) == 1 and rag_tokenizer.tag(tks_down[0]).find("n") >= 0,
len(tks_up) == 1 and rag_tokenizer.tag(tks_up[0]).find("n") >= 0
len(tks_up) == 1 and rag_tokenizer.tag(tks_up[0]).find("n") >= 0,
]
return fea
@ -187,9 +177,7 @@ class RAGFlowPdfParser:
for i in range(len(arr) - 1):
for j in range(i, -1, -1):
# restore the order using th
if abs(arr[j + 1]["x0"] - arr[j]["x0"]) < threshold \
and arr[j + 1]["top"] < arr[j]["top"] \
and arr[j + 1]["page_number"] == arr[j]["page_number"]:
if abs(arr[j + 1]["x0"] - arr[j]["x0"]) < threshold and arr[j + 1]["top"] < arr[j]["top"] and arr[j + 1]["page_number"] == arr[j]["page_number"]:
tmp = arr[j]
arr[j] = arr[j + 1]
arr[j + 1] = tmp
@ -197,8 +185,7 @@ class RAGFlowPdfParser:
def _has_color(self, o):
if o.get("ncs", "") == "DeviceGray":
if o["stroking_color"] and o["stroking_color"][0] == 1 and o["non_stroking_color"] and \
o["non_stroking_color"][0] == 1:
if o["stroking_color"] and o["stroking_color"][0] == 1 and o["non_stroking_color"] and o["non_stroking_color"][0] == 1:
if re.match(r"[a-zT_\[\]\(\)-]+", o.get("text", "")):
return False
return True
@ -216,8 +203,7 @@ class RAGFlowPdfParser:
if not tbls:
continue
for tb in tbls: # for table
left, top, right, bott = tb["x0"] - MARGIN, tb["top"] - MARGIN, \
tb["x1"] + MARGIN, tb["bottom"] + MARGIN
left, top, right, bott = tb["x0"] - MARGIN, tb["top"] - MARGIN, tb["x1"] + MARGIN, tb["bottom"] + MARGIN
left *= ZM
top *= ZM
right *= ZM
@ -232,14 +218,13 @@ class RAGFlowPdfParser:
tbcnt = np.cumsum(tbcnt)
for i in range(len(tbcnt) - 1): # for page
pg = []
for j, tb_items in enumerate(
recos[tbcnt[i]: tbcnt[i + 1]]): # for table
poss = pos[tbcnt[i]: tbcnt[i + 1]]
for j, tb_items in enumerate(recos[tbcnt[i] : tbcnt[i + 1]]): # for table
poss = pos[tbcnt[i] : tbcnt[i + 1]]
for it in tb_items: # for table components
it["x0"] = (it["x0"] + poss[j][0])
it["x1"] = (it["x1"] + poss[j][0])
it["top"] = (it["top"] + poss[j][1])
it["bottom"] = (it["bottom"] + poss[j][1])
it["x0"] = it["x0"] + poss[j][0]
it["x1"] = it["x1"] + poss[j][0]
it["top"] = it["top"] + poss[j][1]
it["bottom"] = it["bottom"] + poss[j][1]
for n in ["x0", "x1", "top", "bottom"]:
it[n] /= ZM
it["top"] += self.page_cum_height[i]
@ -250,8 +235,7 @@ class RAGFlowPdfParser:
self.tb_cpns.extend(pg)
def gather(kwd, fzy=10, ption=0.6):
eles = Recognizer.sort_Y_firstly(
[r for r in self.tb_cpns if re.match(kwd, r["label"])], fzy)
eles = Recognizer.sort_Y_firstly([r for r in self.tb_cpns if re.match(kwd, r["label"])], fzy)
eles = Recognizer.layouts_cleanup(self.boxes, eles, 5, ption)
return Recognizer.sort_Y_firstly(eles, 0)
@ -259,8 +243,7 @@ class RAGFlowPdfParser:
headers = gather(r".*header$")
rows = gather(r".* (row|header)")
spans = gather(r".*spanning")
clmns = sorted([r for r in self.tb_cpns if re.match(
r"table column$", r["label"])], key=lambda x: (x["pn"], x["layoutno"], x["x0"]))
clmns = sorted([r for r in self.tb_cpns if re.match(r"table column$", r["label"])], key=lambda x: (x["pn"], x["layoutno"], x["x0"]))
clmns = Recognizer.layouts_cleanup(self.boxes, clmns, 5, 0.5)
for b in self.boxes:
if b.get("layout_type", "") != "table":
@ -271,8 +254,7 @@ class RAGFlowPdfParser:
b["R_top"] = rows[ii]["top"]
b["R_bott"] = rows[ii]["bottom"]
ii = Recognizer.find_overlapped_with_threshold(
b, headers, thr=0.3)
ii = Recognizer.find_overlapped_with_threshold(b, headers, thr=0.3)
if ii is not None:
b["H_top"] = headers[ii]["top"]
b["H_bott"] = headers[ii]["bottom"]
@ -305,12 +287,12 @@ class RAGFlowPdfParser:
return
bxs = [(line[0], line[1][0]) for line in bxs]
bxs = Recognizer.sort_Y_firstly(
[{"x0": b[0][0] / ZM, "x1": b[1][0] / ZM,
"top": b[0][1] / ZM, "text": "", "txt": t,
"bottom": b[-1][1] / ZM,
"chars": [],
"page_number": pagenum} for b, t in bxs if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]],
self.mean_height[pagenum-1] / 3
[
{"x0": b[0][0] / ZM, "x1": b[1][0] / ZM, "top": b[0][1] / ZM, "text": "", "txt": t, "bottom": b[-1][1] / ZM, "chars": [], "page_number": pagenum}
for b, t in bxs
if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]
],
self.mean_height[pagenum - 1] / 3,
)
# merge chars in the same rect
@ -321,7 +303,7 @@ class RAGFlowPdfParser:
continue
ch = c["bottom"] - c["top"]
bh = bxs[ii]["bottom"] - bxs[ii]["top"]
if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != ' ':
if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != " ":
self.lefted_chars.append(c)
continue
bxs[ii]["chars"].append(c)
@ -345,8 +327,7 @@ class RAGFlowPdfParser:
img_np = np.array(img)
for b in bxs:
if not b["text"]:
left, right, top, bott = b["x0"] * ZM, b["x1"] * \
ZM, b["top"] * ZM, b["bottom"] * ZM
left, right, top, bott = b["x0"] * ZM, b["x1"] * ZM, b["top"] * ZM, b["bottom"] * ZM
b["box_image"] = self.ocr.get_rotate_crop_image(img_np, np.array([[left, top], [right, top], [right, bott], [left, bott]], dtype=np.float32))
boxes_to_reg.append(b)
del b["txt"]
@ -356,21 +337,17 @@ class RAGFlowPdfParser:
del boxes_to_reg[i]["box_image"]
logging.info(f"__ocr recognize {len(bxs)} boxes cost {timer() - start}s")
bxs = [b for b in bxs if b["text"]]
if self.mean_height[pagenum-1] == 0:
self.mean_height[pagenum-1] = np.median([b["bottom"] - b["top"]
for b in bxs])
if self.mean_height[pagenum - 1] == 0:
self.mean_height[pagenum - 1] = np.median([b["bottom"] - b["top"] for b in bxs])
self.boxes.append(bxs)
def _layouts_rec(self, ZM, drop=True):
assert len(self.page_images) == len(self.boxes)
self.boxes, self.page_layout = self.layouter(
self.page_images, self.boxes, ZM, drop=drop)
self.boxes, self.page_layout = self.layouter(self.page_images, self.boxes, ZM, drop=drop)
# cumlative Y
for i in range(len(self.boxes)):
self.boxes[i]["top"] += \
self.page_cum_height[self.boxes[i]["page_number"] - 1]
self.boxes[i]["bottom"] += \
self.page_cum_height[self.boxes[i]["page_number"] - 1]
self.boxes[i]["top"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
self.boxes[i]["bottom"] += self.page_cum_height[self.boxes[i]["page_number"] - 1]
def _text_merge(self):
# merge adjusted boxes
@ -390,12 +367,10 @@ class RAGFlowPdfParser:
while i < len(bxs) - 1:
b = bxs[i]
b_ = bxs[i + 1]
if b.get("layoutno", "0") != b_.get("layoutno", "1") or b.get("layout_type", "") in ["table", "figure",
"equation"]:
if b.get("layoutno", "0") != b_.get("layoutno", "1") or b.get("layout_type", "") in ["table", "figure", "equation"]:
i += 1
continue
if abs(self._y_dis(b, b_)
) < self.mean_height[bxs[i]["page_number"] - 1] / 3:
if abs(self._y_dis(b, b_)) < self.mean_height[bxs[i]["page_number"] - 1] / 3:
# merge
bxs[i]["x1"] = b_["x1"]
bxs[i]["top"] = (b["top"] + b_["top"]) / 2
@ -408,16 +383,14 @@ class RAGFlowPdfParser:
dis_thr = 1
dis = b["x1"] - b_["x0"]
if b.get("layout_type", "") != "text" or b_.get(
"layout_type", "") != "text":
if b.get("layout_type", "") != "text" or b_.get("layout_type", "") != "text":
if end_with(b, "") or start_with(b_, ""):
dis_thr = -8
else:
i += 1
continue
if abs(self._y_dis(b, b_)) < self.mean_height[bxs[i]["page_number"] - 1] / 5 \
and dis >= dis_thr and b["x1"] < b_["x1"]:
if abs(self._y_dis(b, b_)) < self.mean_height[bxs[i]["page_number"] - 1] / 5 and dis >= dis_thr and b["x1"] < b_["x1"]:
# merge
bxs[i]["x1"] = b_["x1"]
bxs[i]["top"] = (b["top"] + b_["top"]) / 2
@ -429,23 +402,22 @@ class RAGFlowPdfParser:
self.boxes = bxs
def _naive_vertical_merge(self, zoomin=3):
bxs = Recognizer.sort_Y_firstly(
self.boxes, np.median(
self.mean_height) / 3)
import math
bxs = Recognizer.sort_Y_firstly(self.boxes, np.median(self.mean_height) / 3)
column_width = np.median([b["x1"] - b["x0"] for b in self.boxes])
if not column_width or math.isnan(column_width):
column_width = self.mean_width[0]
self.column_num = int(self.page_images[0].size[0] / zoomin / column_width)
if column_width < self.page_images[0].size[0] / zoomin / self.column_num:
logging.info("Multi-column................... {} {}".format(column_width,
self.page_images[0].size[0] / zoomin / self.column_num))
logging.info("Multi-column................... {} {}".format(column_width, self.page_images[0].size[0] / zoomin / self.column_num))
self.boxes = self.sort_X_by_page(self.boxes, column_width / self.column_num)
i = 0
while i + 1 < len(bxs):
b = bxs[i]
b_ = bxs[i + 1]
if b["page_number"] < b_["page_number"] and re.match(
r"[0-9 •一—-]+$", b["text"]):
if b["page_number"] < b_["page_number"] and re.match(r"[0-9 •一—-]+$", b["text"]):
bxs.pop(i)
continue
if not b["text"].strip():
@ -453,8 +425,7 @@ class RAGFlowPdfParser:
continue
concatting_feats = [
b["text"].strip()[-1] in ",;:'\",、‘“;:-",
len(b["text"].strip()) > 1 and b["text"].strip(
)[-2] in ",;:'\",‘“、;:",
len(b["text"].strip()) > 1 and b["text"].strip()[-2] in ",;:'\",‘“、;:",
b_["text"].strip() and b_["text"].strip()[0] in "。;?!?”)),,、:",
]
# features for not concating
@ -462,21 +433,20 @@ class RAGFlowPdfParser:
b.get("layoutno", 0) != b_.get("layoutno", 0),
b["text"].strip()[-1] in "。?!?",
self.is_english and b["text"].strip()[-1] in ".!?",
b["page_number"] == b_["page_number"] and b_["top"] -
b["bottom"] > self.mean_height[b["page_number"] - 1] * 1.5,
b["page_number"] < b_["page_number"] and abs(
b["x0"] - b_["x0"]) > self.mean_width[b["page_number"] - 1] * 4,
b["page_number"] == b_["page_number"] and b_["top"] - b["bottom"] > self.mean_height[b["page_number"] - 1] * 1.5,
b["page_number"] < b_["page_number"] and abs(b["x0"] - b_["x0"]) > self.mean_width[b["page_number"] - 1] * 4,
]
# split features
detach_feats = [b["x1"] < b_["x0"],
b["x0"] > b_["x1"]]
detach_feats = [b["x1"] < b_["x0"], b["x0"] > b_["x1"]]
if (any(feats) and not any(concatting_feats)) or any(detach_feats):
logging.debug("{} {} {} {}".format(
b["text"],
b_["text"],
any(feats),
any(concatting_feats),
))
logging.debug(
"{} {} {} {}".format(
b["text"],
b_["text"],
any(feats),
any(concatting_feats),
)
)
i += 1
continue
# merge up and down
@ -529,14 +499,11 @@ class RAGFlowPdfParser:
if not concat_between_pages and down["page_number"] > up["page_number"]:
break
if up.get("R", "") != down.get(
"R", "") and up["text"][-1] != "":
if up.get("R", "") != down.get("R", "") and up["text"][-1] != "":
i += 1
continue
if re.match(r"[0-9]{2,3}/[0-9]{3}$", up["text"]) \
or re.match(r"[0-9]{2,3}/[0-9]{3}$", down["text"]) \
or not down["text"].strip():
if re.match(r"[0-9]{2,3}/[0-9]{3}$", up["text"]) or re.match(r"[0-9]{2,3}/[0-9]{3}$", down["text"]) or not down["text"].strip():
i += 1
continue
@ -544,14 +511,12 @@ class RAGFlowPdfParser:
i += 1
continue
if up["x1"] < down["x0"] - 10 * \
mw or up["x0"] > down["x1"] + 10 * mw:
if up["x1"] < down["x0"] - 10 * mw or up["x0"] > down["x1"] + 10 * mw:
i += 1
continue
if i - dp < 5 and up.get("layout_type") == "text":
if up.get("layoutno", "1") == down.get(
"layoutno", "2"):
if up.get("layoutno", "1") == down.get("layoutno", "2"):
dfs(down, i + 1)
boxes.pop(i)
return
@ -559,8 +524,7 @@ class RAGFlowPdfParser:
continue
fea = self._updown_concat_features(up, down)
if self.updown_cnt_mdl.predict(
xgb.DMatrix([fea]))[0] <= 0.5:
if self.updown_cnt_mdl.predict(xgb.DMatrix([fea]))[0] <= 0.5:
i += 1
continue
dfs(down, i + 1)
@ -584,16 +548,14 @@ class RAGFlowPdfParser:
c["text"] = c["text"].strip()
if not c["text"]:
continue
if t["text"] and re.match(
r"[0-9\.a-zA-Z]+$", t["text"][-1] + c["text"][-1]):
if t["text"] and re.match(r"[0-9\.a-zA-Z]+$", t["text"][-1] + c["text"][-1]):
t["text"] += " "
t["text"] += c["text"]
t["x0"] = min(t["x0"], c["x0"])
t["x1"] = max(t["x1"], c["x1"])
t["page_number"] = min(t["page_number"], c["page_number"])
t["bottom"] = c["bottom"]
if not t["layout_type"] \
and c["layout_type"]:
if not t["layout_type"] and c["layout_type"]:
t["layout_type"] = c["layout_type"]
boxes.append(t)
@ -605,25 +567,20 @@ class RAGFlowPdfParser:
findit = False
i = 0
while i < len(self.boxes):
if not re.match(r"(contents|目录|目次|table of contents|致谢|acknowledge)$",
re.sub(r"( | |\u3000)+", "", self.boxes[i]["text"].lower())):
if not re.match(r"(contents|目录|目次|table of contents|致谢|acknowledge)$", re.sub(r"( | |\u3000)+", "", self.boxes[i]["text"].lower())):
i += 1
continue
findit = True
eng = re.match(
r"[0-9a-zA-Z :'.-]{5,}",
self.boxes[i]["text"].strip())
eng = re.match(r"[0-9a-zA-Z :'.-]{5,}", self.boxes[i]["text"].strip())
self.boxes.pop(i)
if i >= len(self.boxes):
break
prefix = self.boxes[i]["text"].strip()[:3] if not eng else " ".join(
self.boxes[i]["text"].strip().split()[:2])
prefix = self.boxes[i]["text"].strip()[:3] if not eng else " ".join(self.boxes[i]["text"].strip().split()[:2])
while not prefix:
self.boxes.pop(i)
if i >= len(self.boxes):
break
prefix = self.boxes[i]["text"].strip()[:3] if not eng else " ".join(
self.boxes[i]["text"].strip().split()[:2])
prefix = self.boxes[i]["text"].strip()[:3] if not eng else " ".join(self.boxes[i]["text"].strip().split()[:2])
self.boxes.pop(i)
if i >= len(self.boxes) or not prefix:
break
@ -662,10 +619,12 @@ class RAGFlowPdfParser:
self.boxes.pop(i + 1)
continue
if b["text"].strip()[0] != b_["text"].strip()[0] \
or b["text"].strip()[0].lower() in set("qwertyuopasdfghjklzxcvbnm") \
or rag_tokenizer.is_chinese(b["text"].strip()[0]) \
or b["top"] > b_["bottom"]:
if (
b["text"].strip()[0] != b_["text"].strip()[0]
or b["text"].strip()[0].lower() in set("qwertyuopasdfghjklzxcvbnm")
or rag_tokenizer.is_chinese(b["text"].strip()[0])
or b["top"] > b_["bottom"]
):
i += 1
continue
b_["text"] = b["text"] + "\n" + b_["text"]
@ -685,12 +644,8 @@ class RAGFlowPdfParser:
if "layoutno" not in self.boxes[i]:
i += 1
continue
lout_no = str(self.boxes[i]["page_number"]) + \
"-" + str(self.boxes[i]["layoutno"])
if TableStructureRecognizer.is_caption(self.boxes[i]) or self.boxes[i]["layout_type"] in ["table caption",
"title",
"figure caption",
"reference"]:
lout_no = str(self.boxes[i]["page_number"]) + "-" + str(self.boxes[i]["layoutno"])
if TableStructureRecognizer.is_caption(self.boxes[i]) or self.boxes[i]["layout_type"] in ["table caption", "title", "figure caption", "reference"]:
nomerge_lout_no.append(lst_lout_no)
if self.boxes[i]["layout_type"] == "table":
if re.match(r"(数据|资料|图表)*来源[: ]", self.boxes[i]["text"]):
@ -716,8 +671,7 @@ class RAGFlowPdfParser:
# merge table on different pages
nomerge_lout_no = set(nomerge_lout_no)
tbls = sorted([(k, bxs) for k, bxs in tables.items()],
key=lambda x: (x[1][0]["top"], x[1][0]["x0"]))
tbls = sorted([(k, bxs) for k, bxs in tables.items()], key=lambda x: (x[1][0]["top"], x[1][0]["x0"]))
i = len(tbls) - 1
while i - 1 >= 0:
@ -758,9 +712,7 @@ class RAGFlowPdfParser:
if b.get("layout_type", "").find("caption") >= 0:
continue
y_dis = self._y_dis(c, b)
x_dis = self._x_dis(
c, b) if not x_overlapped(
c, b) else 0
x_dis = self._x_dis(c, b) if not x_overlapped(c, b) else 0
dis = y_dis * y_dis + x_dis * x_dis
if dis < minv:
mink = k
@ -774,18 +726,10 @@ class RAGFlowPdfParser:
# continue
if tv < fv and tk:
tables[tk].insert(0, c)
logging.debug(
"TABLE:" +
self.boxes[i]["text"] +
"; Cap: " +
tk)
logging.debug("TABLE:" + self.boxes[i]["text"] + "; Cap: " + tk)
elif fk:
figures[fk].insert(0, c)
logging.debug(
"FIGURE:" +
self.boxes[i]["text"] +
"; Cap: " +
tk)
logging.debug("FIGURE:" + self.boxes[i]["text"] + "; Cap: " + tk)
self.boxes.pop(i)
def cropout(bxs, ltype, poss):
@ -794,29 +738,19 @@ class RAGFlowPdfParser:
if len(pn) < 2:
pn = list(pn)[0]
ht = self.page_cum_height[pn]
b = {
"x0": np.min([b["x0"] for b in bxs]),
"top": np.min([b["top"] for b in bxs]) - ht,
"x1": np.max([b["x1"] for b in bxs]),
"bottom": np.max([b["bottom"] for b in bxs]) - ht
}
b = {"x0": np.min([b["x0"] for b in bxs]), "top": np.min([b["top"] for b in bxs]) - ht, "x1": np.max([b["x1"] for b in bxs]), "bottom": np.max([b["bottom"] for b in bxs]) - ht}
louts = [layout for layout in self.page_layout[pn] if layout["type"] == ltype]
ii = Recognizer.find_overlapped(b, louts, naive=True)
if ii is not None:
b = louts[ii]
else:
logging.warning(
f"Missing layout match: {pn + 1},%s" %
(bxs[0].get(
"layoutno", "")))
logging.warning(f"Missing layout match: {pn + 1},%s" % (bxs[0].get("layoutno", "")))
left, top, right, bott = b["x0"], b["top"], b["x1"], b["bottom"]
if right < left:
right = left + 1
poss.append((pn + self.page_from, left, right, top, bott))
return self.page_images[pn] \
.crop((left * ZM, top * ZM,
right * ZM, bott * ZM))
return self.page_images[pn].crop((left * ZM, top * ZM, right * ZM, bott * ZM))
pn = {}
for b in bxs:
p = b["page_number"] - 1
@ -825,10 +759,7 @@ class RAGFlowPdfParser:
pn[p].append(b)
pn = sorted(pn.items(), key=lambda x: x[0])
imgs = [cropout(arr, ltype, poss) for p, arr in pn]
pic = Image.new("RGB",
(int(np.max([i.size[0] for i in imgs])),
int(np.sum([m.size[1] for m in imgs]))),
(245, 245, 245))
pic = Image.new("RGB", (int(np.max([i.size[0] for i in imgs])), int(np.sum([m.size[1] for m in imgs]))), (245, 245, 245))
height = 0
for img in imgs:
pic.paste(img, (0, int(height)))
@ -848,30 +779,20 @@ class RAGFlowPdfParser:
poss = []
if separate_tables_figures:
figure_results.append(
(cropout(
bxs,
"figure", poss),
[txt]))
figure_results.append((cropout(bxs, "figure", poss), [txt]))
figure_positions.append(poss)
else:
res.append(
(cropout(
bxs,
"figure", poss),
[txt]))
res.append((cropout(bxs, "figure", poss), [txt]))
positions.append(poss)
for k, bxs in tables.items():
if not bxs:
continue
bxs = Recognizer.sort_Y_firstly(bxs, np.mean(
[(b["bottom"] - b["top"]) / 2 for b in bxs]))
bxs = Recognizer.sort_Y_firstly(bxs, np.mean([(b["bottom"] - b["top"]) / 2 for b in bxs]))
poss = []
res.append((cropout(bxs, "table", poss),
self.tbl_det.construct_table(bxs, html=return_html, is_english=self.is_english)))
res.append((cropout(bxs, "table", poss), self.tbl_det.construct_table(bxs, html=return_html, is_english=self.is_english)))
positions.append(poss)
if separate_tables_figures:
@ -905,7 +826,7 @@ class RAGFlowPdfParser:
(r"[0-9]+", 10),
(r"[\(][0-9]+[\)]", 11),
(r"[零一二三四五六七八九十百]+是", 12),
(r"[⚫•➢✓]", 12)
(r"[⚫•➢✓]", 12),
]:
if re.match(p, line):
return j
@ -924,12 +845,9 @@ class RAGFlowPdfParser:
if pn[-1] - 1 >= page_images_cnt:
return ""
return "@@{}\t{:.1f}\t{:.1f}\t{:.1f}\t{:.1f}##" \
.format("-".join([str(p) for p in pn]),
bx["x0"], bx["x1"], top, bott)
return "@@{}\t{:.1f}\t{:.1f}\t{:.1f}\t{:.1f}##".format("-".join([str(p) for p in pn]), bx["x0"], bx["x1"], top, bott)
def __filterout_scraps(self, boxes, ZM):
def width(b):
return b["x1"] - b["x0"]
@ -939,8 +857,7 @@ class RAGFlowPdfParser:
def usefull(b):
if b.get("layout_type"):
return True
if width(
b) > self.page_images[b["page_number"] - 1].size[0] / ZM / 3:
if width(b) > self.page_images[b["page_number"] - 1].size[0] / ZM / 3:
return True
if b["bottom"] - b["top"] > self.mean_height[b["page_number"] - 1]:
return True
@ -952,31 +869,23 @@ class RAGFlowPdfParser:
widths = []
pw = self.page_images[boxes[0]["page_number"] - 1].size[0] / ZM
mh = self.mean_height[boxes[0]["page_number"] - 1]
mj = self.proj_match(
boxes[0]["text"]) or boxes[0].get(
"layout_type",
"") == "title"
mj = self.proj_match(boxes[0]["text"]) or boxes[0].get("layout_type", "") == "title"
def dfs(line, st):
nonlocal mh, pw, lines, widths
lines.append(line)
widths.append(width(line))
mmj = self.proj_match(
line["text"]) or line.get(
"layout_type",
"") == "title"
mmj = self.proj_match(line["text"]) or line.get("layout_type", "") == "title"
for i in range(st + 1, min(st + 20, len(boxes))):
if (boxes[i]["page_number"] - line["page_number"]) > 0:
break
if not mmj and self._y_dis(
line, boxes[i]) >= 3 * mh and height(line) < 1.5 * mh:
if not mmj and self._y_dis(line, boxes[i]) >= 3 * mh and height(line) < 1.5 * mh:
break
if not usefull(boxes[i]):
continue
if mmj or \
(self._x_dis(boxes[i], line) < pw / 10): \
# and abs(width(boxes[i])-width_mean)/max(width(boxes[i]),width_mean)<0.5):
if mmj or (self._x_dis(boxes[i], line) < pw / 10):
# and abs(width(boxes[i])-width_mean)/max(width(boxes[i]),width_mean)<0.5):
# concat following
dfs(boxes[i], i)
boxes.pop(i)
@ -992,11 +901,9 @@ class RAGFlowPdfParser:
boxes.pop(0)
mw = np.mean(widths)
if mj or mw / pw >= 0.35 or mw > 200:
res.append(
"\n".join([c["text"] + self._line_tag(c, ZM) for c in lines]))
res.append("\n".join([c["text"] + self._line_tag(c, ZM) for c in lines]))
else:
logging.debug("REMOVED: " +
"<<".join([c["text"] for c in lines]))
logging.debug("REMOVED: " + "<<".join([c["text"] for c in lines]))
return "\n\n".join(res)
@ -1004,16 +911,14 @@ class RAGFlowPdfParser:
def total_page_number(fnm, binary=None):
try:
with sys.modules[LOCK_KEY_pdfplumber]:
pdf = pdfplumber.open(
fnm) if not binary else pdfplumber.open(BytesIO(binary))
pdf = pdfplumber.open(fnm) if not binary else pdfplumber.open(BytesIO(binary))
total_page = len(pdf.pages)
pdf.close()
return total_page
except Exception:
logging.exception("total_page_number")
def __images__(self, fnm, zoomin=3, page_from=0,
page_to=299, callback=None):
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
self.lefted_chars = []
self.mean_height = []
self.mean_width = []
@ -1025,10 +930,9 @@ class RAGFlowPdfParser:
start = timer()
try:
with sys.modules[LOCK_KEY_pdfplumber]:
with (pdfplumber.open(fnm) if isinstance(fnm, str) else pdfplumber.open(BytesIO(fnm))) as pdf:
with pdfplumber.open(fnm) if isinstance(fnm, str) else pdfplumber.open(BytesIO(fnm)) as pdf:
self.pdf = pdf
self.page_images = [p.to_image(resolution=72 * zoomin, antialias=True).annotated for i, p in
enumerate(self.pdf.pages[page_from:page_to])]
self.page_images = [p.to_image(resolution=72 * zoomin, antialias=True).annotated for i, p in enumerate(self.pdf.pages[page_from:page_to])]
try:
self.page_chars = [[c for c in page.dedupe_chars().chars if self._has_color(c)] for page in self.pdf.pages[page_from:page_to]]
@ -1044,11 +948,11 @@ class RAGFlowPdfParser:
self.outlines = []
try:
with (pdf2_read(fnm if isinstance(fnm, str)
else BytesIO(fnm))) as pdf:
with pdf2_read(fnm if isinstance(fnm, str) else BytesIO(fnm)) as pdf:
self.pdf = pdf
outlines = self.pdf.outline
def dfs(arr, depth):
for a in arr:
if isinstance(a, dict):
@ -1065,11 +969,11 @@ class RAGFlowPdfParser:
logging.warning("Miss outlines")
logging.debug("Images converted.")
self.is_english = [re.search(r"[a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join(
random.choices([c["text"] for c in self.page_chars[i]], k=min(100, len(self.page_chars[i]))))) for i in
range(len(self.page_chars))]
if sum([1 if e else 0 for e in self.is_english]) > len(
self.page_images) / 2:
self.is_english = [
re.search(r"[a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join(random.choices([c["text"] for c in self.page_chars[i]], k=min(100, len(self.page_chars[i])))))
for i in range(len(self.page_chars))
]
if sum([1 if e else 0 for e in self.is_english]) > len(self.page_images) / 2:
self.is_english = True
else:
self.is_english = False
@ -1077,10 +981,12 @@ class RAGFlowPdfParser:
async def __img_ocr(i, id, img, chars, limiter):
j = 0
while j + 1 < len(chars):
if chars[j]["text"] and chars[j + 1]["text"] \
and re.match(r"[0-9a-zA-Z,.:;!%]+", chars[j]["text"] + chars[j + 1]["text"]) \
and chars[j + 1]["x0"] - chars[j]["x1"] >= min(chars[j + 1]["width"],
chars[j]["width"]) / 2:
if (
chars[j]["text"]
and chars[j + 1]["text"]
and re.match(r"[0-9a-zA-Z,.:;!%]+", chars[j]["text"] + chars[j + 1]["text"])
and chars[j + 1]["x0"] - chars[j]["x1"] >= min(chars[j + 1]["width"], chars[j]["width"]) / 2
):
chars[j]["text"] += " "
j += 1
@ -1096,12 +1002,8 @@ class RAGFlowPdfParser:
async def __img_ocr_launcher():
def __ocr_preprocess():
chars = self.page_chars[i] if not self.is_english else []
self.mean_height.append(
np.median(sorted([c["height"] for c in chars])) if chars else 0
)
self.mean_width.append(
np.median(sorted([c["width"] for c in chars])) if chars else 8
)
self.mean_height.append(np.median(sorted([c["height"] for c in chars])) if chars else 0)
self.mean_width.append(np.median(sorted([c["width"] for c in chars])) if chars else 8)
self.page_cum_height.append(img.size[1] / zoomin)
return chars
@ -1110,8 +1012,7 @@ class RAGFlowPdfParser:
for i, img in enumerate(self.page_images):
chars = __ocr_preprocess()
nursery.start_soon(__img_ocr, i, i % PARALLEL_DEVICES, img, chars,
self.parallel_limiter[i % PARALLEL_DEVICES])
nursery.start_soon(__img_ocr, i, i % PARALLEL_DEVICES, img, chars, self.parallel_limiter[i % PARALLEL_DEVICES])
await trio.sleep(0.1)
else:
for i, img in enumerate(self.page_images):
@ -1124,11 +1025,9 @@ class RAGFlowPdfParser:
logging.info(f"__images__ {len(self.page_images)} pages cost {timer() - start}s")
if not self.is_english and not any(
[c for c in self.page_chars]) and self.boxes:
if not self.is_english and not any([c for c in self.page_chars]) and self.boxes:
bxes = [b for bxs in self.boxes for b in bxs]
self.is_english = re.search(r"[\na-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}",
"".join([b["text"] for b in random.choices(bxes, k=min(30, len(bxes)))]))
self.is_english = re.search(r"[\na-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join([b["text"] for b in random.choices(bxes, k=min(30, len(bxes)))]))
logging.debug("Is it English:", self.is_english)
@ -1144,8 +1043,7 @@ class RAGFlowPdfParser:
self._text_merge()
self._concat_downward()
self._filter_forpages()
tbls = self._extract_table_figure(
need_image, zoomin, return_html, False)
tbls = self._extract_table_figure(need_image, zoomin, return_html, False)
return self.__filterout_scraps(deepcopy(self.boxes), zoomin), tbls
def parse_into_bboxes(self, fnm, callback=None, zoomin=3):
@ -1177,11 +1075,11 @@ class RAGFlowPdfParser:
def insert_table_figures(tbls_or_figs, layout_type):
def min_rectangle_distance(rect1, rect2):
import math
pn1, left1, right1, top1, bottom1 = rect1
pn2, left2, right2, top2, bottom2 = rect2
if (right1 >= left2 and right2 >= left1 and
bottom1 >= top2 and bottom2 >= top1):
return 0 + (pn1-pn2)*10000
if right1 >= left2 and right2 >= left1 and bottom1 >= top2 and bottom2 >= top1:
return 0 + (pn1 - pn2) * 10000
if right1 < left2:
dx = left2 - right1
elif right2 < left1:
@ -1194,18 +1092,16 @@ class RAGFlowPdfParser:
dy = top1 - bottom2
else:
dy = 0
return math.sqrt(dx*dx + dy*dy) + (pn1-pn2)*10000
return math.sqrt(dx * dx + dy * dy) + (pn1 - pn2) * 10000
for (img, txt), poss in tbls_or_figs:
bboxes = [(i, (b["page_number"], b["x0"], b["x1"], b["top"], b["bottom"])) for i, b in enumerate(self.boxes)]
dists = [(min_rectangle_distance((pn, left, right, top, bott), rect),i) for i, rect in bboxes for pn, left, right, top, bott in poss]
dists = [(min_rectangle_distance((pn, left, right, top, bott), rect), i) for i, rect in bboxes for pn, left, right, top, bott in poss]
min_i = np.argmin(dists, axis=0)[0]
min_i, rect = bboxes[dists[min_i][-1]]
if isinstance(txt, list):
txt = "\n".join(txt)
self.boxes.insert(min_i, {
"page_number": rect[0], "x0": rect[1], "x1": rect[2], "top": rect[3], "bottom": rect[4], "layout_type": layout_type, "text": txt, "image": img
})
self.boxes.insert(min_i, {"page_number": rect[0], "x0": rect[1], "x1": rect[2], "top": rect[3], "bottom": rect[4], "layout_type": layout_type, "text": txt, "image": img})
for b in self.boxes:
b["position_tag"] = self._line_tag(b, zoomin)
@ -1225,12 +1121,9 @@ class RAGFlowPdfParser:
def extract_positions(txt):
poss = []
for tag in re.findall(r"@@[0-9-]+\t[0-9.\t]+##", txt):
pn, left, right, top, bottom = tag.strip(
"#").strip("@").split("\t")
left, right, top, bottom = float(left), float(
right), float(top), float(bottom)
poss.append(([int(p) - 1 for p in pn.split("-")],
left, right, top, bottom))
pn, left, right, top, bottom = tag.strip("#").strip("@").split("\t")
left, right, top, bottom = float(left), float(right), float(top), float(bottom)
poss.append(([int(p) - 1 for p in pn.split("-")], left, right, top, bottom))
return poss
def crop(self, text, ZM=3, need_position=False):
@ -1241,15 +1134,12 @@ class RAGFlowPdfParser:
return None, None
return
max_width = max(
np.max([right - left for (_, left, right, _, _) in poss]), 6)
max_width = max(np.max([right - left for (_, left, right, _, _) in poss]), 6)
GAP = 6
pos = poss[0]
poss.insert(0, ([pos[0][0]], pos[1], pos[2], max(
0, pos[3] - 120), max(pos[3] - GAP, 0)))
poss.insert(0, ([pos[0][0]], pos[1], pos[2], max(0, pos[3] - 120), max(pos[3] - GAP, 0)))
pos = poss[-1]
poss.append(([pos[0][-1]], pos[1], pos[2], min(self.page_images[pos[0][-1]].size[1] / ZM, pos[4] + GAP),
min(self.page_images[pos[0][-1]].size[1] / ZM, pos[4] + 120)))
poss.append(([pos[0][-1]], pos[1], pos[2], min(self.page_images[pos[0][-1]].size[1] / ZM, pos[4] + GAP), min(self.page_images[pos[0][-1]].size[1] / ZM, pos[4] + 120)))
positions = []
for ii, (pns, left, right, top, bottom) in enumerate(poss):
@ -1257,28 +1147,14 @@ class RAGFlowPdfParser:
bottom *= ZM
for pn in pns[1:]:
bottom += self.page_images[pn - 1].size[1]
imgs.append(
self.page_images[pns[0]].crop((left * ZM, top * ZM,
right *
ZM, min(
bottom, self.page_images[pns[0]].size[1])
))
)
imgs.append(self.page_images[pns[0]].crop((left * ZM, top * ZM, right * ZM, min(bottom, self.page_images[pns[0]].size[1]))))
if 0 < ii < len(poss) - 1:
positions.append((pns[0] + self.page_from, left, right, top, min(
bottom, self.page_images[pns[0]].size[1]) / ZM))
positions.append((pns[0] + self.page_from, left, right, top, min(bottom, self.page_images[pns[0]].size[1]) / ZM))
bottom -= self.page_images[pns[0]].size[1]
for pn in pns[1:]:
imgs.append(
self.page_images[pn].crop((left * ZM, 0,
right * ZM,
min(bottom,
self.page_images[pn].size[1])
))
)
imgs.append(self.page_images[pn].crop((left * ZM, 0, right * ZM, min(bottom, self.page_images[pn].size[1]))))
if 0 < ii < len(poss) - 1:
positions.append((pn + self.page_from, left, right, 0, min(
bottom, self.page_images[pn].size[1]) / ZM))
positions.append((pn + self.page_from, left, right, 0, min(bottom, self.page_images[pn].size[1]) / ZM))
bottom -= self.page_images[pn].size[1]
if not imgs:
@ -1290,14 +1166,12 @@ class RAGFlowPdfParser:
height += img.size[1] + GAP
height = int(height)
width = int(np.max([i.size[0] for i in imgs]))
pic = Image.new("RGB",
(width, height),
(245, 245, 245))
pic = Image.new("RGB", (width, height), (245, 245, 245))
height = 0
for ii, img in enumerate(imgs):
if ii == 0 or ii + 1 == len(imgs):
img = img.convert('RGBA')
overlay = Image.new('RGBA', img.size, (0, 0, 0, 0))
img = img.convert("RGBA")
overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
overlay.putalpha(128)
img = Image.alpha_composite(img, overlay).convert("RGB")
pic.paste(img, (0, int(height)))
@ -1312,14 +1186,12 @@ class RAGFlowPdfParser:
pn = bx["page_number"]
top = bx["top"] - self.page_cum_height[pn - 1]
bott = bx["bottom"] - self.page_cum_height[pn - 1]
poss.append((pn, bx["x0"], bx["x1"], top, min(
bott, self.page_images[pn - 1].size[1] / ZM)))
poss.append((pn, bx["x0"], bx["x1"], top, min(bott, self.page_images[pn - 1].size[1] / ZM)))
while bott * ZM > self.page_images[pn - 1].size[1]:
bott -= self.page_images[pn - 1].size[1] / ZM
top = 0
pn += 1
poss.append((pn, bx["x0"], bx["x1"], top, min(
bott, self.page_images[pn - 1].size[1] / ZM)))
poss.append((pn, bx["x0"], bx["x1"], top, min(bott, self.page_images[pn - 1].size[1] / ZM)))
return poss
@ -1328,9 +1200,7 @@ class PlainParser:
self.outlines = []
lines = []
try:
self.pdf = pdf2_read(
filename if isinstance(
filename, str) else BytesIO(filename))
self.pdf = pdf2_read(filename if isinstance(filename, str) else BytesIO(filename))
for page in self.pdf.pages[from_page:to_page]:
lines.extend([t for t in page.extract_text().split("\n")])
@ -1367,10 +1237,8 @@ class VisionParser(RAGFlowPdfParser):
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
try:
with sys.modules[LOCK_KEY_pdfplumber]:
self.pdf = pdfplumber.open(fnm) if isinstance(
fnm, str) else pdfplumber.open(BytesIO(fnm))
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
enumerate(self.pdf.pages[page_from:page_to])]
self.pdf = pdfplumber.open(fnm) if isinstance(fnm, str) else pdfplumber.open(BytesIO(fnm))
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in enumerate(self.pdf.pages[page_from:page_to])]
self.total_page = len(self.pdf.pages)
except Exception:
self.page_images = None
@ -1397,15 +1265,15 @@ class VisionParser(RAGFlowPdfParser):
text = picture_vision_llm_chunk(
binary=img_binary,
vision_model=self.vision_model,
prompt=vision_llm_describe_prompt(page=pdf_page_num+1),
prompt=vision_llm_describe_prompt(page=pdf_page_num + 1),
callback=callback,
)
if kwargs.get("callback"):
kwargs["callback"](idx*1./len(self.page_images), f"Processed: {idx+1}/{len(self.page_images)}")
kwargs["callback"](idx * 1.0 / len(self.page_images), f"Processed: {idx + 1}/{len(self.page_images)}")
if text:
width, height = self.page_images[idx].size
all_docs.append((text, f"{pdf_page_num+1} 0 {width/zoomin} 0 {height/zoomin}"))
all_docs.append((text, f"{pdf_page_num + 1} 0 {width / zoomin} 0 {height / zoomin}"))
return all_docs, []

View File

@ -16,24 +16,28 @@
import io
import sys
import threading
import pdfplumber
from .ocr import OCR
from .recognizer import Recognizer
from .layout_recognizer import AscendLayoutRecognizer
from .layout_recognizer import LayoutRecognizer4YOLOv10 as LayoutRecognizer
from .table_structure_recognizer import TableStructureRecognizer
LOCK_KEY_pdfplumber = "global_shared_lock_pdfplumber"
if LOCK_KEY_pdfplumber not in sys.modules:
sys.modules[LOCK_KEY_pdfplumber] = threading.Lock()
def init_in_out(args):
from PIL import Image
import os
import traceback
from PIL import Image
from api.utils.file_utils import traversal_files
images = []
outputs = []
@ -44,8 +48,7 @@ def init_in_out(args):
nonlocal outputs, images
with sys.modules[LOCK_KEY_pdfplumber]:
pdf = pdfplumber.open(fnm)
images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
enumerate(pdf.pages)]
images = [p.to_image(resolution=72 * zoomin).annotated for i, p in enumerate(pdf.pages)]
for i, page in enumerate(images):
outputs.append(os.path.split(fnm)[-1] + f"_{i}.jpg")
@ -57,10 +60,10 @@ def init_in_out(args):
pdf_pages(fnm)
return
try:
fp = open(fnm, 'rb')
fp = open(fnm, "rb")
binary = fp.read()
fp.close()
images.append(Image.open(io.BytesIO(binary)).convert('RGB'))
images.append(Image.open(io.BytesIO(binary)).convert("RGB"))
outputs.append(os.path.split(fnm)[-1])
except Exception:
traceback.print_exc()
@ -81,6 +84,7 @@ __all__ = [
"OCR",
"Recognizer",
"LayoutRecognizer",
"AscendLayoutRecognizer",
"TableStructureRecognizer",
"init_in_out",
]

View File

@ -14,6 +14,8 @@
# limitations under the License.
#
import logging
import math
import os
import re
from collections import Counter
@ -45,28 +47,22 @@ class LayoutRecognizer(Recognizer):
def __init__(self, domain):
try:
model_dir = os.path.join(
get_project_base_directory(),
"rag/res/deepdoc")
model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
super().__init__(self.labels, domain, model_dir)
except Exception:
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False)
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc", local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"), local_dir_use_symlinks=False)
super().__init__(self.labels, domain, model_dir)
self.garbage_layouts = ["footer", "header", "reference"]
self.client = None
if os.environ.get("TENSORRT_DLA_SVR"):
from deepdoc.vision.dla_cli import DLAClient
self.client = DLAClient(os.environ["TENSORRT_DLA_SVR"])
def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
def __is_garbage(b):
patt = [r"^•+$", "^[0-9]{1,2} / ?[0-9]{1,2}$",
r"^[0-9]{1,2} of [0-9]{1,2}$", "^http://[^ ]{12,}",
"\\(cid *: *[0-9]+ *\\)"
]
patt = [r"^•+$", "^[0-9]{1,2} / ?[0-9]{1,2}$", r"^[0-9]{1,2} of [0-9]{1,2}$", "^http://[^ ]{12,}", "\\(cid *: *[0-9]+ *\\)"]
return any([re.search(p, b["text"]) for p in patt])
if self.client:
@ -82,18 +78,23 @@ class LayoutRecognizer(Recognizer):
page_layout = []
for pn, lts in enumerate(layouts):
bxs = ocr_res[pn]
lts = [{"type": b["type"],
lts = [
{
"type": b["type"],
"score": float(b["score"]),
"x0": b["bbox"][0] / scale_factor, "x1": b["bbox"][2] / scale_factor,
"top": b["bbox"][1] / scale_factor, "bottom": b["bbox"][-1] / scale_factor,
"x0": b["bbox"][0] / scale_factor,
"x1": b["bbox"][2] / scale_factor,
"top": b["bbox"][1] / scale_factor,
"bottom": b["bbox"][-1] / scale_factor,
"page_number": pn,
} for b in lts if float(b["score"]) >= 0.4 or b["type"] not in self.garbage_layouts]
lts = self.sort_Y_firstly(lts, np.mean(
[lt["bottom"] - lt["top"] for lt in lts]) / 2)
}
for b in lts
if float(b["score"]) >= 0.4 or b["type"] not in self.garbage_layouts
]
lts = self.sort_Y_firstly(lts, np.mean([lt["bottom"] - lt["top"] for lt in lts]) / 2)
lts = self.layouts_cleanup(bxs, lts)
page_layout.append(lts)
# Tag layout type, layouts are ready
def findLayout(ty):
nonlocal bxs, lts, self
lts_ = [lt for lt in lts if lt["type"] == ty]
@ -106,21 +107,17 @@ class LayoutRecognizer(Recognizer):
bxs.pop(i)
continue
ii = self.find_overlapped_with_threshold(bxs[i], lts_,
thr=0.4)
if ii is None: # belong to nothing
ii = self.find_overlapped_with_threshold(bxs[i], lts_, thr=0.4)
if ii is None:
bxs[i]["layout_type"] = ""
i += 1
continue
lts_[ii]["visited"] = True
keep_feats = [
lts_[
ii]["type"] == "footer" and bxs[i]["bottom"] < image_list[pn].size[1] * 0.9 / scale_factor,
lts_[
ii]["type"] == "header" and bxs[i]["top"] > image_list[pn].size[1] * 0.1 / scale_factor,
lts_[ii]["type"] == "footer" and bxs[i]["bottom"] < image_list[pn].size[1] * 0.9 / scale_factor,
lts_[ii]["type"] == "header" and bxs[i]["top"] > image_list[pn].size[1] * 0.1 / scale_factor,
]
if drop and lts_[
ii]["type"] in self.garbage_layouts and not any(keep_feats):
if drop and lts_[ii]["type"] in self.garbage_layouts and not any(keep_feats):
if lts_[ii]["type"] not in garbages:
garbages[lts_[ii]["type"]] = []
garbages[lts_[ii]["type"]].append(bxs[i]["text"])
@ -128,17 +125,14 @@ class LayoutRecognizer(Recognizer):
continue
bxs[i]["layoutno"] = f"{ty}-{ii}"
bxs[i]["layout_type"] = lts_[ii]["type"] if lts_[
ii]["type"] != "equation" else "figure"
bxs[i]["layout_type"] = lts_[ii]["type"] if lts_[ii]["type"] != "equation" else "figure"
i += 1
for lt in ["footer", "header", "reference", "figure caption",
"table caption", "title", "table", "text", "figure", "equation"]:
for lt in ["footer", "header", "reference", "figure caption", "table caption", "title", "table", "text", "figure", "equation"]:
findLayout(lt)
# add box to figure layouts which has not text box
for i, lt in enumerate(
[lt for lt in lts if lt["type"] in ["figure", "equation"]]):
for i, lt in enumerate([lt for lt in lts if lt["type"] in ["figure", "equation"]]):
if lt.get("visited"):
continue
lt = deepcopy(lt)
@ -206,13 +200,11 @@ class LayoutRecognizer4YOLOv10(LayoutRecognizer):
img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(dh - 0.1)) if self.center else 0, int(round(dh + 0.1))
left, right = int(round(dw - 0.1)) if self.center else 0, int(round(dw + 0.1))
img = cv2.copyMakeBorder(
img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)
) # add border
img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)) # add border
img /= 255.0
img = img.transpose(2, 0, 1)
img = img[np.newaxis, :, :, :].astype(np.float32)
inputs.append({self.input_names[0]: img, "scale_factor": [shape[1]/ww, shape[0]/hh, dw, dh]})
inputs.append({self.input_names[0]: img, "scale_factor": [shape[1] / ww, shape[0] / hh, dw, dh]})
return inputs
@ -230,8 +222,7 @@ class LayoutRecognizer4YOLOv10(LayoutRecognizer):
boxes[:, 2] -= inputs["scale_factor"][2]
boxes[:, 1] -= inputs["scale_factor"][3]
boxes[:, 3] -= inputs["scale_factor"][3]
input_shape = np.array([inputs["scale_factor"][0], inputs["scale_factor"][1], inputs["scale_factor"][0],
inputs["scale_factor"][1]])
input_shape = np.array([inputs["scale_factor"][0], inputs["scale_factor"][1], inputs["scale_factor"][0], inputs["scale_factor"][1]])
boxes = np.multiply(boxes, input_shape, dtype=np.float32)
unique_class_ids = np.unique(class_ids)
@ -243,8 +234,223 @@ class LayoutRecognizer4YOLOv10(LayoutRecognizer):
class_keep_boxes = nms(class_boxes, class_scores, 0.45)
indices.extend(class_indices[class_keep_boxes])
return [{
"type": self.label_list[class_ids[i]].lower(),
"bbox": [float(t) for t in boxes[i].tolist()],
"score": float(scores[i])
} for i in indices]
return [{"type": self.label_list[class_ids[i]].lower(), "bbox": [float(t) for t in boxes[i].tolist()], "score": float(scores[i])} for i in indices]
class AscendLayoutRecognizer(Recognizer):
labels = [
"title",
"Text",
"Reference",
"Figure",
"Figure caption",
"Table",
"Table caption",
"Table caption",
"Equation",
"Figure caption",
]
def __init__(self, domain):
from ais_bench.infer.interface import InferSession
model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
model_file_path = os.path.join(model_dir, domain + ".om")
if not os.path.exists(model_file_path):
raise ValueError(f"Model file not found: {model_file_path}")
device_id = int(os.getenv("ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID", 0))
self.session = InferSession(device_id=device_id, model_path=model_file_path)
self.input_shape = self.session.get_inputs()[0].shape[2:4] # H,W
self.garbage_layouts = ["footer", "header", "reference"]
def preprocess(self, image_list):
inputs = []
H, W = self.input_shape
for img in image_list:
h, w = img.shape[:2]
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32)
r = min(H / h, W / w)
new_unpad = (int(round(w * r)), int(round(h * r)))
dw, dh = (W - new_unpad[0]) / 2.0, (H - new_unpad[1]) / 2.0
img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114))
img /= 255.0
img = img.transpose(2, 0, 1)[np.newaxis, :, :, :].astype(np.float32)
inputs.append(
{
"image": img,
"scale_factor": [w / new_unpad[0], h / new_unpad[1]],
"pad": [dw, dh],
"orig_shape": [h, w],
}
)
return inputs
def postprocess(self, boxes, inputs, thr=0.25):
arr = np.squeeze(boxes)
if arr.ndim == 1:
arr = arr.reshape(1, -1)
results = []
if arr.shape[1] == 6:
# [x1,y1,x2,y2,score,cls]
m = arr[:, 4] >= thr
arr = arr[m]
if arr.size == 0:
return []
xyxy = arr[:, :4].astype(np.float32)
scores = arr[:, 4].astype(np.float32)
cls_ids = arr[:, 5].astype(np.int32)
if "pad" in inputs:
dw, dh = inputs["pad"]
sx, sy = inputs["scale_factor"]
xyxy[:, [0, 2]] -= dw
xyxy[:, [1, 3]] -= dh
xyxy *= np.array([sx, sy, sx, sy], dtype=np.float32)
else:
# backup
sx, sy = inputs["scale_factor"]
xyxy *= np.array([sx, sy, sx, sy], dtype=np.float32)
keep_indices = []
for c in np.unique(cls_ids):
idx = np.where(cls_ids == c)[0]
k = nms(xyxy[idx], scores[idx], 0.45)
keep_indices.extend(idx[k])
for i in keep_indices:
cid = int(cls_ids[i])
if 0 <= cid < len(self.labels):
results.append({"type": self.labels[cid].lower(), "bbox": [float(t) for t in xyxy[i].tolist()], "score": float(scores[i])})
return results
raise ValueError(f"Unexpected output shape: {arr.shape}")
def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
import re
from collections import Counter
assert len(image_list) == len(ocr_res)
images = [np.array(im) if not isinstance(im, np.ndarray) else im for im in image_list]
layouts_all_pages = [] # list of list[{"type","score","bbox":[x1,y1,x2,y2]}]
conf_thr = max(thr, 0.08)
batch_loop_cnt = math.ceil(float(len(images)) / batch_size)
for bi in range(batch_loop_cnt):
s = bi * batch_size
e = min((bi + 1) * batch_size, len(images))
batch_images = images[s:e]
inputs_list = self.preprocess(batch_images)
logging.debug("preprocess done")
for ins in inputs_list:
feeds = [ins["image"]]
out_list = self.session.infer(feeds=feeds, mode="static")
for out in out_list:
lts = self.postprocess(out, ins, conf_thr)
page_lts = []
for b in lts:
if float(b["score"]) >= 0.4 or b["type"] not in self.garbage_layouts:
x0, y0, x1, y1 = b["bbox"]
page_lts.append(
{
"type": b["type"],
"score": float(b["score"]),
"x0": float(x0) / scale_factor,
"x1": float(x1) / scale_factor,
"top": float(y0) / scale_factor,
"bottom": float(y1) / scale_factor,
"page_number": len(layouts_all_pages),
}
)
layouts_all_pages.append(page_lts)
def _is_garbage_text(box):
patt = [r"^•+$", r"^[0-9]{1,2} / ?[0-9]{1,2}$", r"^[0-9]{1,2} of [0-9]{1,2}$", r"^http://[^ ]{12,}", r"\(cid *: *[0-9]+ *\)"]
return any(re.search(p, box.get("text", "")) for p in patt)
boxes_out = []
page_layout = []
garbages = {}
for pn, lts in enumerate(layouts_all_pages):
if lts:
avg_h = np.mean([lt["bottom"] - lt["top"] for lt in lts])
lts = self.sort_Y_firstly(lts, avg_h / 2 if avg_h > 0 else 0)
bxs = ocr_res[pn]
lts = self.layouts_cleanup(bxs, lts)
page_layout.append(lts)
def _tag_layout(ty):
nonlocal bxs, lts
lts_of_ty = [lt for lt in lts if lt["type"] == ty]
i = 0
while i < len(bxs):
if bxs[i].get("layout_type"):
i += 1
continue
if _is_garbage_text(bxs[i]):
bxs.pop(i)
continue
ii = self.find_overlapped_with_threshold(bxs[i], lts_of_ty, thr=0.4)
if ii is None:
bxs[i]["layout_type"] = ""
i += 1
continue
lts_of_ty[ii]["visited"] = True
keep_feats = [
lts_of_ty[ii]["type"] == "footer" and bxs[i]["bottom"] < image_list[pn].shape[0] * 0.9 / scale_factor,
lts_of_ty[ii]["type"] == "header" and bxs[i]["top"] > image_list[pn].shape[0] * 0.1 / scale_factor,
]
if drop and lts_of_ty[ii]["type"] in self.garbage_layouts and not any(keep_feats):
garbages.setdefault(lts_of_ty[ii]["type"], []).append(bxs[i].get("text", ""))
bxs.pop(i)
continue
bxs[i]["layoutno"] = f"{ty}-{ii}"
bxs[i]["layout_type"] = lts_of_ty[ii]["type"] if lts_of_ty[ii]["type"] != "equation" else "figure"
i += 1
for ty in ["footer", "header", "reference", "figure caption", "table caption", "title", "table", "text", "figure", "equation"]:
_tag_layout(ty)
figs = [lt for lt in lts if lt["type"] in ["figure", "equation"]]
for i, lt in enumerate(figs):
if lt.get("visited"):
continue
lt = deepcopy(lt)
lt.pop("type", None)
lt["text"] = ""
lt["layout_type"] = "figure"
lt["layoutno"] = f"figure-{i}"
bxs.append(lt)
boxes_out.extend(bxs)
garbag_set = set()
for k, lst in garbages.items():
cnt = Counter(lst)
for g, c in cnt.items():
if c > 1:
garbag_set.add(g)
ocr_res_new = [b for b in boxes_out if b["text"].strip() not in garbag_set]
return ocr_res_new, page_layout

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
import gc
import logging
import copy
import time
@ -348,6 +348,13 @@ class TextRecognizer:
return img
def close(self):
# close session and release manually
logging.info('Close TextRecognizer.')
if hasattr(self, "predictor"):
del self.predictor
gc.collect()
def __call__(self, img_list):
img_num = len(img_list)
# Calculate the aspect ratio of all text bars
@ -395,6 +402,9 @@ class TextRecognizer:
return rec_res, time.time() - st
def __del__(self):
self.close()
class TextDetector:
def __init__(self, model_dir, device_id: int | None = None):
@ -479,6 +489,12 @@ class TextDetector:
dt_boxes = np.array(dt_boxes_new)
return dt_boxes
def close(self):
logging.info("Close TextDetector.")
if hasattr(self, "predictor"):
del self.predictor
gc.collect()
def __call__(self, img):
ori_im = img.copy()
data = {'image': img}
@ -508,6 +524,9 @@ class TextDetector:
return dt_boxes, time.time() - st
def __del__(self):
self.close()
class OCR:
def __init__(self, model_dir=None):

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
import gc
import logging
import os
import math
@ -406,6 +406,12 @@ class Recognizer:
"score": float(scores[i])
} for i in indices]
def close(self):
logging.info("Close recognizer.")
if hasattr(self, "ort_sess"):
del self.ort_sess
gc.collect()
def __call__(self, image_list, thr=0.7, batch_size=16):
res = []
images = []
@ -430,5 +436,7 @@ class Recognizer:
return res
def __del__(self):
self.close()

View File

@ -23,6 +23,7 @@ from huggingface_hub import snapshot_download
from api.utils.file_utils import get_project_base_directory
from rag.nlp import rag_tokenizer
from .recognizer import Recognizer
@ -38,31 +39,49 @@ class TableStructureRecognizer(Recognizer):
def __init__(self):
try:
super().__init__(self.labels, "tsr", os.path.join(
get_project_base_directory(),
"rag/res/deepdoc"))
super().__init__(self.labels, "tsr", os.path.join(get_project_base_directory(), "rag/res/deepdoc"))
except Exception:
super().__init__(self.labels, "tsr", snapshot_download(repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False))
super().__init__(
self.labels,
"tsr",
snapshot_download(
repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False,
),
)
def __call__(self, images, thr=0.2):
tbls = super().__call__(images, thr)
table_structure_recognizer_type = os.getenv("TABLE_STRUCTURE_RECOGNIZER_TYPE", "onnx").lower()
if table_structure_recognizer_type not in ["onnx", "ascend"]:
raise RuntimeError("Unsupported table structure recognizer type.")
if table_structure_recognizer_type == "onnx":
logging.debug("Using Onnx table structure recognizer", flush=True)
tbls = super().__call__(images, thr)
else: # ascend
logging.debug("Using Ascend table structure recognizer", flush=True)
tbls = self._run_ascend_tsr(images, thr)
res = []
# align left&right for rows, align top&bottom for columns
for tbl in tbls:
lts = [{"label": b["type"],
lts = [
{
"label": b["type"],
"score": b["score"],
"x0": b["bbox"][0], "x1": b["bbox"][2],
"top": b["bbox"][1], "bottom": b["bbox"][-1]
} for b in tbl]
"x0": b["bbox"][0],
"x1": b["bbox"][2],
"top": b["bbox"][1],
"bottom": b["bbox"][-1],
}
for b in tbl
]
if not lts:
continue
left = [b["x0"] for b in lts if b["label"].find(
"row") > 0 or b["label"].find("header") > 0]
right = [b["x1"] for b in lts if b["label"].find(
"row") > 0 or b["label"].find("header") > 0]
left = [b["x0"] for b in lts if b["label"].find("row") > 0 or b["label"].find("header") > 0]
right = [b["x1"] for b in lts if b["label"].find("row") > 0 or b["label"].find("header") > 0]
if not left:
continue
left = np.mean(left) if len(left) > 4 else np.min(left)
@ -93,11 +112,8 @@ class TableStructureRecognizer(Recognizer):
@staticmethod
def is_caption(bx):
patt = [
r"[图表]+[ 0-9:]{2,}"
]
if any([re.match(p, bx["text"].strip()) for p in patt]) \
or bx.get("layout_type", "").find("caption") >= 0:
patt = [r"[图表]+[ 0-9:]{2,}"]
if any([re.match(p, bx["text"].strip()) for p in patt]) or bx.get("layout_type", "").find("caption") >= 0:
return True
return False
@ -115,7 +131,7 @@ class TableStructureRecognizer(Recognizer):
(r"^[0-9A-Z/\._~-]+$", "Ca"),
(r"^[A-Z]*[a-z' -]+$", "En"),
(r"^[0-9.,+-]+[0-9A-Za-z/$¥%<>()' -]+$", "NE"),
(r"^.{1}$", "Sg")
(r"^.{1}$", "Sg"),
]
for p, n in patt:
if re.search(p, b["text"].strip()):
@ -156,21 +172,19 @@ class TableStructureRecognizer(Recognizer):
rowh = [b["R_bott"] - b["R_top"] for b in boxes if "R" in b]
rowh = np.min(rowh) if rowh else 0
boxes = Recognizer.sort_R_firstly(boxes, rowh / 2)
#for b in boxes:print(b)
# for b in boxes:print(b)
boxes[0]["rn"] = 0
rows = [[boxes[0]]]
btm = boxes[0]["bottom"]
for b in boxes[1:]:
b["rn"] = len(rows) - 1
lst_r = rows[-1]
if lst_r[-1].get("R", "") != b.get("R", "") \
or (b["top"] >= btm - 3 and lst_r[-1].get("R", "-1") != b.get("R", "-2")
): # new row
if lst_r[-1].get("R", "") != b.get("R", "") or (b["top"] >= btm - 3 and lst_r[-1].get("R", "-1") != b.get("R", "-2")): # new row
btm = b["bottom"]
b["rn"] += 1
rows.append([b])
continue
btm = (btm + b["bottom"]) / 2.
btm = (btm + b["bottom"]) / 2.0
rows[-1].append(b)
colwm = [b["C_right"] - b["C_left"] for b in boxes if "C" in b]
@ -186,14 +200,14 @@ class TableStructureRecognizer(Recognizer):
for b in boxes[1:]:
b["cn"] = len(cols) - 1
lst_c = cols[-1]
if (int(b.get("C", "1")) - int(lst_c[-1].get("C", "1")) == 1 and b["page_number"] == lst_c[-1][
"page_number"]) \
or (b["x0"] >= right and lst_c[-1].get("C", "-1") != b.get("C", "-2")): # new col
if (int(b.get("C", "1")) - int(lst_c[-1].get("C", "1")) == 1 and b["page_number"] == lst_c[-1]["page_number"]) or (
b["x0"] >= right and lst_c[-1].get("C", "-1") != b.get("C", "-2")
): # new col
right = b["x1"]
b["cn"] += 1
cols.append([b])
continue
right = (right + b["x1"]) / 2.
right = (right + b["x1"]) / 2.0
cols[-1].append(b)
tbl = [[[] for _ in range(len(cols))] for _ in range(len(rows))]
@ -214,10 +228,8 @@ class TableStructureRecognizer(Recognizer):
if e > 1:
j += 1
continue
f = (j > 0 and tbl[ii][j - 1] and tbl[ii]
[j - 1][0].get("text")) or j == 0
ff = (j + 1 < len(tbl[ii]) and tbl[ii][j + 1] and tbl[ii]
[j + 1][0].get("text")) or j + 1 >= len(tbl[ii])
f = (j > 0 and tbl[ii][j - 1] and tbl[ii][j - 1][0].get("text")) or j == 0
ff = (j + 1 < len(tbl[ii]) and tbl[ii][j + 1] and tbl[ii][j + 1][0].get("text")) or j + 1 >= len(tbl[ii])
if f and ff:
j += 1
continue
@ -228,13 +240,11 @@ class TableStructureRecognizer(Recognizer):
if j > 0 and not f:
for i in range(len(tbl)):
if tbl[i][j - 1]:
left = min(left, np.min(
[bx["x0"] - a["x1"] for a in tbl[i][j - 1]]))
left = min(left, np.min([bx["x0"] - a["x1"] for a in tbl[i][j - 1]]))
if j + 1 < len(tbl[0]) and not ff:
for i in range(len(tbl)):
if tbl[i][j + 1]:
right = min(right, np.min(
[a["x0"] - bx["x1"] for a in tbl[i][j + 1]]))
right = min(right, np.min([a["x0"] - bx["x1"] for a in tbl[i][j + 1]]))
assert left < 100000 or right < 100000
if left < right:
for jj in range(j, len(tbl[0])):
@ -260,8 +270,7 @@ class TableStructureRecognizer(Recognizer):
for i in range(len(tbl)):
tbl[i].pop(j)
cols.pop(j)
assert len(cols) == len(tbl[0]), "Column NO. miss matched: %d vs %d" % (
len(cols), len(tbl[0]))
assert len(cols) == len(tbl[0]), "Column NO. miss matched: %d vs %d" % (len(cols), len(tbl[0]))
if len(cols) >= 4:
# remove single in row
@ -277,10 +286,8 @@ class TableStructureRecognizer(Recognizer):
if e > 1:
i += 1
continue
f = (i > 0 and tbl[i - 1][jj] and tbl[i - 1]
[jj][0].get("text")) or i == 0
ff = (i + 1 < len(tbl) and tbl[i + 1][jj] and tbl[i + 1]
[jj][0].get("text")) or i + 1 >= len(tbl)
f = (i > 0 and tbl[i - 1][jj] and tbl[i - 1][jj][0].get("text")) or i == 0
ff = (i + 1 < len(tbl) and tbl[i + 1][jj] and tbl[i + 1][jj][0].get("text")) or i + 1 >= len(tbl)
if f and ff:
i += 1
continue
@ -292,13 +299,11 @@ class TableStructureRecognizer(Recognizer):
if i > 0 and not f:
for j in range(len(tbl[i - 1])):
if tbl[i - 1][j]:
up = min(up, np.min(
[bx["top"] - a["bottom"] for a in tbl[i - 1][j]]))
up = min(up, np.min([bx["top"] - a["bottom"] for a in tbl[i - 1][j]]))
if i + 1 < len(tbl) and not ff:
for j in range(len(tbl[i + 1])):
if tbl[i + 1][j]:
down = min(down, np.min(
[a["top"] - bx["bottom"] for a in tbl[i + 1][j]]))
down = min(down, np.min([a["top"] - bx["bottom"] for a in tbl[i + 1][j]]))
assert up < 100000 or down < 100000
if up < down:
for ii in range(i, len(tbl)):
@ -333,22 +338,15 @@ class TableStructureRecognizer(Recognizer):
cnt += 1
if max_type == "Nu" and arr[0]["btype"] == "Nu":
continue
if any([a.get("H") for a in arr]) \
or (max_type == "Nu" and arr[0]["btype"] != "Nu"):
if any([a.get("H") for a in arr]) or (max_type == "Nu" and arr[0]["btype"] != "Nu"):
h += 1
if h / cnt > 0.5:
hdset.add(i)
if html:
return TableStructureRecognizer.__html_table(cap, hdset,
TableStructureRecognizer.__cal_spans(boxes, rows,
cols, tbl, True)
)
return TableStructureRecognizer.__html_table(cap, hdset, TableStructureRecognizer.__cal_spans(boxes, rows, cols, tbl, True))
return TableStructureRecognizer.__desc_table(cap, hdset,
TableStructureRecognizer.__cal_spans(boxes, rows, cols, tbl,
False),
is_english)
return TableStructureRecognizer.__desc_table(cap, hdset, TableStructureRecognizer.__cal_spans(boxes, rows, cols, tbl, False), is_english)
@staticmethod
def __html_table(cap, hdset, tbl):
@ -367,10 +365,8 @@ class TableStructureRecognizer(Recognizer):
continue
txt = ""
if arr:
h = min(np.min([c["bottom"] - c["top"]
for c in arr]) / 2, 10)
txt = " ".join([c["text"]
for c in Recognizer.sort_Y_firstly(arr, h)])
h = min(np.min([c["bottom"] - c["top"] for c in arr]) / 2, 10)
txt = " ".join([c["text"] for c in Recognizer.sort_Y_firstly(arr, h)])
txts.append(txt)
sp = ""
if arr[0].get("colspan"):
@ -436,15 +432,11 @@ class TableStructureRecognizer(Recognizer):
if headers[j][k].find(headers[j - 1][k]) >= 0:
continue
if len(headers[j][k]) > len(headers[j - 1][k]):
headers[j][k] += (de if headers[j][k]
else "") + headers[j - 1][k]
headers[j][k] += (de if headers[j][k] else "") + headers[j - 1][k]
else:
headers[j][k] = headers[j - 1][k] \
+ (de if headers[j - 1][k] else "") \
+ headers[j][k]
headers[j][k] = headers[j - 1][k] + (de if headers[j - 1][k] else "") + headers[j][k]
logging.debug(
f">>>>>>>>>>>>>>>>>{cap}SIZE:{rowno}X{clmno} Header: {hdr_rowno}")
logging.debug(f">>>>>>>>>>>>>>>>>{cap}SIZE:{rowno}X{clmno} Header: {hdr_rowno}")
row_txt = []
for i in range(rowno):
if i in hdr_rowno:
@ -503,14 +495,10 @@ class TableStructureRecognizer(Recognizer):
@staticmethod
def __cal_spans(boxes, rows, cols, tbl, html=True):
# caculate span
clft = [np.mean([c.get("C_left", c["x0"]) for c in cln])
for cln in cols]
crgt = [np.mean([c.get("C_right", c["x1"]) for c in cln])
for cln in cols]
rtop = [np.mean([c.get("R_top", c["top"]) for c in row])
for row in rows]
rbtm = [np.mean([c.get("R_btm", c["bottom"])
for c in row]) for row in rows]
clft = [np.mean([c.get("C_left", c["x0"]) for c in cln]) for cln in cols]
crgt = [np.mean([c.get("C_right", c["x1"]) for c in cln]) for cln in cols]
rtop = [np.mean([c.get("R_top", c["top"]) for c in row]) for row in rows]
rbtm = [np.mean([c.get("R_btm", c["bottom"]) for c in row]) for row in rows]
for b in boxes:
if "SP" not in b:
continue
@ -585,3 +573,40 @@ class TableStructureRecognizer(Recognizer):
tbl[rowspan[0]][colspan[0]] = arr
return tbl
def _run_ascend_tsr(self, image_list, thr=0.2, batch_size=16):
import math
from ais_bench.infer.interface import InferSession
model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
model_file_path = os.path.join(model_dir, "tsr.om")
if not os.path.exists(model_file_path):
raise ValueError(f"Model file not found: {model_file_path}")
device_id = int(os.getenv("ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID", 0))
session = InferSession(device_id=device_id, model_path=model_file_path)
images = [np.array(im) if not isinstance(im, np.ndarray) else im for im in image_list]
results = []
conf_thr = max(thr, 0.08)
batch_loop_cnt = math.ceil(float(len(images)) / batch_size)
for bi in range(batch_loop_cnt):
s = bi * batch_size
e = min((bi + 1) * batch_size, len(images))
batch_images = images[s:e]
inputs_list = self.preprocess(batch_images)
for ins in inputs_list:
feeds = []
if "image" in ins:
feeds.append(ins["image"])
else:
feeds.append(ins[self.input_names[0]])
output_list = session.infer(feeds=feeds, mode="static")
bb = self.postprocess(output_list, ins, conf_thr)
results.append(bb)
return results

View File

@ -507,3 +507,16 @@ All uploaded files are stored in Minio, RAGFlow's object storage solution. For i
You can control the batch size for document parsing and embedding by setting the environment variables `DOC_BULK_SIZE` and `EMBEDDING_BATCH_SIZE`. Increasing these values may improve throughput for large-scale data processing, but will also increase memory usage. Adjust them according to your hardware resources.
---
### How to accelerate the question-answering speed of my chat assistant?
See [here](./guides/chat/best_practices/accelerate_question_answering.mdx).
---
### How to accelerate the question-answering speed of my Agent?
See [here](./guides/agent/best_practices/accelerate_agent_question_answering.md).
---

View File

@ -26,6 +26,84 @@ An **Agent** component is essential when you need the LLM to assist with summari
2. If your Agent involves dataset retrieval, ensure you [have properly configured your target knowledge base(s)](../../dataset/configure_knowledge_base.md).
## Quickstart
### 1. Click on an **Agent** component to show its configuration panel
The corresponding configuration panel appears to the right of the canvas. Use this panel to define and fine-tune the **Agent** component's behavior.
### 2. Select your model
Click **Model**, and select a chat model from the dropdown menu.
:::tip NOTE
If no model appears, check if your have added a chat model on the **Model providers** page.
:::
### 3. Update system prompt (Optional)
The system prompt typically defines your model's role. You can either keep the system prompt as is or customize it to override the default.
### 4. Update user prompt
The user prompt typically defines your model's task. You will find the `sys.query` variable auto-populated. Type `/` or click **(x)** to view or add variables.
In this quickstart, we assume your **Agent** component is used standalone (without tools or sub-Agents below), then you may also need to specify retrieved chunks using the `formalized_content` variable:
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/standalone_user_prompt_variable.jpg)
### 5. Skip Tools and Agent
The **+ Add tools** and **+ Add agent** sections are used *only* when you need to configure your **Agent** component as a planner (with tools or sub-Agents beneath). In this quickstart, we assume your **Agent** component is used standalone (without tools or sub-Agents beneath).
### 6. Choose the next component
When necessary, click the **+** button on the **Agent** component to choose the next component in the worflow from the dropdown list.
## Connect to an MCP server as a client
:::danger IMPORTANT
In this section, we assume your **Agent** will be configured as a planner, with a Tavily tool beneath it.
:::
### 1. Navigate to the MCP configuration page
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/mcp_page.jpg)
### 2. Configure your Tavily MCP server
Update your MCP server's name, URL (including the API key), server type, and other necessary settings. When configured correctly, the available tools will be displayed.
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/edit_mcp_server.jpg)
### 3. Navigate to your Agent's editing page
### 4. Connect to your MCP server
1. Click **+ Add tools**:
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/add_tools.jpg)
2. Click **MCP** to show the available MCP servers.
3. Select your MCP server:
*The target MCP server appears below your Agent component, and your Agent will autonomously decide when to invoke the available tools it offers.*
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/choose_tavily_mcp_server.jpg)
### 5. Update system prompt to specify trigger conditions (Optional)
To ensure reliable tool calls, you may specify within the system prompt which tasks should trigger each tool call.
### 6. View the availabe tools of your MCP server
On the canvas, click the newly-populated Tavily server to view and select its available tools:
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/tavily_mcp_server.jpg)
## Configurations
### Model
@ -69,7 +147,7 @@ An **Agent** component relies on keys (variables) to specify its data inputs. It
#### Advanced usage
From v0.20.5 onwards, four framework-level prompt blocks are available in the **System prompt** field. Type `/` or click **(x)** to view them; they appear under the **Framework** entry in the dropdown menu.
From v0.20.5 onwards, four framework-level prompt blocks are available in the **System prompt** field, enabling you to customize and *override* prompts at the framework level. Type `/` or click **(x)** to view them; they appear under the **Framework** entry in the dropdown menu.
- `task_analysis` prompt block
- This block is responsible for analyzing tasks — either a user task or a task assigned by the lead Agent when the **Agent** component is acting as a Sub-Agent.
@ -100,6 +178,12 @@ From v0.20.5 onwards, four framework-level prompt blocks are available in the **
- `citation_guidelines` prompt block
- Reference design: [citation_prompt.md](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/citation_prompt.md)
*The screenshots below show the framework prompt blocks available to an **Agent** component, both as a standalone and as a planner (with a Tavily tool below):*
![standalone](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/standalone_agent_framework_block.jpg)
![planner](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/planner_agent_framework_blocks.jpg)
### User prompt
The user-defined prompt. Defaults to `sys.query`, the user query. As a general rule, when using the **Agent** component as a standalone module (not as a planner), you usually need to specify the corresponding **Retrieval** components output variable (`formalized_content`) here as part of the input to the LLM.
@ -129,7 +213,7 @@ Defines the maximum number of attempts the agent will make to retry a failed tas
The waiting period in seconds that the agent observes before retrying a failed task, helping to prevent immediate repeated attempts and allowing system conditions to improve. Defaults to 1 second.
### Max rounds
### Max reflection rounds
Defines the maximum number reflection rounds of the selected chat model. Defaults to 1 round.
@ -145,18 +229,4 @@ The global variable name for the output of the **Agent** component, which can be
### Why does it take so long for my Agent to respond?
An Agents response time generally depends on two key factors: the LLMs capabilities and the prompt, the latter reflecting task complexity. When using an Agent, you should always balance task demands with the LLMs ability. See [How to balance task complexity with an Agent's performance and speed?](#how-to-balance-task-complexity-with-an-agents-performance-and-speed) for details.
## Best practices
### How to balance task complexity with an Agents performance and speed?
- For simple tasks, such as retrieval, rewriting, formatting, or structured data extraction, use concise prompts, remove planning or reasoning instructions, enforce output length limits, and select smaller or Turbo-class models. This significantly reduces latency and cost with minimal impact on quality.
- For complex tasks, like multi-step reasoning, cross-document synthesis, or tool-based workflows, maintain or enhance prompts that include planning, reflection, and verification steps.
- In multi-Agent orchestration systems, delegate simple subtasks to sub-Agents using smaller, faster models, and reserve more powerful models for the lead Agent to handle complexity and uncertainty.
:::tip KEY INSIGHT
Focus on minimizing output tokens — through summarization, bullet points, or explicit length limits — as this has far greater impact on reducing latency than optimizing input size.
:::
See [here](../best_practices/accelerate_agent_question_answering.md) for details.

View File

@ -49,6 +49,10 @@ You can specify multiple input sources for the **Code** component. Click **+ Add
This field allows you to enter and edit your source code.
:::danger IMPORTANT
If your code implementation includes defined variables, whether input or output variables, ensure they are also specified in the corresponding **Input** or **Output** sections.
:::
#### A Python code example
```Python
@ -77,6 +81,15 @@ This field allows you to enter and edit your source code.
You define the output variable(s) of the **Code** component here.
:::danger IMPORTANT
If you define output variables here, ensure they are also defined in your code implementation; otherwise, their values will be `null`. The following are two examples:
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/set_object_output.jpg)
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/set_nested_object_output.png)
:::
### Output
The defined output variable(s) will be auto-populated here.

View File

@ -0,0 +1,79 @@
---
sidebar_position: 25
slug: /execute_sql
---
# Execute SQL tool
A tool that execute SQL queries on a specified relational database.
---
The **Execute SQL** tool enables you to connect to a relational database and run SQL queries, whether entered directly or generated by the systems Text2SQL capability via an **Agent** component.
## Prerequisites
- A database instance properly configured and running.
- The database must be one of the following types:
- MySQL
- PostgreSQL
- MariaDB
- Microsoft SQL Server
## Examples
You can pair an **Agent** component with the **Execute SQL** tool, with the **Agent** generating SQL statements and the **Execute SQL** tool handling database connection and query execution. An example of this setup can be found in the **SQL Assistant** Agent template shown below:
![](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/exeSQL.jpg)
## Configurations
### SQL statement
This text input field allows you to write static SQL queries, such as `SELECT * FROM my_table`, and dynamic SQL queries using variables.
:::tip NOTE
Click **(x)** or type `/` to insert variables.
:::
For dynamic SQL queries, you can include variables in your SQL queries, such as `SELECT * FROM /sys.query`; if an **Agent** component is paired with the **Execute SQL** tool to generate SQL tasks (see the [Examples](#examples) section), you can directly insert that **Agent**'s output, `content`, into this field.
### Database type
The supported database type. Currently the following database types are available:
- MySQL
- PostreSQL
- MariaDB
- Microsoft SQL Server (Myssql)
### Database
Appears only when you select **Split** as method.
### Username
The username with access privileges to the database.
### Host
The IP address of the database server.
### Port
The port number on which the database server is listening.
### Password
The password for the database user.
### Max records
The maximum number of records returned by the SQL query to control response size and improve efficiency. Defaults to `1024`.
### Output
The **Execute SQL** tool provides two output variables:
- `formalized_content`: A string. If you reference this variable in a **Message** component, the returned records are displayed as a table.
- `json`: An object array. If you reference this variable in a **Message** component, the returned records will be presented as key-value pairs.

View File

@ -0,0 +1,8 @@
{
"label": "Best practices",
"position": 30,
"link": {
"type": "generated-index",
"description": "Best practices on Agent configuration."
}
}

View File

@ -0,0 +1,58 @@
---
sidebar_position: 1
slug: /accelerate_agent_question_answering
---
# Accelerate answering
A checklist to speed up question answering.
---
Please note that some of your settings may consume a significant amount of time. If you often find that your question answering is time-consuming, here is a checklist to consider:
## Balance task complexity with an Agents performance and speed?
An Agents response time generally depends on many factors, e.g., the LLMs capabilities and the prompt, the latter reflecting task complexity. When using an Agent, you should always balance task demands with the LLMs ability.
- For simple tasks, such as retrieval, rewriting, formatting, or structured data extraction, use concise prompts, remove planning or reasoning instructions, enforce output length limits, and select smaller or Turbo-class models. This significantly reduces latency and cost with minimal impact on quality.
- For complex tasks, like multi-step reasoning, cross-document synthesis, or tool-based workflows, maintain or enhance prompts that include planning, reflection, and verification steps.
- In multi-Agent orchestration systems, delegate simple subtasks to sub-Agents using smaller, faster models, and reserve more powerful models for the lead Agent to handle complexity and uncertainty.
:::tip KEY INSIGHT
Focus on minimizing output tokens — through summarization, bullet points, or explicit length limits — as this has far greater impact on reducing latency than optimizing input size.
:::
## Disable Reasoning
Disabling the **Reasoning** toggle will reduce the LLM's thinking time. For a model like Qwen3, you also need to add `/no_think` to the system prompt to disable reasoning.
## Disable Rerank model
- Leaving the **Rerank model** field empty (in the corresponding **Retrieval** component) will significantly decrease retrieval time.
- When using a rerank model, ensure you have a GPU for acceleration; otherwise, the reranking process will be *prohibitively* slow.
:::tip NOTE
Please note that rerank models are essential in certain scenarios. There is always a trade-off between speed and performance; you must weigh the pros against cons for your specific case.
:::
## Check the time taken for each task
Click the light bulb icon above the *current* dialogue and scroll down the popup window to view the time taken for each task:
| Item name | Description |
| ----------------- | --------------------------------------------------------------------------------------------- |
| Total | Total time spent on this conversation round, including chunk retrieval and answer generation. |
| Check LLM | Time to validate the specified LLM. |
| Create retriever | Time to create a chunk retriever. |
| Bind embedding | Time to initialize an embedding model instance. |
| Bind LLM | Time to initialize an LLM instance. |
| Tune question | Time to optimize the user query using the context of the mult-turn conversation. |
| Bind reranker | Time to initialize an reranker model instance for chunk retrieval. |
| Generate keywords | Time to extract keywords from the user query. |
| Retrieval | Time to retrieve the chunks. |
| Generate answer | Time to generate the answer. |

View File

@ -6,21 +6,22 @@ slug: /accelerate_question_answering
# Accelerate answering
import APITable from '@site/src/components/APITable';
A checklist to speed up question answering.
A checklist to speed up question answering for your chat assistant.
---
Please note that some of your settings may consume a significant amount of time. If you often find that your question answering is time-consuming, here is a checklist to consider:
- In the **Prompt engine** tab of your **Chat Configuration** dialogue, disabling **Multi-turn optimization** will reduce the time required to get an answer from the LLM.
- In the **Prompt engine** tab of your **Chat Configuration** dialogue, leaving the **Rerank model** field empty will significantly decrease retrieval time.
- Disabling **Multi-turn optimization** will reduce the time required to get an answer from the LLM.
- Leaving the **Rerank model** field empty will significantly decrease retrieval time.
- Disabling the **Reasoning** toggle will reduce the LLM's thinking time. For a model like Qwen3, you also need to add `/no_think` to the system prompt to disable reasoning.
- When using a rerank model, ensure you have a GPU for acceleration; otherwise, the reranking process will be *prohibitively* slow.
:::tip NOTE
Please note that rerank models are essential in certain scenarios. There is always a trade-off between speed and performance; you must weigh the pros against cons for your specific case.
:::
- In the **Assistant settings** tab of your **Chat Configuration** dialogue, disabling **Keyword analysis** will reduce the time to receive an answer from the LLM.
- Disabling **Keyword analysis** will reduce the time to receive an answer from the LLM.
- When chatting with your chat assistant, click the light bulb icon above the *current* dialogue and scroll down the popup window to view the time taken for each task:
![enlighten](https://github.com/user-attachments/assets/fedfa2ee-21a7-451b-be66-20125619923c)

View File

@ -106,7 +106,7 @@ RAGFlow offers HTTP and Python APIs for you to integrate RAGFlow's capabilities
You can use iframe to embed the created chat assistant into a third-party webpage:
1. Before proceeding, you must [acquire an API key](../models/llm_api_key_setup.md); otherwise, an error message would appear.
1. Before proceeding, you must [acquire an API key](../../develop/acquire_ragflow_api_key.md); otherwise, an error message would appear.
2. Hover over an intended chat assistant **>** **Edit** to show the **iframe** window:
![chat-embed](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/embed_chat_into_webpage.jpg)

View File

@ -91,7 +91,7 @@ In RAGFlow, click on your logo on the top right of the page **>** **Model provid
In the popup window, complete basic settings for Ollama:
1. Ensure that your model name and type match those been pulled at step 1 (Deploy Ollama using Docker). For example, (`llama3.2` and `chat`) or (`bge-m3` and `embedding`).
2. In Ollama base URL, put the URL you found in step 2 followed by `/v1`, i.e. `http://host.docker.internal:11434/v1`, `http://localhost:11434/v1` or `http://${IP_OF_OLLAMA_MACHINE}:11434/v1`.
2. Put in the Ollama base URL, i.e. `http://host.docker.internal:11434`, `http://localhost:11434` or `http://${IP_OF_OLLAMA_MACHINE}:11434`.
3. OPTIONAL: Switch on the toggle under **Does it support Vision?** if your model includes an image-to-text model.

View File

@ -31,3 +31,79 @@ You can click on a specific 30-second time interval to view the details of compl
![done_tasks](https://github.com/user-attachments/assets/49b25ec4-03af-48cf-b2e5-c892f6eaa261)
![done_vs_failed](https://github.com/user-attachments/assets/eaa928d0-a31c-4072-adea-046091e04599)
## API Health Check
In addition to checking the system dependencies from the **avatar > System** page in the UI, you can directly query the backend health check endpoint:
```bash
http://IP_OF_YOUR_MACHINE/v1/system/healthz
```
Here `<port>` refers to the actual port of your backend service (e.g., `7897`, `9222`, etc.).
Key points:
- **No login required** (no `@login_required` decorator)
- Returns results in JSON format
- If all dependencies are healthy → HTTP **200 OK**
- If any dependency fails → HTTP **500 Internal Server Error**
### Example 1: All services healthy (HTTP 200)
```bash
http://127.0.0.1/v1/system/healthz
```
Response:
```http
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 120
```
Explanation:
- Database (MySQL/Postgres), Redis, document engine (Elasticsearch/Infinity), and object storage (MinIO) are all healthy.
- The `status` field returns `"ok"`.
### Example 2: One service unhealthy (HTTP 500)
For example, if Redis is down:
Response:
```http
HTTP/1.1 500 INTERNAL SERVER ERROR
Content-Type: application/json
Content-Length: 300
```
Explanation:
- `redis` is marked as `"nok"`, with detailed error info under `_meta.redis.error`.
- The overall `status` is `"nok"`, so the endpoint returns 500.
---
This endpoint allows you to monitor RAGFlows core dependencies programmatically in scripts or external monitoring systems, without relying on the frontend UI.
"redis": "nok",
"doc_engine": "ok",
"storage": "ok",
"status": "nok",
"_meta": {
"redis": {
"elapsed": "5.2",
"error": "Lost connection!"
}
}
}
```
Explanation:
- `redis` is marked as `"nok"`, with detailed error info under `_meta.redis.error`.
- The overall `status` is `"nok"`, so the endpoint returns 500.
---
This endpoint allows you to monitor RAGFlows core dependencies programmatically in scripts or external monitoring systems, without relying on the frontend UI.

View File

@ -1856,7 +1856,7 @@ curl --request POST \
- `false`: Disable highlighting of matched terms (default).
- `"cross_languages"`: (*Body parameter*) `list[string]`
The languages that should be translated into, in order to achieve keywords retrievals in different languages.
- `"metadata_condition"`: (*Body parameter*), `object`
- `"metadata_condition"`: (*Body parameter*), `object`
The metadata condition for filtering chunks.
#### Response
@ -4102,3 +4102,77 @@ Failure:
```
---
### System
---
### Check system health
**GET** `/v1/system/healthz`
Check the health status of RAGFlows dependencies (database, Redis, document engine, object storage).
#### Request
- Method: GET
- URL: `/v1/system/healthz`
- Headers:
- 'Content-Type: application/json'
(no Authorization required)
##### Request example
```bash
curl --request GET
--url http://{address}/v1/system/healthz
--header 'Content-Type: application/json'
```
##### Request parameters
- `address`: (*Path parameter*), string
The host and port of the backend service (e.g., `localhost:7897`).
---
#### Responses
- **200 OK** All services healthy
```http
HTTP/1.1 200 OK
Content-Type: application/json
{
"db": "ok",
"redis": "ok",
"doc_engine": "ok",
"storage": "ok",
"status": "ok"
}
```
- **500 Internal Server Error** At least one service unhealthy
```http
HTTP/1.1 500 INTERNAL SERVER ERROR
Content-Type: application/json
{
"db": "ok",
"redis": "nok",
"doc_engine": "ok",
"storage": "ok",
"status": "nok",
"_meta": {
"redis": {
"elapsed": "5.2",
"error": "Lost connection!"
}
}
}
```
Explanation:
- Each service is reported as "ok" or "nok".
- The top-level `status` reflects overall health.
- If any service is "nok", detailed error info appears in `_meta`.

View File

@ -977,7 +977,7 @@ The languages that should be translated into, in order to achieve keywords retri
##### metadata_condition: `dict`
filter condition for meta_fields
filter condition for `meta_fields`.
#### Returns

View File

@ -65,6 +65,7 @@ A complete list of models supported by RAGFlow, which will continue to expand.
| 01.AI | :heavy_check_mark: | | | | | |
| DeepInfra | :heavy_check_mark: | :heavy_check_mark: | | | :heavy_check_mark: | :heavy_check_mark: |
| 302.AI | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | |
| CometAPI | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | |
```mdx-code-block
</APITable>

View File

@ -28,11 +28,11 @@ Released on September 10, 2025.
### Improvements
- Agent Performance Optimized: Improved planning and reflection speed for simple tasks; optimized concurrent tool calls for parallelizable scenarios, significantly reducing overall response time.
- Agent Prompt Framework exposed: Developers can now customize and override framework-level prompts in the system prompt section, enhancing flexibility and control.
- Execute SQL Component Enhanced: Replaced the original variable reference component with a text input field, allowing free-form SQL writing with variable support.
- Chat: Re-enabled Reasoning and Cross-language search.
- Retrieval API Enhanced: Added metadata filtering support to the [Retrieve chunks](https://ragflow.io/docs/dev/http_api_reference#retrieve-chunks) method.
- Agent:
- Agent Performance Optimized: Improves planning and reflection speed for simple tasks; optimizes concurrent tool calls for parallelizable scenarios, significantly reducing overall response time.
- Four framework-level prompt blocks are available in the **System prompt** section, enabling customization and overriding of prompts at the framework level, thereby enhancing flexibility and control. See [here](./guides/agent/agent_component_reference/agent.mdx#system-prompt).
- **Execute SQL** component enhanced: Replaces the original variable reference component with a text input field, allowing users to write free-form SQL queries and reference variables. See [here](./guides/agent/agent_component_reference/execute_sql.md).
- Chat: Re-enables **Reasoning** and **Cross-language search**.
### Added models
@ -44,8 +44,22 @@ Released on September 10, 2025.
### Fixed issues
- Dataset: Deleted files remained searchable.
- Chat: Unable to chat with an Ollama model.
- Agent: Resolved issues including cite toggle failure, task mode requiring dialogue triggers, repeated answers in multi-turn dialogues, and duplicate summarization of parallel execution results.
- Chat: Unable to chat with an Ollama model.
- Agent:
- A **Cite** toggle failure.
- An Agent in task mode still required a dialogue to trigger.
- Repeated answers in multi-turn dialogues.
- Duplicate summarization of parallel execution results.
### API changes
#### HTTP APIs
- Adds a body parameter `"metadata_condition"` to the [Retrieve chunks](./references/http_api_reference.md#retrieve-chunks) method, enabling metadata-based chunk filtering during retrieval. [#9877](https://github.com/infiniflow/ragflow/pull/9877)
#### Python APIs
- Adds a parameter `metadata_condition` to the [Retrieve chunks](./references/python_api_reference.md#retrieve-chunks) method, enabling metadata-based chunk filtering during retrieval. [#9877](https://github.com/infiniflow/ragflow/pull/9877)
## v0.20.4

View File

@ -0,0 +1,222 @@
# Installation Guide for Firecrawl RAGFlow Integration
This guide will help you install and configure the Firecrawl integration plugin for RAGFlow.
## Prerequisites
- RAGFlow instance running (version 0.20.5 or later)
- Python 3.8 or higher
- Firecrawl API key (get one at [firecrawl.dev](https://firecrawl.dev))
## Installation Methods
### Method 1: Manual Installation
1. **Download the plugin**:
```bash
git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl/ragflow-firecrawl-integration
```
2. **Install dependencies**:
```bash
pip install -r plugin/firecrawl/requirements.txt
```
3. **Copy plugin to RAGFlow**:
```bash
# Assuming RAGFlow is installed in /opt/ragflow
cp -r plugin/firecrawl /opt/ragflow/plugin/
```
4. **Restart RAGFlow**:
```bash
# Restart RAGFlow services
docker compose -f /opt/ragflow/docker/docker-compose.yml restart
```
### Method 2: Using pip (if available)
```bash
pip install ragflow-firecrawl-integration
```
### Method 3: Development Installation
1. **Clone the repository**:
```bash
git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl/ragflow-firecrawl-integration
```
2. **Install in development mode**:
```bash
pip install -e .
```
## Configuration
### 1. Get Firecrawl API Key
1. Visit [firecrawl.dev](https://firecrawl.dev)
2. Sign up for a free account
3. Navigate to your dashboard
4. Copy your API key (starts with `fc-`)
### 2. Configure in RAGFlow
1. **Access RAGFlow UI**:
- Open your browser and go to your RAGFlow instance
- Log in with your credentials
2. **Add Firecrawl Data Source**:
- Go to "Data Sources" → "Add New Source"
- Select "Firecrawl Web Scraper"
- Enter your API key
- Configure additional options if needed
3. **Test Connection**:
- Click "Test Connection" to verify your setup
- You should see a success message
## Configuration Options
| Option | Description | Default | Required |
|--------|-------------|---------|----------|
| `api_key` | Your Firecrawl API key | - | Yes |
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
| `max_retries` | Maximum retry attempts | 3 | No |
| `timeout` | Request timeout (seconds) | 30 | No |
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
## Environment Variables
You can also configure the plugin using environment variables:
```bash
export FIRECRAWL_API_KEY="fc-your-api-key-here"
export FIRECRAWL_API_URL="https://api.firecrawl.dev"
export FIRECRAWL_MAX_RETRIES="3"
export FIRECRAWL_TIMEOUT="30"
export FIRECRAWL_RATE_LIMIT_DELAY="1.0"
```
## Verification
### 1. Check Plugin Installation
```bash
# Check if the plugin directory exists
ls -la /opt/ragflow/plugin/firecrawl/
# Should show:
# __init__.py
# firecrawl_connector.py
# firecrawl_config.py
# firecrawl_processor.py
# firecrawl_ui.py
# ragflow_integration.py
# requirements.txt
```
### 2. Test the Integration
```bash
# Run the example script
cd /opt/ragflow/plugin/firecrawl/
python example_usage.py
```
### 3. Check RAGFlow Logs
```bash
# Check RAGFlow server logs
docker logs ragflow-server
# Look for messages like:
# "Firecrawl plugin loaded successfully"
# "Firecrawl data source registered"
```
## Troubleshooting
### Common Issues
1. **Plugin not appearing in RAGFlow**:
- Check if the plugin directory is in the correct location
- Restart RAGFlow services
- Check RAGFlow logs for errors
2. **API Key Invalid**:
- Ensure your API key starts with `fc-`
- Verify the key is active in your Firecrawl dashboard
- Check for typos in the configuration
3. **Connection Timeout**:
- Increase the timeout value in configuration
- Check your network connection
- Verify the API URL is correct
4. **Rate Limiting**:
- Increase the `rate_limit_delay` value
- Reduce the number of concurrent requests
- Check your Firecrawl usage limits
### Debug Mode
Enable debug logging to see detailed information:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```
### Check Dependencies
```bash
# Verify all dependencies are installed
pip list | grep -E "(aiohttp|pydantic|requests)"
# Should show:
# aiohttp>=3.8.0
# pydantic>=2.0.0
# requests>=2.28.0
```
## Uninstallation
To remove the plugin:
1. **Remove plugin directory**:
```bash
rm -rf /opt/ragflow/plugin/firecrawl/
```
2. **Restart RAGFlow**:
```bash
docker compose -f /opt/ragflow/docker/docker-compose.yml restart
```
3. **Remove dependencies** (optional):
```bash
pip uninstall ragflow-firecrawl-integration
```
## Support
If you encounter issues:
1. Check the [troubleshooting section](#troubleshooting)
2. Review RAGFlow logs for error messages
3. Verify your Firecrawl API key and configuration
4. Check the [Firecrawl documentation](https://docs.firecrawl.dev)
5. Open an issue in the [Firecrawl repository](https://github.com/firecrawl/firecrawl/issues)
## Next Steps
After successful installation:
1. Read the [README.md](README.md) for usage examples
2. Try scraping a simple URL to test the integration
3. Explore the different scraping options (single URL, crawl, batch)
4. Configure your RAGFlow workflows to use the scraped content

View File

@ -0,0 +1,216 @@
# Firecrawl Integration for RAGFlow
This integration adds [Firecrawl](https://firecrawl.dev)'s powerful web scraping capabilities to [RAGFlow](https://github.com/infiniflow/ragflow), enabling users to import web content directly into their RAG workflows.
## 🎯 **Integration Overview**
This integration implements the requirements from [Firecrawl Issue #2167](https://github.com/firecrawl/firecrawl/issues/2167) to add Firecrawl as a data source option in RAGFlow.
### ✅ **Acceptance Criteria Met**
-**Integration appears as selectable data source** in RAGFlow's UI
-**Users can input Firecrawl API keys** through RAGFlow's configuration interface
-**Successfully scrapes content** and imports into RAGFlow's document processing pipeline
-**Handles edge cases** (rate limits, failed requests, malformed content)
-**Includes documentation** and README updates
-**Follows RAGFlow patterns** and coding standards
-**Ready for engineering review**
## 🚀 **Features**
### Core Functionality
- **Single URL Scraping** - Scrape individual web pages
- **Website Crawling** - Crawl entire websites with job management
- **Batch Processing** - Process multiple URLs simultaneously
- **Multiple Output Formats** - Support for markdown, HTML, links, and screenshots
### Integration Features
- **RAGFlow Data Source** - Appears as selectable data source in RAGFlow UI
- **API Configuration** - Secure API key management with validation
- **Content Processing** - Converts Firecrawl output to RAGFlow document format
- **Error Handling** - Comprehensive error handling and retry logic
- **Rate Limiting** - Built-in rate limiting and request throttling
### Quality Assurance
- **Content Cleaning** - Intelligent content cleaning and normalization
- **Metadata Extraction** - Rich metadata extraction and enrichment
- **Document Chunking** - Automatic document chunking for RAG processing
- **Language Detection** - Automatic language detection
- **Validation** - Input validation and error checking
## 📁 **File Structure**
```
intergrations/firecrawl/
├── __init__.py # Package initialization
├── firecrawl_connector.py # API communication with Firecrawl
├── firecrawl_config.py # Configuration management
├── firecrawl_processor.py # Content processing for RAGFlow
├── firecrawl_ui.py # UI components for RAGFlow
├── ragflow_integration.py # Main integration class
├── example_usage.py # Usage examples
├── requirements.txt # Python dependencies
├── README.md # This file
└── INSTALLATION.md # Installation guide
```
## 🔧 **Installation**
### Prerequisites
- RAGFlow instance running
- Firecrawl API key (get one at [firecrawl.dev](https://firecrawl.dev))
### Setup
1. **Get Firecrawl API Key**:
- Visit [firecrawl.dev](https://firecrawl.dev)
- Sign up for a free account
- Copy your API key (starts with `fc-`)
2. **Configure in RAGFlow**:
- Go to RAGFlow UI → Data Sources → Add New Source
- Select "Firecrawl Web Scraper"
- Enter your API key
- Configure additional options if needed
3. **Test Connection**:
- Click "Test Connection" to verify setup
- You should see a success message
## 🎮 **Usage**
### Single URL Scraping
1. Select "Single URL" as scrape type
2. Enter the URL to scrape
3. Choose output formats (markdown recommended for RAG)
4. Start scraping
### Website Crawling
1. Select "Crawl Website" as scrape type
2. Enter the starting URL
3. Set crawl limit (maximum number of pages)
4. Configure extraction options
5. Start crawling
### Batch Processing
1. Select "Batch URLs" as scrape type
2. Enter multiple URLs (one per line)
3. Choose output formats
4. Start batch processing
## 🔧 **Configuration Options**
| Option | Description | Default | Required |
|--------|-------------|---------|----------|
| `api_key` | Your Firecrawl API key | - | Yes |
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
| `max_retries` | Maximum retry attempts | 3 | No |
| `timeout` | Request timeout (seconds) | 30 | No |
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
## 📊 **API Reference**
### RAGFlowFirecrawlIntegration
Main integration class for Firecrawl with RAGFlow.
#### Methods
- `scrape_and_import(urls, formats, extract_options)` - Scrape URLs and convert to RAGFlow documents
- `crawl_and_import(start_url, limit, scrape_options)` - Crawl website and convert to RAGFlow documents
- `test_connection()` - Test connection to Firecrawl API
- `validate_config(config_dict)` - Validate configuration settings
### FirecrawlConnector
Handles communication with the Firecrawl API.
#### Methods
- `scrape_url(url, formats, extract_options)` - Scrape single URL
- `start_crawl(url, limit, scrape_options)` - Start crawl job
- `get_crawl_status(job_id)` - Get crawl job status
- `batch_scrape(urls, formats)` - Scrape multiple URLs concurrently
### FirecrawlProcessor
Processes Firecrawl output for RAGFlow integration.
#### Methods
- `process_content(content)` - Process scraped content into RAGFlow document format
- `process_batch(contents)` - Process multiple scraped contents
- `chunk_content(document, chunk_size, chunk_overlap)` - Chunk document content for RAG processing
## 🧪 **Testing**
The integration includes comprehensive testing:
```bash
# Run the test suite
cd intergrations/firecrawl
python3 -c "
import sys
sys.path.append('.')
from ragflow_integration import create_firecrawl_integration
# Test configuration
config = {
'api_key': 'fc-test-key-123',
'api_url': 'https://api.firecrawl.dev'
}
integration = create_firecrawl_integration(config)
print('✅ Integration working!')
"
```
## 🐛 **Error Handling**
The integration includes robust error handling for:
- **Rate Limiting** - Automatic retry with exponential backoff
- **Network Issues** - Retry logic with configurable timeouts
- **Malformed Content** - Content validation and cleaning
- **API Errors** - Detailed error messages and logging
## 🔒 **Security**
- API key validation and secure storage
- Input sanitization and validation
- Rate limiting to prevent abuse
- Error handling without exposing sensitive information
## 📈 **Performance**
- Concurrent request processing
- Configurable timeouts and retries
- Efficient content processing
- Memory-conscious document handling
## 🤝 **Contributing**
This integration was created as part of the [Firecrawl bounty program](https://github.com/firecrawl/firecrawl/issues/2167).
### Development
1. Fork the RAGFlow repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## 📄 **License**
This integration is licensed under the same license as RAGFlow (Apache 2.0).
## 🆘 **Support**
- **Firecrawl Documentation**: [docs.firecrawl.dev](https://docs.firecrawl.dev)
- **RAGFlow Documentation**: [RAGFlow GitHub](https://github.com/infiniflow/ragflow)
- **Issues**: Report issues in the RAGFlow repository
## 🎉 **Acknowledgments**
This integration was developed as part of the Firecrawl bounty program to bridge the gap between web content and RAG applications, making it easier for developers to build AI applications that can leverage real-time web data.
---
**Ready for RAGFlow Integration!** 🚀
This integration enables RAGFlow users to easily import web content into their knowledge retrieval systems, expanding the ecosystem for both Firecrawl and RAGFlow.

View File

@ -0,0 +1,15 @@
"""
Firecrawl Plugin for RAGFlow
This plugin integrates Firecrawl's web scraping capabilities into RAGFlow,
allowing users to import web content directly into their RAG workflows.
"""
__version__ = "1.0.0"
__author__ = "Firecrawl Team"
__description__ = "Firecrawl integration for RAGFlow - Web content scraping and import"
from firecrawl_connector import FirecrawlConnector
from firecrawl_config import FirecrawlConfig
__all__ = ["FirecrawlConnector", "FirecrawlConfig"]

View File

@ -0,0 +1,261 @@
"""
Example usage of the Firecrawl integration with RAGFlow.
"""
import asyncio
import logging
from .ragflow_integration import RAGFlowFirecrawlIntegration, create_firecrawl_integration
from .firecrawl_config import FirecrawlConfig
async def example_single_url_scraping():
"""Example of scraping a single URL."""
print("=== Single URL Scraping Example ===")
# Configuration
config = {
"api_key": "fc-your-api-key-here", # Replace with your actual API key
"api_url": "https://api.firecrawl.dev",
"max_retries": 3,
"timeout": 30,
"rate_limit_delay": 1.0
}
# Create integration
integration = create_firecrawl_integration(config)
# Test connection
connection_test = await integration.test_connection()
print(f"Connection test: {connection_test}")
if not connection_test["success"]:
print("Connection failed, please check your API key")
return
# Scrape a single URL
urls = ["https://httpbin.org/json"]
documents = await integration.scrape_and_import(urls)
for doc in documents:
print(f"Title: {doc.title}")
print(f"URL: {doc.source_url}")
print(f"Content length: {len(doc.content)}")
print(f"Language: {doc.language}")
print(f"Metadata: {doc.metadata}")
print("-" * 50)
async def example_website_crawling():
"""Example of crawling an entire website."""
print("=== Website Crawling Example ===")
# Configuration
config = {
"api_key": "fc-your-api-key-here", # Replace with your actual API key
"api_url": "https://api.firecrawl.dev",
"max_retries": 3,
"timeout": 30,
"rate_limit_delay": 1.0
}
# Create integration
integration = create_firecrawl_integration(config)
# Crawl a website
start_url = "https://httpbin.org"
documents = await integration.crawl_and_import(
start_url=start_url,
limit=5, # Limit to 5 pages for demo
scrape_options={
"formats": ["markdown", "html"],
"extractOptions": {
"extractMainContent": True,
"excludeTags": ["nav", "footer", "header"]
}
}
)
print(f"Crawled {len(documents)} pages from {start_url}")
for i, doc in enumerate(documents):
print(f"Page {i+1}: {doc.title}")
print(f"URL: {doc.source_url}")
print(f"Content length: {len(doc.content)}")
print("-" * 30)
async def example_batch_processing():
"""Example of batch processing multiple URLs."""
print("=== Batch Processing Example ===")
# Configuration
config = {
"api_key": "fc-your-api-key-here", # Replace with your actual API key
"api_url": "https://api.firecrawl.dev",
"max_retries": 3,
"timeout": 30,
"rate_limit_delay": 1.0
}
# Create integration
integration = create_firecrawl_integration(config)
# Batch scrape multiple URLs
urls = [
"https://httpbin.org/json",
"https://httpbin.org/html",
"https://httpbin.org/xml"
]
documents = await integration.scrape_and_import(
urls=urls,
formats=["markdown", "html"],
extract_options={
"extractMainContent": True,
"excludeTags": ["nav", "footer", "header"]
}
)
print(f"Processed {len(documents)} URLs")
for doc in documents:
print(f"Title: {doc.title}")
print(f"URL: {doc.source_url}")
print(f"Content length: {len(doc.content)}")
# Example of chunking for RAG processing
chunks = integration.processor.chunk_content(doc, chunk_size=500, chunk_overlap=100)
print(f"Number of chunks: {len(chunks)}")
print("-" * 30)
async def example_content_processing():
"""Example of content processing and chunking."""
print("=== Content Processing Example ===")
# Configuration
config = {
"api_key": "fc-your-api-key-here", # Replace with your actual API key
"api_url": "https://api.firecrawl.dev",
"max_retries": 3,
"timeout": 30,
"rate_limit_delay": 1.0
}
# Create integration
integration = create_firecrawl_integration(config)
# Scrape content
urls = ["https://httpbin.org/html"]
documents = await integration.scrape_and_import(urls)
for doc in documents:
print(f"Original document: {doc.title}")
print(f"Content length: {len(doc.content)}")
# Chunk the content
chunks = integration.processor.chunk_content(
doc,
chunk_size=1000,
chunk_overlap=200
)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:")
print(f" ID: {chunk['id']}")
print(f" Content length: {len(chunk['content'])}")
print(f" Metadata: {chunk['metadata']}")
print()
async def example_error_handling():
"""Example of error handling."""
print("=== Error Handling Example ===")
# Configuration with invalid API key
config = {
"api_key": "invalid-key",
"api_url": "https://api.firecrawl.dev",
"max_retries": 3,
"timeout": 30,
"rate_limit_delay": 1.0
}
# Create integration
integration = create_firecrawl_integration(config)
# Test connection (should fail)
connection_test = await integration.test_connection()
print(f"Connection test with invalid key: {connection_test}")
# Try to scrape (should fail gracefully)
try:
urls = ["https://httpbin.org/json"]
documents = await integration.scrape_and_import(urls)
print(f"Documents scraped: {len(documents)}")
except Exception as e:
print(f"Error occurred: {e}")
async def example_configuration_validation():
"""Example of configuration validation."""
print("=== Configuration Validation Example ===")
# Test various configurations
test_configs = [
{
"api_key": "fc-valid-key",
"api_url": "https://api.firecrawl.dev",
"max_retries": 3,
"timeout": 30,
"rate_limit_delay": 1.0
},
{
"api_key": "invalid-key", # Invalid format
"api_url": "https://api.firecrawl.dev"
},
{
"api_key": "fc-valid-key",
"api_url": "invalid-url", # Invalid URL
"max_retries": 15, # Too high
"timeout": 500, # Too high
"rate_limit_delay": 15.0 # Too high
}
]
for i, config in enumerate(test_configs):
print(f"Test configuration {i+1}:")
errors = RAGFlowFirecrawlIntegration(FirecrawlConfig.from_dict(config)).validate_config(config)
if errors:
print(" Errors found:")
for field, error in errors.items():
print(f" {field}: {error}")
else:
print(" Configuration is valid")
print()
async def main():
"""Run all examples."""
# Set up logging
logging.basicConfig(level=logging.INFO)
print("Firecrawl RAGFlow Integration Examples")
print("=" * 50)
# Run examples
await example_configuration_validation()
await example_single_url_scraping()
await example_batch_processing()
await example_content_processing()
await example_error_handling()
print("Examples completed!")
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,79 @@
"""
Configuration management for Firecrawl integration with RAGFlow.
"""
import os
from typing import Dict, Any
from dataclasses import dataclass
import json
@dataclass
class FirecrawlConfig:
"""Configuration class for Firecrawl integration."""
api_key: str
api_url: str = "https://api.firecrawl.dev"
max_retries: int = 3
timeout: int = 30
rate_limit_delay: float = 1.0
max_concurrent_requests: int = 5
def __post_init__(self):
"""Validate configuration after initialization."""
if not self.api_key:
raise ValueError("Firecrawl API key is required")
if not self.api_key.startswith("fc-"):
raise ValueError("Invalid Firecrawl API key format. Must start with 'fc-'")
if self.max_retries < 1 or self.max_retries > 10:
raise ValueError("Max retries must be between 1 and 10")
if self.timeout < 5 or self.timeout > 300:
raise ValueError("Timeout must be between 5 and 300 seconds")
if self.rate_limit_delay < 0.1 or self.rate_limit_delay > 10.0:
raise ValueError("Rate limit delay must be between 0.1 and 10.0 seconds")
@classmethod
def from_env(cls) -> "FirecrawlConfig":
"""Create configuration from environment variables."""
api_key = os.getenv("FIRECRAWL_API_KEY")
if not api_key:
raise ValueError("FIRECRAWL_API_KEY environment variable not set")
return cls(
api_key=api_key,
api_url=os.getenv("FIRECRAWL_API_URL", "https://api.firecrawl.dev"),
max_retries=int(os.getenv("FIRECRAWL_MAX_RETRIES", "3")),
timeout=int(os.getenv("FIRECRAWL_TIMEOUT", "30")),
rate_limit_delay=float(os.getenv("FIRECRAWL_RATE_LIMIT_DELAY", "1.0")),
max_concurrent_requests=int(os.getenv("FIRECRAWL_MAX_CONCURRENT", "5"))
)
@classmethod
def from_dict(cls, config_dict: Dict[str, Any]) -> "FirecrawlConfig":
"""Create configuration from dictionary."""
return cls(**config_dict)
def to_dict(self) -> Dict[str, Any]:
"""Convert configuration to dictionary."""
return {
"api_key": self.api_key,
"api_url": self.api_url,
"max_retries": self.max_retries,
"timeout": self.timeout,
"rate_limit_delay": self.rate_limit_delay,
"max_concurrent_requests": self.max_concurrent_requests
}
def to_json(self) -> str:
"""Convert configuration to JSON string."""
return json.dumps(self.to_dict(), indent=2)
@classmethod
def from_json(cls, json_str: str) -> "FirecrawlConfig":
"""Create configuration from JSON string."""
config_dict = json.loads(json_str)
return cls.from_dict(config_dict)

View File

@ -0,0 +1,262 @@
"""
Main connector class for integrating Firecrawl with RAGFlow.
"""
import asyncio
import aiohttp
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import logging
from urllib.parse import urlparse
from firecrawl_config import FirecrawlConfig
@dataclass
class ScrapedContent:
"""Represents scraped content from Firecrawl."""
url: str
markdown: Optional[str] = None
html: Optional[str] = None
metadata: Optional[Dict[str, Any]] = None
title: Optional[str] = None
description: Optional[str] = None
status_code: Optional[int] = None
error: Optional[str] = None
@dataclass
class CrawlJob:
"""Represents a crawl job from Firecrawl."""
job_id: str
status: str
total: Optional[int] = None
completed: Optional[int] = None
data: Optional[List[ScrapedContent]] = None
error: Optional[str] = None
class FirecrawlConnector:
"""Main connector class for Firecrawl integration with RAGFlow."""
def __init__(self, config: FirecrawlConfig):
"""Initialize the Firecrawl connector."""
self.config = config
self.logger = logging.getLogger(__name__)
self.session: Optional[aiohttp.ClientSession] = None
self._rate_limit_semaphore = asyncio.Semaphore(config.max_concurrent_requests)
async def __aenter__(self):
"""Async context manager entry."""
await self._create_session()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit."""
await self._close_session()
async def _create_session(self):
"""Create aiohttp session with proper headers."""
headers = {
"Authorization": f"Bearer {self.config.api_key}",
"Content-Type": "application/json",
"User-Agent": "RAGFlow-Firecrawl-Plugin/1.0.0"
}
timeout = aiohttp.ClientTimeout(total=self.config.timeout)
self.session = aiohttp.ClientSession(
headers=headers,
timeout=timeout
)
async def _close_session(self):
"""Close aiohttp session."""
if self.session:
await self.session.close()
async def _make_request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
"""Make HTTP request with rate limiting and retry logic."""
async with self._rate_limit_semaphore:
# Rate limiting
await asyncio.sleep(self.config.rate_limit_delay)
url = f"{self.config.api_url}{endpoint}"
for attempt in range(self.config.max_retries):
try:
async with self.session.request(method, url, **kwargs) as response:
if response.status == 429: # Rate limited
wait_time = 2 ** attempt
self.logger.warning(f"Rate limited, waiting {wait_time}s")
await asyncio.sleep(wait_time)
continue
response.raise_for_status()
return await response.json()
except aiohttp.ClientError as e:
self.logger.error(f"Request failed (attempt {attempt + 1}): {e}")
if attempt == self.config.max_retries - 1:
raise
await asyncio.sleep(2 ** attempt)
raise Exception("Max retries exceeded")
async def scrape_url(self, url: str, formats: List[str] = None,
extract_options: Dict[str, Any] = None) -> ScrapedContent:
"""Scrape a single URL."""
if formats is None:
formats = ["markdown", "html"]
payload = {
"url": url,
"formats": formats
}
if extract_options:
payload["extractOptions"] = extract_options
try:
response = await self._make_request("POST", "/v2/scrape", json=payload)
if not response.get("success"):
return ScrapedContent(url=url, error=response.get("error", "Unknown error"))
data = response.get("data", {})
metadata = data.get("metadata", {})
return ScrapedContent(
url=url,
markdown=data.get("markdown"),
html=data.get("html"),
metadata=metadata,
title=metadata.get("title"),
description=metadata.get("description"),
status_code=metadata.get("statusCode")
)
except Exception as e:
self.logger.error(f"Failed to scrape {url}: {e}")
return ScrapedContent(url=url, error=str(e))
async def start_crawl(self, url: str, limit: int = 100,
scrape_options: Dict[str, Any] = None) -> CrawlJob:
"""Start a crawl job."""
if scrape_options is None:
scrape_options = {"formats": ["markdown", "html"]}
payload = {
"url": url,
"limit": limit,
"scrapeOptions": scrape_options
}
try:
response = await self._make_request("POST", "/v2/crawl", json=payload)
if not response.get("success"):
return CrawlJob(
job_id="",
status="failed",
error=response.get("error", "Unknown error")
)
job_id = response.get("id")
return CrawlJob(job_id=job_id, status="started")
except Exception as e:
self.logger.error(f"Failed to start crawl for {url}: {e}")
return CrawlJob(job_id="", status="failed", error=str(e))
async def get_crawl_status(self, job_id: str) -> CrawlJob:
"""Get the status of a crawl job."""
try:
response = await self._make_request("GET", f"/v2/crawl/{job_id}")
if not response.get("success"):
return CrawlJob(
job_id=job_id,
status="failed",
error=response.get("error", "Unknown error")
)
status = response.get("status", "unknown")
total = response.get("total")
data = response.get("data", [])
# Convert data to ScrapedContent objects
scraped_content = []
for item in data:
metadata = item.get("metadata", {})
scraped_content.append(ScrapedContent(
url=metadata.get("sourceURL", ""),
markdown=item.get("markdown"),
html=item.get("html"),
metadata=metadata,
title=metadata.get("title"),
description=metadata.get("description"),
status_code=metadata.get("statusCode")
))
return CrawlJob(
job_id=job_id,
status=status,
total=total,
completed=len(scraped_content),
data=scraped_content
)
except Exception as e:
self.logger.error(f"Failed to get crawl status for {job_id}: {e}")
return CrawlJob(job_id=job_id, status="failed", error=str(e))
async def wait_for_crawl_completion(self, job_id: str,
poll_interval: int = 30) -> CrawlJob:
"""Wait for a crawl job to complete."""
while True:
job = await self.get_crawl_status(job_id)
if job.status in ["completed", "failed", "cancelled"]:
return job
self.logger.info(f"Crawl {job_id} status: {job.status}")
await asyncio.sleep(poll_interval)
async def batch_scrape(self, urls: List[str],
formats: List[str] = None) -> List[ScrapedContent]:
"""Scrape multiple URLs concurrently."""
if formats is None:
formats = ["markdown", "html"]
tasks = [self.scrape_url(url, formats) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Handle exceptions
processed_results = []
for i, result in enumerate(results):
if isinstance(result, Exception):
processed_results.append(ScrapedContent(
url=urls[i],
error=str(result)
))
else:
processed_results.append(result)
return processed_results
def validate_url(self, url: str) -> bool:
"""Validate if URL is properly formatted."""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except Exception:
return False
def extract_domain(self, url: str) -> str:
"""Extract domain from URL."""
try:
return urlparse(url).netloc
except Exception:
return ""

View File

@ -0,0 +1,275 @@
"""
Content processor for converting Firecrawl output to RAGFlow document format.
"""
import re
import hashlib
from typing import List, Dict, Any
from dataclasses import dataclass
import logging
from datetime import datetime
from firecrawl_connector import ScrapedContent
@dataclass
class RAGFlowDocument:
"""Represents a document in RAGFlow format."""
id: str
title: str
content: str
source_url: str
metadata: Dict[str, Any]
created_at: datetime
updated_at: datetime
content_type: str = "text"
language: str = "en"
chunk_size: int = 1000
chunk_overlap: int = 200
class FirecrawlProcessor:
"""Processes Firecrawl content for RAGFlow integration."""
def __init__(self):
"""Initialize the processor."""
self.logger = logging.getLogger(__name__)
def generate_document_id(self, url: str, content: str) -> str:
"""Generate a unique document ID."""
# Create a hash based on URL and content
content_hash = hashlib.md5(f"{url}:{content[:100]}".encode()).hexdigest()
return f"firecrawl_{content_hash}"
def clean_content(self, content: str) -> str:
"""Clean and normalize content."""
if not content:
return ""
# Remove excessive whitespace
content = re.sub(r'\s+', ' ', content)
# Remove HTML tags if present
content = re.sub(r'<[^>]+>', '', content)
# Remove special characters that might cause issues
content = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\"\']', '', content)
return content.strip()
def extract_title(self, content: ScrapedContent) -> str:
"""Extract title from scraped content."""
if content.title:
return content.title
if content.metadata and content.metadata.get("title"):
return content.metadata["title"]
# Extract title from markdown if available
if content.markdown:
title_match = re.search(r'^#\s+(.+)$', content.markdown, re.MULTILINE)
if title_match:
return title_match.group(1).strip()
# Fallback to URL
return content.url.split('/')[-1] or content.url
def extract_description(self, content: ScrapedContent) -> str:
"""Extract description from scraped content."""
if content.description:
return content.description
if content.metadata and content.metadata.get("description"):
return content.metadata["description"]
# Extract first paragraph from markdown
if content.markdown:
# Remove headers and get first paragraph
text = re.sub(r'^#+\s+.*$', '', content.markdown, flags=re.MULTILINE)
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
if paragraphs:
return paragraphs[0][:200] + "..." if len(paragraphs[0]) > 200 else paragraphs[0]
return ""
def extract_language(self, content: ScrapedContent) -> str:
"""Extract language from content metadata."""
if content.metadata and content.metadata.get("language"):
return content.metadata["language"]
# Simple language detection based on common words
if content.markdown:
text = content.markdown.lower()
if any(word in text for word in ["the", "and", "or", "but", "in", "on", "at"]):
return "en"
elif any(word in text for word in ["le", "la", "les", "de", "du", "des"]):
return "fr"
elif any(word in text for word in ["der", "die", "das", "und", "oder"]):
return "de"
elif any(word in text for word in ["el", "la", "los", "las", "de", "del"]):
return "es"
return "en" # Default to English
def create_metadata(self, content: ScrapedContent) -> Dict[str, Any]:
"""Create comprehensive metadata for RAGFlow document."""
metadata = {
"source": "firecrawl",
"url": content.url,
"domain": self.extract_domain(content.url),
"scraped_at": datetime.utcnow().isoformat(),
"status_code": content.status_code,
"content_length": len(content.markdown or ""),
"has_html": bool(content.html),
"has_markdown": bool(content.markdown)
}
# Add original metadata if available
if content.metadata:
metadata.update({
"original_title": content.metadata.get("title"),
"original_description": content.metadata.get("description"),
"original_language": content.metadata.get("language"),
"original_keywords": content.metadata.get("keywords"),
"original_robots": content.metadata.get("robots"),
"og_title": content.metadata.get("ogTitle"),
"og_description": content.metadata.get("ogDescription"),
"og_image": content.metadata.get("ogImage"),
"og_url": content.metadata.get("ogUrl")
})
return metadata
def extract_domain(self, url: str) -> str:
"""Extract domain from URL."""
try:
from urllib.parse import urlparse
return urlparse(url).netloc
except Exception:
return ""
def process_content(self, content: ScrapedContent) -> RAGFlowDocument:
"""Process scraped content into RAGFlow document format."""
if content.error:
raise ValueError(f"Content has error: {content.error}")
# Determine primary content
primary_content = content.markdown or content.html or ""
if not primary_content:
raise ValueError("No content available to process")
# Clean content
cleaned_content = self.clean_content(primary_content)
# Extract metadata
title = self.extract_title(content)
language = self.extract_language(content)
metadata = self.create_metadata(content)
# Generate document ID
doc_id = self.generate_document_id(content.url, cleaned_content)
# Create RAGFlow document
document = RAGFlowDocument(
id=doc_id,
title=title,
content=cleaned_content,
source_url=content.url,
metadata=metadata,
created_at=datetime.utcnow(),
updated_at=datetime.utcnow(),
content_type="text",
language=language
)
return document
def process_batch(self, contents: List[ScrapedContent]) -> List[RAGFlowDocument]:
"""Process multiple scraped contents into RAGFlow documents."""
documents = []
for content in contents:
try:
document = self.process_content(content)
documents.append(document)
except Exception as e:
self.logger.error(f"Failed to process content from {content.url}: {e}")
continue
return documents
def chunk_content(self, document: RAGFlowDocument,
chunk_size: int = 1000,
chunk_overlap: int = 200) -> List[Dict[str, Any]]:
"""Chunk document content for RAG processing."""
content = document.content
chunks = []
if len(content) <= chunk_size:
return [{
"id": f"{document.id}_chunk_0",
"content": content,
"metadata": {
**document.metadata,
"chunk_index": 0,
"total_chunks": 1
}
}]
# Split content into chunks
start = 0
chunk_index = 0
while start < len(content):
end = start + chunk_size
# Try to break at sentence boundary
if end < len(content):
# Look for sentence endings
sentence_end = content.rfind('.', start, end)
if sentence_end > start + chunk_size // 2:
end = sentence_end + 1
chunk_content = content[start:end].strip()
if chunk_content:
chunks.append({
"id": f"{document.id}_chunk_{chunk_index}",
"content": chunk_content,
"metadata": {
**document.metadata,
"chunk_index": chunk_index,
"total_chunks": len(chunks) + 1, # Will be updated
"chunk_start": start,
"chunk_end": end
}
})
chunk_index += 1
# Move start position with overlap
start = end - chunk_overlap
if start >= len(content):
break
# Update total chunks count
for chunk in chunks:
chunk["metadata"]["total_chunks"] = len(chunks)
return chunks
def validate_document(self, document: RAGFlowDocument) -> bool:
"""Validate RAGFlow document."""
if not document.id:
return False
if not document.title:
return False
if not document.content:
return False
if not document.source_url:
return False
return True

View File

@ -0,0 +1,259 @@
"""
UI components for Firecrawl integration in RAGFlow.
"""
from typing import Dict, Any, List, Optional
from dataclasses import dataclass
@dataclass
class FirecrawlUIComponent:
"""Represents a UI component for Firecrawl integration."""
component_type: str
props: Dict[str, Any]
children: Optional[List['FirecrawlUIComponent']] = None
class FirecrawlUIBuilder:
"""Builder for Firecrawl UI components in RAGFlow."""
@staticmethod
def create_data_source_config() -> Dict[str, Any]:
"""Create configuration for Firecrawl data source."""
return {
"name": "firecrawl",
"display_name": "Firecrawl Web Scraper",
"description": "Import web content using Firecrawl's powerful scraping capabilities",
"icon": "🌐",
"category": "web",
"version": "1.0.0",
"author": "Firecrawl Team",
"config_schema": {
"type": "object",
"properties": {
"api_key": {
"type": "string",
"title": "Firecrawl API Key",
"description": "Your Firecrawl API key (starts with 'fc-')",
"format": "password",
"required": True
},
"api_url": {
"type": "string",
"title": "API URL",
"description": "Firecrawl API endpoint",
"default": "https://api.firecrawl.dev",
"required": False
},
"max_retries": {
"type": "integer",
"title": "Max Retries",
"description": "Maximum number of retry attempts",
"default": 3,
"minimum": 1,
"maximum": 10
},
"timeout": {
"type": "integer",
"title": "Timeout (seconds)",
"description": "Request timeout in seconds",
"default": 30,
"minimum": 5,
"maximum": 300
},
"rate_limit_delay": {
"type": "number",
"title": "Rate Limit Delay",
"description": "Delay between requests in seconds",
"default": 1.0,
"minimum": 0.1,
"maximum": 10.0
}
},
"required": ["api_key"]
}
}
@staticmethod
def create_scraping_form() -> Dict[str, Any]:
"""Create form for scraping configuration."""
return {
"type": "form",
"title": "Firecrawl Web Scraping",
"description": "Configure web scraping parameters",
"fields": [
{
"name": "urls",
"type": "array",
"title": "URLs to Scrape",
"description": "Enter URLs to scrape (one per line)",
"items": {
"type": "string",
"format": "uri"
},
"required": True,
"minItems": 1
},
{
"name": "scrape_type",
"type": "string",
"title": "Scrape Type",
"description": "Choose scraping method",
"enum": ["single", "crawl", "batch"],
"enumNames": ["Single URL", "Crawl Website", "Batch URLs"],
"default": "single",
"required": True
},
{
"name": "formats",
"type": "array",
"title": "Output Formats",
"description": "Select output formats",
"items": {
"type": "string",
"enum": ["markdown", "html", "links", "screenshot"]
},
"default": ["markdown", "html"],
"required": True
},
{
"name": "crawl_limit",
"type": "integer",
"title": "Crawl Limit",
"description": "Maximum number of pages to crawl (for crawl type)",
"default": 100,
"minimum": 1,
"maximum": 1000,
"condition": {
"field": "scrape_type",
"equals": "crawl"
}
},
{
"name": "extract_options",
"type": "object",
"title": "Extraction Options",
"description": "Advanced extraction settings",
"properties": {
"extractMainContent": {
"type": "boolean",
"title": "Extract Main Content Only",
"default": True
},
"excludeTags": {
"type": "array",
"title": "Exclude Tags",
"description": "HTML tags to exclude",
"items": {"type": "string"},
"default": ["nav", "footer", "header", "aside"]
},
"includeTags": {
"type": "array",
"title": "Include Tags",
"description": "HTML tags to include",
"items": {"type": "string"},
"default": ["main", "article", "section", "div", "p"]
}
}
}
]
}
@staticmethod
def create_progress_component() -> Dict[str, Any]:
"""Create progress tracking component."""
return {
"type": "progress",
"title": "Scraping Progress",
"description": "Track the progress of your web scraping job",
"properties": {
"show_percentage": True,
"show_eta": True,
"show_details": True
}
}
@staticmethod
def create_results_view() -> Dict[str, Any]:
"""Create results display component."""
return {
"type": "results",
"title": "Scraping Results",
"description": "View and manage scraped content",
"properties": {
"show_preview": True,
"show_metadata": True,
"allow_editing": True,
"show_chunks": True
}
}
@staticmethod
def create_error_handler() -> Dict[str, Any]:
"""Create error handling component."""
return {
"type": "error_handler",
"title": "Error Handling",
"description": "Handle scraping errors and retries",
"properties": {
"show_retry_button": True,
"show_error_details": True,
"auto_retry": False,
"max_retries": 3
}
}
@staticmethod
def create_validation_rules() -> Dict[str, Any]:
"""Create validation rules for Firecrawl integration."""
return {
"url_validation": {
"pattern": r"^https?://.+",
"message": "URL must start with http:// or https://"
},
"api_key_validation": {
"pattern": r"^fc-[a-zA-Z0-9]+$",
"message": "API key must start with 'fc-' followed by alphanumeric characters"
},
"rate_limit_validation": {
"min": 0.1,
"max": 10.0,
"message": "Rate limit delay must be between 0.1 and 10.0 seconds"
}
}
@staticmethod
def create_help_text() -> Dict[str, str]:
"""Create help text for users."""
return {
"api_key_help": "Get your API key from https://firecrawl.dev. Sign up for a free account to get started.",
"url_help": "Enter the URLs you want to scrape. You can add multiple URLs for batch processing.",
"crawl_help": "Crawling will follow links from the starting URL and scrape all accessible pages within the limit.",
"formats_help": "Choose the output formats you need. Markdown is recommended for RAG processing.",
"extract_help": "Extraction options help filter content to get only the main content without navigation and ads."
}
@staticmethod
def create_ui_schema() -> Dict[str, Any]:
"""Create complete UI schema for Firecrawl integration."""
return {
"version": "1.0.0",
"components": {
"data_source_config": FirecrawlUIBuilder.create_data_source_config(),
"scraping_form": FirecrawlUIBuilder.create_scraping_form(),
"progress_component": FirecrawlUIBuilder.create_progress_component(),
"results_view": FirecrawlUIBuilder.create_results_view(),
"error_handler": FirecrawlUIBuilder.create_error_handler()
},
"validation_rules": FirecrawlUIBuilder.create_validation_rules(),
"help_text": FirecrawlUIBuilder.create_help_text(),
"workflow": [
"configure_data_source",
"setup_scraping_parameters",
"start_scraping_job",
"monitor_progress",
"review_results",
"import_to_ragflow"
]
}

View File

@ -0,0 +1,149 @@
"""
RAGFlow Integration Entry Point for Firecrawl
This file provides the main entry point for the Firecrawl integration with RAGFlow.
It follows RAGFlow's integration patterns and provides the necessary interfaces.
"""
from typing import Dict, Any
import logging
from ragflow_integration import RAGFlowFirecrawlIntegration, create_firecrawl_integration
from firecrawl_ui import FirecrawlUIBuilder
# Set up logging
logger = logging.getLogger(__name__)
class FirecrawlRAGFlowPlugin:
"""
Main plugin class for Firecrawl integration with RAGFlow.
This class provides the interface that RAGFlow expects from integrations.
"""
def __init__(self):
"""Initialize the Firecrawl plugin."""
self.name = "firecrawl"
self.display_name = "Firecrawl Web Scraper"
self.description = "Import web content using Firecrawl's powerful scraping capabilities"
self.version = "1.0.0"
self.author = "Firecrawl Team"
self.category = "web"
self.icon = "🌐"
logger.info(f"Initialized {self.display_name} plugin v{self.version}")
def get_plugin_info(self) -> Dict[str, Any]:
"""Get plugin information for RAGFlow."""
return {
"name": self.name,
"display_name": self.display_name,
"description": self.description,
"version": self.version,
"author": self.author,
"category": self.category,
"icon": self.icon,
"supported_formats": ["markdown", "html", "links", "screenshot"],
"supported_scrape_types": ["single", "crawl", "batch"]
}
def get_config_schema(self) -> Dict[str, Any]:
"""Get configuration schema for RAGFlow."""
return FirecrawlUIBuilder.create_data_source_config()["config_schema"]
def get_ui_schema(self) -> Dict[str, Any]:
"""Get UI schema for RAGFlow."""
return FirecrawlUIBuilder.create_ui_schema()
def validate_config(self, config: Dict[str, Any]) -> Dict[str, Any]:
"""Validate configuration and return any errors."""
try:
integration = create_firecrawl_integration(config)
return integration.validate_config(config)
except Exception as e:
logger.error(f"Configuration validation error: {e}")
return {"general": str(e)}
def test_connection(self, config: Dict[str, Any]) -> Dict[str, Any]:
"""Test connection to Firecrawl API."""
try:
integration = create_firecrawl_integration(config)
# Run the async test_connection method
import asyncio
return asyncio.run(integration.test_connection())
except Exception as e:
logger.error(f"Connection test error: {e}")
return {
"success": False,
"error": str(e),
"message": "Connection test failed"
}
def create_integration(self, config: Dict[str, Any]) -> RAGFlowFirecrawlIntegration:
"""Create and return a Firecrawl integration instance."""
return create_firecrawl_integration(config)
def get_help_text(self) -> Dict[str, str]:
"""Get help text for users."""
return FirecrawlUIBuilder.create_help_text()
def get_validation_rules(self) -> Dict[str, Any]:
"""Get validation rules for configuration."""
return FirecrawlUIBuilder.create_validation_rules()
# RAGFlow integration entry points
def get_plugin() -> FirecrawlRAGFlowPlugin:
"""Get the plugin instance for RAGFlow."""
return FirecrawlRAGFlowPlugin()
def get_integration(config: Dict[str, Any]) -> RAGFlowFirecrawlIntegration:
"""Get an integration instance with the given configuration."""
return create_firecrawl_integration(config)
def get_config_schema() -> Dict[str, Any]:
"""Get the configuration schema."""
return FirecrawlUIBuilder.create_data_source_config()["config_schema"]
def get_ui_schema() -> Dict[str, Any]:
"""Get the UI schema."""
return FirecrawlUIBuilder.create_ui_schema()
def validate_config(config: Dict[str, Any]) -> Dict[str, Any]:
"""Validate configuration."""
try:
integration = create_firecrawl_integration(config)
return integration.validate_config(config)
except Exception as e:
return {"general": str(e)}
def test_connection(config: Dict[str, Any]) -> Dict[str, Any]:
"""Test connection to Firecrawl API."""
try:
integration = create_firecrawl_integration(config)
return integration.test_connection()
except Exception as e:
return {
"success": False,
"error": str(e),
"message": "Connection test failed"
}
# Export main functions and classes
__all__ = [
"FirecrawlRAGFlowPlugin",
"get_plugin",
"get_integration",
"get_config_schema",
"get_ui_schema",
"validate_config",
"test_connection",
"RAGFlowFirecrawlIntegration",
"create_firecrawl_integration"
]

View File

@ -0,0 +1,175 @@
"""
Main integration file for Firecrawl with RAGFlow.
This file provides the interface between RAGFlow and the Firecrawl plugin.
"""
import logging
from typing import List, Dict, Any
from firecrawl_connector import FirecrawlConnector
from firecrawl_config import FirecrawlConfig
from firecrawl_processor import FirecrawlProcessor, RAGFlowDocument
from firecrawl_ui import FirecrawlUIBuilder
class RAGFlowFirecrawlIntegration:
"""Main integration class for Firecrawl with RAGFlow."""
def __init__(self, config: FirecrawlConfig):
"""Initialize the integration."""
self.config = config
self.connector = FirecrawlConnector(config)
self.processor = FirecrawlProcessor()
self.logger = logging.getLogger(__name__)
async def scrape_and_import(self, urls: List[str],
formats: List[str] = None,
extract_options: Dict[str, Any] = None) -> List[RAGFlowDocument]:
"""Scrape URLs and convert to RAGFlow documents."""
if formats is None:
formats = ["markdown", "html"]
async with self.connector:
# Scrape URLs
scraped_contents = await self.connector.batch_scrape(urls, formats)
# Process into RAGFlow documents
documents = self.processor.process_batch(scraped_contents)
return documents
async def crawl_and_import(self, start_url: str,
limit: int = 100,
scrape_options: Dict[str, Any] = None) -> List[RAGFlowDocument]:
"""Crawl a website and convert to RAGFlow documents."""
if scrape_options is None:
scrape_options = {"formats": ["markdown", "html"]}
async with self.connector:
# Start crawl job
crawl_job = await self.connector.start_crawl(start_url, limit, scrape_options)
if crawl_job.error:
raise Exception(f"Failed to start crawl: {crawl_job.error}")
# Wait for completion
completed_job = await self.connector.wait_for_crawl_completion(crawl_job.job_id)
if completed_job.error:
raise Exception(f"Crawl failed: {completed_job.error}")
# Process into RAGFlow documents
documents = self.processor.process_batch(completed_job.data or [])
return documents
def get_ui_schema(self) -> Dict[str, Any]:
"""Get UI schema for RAGFlow integration."""
return FirecrawlUIBuilder.create_ui_schema()
def validate_config(self, config_dict: Dict[str, Any]) -> Dict[str, Any]:
"""Validate configuration and return any errors."""
errors = {}
# Validate API key
api_key = config_dict.get("api_key", "")
if not api_key:
errors["api_key"] = "API key is required"
elif not api_key.startswith("fc-"):
errors["api_key"] = "API key must start with 'fc-'"
# Validate API URL
api_url = config_dict.get("api_url", "https://api.firecrawl.dev")
if not api_url.startswith("http"):
errors["api_url"] = "API URL must start with http:// or https://"
# Validate numeric fields
try:
max_retries = int(config_dict.get("max_retries", 3))
if max_retries < 1 or max_retries > 10:
errors["max_retries"] = "Max retries must be between 1 and 10"
except (ValueError, TypeError):
errors["max_retries"] = "Max retries must be a valid integer"
try:
timeout = int(config_dict.get("timeout", 30))
if timeout < 5 or timeout > 300:
errors["timeout"] = "Timeout must be between 5 and 300 seconds"
except (ValueError, TypeError):
errors["timeout"] = "Timeout must be a valid integer"
try:
rate_limit_delay = float(config_dict.get("rate_limit_delay", 1.0))
if rate_limit_delay < 0.1 or rate_limit_delay > 10.0:
errors["rate_limit_delay"] = "Rate limit delay must be between 0.1 and 10.0 seconds"
except (ValueError, TypeError):
errors["rate_limit_delay"] = "Rate limit delay must be a valid number"
return errors
def create_config(self, config_dict: Dict[str, Any]) -> FirecrawlConfig:
"""Create FirecrawlConfig from dictionary."""
return FirecrawlConfig.from_dict(config_dict)
async def test_connection(self) -> Dict[str, Any]:
"""Test the connection to Firecrawl API."""
try:
async with self.connector:
# Try to scrape a simple URL to test connection
test_url = "https://httpbin.org/json"
result = await self.connector.scrape_url(test_url, ["markdown"])
if result.error:
return {
"success": False,
"error": result.error,
"message": "Failed to connect to Firecrawl API"
}
return {
"success": True,
"message": "Successfully connected to Firecrawl API",
"test_url": test_url,
"response_time": "N/A" # Could be enhanced to measure actual response time
}
except Exception as e:
return {
"success": False,
"error": str(e),
"message": "Connection test failed"
}
def get_supported_formats(self) -> List[str]:
"""Get list of supported output formats."""
return ["markdown", "html", "links", "screenshot"]
def get_supported_scrape_types(self) -> List[str]:
"""Get list of supported scrape types."""
return ["single", "crawl", "batch"]
def get_help_text(self) -> Dict[str, str]:
"""Get help text for users."""
return FirecrawlUIBuilder.create_help_text()
def get_validation_rules(self) -> Dict[str, Any]:
"""Get validation rules for configuration."""
return FirecrawlUIBuilder.create_validation_rules()
# Factory function for creating integration instance
def create_firecrawl_integration(config_dict: Dict[str, Any]) -> RAGFlowFirecrawlIntegration:
"""Create a Firecrawl integration instance from configuration."""
config = FirecrawlConfig.from_dict(config_dict)
return RAGFlowFirecrawlIntegration(config)
# Export main classes and functions
__all__ = [
"RAGFlowFirecrawlIntegration",
"create_firecrawl_integration",
"FirecrawlConfig",
"FirecrawlConnector",
"FirecrawlProcessor",
"RAGFlowDocument"
]

View File

@ -0,0 +1,31 @@
# Firecrawl Plugin for RAGFlow - Dependencies
# Core dependencies
aiohttp>=3.8.0
asyncio-throttle>=1.0.0
# Data processing
pydantic>=2.0.0
python-dateutil>=2.8.0
# HTTP and networking
urllib3>=1.26.0
requests>=2.28.0
# Logging and monitoring
structlog>=22.0.0
# Optional: For advanced content processing
beautifulsoup4>=4.11.0
lxml>=4.9.0
html2text>=2020.1.16
# Optional: For enhanced error handling
tenacity>=8.0.0
# Development dependencies (optional)
pytest>=7.0.0
pytest-asyncio>=0.21.0
black>=22.0.0
flake8>=5.0.0
mypy>=1.0.0

View File

@ -507,16 +507,29 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
markdown_parser = Markdown(int(parser_config.get("chunk_token_num", 128)))
sections, tables = markdown_parser(filename, binary, separate_tables=False)
# Process images for each section
section_images = []
for section_text, _ in sections:
images = markdown_parser.get_pictures(section_text) if section_text else None
if images:
# If multiple images found, combine them using concat_img
combined_image = reduce(concat_img, images) if len(images) > 1 else images[0]
section_images.append(combined_image)
else:
section_images.append(None)
try:
vision_model = LLMBundle(kwargs["tenant_id"], LLMType.IMAGE2TEXT)
callback(0.2, "Visual model detected. Attempting to enhance figure extraction...")
except Exception:
vision_model = None
if vision_model:
# Process images for each section
section_images = []
for idx, (section_text, _) in enumerate(sections):
images = markdown_parser.get_pictures(section_text) if section_text else None
if images:
# If multiple images found, combine them using concat_img
combined_image = reduce(concat_img, images) if len(images) > 1 else images[0]
section_images.append(combined_image)
markdown_vision_parser = VisionFigureParser(vision_model=vision_model, figures_data= [((combined_image, ["markdown image"]), [(0, 0, 0, 0, 0)])], **kwargs)
boosted_figures = markdown_vision_parser(callback=callback)
sections[idx] = (section_text + "\n\n" + "\n\n".join([fig[0][1] for fig in boosted_figures]), sections[idx][1])
else:
section_images.append(None)
else:
logging.warning("No visual model detected. Skipping figure parsing enhancement.")
res = tokenize_table(tables, doc, is_english)
callback(0.8, "Finish parsing.")

View File

@ -138,6 +138,8 @@ def label_question(question, kbs):
else:
all_tags = json.loads(all_tags)
tag_kbs = KnowledgebaseService.get_by_ids(tag_kb_ids)
if not tag_kbs:
return tags
tags = settings.retrievaler.tag_query(question,
list(set([kb.tenant_id for kb in tag_kbs])),
tag_kb_ids,

View File

@ -12,10 +12,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import io
import logging
import random
import trio
import numpy as np
from PIL import Image
from api.db import LLMType
from api.db.services.llm_service import LLMBundle
@ -43,17 +46,24 @@ class ParserParam(ProcessParamBase):
"json",
],
"ppt": [],
"image": [],
"image": [
"text"
],
"email": [],
"text": [],
"audio": [],
"text": [
"text",
"json"
],
"audio": [
"json"
],
"video": [],
}
self.setups = {
"pdf": {
"parse_method": "deepdoc", # deepdoc/plain_text/vlm
"vlm_name": "",
"llm_id": "",
"lang": "Chinese",
"suffix": [
"pdf",
@ -81,11 +91,39 @@ class ParserParam(ProcessParamBase):
},
"ppt": {},
"image": {
"parse_method": "ocr",
"parse_method": ["ocr", "vlm"],
"llm_id": "",
"lang": "Chinese",
"suffix": ["jpg", "jpeg", "png", "gif"],
"output_format": "json",
},
"email": {},
"text": {},
"audio": {},
"text": {
"suffix": [
"txt"
],
"output_format": "json",
},
"audio": {
"suffix":[
"da",
"wave",
"wav",
"mp3",
"aac",
"flac",
"ogg",
"aiff",
"au",
"midi",
"wma",
"realaudio",
"vqf",
"oggvorbis",
"ape"
],
"output_format": "json",
},
"video": {},
}
@ -96,7 +134,7 @@ class ParserParam(ProcessParamBase):
self.check_valid_value(pdf_parse_method.lower(), "Parse method abnormal.", ["deepdoc", "plain_text", "vlm"])
if pdf_parse_method not in ["deepdoc", "plain_text"]:
self.check_empty(pdf_config.get("vlm_name"), "VLM")
self.check_empty(pdf_config.get("llm_id"), "VLM")
pdf_language = pdf_config.get("lang", "")
self.check_empty(pdf_language, "Language")
@ -117,7 +155,23 @@ class ParserParam(ProcessParamBase):
image_config = self.setups.get("image", "")
if image_config:
image_parse_method = image_config.get("parse_method", "")
self.check_valid_value(image_parse_method.lower(), "Parse method abnormal.", ["ocr"])
self.check_valid_value(image_parse_method.lower(), "Parse method abnormal.", ["ocr", "vlm"])
if image_parse_method not in ["ocr"]:
self.check_empty(image_config.get("llm_id"), "VLM")
image_language = image_config.get("lang", "")
self.check_empty(image_language, "Language")
text_config = self.setups.get("text", "")
if text_config:
text_output_format = text_config.get("output_format", "")
self.check_valid_value(text_output_format, "Text output format abnormal.", self.allowed_output_format["text"])
audio_config = self.setups.get("audio", "")
if audio_config:
self.check_empty(audio_config.get("llm_id"), "VLM")
audio_language = audio_config.get("lang", "")
self.check_empty(audio_language, "Language")
def get_input_form(self) -> dict[str, dict]:
return {}
@ -139,8 +193,8 @@ class Parser(ProcessBase):
lines, _ = PlainParser()(blob)
bboxes = [{"text": t} for t, _ in lines]
else:
assert conf.get("vlm_name")
vision_model = LLMBundle(self._canvas._tenant_id, LLMType.IMAGE2TEXT, llm_name=conf.get("vlm_name"), lang=self._param.setups["pdf"].get("lang"))
assert conf.get("llm_id")
vision_model = LLMBundle(self._canvas._tenant_id, LLMType.IMAGE2TEXT, llm_name=conf.get("llm_id"), lang=self._param.setups["pdf"].get("lang"))
lines, _ = VisionParser(vision_model=vision_model)(blob, callback=self.callback)
bboxes = []
for t, poss in lines:
@ -208,15 +262,13 @@ class Parser(ProcessBase):
from rag.app.naive import Markdown as naive_markdown_parser
from rag.nlp import concat_img
self.callback(random.randint(1, 5) / 100.0, "Start to work on a Word Processor Document")
self.callback(random.randint(1, 5) / 100.0, "Start to work on a markdown.")
blob = from_upstream.blob
name = from_upstream.name
conf = self._param.setups["markdown"]
self.set_output("output_format", conf["output_format"])
print("markdown {conf=}", flush=True)
markdown_parser = naive_markdown_parser()
sections, tables = markdown_parser(name, blob, separate_tables=False)
@ -240,13 +292,87 @@ class Parser(ProcessBase):
self.set_output("json", json_results)
def _text(self, from_upstream: ParserFromUpstream):
from deepdoc.parser.utils import get_text
self.callback(random.randint(1, 5) / 100.0, "Start to work on a text.")
blob = from_upstream.blob
name = from_upstream.name
conf = self._param.setups["text"]
self.set_output("output_format", conf["output_format"])
# parse binary to text
text_content = get_text(name, binary=blob)
if conf.get("output_format") == "json":
result = [{"text": text_content}]
self.set_output("json", result)
else:
result = text_content
self.set_output("text", result)
def _image(self, from_upstream: ParserFromUpstream):
from deepdoc.vision import OCR
self.callback(random.randint(1, 5) / 100.0, "Start to work on an image.")
blob = from_upstream.blob
conf = self._param.setups["image"]
self.set_output("output_format", conf["output_format"])
img = Image.open(io.BytesIO(blob)).convert("RGB")
lang = conf["lang"]
if conf["parse_method"] == "ocr":
# use ocr, recognize chars only
ocr = OCR()
bxs = ocr(np.array(img)) # return boxes and recognize result
txt = "\n".join([t[0] for _, t in bxs if t[0]])
else:
# use VLM to describe the picture
cv_model = LLMBundle(self._canvas.get_tenant_id(), LLMType.IMAGE2TEXT, llm_name=conf["llm_id"],lang=lang)
img_binary = io.BytesIO()
img.save(img_binary, format="JPEG")
img_binary.seek(0)
txt = cv_model.describe(img_binary.read())
self.set_output("text", txt)
def _audio(self, from_upstream: ParserFromUpstream):
import os
import tempfile
self.callback(random.randint(1, 5) / 100.0, "Start to work on an audio.")
blob = from_upstream.blob
name = from_upstream.name
conf = self._param.setups["audio"]
self.set_output("output_format", conf["output_format"])
lang = conf["lang"]
_, ext = os.path.splitext(name)
tmp_path = ""
with tempfile.NamedTemporaryFile(suffix=ext) as tmpf:
tmpf.write(blob)
tmpf.flush()
tmp_path = os.path.abspath(tmpf.name)
seq2txt_mdl = LLMBundle(self._canvas.get_tenant_id(), LLMType.SPEECH2TEXT, lang=lang)
txt = seq2txt_mdl.transcription(tmp_path)
self.set_output("text", txt)
async def _invoke(self, **kwargs):
function_map = {
"pdf": self._pdf,
"markdown": self._markdown,
"spreadsheet": self._spreadsheet,
"word": self._word
"word": self._word,
"text": self._text,
"image": self._image,
"audio": self._audio,
}
try:
from_upstream = ParserFromUpstream.model_validate(kwargs)

View File

@ -44,9 +44,46 @@
"markdown"
],
"output_format": "json"
},
"text": {
"suffix": ["txt"],
"output_format": "json"
},
"image": {
"parse_method": "vlm",
"llm_id":"glm-4.5v",
"lang": "Chinese",
"suffix": [
"jpg",
"jpeg",
"png",
"gif"
],
"output_format": "text"
},
"audio": {
"suffix": [
"da",
"wave",
"wav",
"mp3",
"aac",
"flac",
"ogg",
"aiff",
"au",
"midi",
"wma",
"realaudio",
"vqf",
"oggvorbis",
"ape"
],
"lang": "Chinese",
"llm_id": "SenseVoiceSmall",
"output_format": "json"
}
}
}
}
},
"downstream": ["Chunker:0"],

View File

@ -37,6 +37,18 @@ class SupportedLiteLLMProvider(StrEnum):
TogetherAI = "TogetherAI"
Anthropic = "Anthropic"
Ollama = "Ollama"
Meituan = "Meituan"
CometAPI = "CometAPI"
SILICONFLOW = "SILICONFLOW"
OpenRouter = "OpenRouter"
StepFun = "StepFun"
PPIO = "PPIO"
PerfXCloud = "PerfXCloud"
Upstage = "Upstage"
NovitaAI = "NovitaAI"
Lingyi_AI = "01.AI"
GiteeAI = "GiteeAI"
AI_302 = "302.AI"
FACTORY_DEFAULT_BASE_URL = {
@ -44,6 +56,19 @@ FACTORY_DEFAULT_BASE_URL = {
SupportedLiteLLMProvider.Dashscope: "https://dashscope.aliyuncs.com/compatible-mode/v1",
SupportedLiteLLMProvider.Moonshot: "https://api.moonshot.cn/v1",
SupportedLiteLLMProvider.Ollama: "",
SupportedLiteLLMProvider.Meituan: "https://api.longcat.chat/openai",
SupportedLiteLLMProvider.CometAPI: "https://api.cometapi.com/v1",
SupportedLiteLLMProvider.SILICONFLOW: "https://api.siliconflow.cn/v1",
SupportedLiteLLMProvider.OpenRouter: "https://openrouter.ai/api/v1",
SupportedLiteLLMProvider.StepFun: "https://api.stepfun.com/v1",
SupportedLiteLLMProvider.PPIO: "https://api.ppinfra.com/v3/openai",
SupportedLiteLLMProvider.PerfXCloud: "https://cloud.perfxlab.cn/v1",
SupportedLiteLLMProvider.Upstage: "https://api.upstage.ai/v1/solar",
SupportedLiteLLMProvider.NovitaAI: "https://api.novita.ai/v3/openai",
SupportedLiteLLMProvider.Lingyi_AI: "https://api.lingyiwanwu.com/v1",
SupportedLiteLLMProvider.GiteeAI: "https://ai.gitee.com/v1/",
SupportedLiteLLMProvider.AI_302: "https://api.302.ai/v1",
SupportedLiteLLMProvider.Anthropic: "https://api.anthropic.com/",
}
@ -62,6 +87,18 @@ LITELLM_PROVIDER_PREFIX = {
SupportedLiteLLMProvider.TogetherAI: "together_ai/",
SupportedLiteLLMProvider.Anthropic: "", # don't need a prefix
SupportedLiteLLMProvider.Ollama: "ollama_chat/",
SupportedLiteLLMProvider.Meituan: "openai/",
SupportedLiteLLMProvider.CometAPI: "openai/",
SupportedLiteLLMProvider.SILICONFLOW: "openai/",
SupportedLiteLLMProvider.OpenRouter: "openai/",
SupportedLiteLLMProvider.StepFun: "openai/",
SupportedLiteLLMProvider.PPIO: "openai/",
SupportedLiteLLMProvider.PerfXCloud: "openai/",
SupportedLiteLLMProvider.Upstage: "openai/",
SupportedLiteLLMProvider.NovitaAI: "openai/",
SupportedLiteLLMProvider.Lingyi_AI: "openai/",
SupportedLiteLLMProvider.GiteeAI: "openai/",
SupportedLiteLLMProvider.AI_302: "openai/",
}
ChatModel = globals().get("ChatModel", {})

View File

@ -895,25 +895,6 @@ class MistralChat(Base):
yield total_tokens
## openrouter
class OpenRouterChat(Base):
_FACTORY_NAME = "OpenRouter"
def __init__(self, key, model_name, base_url="https://openrouter.ai/api/v1", **kwargs):
if not base_url:
base_url = "https://openrouter.ai/api/v1"
super().__init__(key, model_name, base_url, **kwargs)
class StepFunChat(Base):
_FACTORY_NAME = "StepFun"
def __init__(self, key, model_name, base_url="https://api.stepfun.com/v1", **kwargs):
if not base_url:
base_url = "https://api.stepfun.com/v1"
super().__init__(key, model_name, base_url, **kwargs)
class LmStudioChat(Base):
_FACTORY_NAME = "LM-Studio"
@ -936,15 +917,6 @@ class OpenAI_APIChat(Base):
super().__init__(key, model_name, base_url, **kwargs)
class PPIOChat(Base):
_FACTORY_NAME = "PPIO"
def __init__(self, key, model_name, base_url="https://api.ppinfra.com/v3/openai", **kwargs):
if not base_url:
base_url = "https://api.ppinfra.com/v3/openai"
super().__init__(key, model_name, base_url, **kwargs)
class LeptonAIChat(Base):
_FACTORY_NAME = "LeptonAI"
@ -954,60 +926,6 @@ class LeptonAIChat(Base):
super().__init__(key, model_name, base_url, **kwargs)
class PerfXCloudChat(Base):
_FACTORY_NAME = "PerfXCloud"
def __init__(self, key, model_name, base_url="https://cloud.perfxlab.cn/v1", **kwargs):
if not base_url:
base_url = "https://cloud.perfxlab.cn/v1"
super().__init__(key, model_name, base_url, **kwargs)
class UpstageChat(Base):
_FACTORY_NAME = "Upstage"
def __init__(self, key, model_name, base_url="https://api.upstage.ai/v1/solar", **kwargs):
if not base_url:
base_url = "https://api.upstage.ai/v1/solar"
super().__init__(key, model_name, base_url, **kwargs)
class NovitaAIChat(Base):
_FACTORY_NAME = "NovitaAI"
def __init__(self, key, model_name, base_url="https://api.novita.ai/v3/openai", **kwargs):
if not base_url:
base_url = "https://api.novita.ai/v3/openai"
super().__init__(key, model_name, base_url, **kwargs)
class SILICONFLOWChat(Base):
_FACTORY_NAME = "SILICONFLOW"
def __init__(self, key, model_name, base_url="https://api.siliconflow.cn/v1", **kwargs):
if not base_url:
base_url = "https://api.siliconflow.cn/v1"
super().__init__(key, model_name, base_url, **kwargs)
class YiChat(Base):
_FACTORY_NAME = "01.AI"
def __init__(self, key, model_name, base_url="https://api.lingyiwanwu.com/v1", **kwargs):
if not base_url:
base_url = "https://api.lingyiwanwu.com/v1"
super().__init__(key, model_name, base_url, **kwargs)
class GiteeChat(Base):
_FACTORY_NAME = "GiteeAI"
def __init__(self, key, model_name, base_url="https://ai.gitee.com/v1/", **kwargs):
if not base_url:
base_url = "https://ai.gitee.com/v1/"
super().__init__(key, model_name, base_url, **kwargs)
class ReplicateChat(Base):
_FACTORY_NAME = "Replicate"
@ -1347,26 +1265,46 @@ class GPUStackChat(Base):
super().__init__(key, model_name, base_url, **kwargs)
class Ai302Chat(Base):
_FACTORY_NAME = "302.AI"
class TokenPonyChat(Base):
_FACTORY_NAME = "TokenPony"
def __init__(self, key, model_name, base_url="https://api.302.ai/v1", **kwargs):
def __init__(self, key, model_name, base_url="https://ragflow.vip-api.tokenpony.cn/v1", **kwargs):
if not base_url:
base_url = "https://api.302.ai/v1"
super().__init__(key, model_name, base_url, **kwargs)
class MeituanChat(Base):
_FACTORY_NAME = "Meituan"
def __init__(self, key, model_name, base_url="https://api.longcat.chat/openai", **kwargs):
if not base_url:
base_url = "https://api.longcat.chat/openai"
super().__init__(key, model_name, base_url, **kwargs)
base_url = "https://ragflow.vip-api.tokenpony.cn/v1"
class LiteLLMBase(ABC):
_FACTORY_NAME = ["Tongyi-Qianwen", "Bedrock", "Moonshot", "xAI", "DeepInfra", "Groq", "Cohere", "Gemini", "DeepSeek", "NVIDIA", "TogetherAI", "Anthropic", "Ollama"]
_FACTORY_NAME = [
"Tongyi-Qianwen",
"Bedrock",
"Moonshot",
"xAI",
"DeepInfra",
"Groq",
"Cohere",
"Gemini",
"DeepSeek",
"NVIDIA",
"TogetherAI",
"Anthropic",
"Ollama",
"Meituan",
"CometAPI",
"SILICONFLOW",
"OpenRouter",
"StepFun",
"PPIO",
"PerfXCloud",
"Upstage",
"NovitaAI",
"01.AI",
"GiteeAI",
"302.AI",
]
import litellm
litellm._turn_on_debug()
def __init__(self, key, model_name, base_url=None, **kwargs):
self.timeout = int(os.environ.get("LM_TIMEOUT_SECONDS", 600))
@ -1374,7 +1312,7 @@ class LiteLLMBase(ABC):
self.prefix = LITELLM_PROVIDER_PREFIX.get(self.provider, "")
self.model_name = f"{self.prefix}{model_name}"
self.api_key = key
self.base_url = (base_url or FACTORY_DEFAULT_BASE_URL.get(self.provider, "")).rstrip('/')
self.base_url = (base_url or FACTORY_DEFAULT_BASE_URL.get(self.provider, "")).rstrip("/")
# Configure retry parameters
self.max_retries = kwargs.get("max_retries", int(os.environ.get("LLM_MAX_RETRIES", 5)))
self.base_delay = kwargs.get("retry_interval", float(os.environ.get("LLM_BASE_DELAY", 2.0)))

View File

@ -86,9 +86,10 @@ class DefaultEmbedding(Base):
with DefaultEmbedding._model_lock:
import torch
from FlagEmbedding import FlagModel
if "CUDA_VISIBLE_DEVICES" in os.environ:
input_cuda_visible_devices = os.environ["CUDA_VISIBLE_DEVICES"]
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # handle some issues with multiple GPUs when initializing the model
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # handle some issues with multiple GPUs when initializing the model
if not DefaultEmbedding._model or model_name != DefaultEmbedding._model_name:
try:
@ -472,6 +473,7 @@ class MistralEmbed(Base):
def encode(self, texts: list):
import time
import random
texts = [truncate(t, 8196) for t in texts]
batch_size = 16
ress = []
@ -495,6 +497,7 @@ class MistralEmbed(Base):
def encode_queries(self, text):
import time
import random
retry_max = 5
while retry_max > 0:
try:
@ -751,7 +754,11 @@ class SILICONFLOWEmbed(Base):
token_count = 0
for i in range(0, len(texts), batch_size):
texts_batch = texts[i : i + batch_size]
texts_batch = [" " if not text.strip() else text for text in texts_batch]
if self.model_name in ["BAAI/bge-large-zh-v1.5", "BAAI/bge-large-en-v1.5"]:
# limit 512, 340 is almost safe
texts_batch = [" " if not text.strip() else truncate(text, 340) for text in texts_batch]
else:
texts_batch = [" " if not text.strip() else text for text in texts_batch]
payload = {
"model": self.model_name,
@ -938,6 +945,7 @@ class GiteeEmbed(SILICONFLOWEmbed):
base_url = "https://ai.gitee.com/v1/embeddings"
super().__init__(key, model_name, base_url)
class DeepInfraEmbed(OpenAIEmbed):
_FACTORY_NAME = "DeepInfra"
@ -954,3 +962,12 @@ class Ai302Embed(Base):
if not base_url:
base_url = "https://api.302.ai/v1/embeddings"
super().__init__(key, model_name, base_url)
class CometEmbed(OpenAIEmbed):
_FACTORY_NAME = "CometAPI"
def __init__(self, key, model_name, base_url="https://api.cometapi.com/v1"):
if not base_url:
base_url = "https://api.cometapi.com/v1"
super().__init__(key, model_name, base_url)

View File

@ -365,7 +365,7 @@ class OpenAI_APIRerank(Base):
max_rank = np.max(rank)
# Avoid division by zero if all ranks are identical
if np.isclose(min_rank, max_rank, atol=1e-3):
if not np.isclose(min_rank, max_rank, atol=1e-3):
rank = (rank - min_rank) / (max_rank - min_rank)
else:
rank = np.zeros_like(rank)

View File

@ -218,7 +218,7 @@ class GPUStackSeq2txt(Base):
class GiteeSeq2txt(Base):
_FACTORY_NAME = "GiteeAI"
def __init__(self, key, model_name="whisper-1", base_url="https://ai.gitee.com/v1/"):
def __init__(self, key, model_name="whisper-1", base_url="https://ai.gitee.com/v1/", **kwargs):
if not base_url:
base_url = "https://ai.gitee.com/v1/"
self.client = OpenAI(api_key=key, base_url=base_url)
@ -234,3 +234,13 @@ class DeepInfraSeq2txt(Base):
self.client = OpenAI(api_key=key, base_url=base_url)
self.model_name = model_name
class CometSeq2txt(Base):
_FACTORY_NAME = "CometAPI"
def __init__(self, key, model_name="whisper-1", base_url="https://api.cometapi.com/v1", **kwargs):
if not base_url:
base_url = "https://api.cometapi.com/v1"
self.client = OpenAI(api_key=key, base_url=base_url)
self.model_name = model_name

View File

@ -394,3 +394,11 @@ class DeepInfraTTS(OpenAITTS):
if not base_url:
base_url = "https://api.deepinfra.com/v1/openai"
super().__init__(key, model_name, base_url, **kwargs)
class CometAPITTS(OpenAITTS):
_FACTORY_NAME = "CometAPI"
def __init__(self, key, model_name, base_url="https://api.cometapi.com/v1", **kwargs):
if not base_url:
base_url = "https://api.cometapi.com/v1"
super().__init__(key, model_name, base_url, **kwargs)

View File

@ -14,24 +14,24 @@
},
"node_modules/asynckit": {
"version": "0.4.0",
"resolved": "https://registry.npmmirror.com/asynckit/-/asynckit-0.4.0.tgz",
"resolved": "https://registry.npmjs.org/asynckit/-/asynckit-0.4.0.tgz",
"integrity": "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q==",
"license": "MIT"
},
"node_modules/axios": {
"version": "1.9.0",
"resolved": "https://registry.npmmirror.com/axios/-/axios-1.9.0.tgz",
"integrity": "sha512-re4CqKTJaURpzbLHtIi6XpDv20/CnpXOtjRY5/CU32L8gU8ek9UIivcfvSWvmKEngmVbrUtPpdDwWDWL7DNHvg==",
"version": "1.12.0",
"resolved": "https://registry.npmjs.org/axios/-/axios-1.12.0.tgz",
"integrity": "sha512-oXTDccv8PcfjZmPGlWsPSwtOJCZ/b6W5jAMCNcfwJbCzDckwG0jrYJFaWH1yvivfCXjVzV/SPDEhMB3Q+DSurg==",
"license": "MIT",
"dependencies": {
"follow-redirects": "^1.15.6",
"form-data": "^4.0.0",
"form-data": "^4.0.4",
"proxy-from-env": "^1.1.0"
}
},
"node_modules/call-bind-apply-helpers": {
"version": "1.0.2",
"resolved": "https://registry.npmmirror.com/call-bind-apply-helpers/-/call-bind-apply-helpers-1.0.2.tgz",
"resolved": "https://registry.npmjs.org/call-bind-apply-helpers/-/call-bind-apply-helpers-1.0.2.tgz",
"integrity": "sha512-Sp1ablJ0ivDkSzjcaJdxEunN5/XvksFJ2sMBFfq6x0ryhQV/2b/KwFe21cMpmHtPOSij8K99/wSfoEuTObmuMQ==",
"license": "MIT",
"dependencies": {
@ -44,7 +44,7 @@
},
"node_modules/combined-stream": {
"version": "1.0.8",
"resolved": "https://registry.npmmirror.com/combined-stream/-/combined-stream-1.0.8.tgz",
"resolved": "https://registry.npmjs.org/combined-stream/-/combined-stream-1.0.8.tgz",
"integrity": "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg==",
"license": "MIT",
"dependencies": {
@ -56,7 +56,7 @@
},
"node_modules/delayed-stream": {
"version": "1.0.0",
"resolved": "https://registry.npmmirror.com/delayed-stream/-/delayed-stream-1.0.0.tgz",
"resolved": "https://registry.npmjs.org/delayed-stream/-/delayed-stream-1.0.0.tgz",
"integrity": "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ==",
"license": "MIT",
"engines": {
@ -65,7 +65,7 @@
},
"node_modules/dunder-proto": {
"version": "1.0.1",
"resolved": "https://registry.npmmirror.com/dunder-proto/-/dunder-proto-1.0.1.tgz",
"resolved": "https://registry.npmjs.org/dunder-proto/-/dunder-proto-1.0.1.tgz",
"integrity": "sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A==",
"license": "MIT",
"dependencies": {
@ -79,7 +79,7 @@
},
"node_modules/es-define-property": {
"version": "1.0.1",
"resolved": "https://registry.npmmirror.com/es-define-property/-/es-define-property-1.0.1.tgz",
"resolved": "https://registry.npmjs.org/es-define-property/-/es-define-property-1.0.1.tgz",
"integrity": "sha512-e3nRfgfUZ4rNGL232gUgX06QNyyez04KdjFrF+LTRoOXmrOgFKDg4BCdsjW8EnT69eqdYGmRpJwiPVYNrCaW3g==",
"license": "MIT",
"engines": {
@ -88,7 +88,7 @@
},
"node_modules/es-errors": {
"version": "1.3.0",
"resolved": "https://registry.npmmirror.com/es-errors/-/es-errors-1.3.0.tgz",
"resolved": "https://registry.npmjs.org/es-errors/-/es-errors-1.3.0.tgz",
"integrity": "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw==",
"license": "MIT",
"engines": {
@ -97,7 +97,7 @@
},
"node_modules/es-object-atoms": {
"version": "1.1.1",
"resolved": "https://registry.npmmirror.com/es-object-atoms/-/es-object-atoms-1.1.1.tgz",
"resolved": "https://registry.npmjs.org/es-object-atoms/-/es-object-atoms-1.1.1.tgz",
"integrity": "sha512-FGgH2h8zKNim9ljj7dankFPcICIK9Cp5bm+c2gQSYePhpaG5+esrLODihIorn+Pe6FGJzWhXQotPv73jTaldXA==",
"license": "MIT",
"dependencies": {
@ -109,7 +109,7 @@
},
"node_modules/es-set-tostringtag": {
"version": "2.1.0",
"resolved": "https://registry.npmmirror.com/es-set-tostringtag/-/es-set-tostringtag-2.1.0.tgz",
"resolved": "https://registry.npmjs.org/es-set-tostringtag/-/es-set-tostringtag-2.1.0.tgz",
"integrity": "sha512-j6vWzfrGVfyXxge+O0x5sh6cvxAog0a/4Rdd2K36zCMV5eJ+/+tOAngRO8cODMNWbVRdVlmGZQL2YS3yR8bIUA==",
"license": "MIT",
"dependencies": {
@ -143,14 +143,15 @@
}
},
"node_modules/form-data": {
"version": "4.0.2",
"resolved": "https://registry.npmmirror.com/form-data/-/form-data-4.0.2.tgz",
"integrity": "sha512-hGfm/slu0ZabnNt4oaRZ6uREyfCj6P4fT/n6A1rGV+Z0VdGXjfOhVUpkn6qVQONHGIFwmveGXyDs75+nr6FM8w==",
"version": "4.0.4",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.4.tgz",
"integrity": "sha512-KrGhL9Q4zjj0kiUt5OO4Mr/A/jlI2jDYs5eHBpYHPcBEVSiipAvn2Ko2HnPe20rmcuuvMHNdZFp+4IlGTMF0Ow==",
"license": "MIT",
"dependencies": {
"asynckit": "^0.4.0",
"combined-stream": "^1.0.8",
"es-set-tostringtag": "^2.1.0",
"hasown": "^2.0.2",
"mime-types": "^2.1.12"
},
"engines": {
@ -159,7 +160,7 @@
},
"node_modules/function-bind": {
"version": "1.1.2",
"resolved": "https://registry.npmmirror.com/function-bind/-/function-bind-1.1.2.tgz",
"resolved": "https://registry.npmjs.org/function-bind/-/function-bind-1.1.2.tgz",
"integrity": "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA==",
"license": "MIT",
"funding": {
@ -168,7 +169,7 @@
},
"node_modules/get-intrinsic": {
"version": "1.3.0",
"resolved": "https://registry.npmmirror.com/get-intrinsic/-/get-intrinsic-1.3.0.tgz",
"resolved": "https://registry.npmjs.org/get-intrinsic/-/get-intrinsic-1.3.0.tgz",
"integrity": "sha512-9fSjSaos/fRIVIp+xSJlE6lfwhES7LNtKaCBIamHsjr2na1BiABJPo0mOjjz8GJDURarmCPGqaiVg5mfjb98CQ==",
"license": "MIT",
"dependencies": {
@ -192,7 +193,7 @@
},
"node_modules/get-proto": {
"version": "1.0.1",
"resolved": "https://registry.npmmirror.com/get-proto/-/get-proto-1.0.1.tgz",
"resolved": "https://registry.npmjs.org/get-proto/-/get-proto-1.0.1.tgz",
"integrity": "sha512-sTSfBjoXBp89JvIKIefqw7U2CCebsc74kiY6awiGogKtoSGbgjYE/G/+l9sF3MWFPNc9IcoOC4ODfKHfxFmp0g==",
"license": "MIT",
"dependencies": {
@ -205,7 +206,7 @@
},
"node_modules/gopd": {
"version": "1.2.0",
"resolved": "https://registry.npmmirror.com/gopd/-/gopd-1.2.0.tgz",
"resolved": "https://registry.npmjs.org/gopd/-/gopd-1.2.0.tgz",
"integrity": "sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg==",
"license": "MIT",
"engines": {
@ -217,7 +218,7 @@
},
"node_modules/has-symbols": {
"version": "1.1.0",
"resolved": "https://registry.npmmirror.com/has-symbols/-/has-symbols-1.1.0.tgz",
"resolved": "https://registry.npmjs.org/has-symbols/-/has-symbols-1.1.0.tgz",
"integrity": "sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ==",
"license": "MIT",
"engines": {
@ -229,7 +230,7 @@
},
"node_modules/has-tostringtag": {
"version": "1.0.2",
"resolved": "https://registry.npmmirror.com/has-tostringtag/-/has-tostringtag-1.0.2.tgz",
"resolved": "https://registry.npmjs.org/has-tostringtag/-/has-tostringtag-1.0.2.tgz",
"integrity": "sha512-NqADB8VjPFLM2V0VvHUewwwsw0ZWBaIdgo+ieHtK3hasLz4qeCRjYcqfB6AQrBggRKppKF8L52/VqdVsO47Dlw==",
"license": "MIT",
"dependencies": {
@ -244,7 +245,7 @@
},
"node_modules/hasown": {
"version": "2.0.2",
"resolved": "https://registry.npmmirror.com/hasown/-/hasown-2.0.2.tgz",
"resolved": "https://registry.npmjs.org/hasown/-/hasown-2.0.2.tgz",
"integrity": "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ==",
"license": "MIT",
"dependencies": {
@ -256,7 +257,7 @@
},
"node_modules/math-intrinsics": {
"version": "1.1.0",
"resolved": "https://registry.npmmirror.com/math-intrinsics/-/math-intrinsics-1.1.0.tgz",
"resolved": "https://registry.npmjs.org/math-intrinsics/-/math-intrinsics-1.1.0.tgz",
"integrity": "sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g==",
"license": "MIT",
"engines": {
@ -265,7 +266,7 @@
},
"node_modules/mime-db": {
"version": "1.52.0",
"resolved": "https://registry.npmmirror.com/mime-db/-/mime-db-1.52.0.tgz",
"resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.52.0.tgz",
"integrity": "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg==",
"license": "MIT",
"engines": {
@ -274,7 +275,7 @@
},
"node_modules/mime-types": {
"version": "2.1.35",
"resolved": "https://registry.npmmirror.com/mime-types/-/mime-types-2.1.35.tgz",
"resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.35.tgz",
"integrity": "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw==",
"license": "MIT",
"dependencies": {

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 96 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 16 KiB

View File

@ -139,7 +139,7 @@ function EmbedDialog({
</form>
</Form>
<div>
<span>Embed code</span>
<span>{t('embedCode', { keyPrefix: 'search' })}</span>
<HightLightMarkdown>{text}</HightLightMarkdown>
</div>
<div className=" font-medium mt-4 mb-1">

View File

@ -17,7 +17,7 @@ export function MaxTokenNumberFormField({ max = 2048, initialValue }: IProps) {
tooltip={t('chunkTokenNumberTip')}
max={max}
defaultValue={initialValue ?? 0}
layout={FormLayout.Horizontal}
layout={FormLayout.Vertical}
></SliderInputFormField>
);
}

View File

@ -1,6 +1,6 @@
import { FormLayout } from '@/constants/form';
import { cn } from '@/lib/utils';
import { ReactNode } from 'react';
import { ReactNode, useMemo } from 'react';
import { useFormContext } from 'react-hook-form';
import { SingleFormSlider } from './ui/dual-range-slider';
import {
@ -40,7 +40,7 @@ export function SliderInputFormField({
}: SliderInputFormFieldProps) {
const form = useFormContext();
const isHorizontal = layout === FormLayout.Horizontal;
const isHorizontal = useMemo(() => layout === FormLayout.Vertical, [layout]);
return (
<FormField

View File

@ -54,7 +54,9 @@ export enum LLMFactory {
DeepInfra = 'DeepInfra',
Grok = 'Grok',
XAI = 'xAI',
TokenPony = 'TokenPony',
Meituan = 'Meituan',
CometAPI = 'CometAPI',
}
// Please lowercase the file name
@ -114,5 +116,7 @@ export const IconMap = {
[LLMFactory.DeepInfra]: 'deepinfra',
[LLMFactory.Grok]: 'grok',
[LLMFactory.XAI]: 'xai',
[LLMFactory.TokenPony]: 'token-pony',
[LLMFactory.Meituan]: 'longcat',
[LLMFactory.CometAPI]: 'cometapi',
};

View File

@ -136,6 +136,7 @@ export const useSelectLlmOptionsByModelType = () => {
};
};
// Merge different types of models from the same manufacturer under one manufacturer
export const useComposeLlmOptionsByModelTypes = (
modelTypes: LlmModelType[],
) => {
@ -155,7 +156,12 @@ export const useComposeLlmOptionsByModelTypes = (
options.forEach((x) => {
const item = pre.find((y) => y.label === x.label);
if (item) {
item.options.push(...x.options);
x.options.forEach((y) => {
// A model that is both an image2text and speech2text model
if (!item.options.some((z) => z.value === y.value)) {
item.options.push(y);
}
});
} else {
pre.push(x);
}

View File

@ -155,7 +155,7 @@ export default {
similarityThreshold: '相似度阈值',
similarityThresholdTip:
'我们使用混合相似度得分来评估两行文本之间的距离。 它是加权关键词相似度和向量余弦相似度。 如果查询和块之间的相似度小于此阈值,则该块将被过滤掉。默认设置为 0.2,也就是说文本块的混合相似度得分至少 20 才会被召回。',
vectorSimilarityWeight: '相似度相似度权重',
vectorSimilarityWeight: '向量相似度权重',
vectorSimilarityWeightTip:
'我们使用混合相似性评分来评估两行文本之间的距离。它是加权关键字相似性和矢量余弦相似性或rerank得分0〜1。两个权重的总和为1.0。',
keywordSimilarityWeight: '关键词相似度权重',
@ -601,7 +601,7 @@ General实体和关系提取提示来自 GitHub - microsoft/graphrag基于
answerTitle: '智能回答',
multiTurn: '多轮对话优化',
multiTurnTip:
'在多轮对话的中,对去知识库查询问题进行优化。会调用大模型额外消耗token。',
'在多轮对话时,对查询问题根据上下文进行优化。会调用大模型额外消耗 token。',
howUseId: '如何使用聊天ID',
description: '助理描述',
descriptionPlaceholder:
@ -632,6 +632,8 @@ General实体和关系提取提示来自 GitHub - microsoft/graphrag基于
},
cancel: '取消',
chatSetting: '聊天设置',
avatarHidden: '隐藏头像',
locale: '地区',
},
setting: {
profile: '概要',

View File

@ -62,7 +62,7 @@ function AgentChatBox() {
return (
<>
<section className="flex flex-1 flex-col px-5 h-[90vh]">
<section className="flex flex-1 flex-col px-5 min-h-0 pb-4">
<div className="flex-1 overflow-auto" ref={messageContainerRef}>
<div>
{/* <Spin spinning={sendLoading}> */}

View File

@ -9,7 +9,7 @@ export function ChatSheet({ hideModal }: IModalProps<any>) {
return (
<Sheet open modal={false} onOpenChange={hideModal}>
<SheetContent
className={cn('top-20 p-0')}
className={cn('top-20 bottom-0 p-0 flex flex-col h-auto')}
onInteractOutside={(e) => e.preventDefault()}
>
<SheetTitle className="hidden"></SheetTitle>

View File

@ -145,7 +145,7 @@ function AgentForm({ node }: INextOperatorForm) {
<PromptEditor
{...field}
placeholder={t('flow.messagePlaceholder')}
showToolbar={false}
showToolbar={true}
extraOptions={extraOptions}
></PromptEditor>
</FormControl>
@ -166,7 +166,7 @@ function AgentForm({ node }: INextOperatorForm) {
<section>
<PromptEditor
{...field}
showToolbar={false}
showToolbar={true}
></PromptEditor>
</section>
</FormControl>

View File

@ -2133,7 +2133,7 @@ export const QWeatherTimePeriodOptions = [
'30d',
];
export const ExeSQLOptions = ['mysql', 'postgresql', 'mariadb', 'mssql'].map(
export const ExeSQLOptions = ['mysql', 'postgres', 'mariadb', 'mssql'].map(
(x) => ({
label: upperFirst(x),
value: x,

View File

@ -2133,7 +2133,7 @@ export const QWeatherTimePeriodOptions = [
'30d',
];
export const ExeSQLOptions = ['mysql', 'postgresql', 'mariadb', 'mssql'].map(
export const ExeSQLOptions = ['mysql', 'postgres', 'mariadb', 'mssql'].map(
(x) => ({
label: upperFirst(x),
value: x,

View File

@ -70,16 +70,15 @@ export function EmbeddingModelItem() {
name={'embd_id'}
render={({ field }) => (
<FormItem className=" items-center space-y-0 ">
<div className="">
<div className="flex items-center">
<FormLabel
required
tooltip={t('embeddingModelTip')}
className="text-sm whitespace-wrap "
className="text-sm whitespace-wrap w-1/4"
>
<span className="text-destructive mr-1"> *</span>
{t('embeddingModel')}
</FormLabel>
<div className="text-muted-foreground">
<div className="text-muted-foreground w-3/4">
<FormControl>
<RAGFlowSelect
{...field}

View File

@ -9,13 +9,7 @@ import { cn, formatBytes } from '@/lib/utils';
import { Routes } from '@/routes';
import { formatPureDate } from '@/utils/date';
import { isEmpty } from 'lodash';
import {
Banknote,
Database,
DatabaseZap,
FileSearch2,
GitGraph,
} from 'lucide-react';
import { Banknote, Database, FileSearch2, GitGraph } from 'lucide-react';
import { useMemo } from 'react';
import { useTranslation } from 'react-i18next';
import { useHandleMenuClick } from './hooks';
@ -34,11 +28,11 @@ export function SideBar({ refreshCount }: PropType) {
const items = useMemo(() => {
const list = [
{
icon: DatabaseZap,
label: t(`knowledgeDetails.overview`),
key: Routes.DataSetOverview,
},
// {
// icon: DatabaseZap,
// label: t(`knowledgeDetails.overview`),
// key: Routes.DataSetOverview,
// },
{
icon: Database,
label: t(`knowledgeDetails.dataset`),

View File

@ -17,16 +17,9 @@ import {
import { Input } from '@/components/ui/input';
import { IModalProps } from '@/interfaces/common';
import { zodResolver } from '@hookform/resolvers/zod';
import { useForm, useWatch } from 'react-hook-form';
import { useForm } from 'react-hook-form';
import { useTranslation } from 'react-i18next';
import { z } from 'zod';
import {
DataExtractKnowledgeItem,
DataFlowItem,
EmbeddingModelItem,
ParseTypeItem,
TeamItem,
} from '../dataset/dataset-setting/configuration/common-item';
const FormId = 'dataset-creating-form';
@ -54,10 +47,6 @@ export function InputForm({ onOk }: IModalProps<any>) {
function onSubmit(data: z.infer<typeof FormSchema>) {
onOk?.(data.name);
}
const parseType = useWatch({
control: form.control,
name: 'parseType',
});
return (
<Form {...form}>
<form
@ -84,15 +73,6 @@ export function InputForm({ onOk }: IModalProps<any>) {
</FormItem>
)}
/>
<EmbeddingModelItem line={2} />
<ParseTypeItem />
{parseType === 2 && (
<>
<DataFlowItem />
<DataExtractKnowledgeItem />
<TeamItem />
</>
)}
</form>
</Form>
);

View File

@ -0,0 +1,123 @@
import { ButtonLoading } from '@/components/ui/button';
import {
Dialog,
DialogContent,
DialogFooter,
DialogHeader,
DialogTitle,
} from '@/components/ui/dialog';
import {
Form,
FormControl,
FormField,
FormItem,
FormLabel,
FormMessage,
} from '@/components/ui/form';
import { Input } from '@/components/ui/input';
import { IModalProps } from '@/interfaces/common';
import { zodResolver } from '@hookform/resolvers/zod';
import { useForm, useWatch } from 'react-hook-form';
import { useTranslation } from 'react-i18next';
import { z } from 'zod';
import {
DataExtractKnowledgeItem,
DataFlowItem,
EmbeddingModelItem,
ParseTypeItem,
TeamItem,
} from '../dataset/dataset-setting/configuration/common-item';
const FormId = 'dataset-creating-form';
export function InputForm({ onOk }: IModalProps<any>) {
const { t } = useTranslation();
const FormSchema = z.object({
name: z
.string()
.min(1, {
message: t('knowledgeList.namePlaceholder'),
})
.trim(),
parseType: z.number().optional(),
});
const form = useForm<z.infer<typeof FormSchema>>({
resolver: zodResolver(FormSchema),
defaultValues: {
name: '',
parseType: 1,
},
});
function onSubmit(data: z.infer<typeof FormSchema>) {
onOk?.(data.name);
}
const parseType = useWatch({
control: form.control,
name: 'parseType',
});
return (
<Form {...form}>
<form
onSubmit={form.handleSubmit(onSubmit)}
className="space-y-6"
id={FormId}
>
<FormField
control={form.control}
name="name"
render={({ field }) => (
<FormItem>
<FormLabel>
<span className="text-destructive mr-1"> *</span>
{t('knowledgeList.name')}
</FormLabel>
<FormControl>
<Input
placeholder={t('knowledgeList.namePlaceholder')}
{...field}
/>
</FormControl>
<FormMessage />
</FormItem>
)}
/>
<EmbeddingModelItem line={2} />
<ParseTypeItem />
{parseType === 2 && (
<>
<DataFlowItem />
<DataExtractKnowledgeItem />
<TeamItem />
</>
)}
</form>
</Form>
);
}
export function DatasetCreatingDialog({
hideModal,
onOk,
loading,
}: IModalProps<any>) {
const { t } = useTranslation();
return (
<Dialog open onOpenChange={hideModal}>
<DialogContent className="sm:max-w-[425px]">
<DialogHeader>
<DialogTitle>{t('knowledgeList.createKnowledgeBase')}</DialogTitle>
</DialogHeader>
<InputForm onOk={onOk}></InputForm>
<DialogFooter>
<ButtonLoading type="submit" form={FormId} loading={loading}>
{t('common.save')}
</ButtonLoading>
</DialogFooter>
</DialogContent>
</Dialog>
);
}

View File

@ -2911,7 +2911,7 @@ export const QWeatherTimePeriodOptions = [
'30d',
];
export const ExeSQLOptions = ['mysql', 'postgresql', 'mariadb', 'mssql'].map(
export const ExeSQLOptions = ['mysql', 'postgres', 'mariadb', 'mssql'].map(
(x) => ({
label: upperFirst(x),
value: x,

View File

@ -89,6 +89,18 @@ const ApiKeyModal = ({
/>
</Form.Item>
)}
{llmFactory?.toLowerCase() === 'Anthropic'.toLowerCase() && (
<Form.Item<FieldType>
label={t('baseUrl')}
name="base_url"
tooltip={t('baseUrlTip')}
>
<Input
placeholder="https://api.anthropic.com/v1"
onKeyDown={handleKeyDown}
/>
</Form.Item>
)}
{llmFactory?.toLowerCase() === 'Minimax'.toLowerCase() && (
<Form.Item<FieldType> label={'Group ID'} name="group_id">
<Input />

View File

@ -37,6 +37,7 @@ const llmFactoryToUrlMap = {
'https://huggingface.co/docs/text-embeddings-inference/quick_tour',
[LLMFactory.GPUStack]: 'https://docs.gpustack.ai/latest/quickstart',
[LLMFactory.VLLM]: 'https://docs.vllm.ai/en/latest/',
[LLMFactory.TokenPony]: 'https://docs.tokenpony.cn/#/',
};
type LlmFactory = keyof typeof llmFactoryToUrlMap;

View File

@ -1,7 +1,10 @@
import { IModalManagerChildrenProps } from '@/components/modal-manager';
import { LlmModelType } from '@/constants/knowledge';
import { useTranslate } from '@/hooks/common-hooks';
import { ISystemModelSettingSavingParams } from '@/hooks/llm-hooks';
import {
ISystemModelSettingSavingParams,
useComposeLlmOptionsByModelTypes,
} from '@/hooks/llm-hooks';
import { Form, Modal, Select } from 'antd';
import { useEffect } from 'react';
import { useFetchSystemModelSettingOnMount } from '../hooks';
@ -43,6 +46,11 @@ const SystemModelSettingModal = ({
const onFormLayoutChange = () => {};
const modelOptions = useComposeLlmOptionsByModelTypes([
LlmModelType.Chat,
LlmModelType.Image2text,
]);
return (
<Modal
title={t('systemModelSettings')}
@ -58,14 +66,7 @@ const SystemModelSettingModal = ({
name="llm_id"
tooltip={t('chatModelTip')}
>
<Select
options={[
...allOptions[LlmModelType.Chat],
...allOptions[LlmModelType.Image2text],
]}
allowClear
showSearch
/>
<Select options={modelOptions} allowClear showSearch />
</Form.Item>
<Form.Item
label={t('embeddingModel')}

View File

@ -44,6 +44,7 @@ const orderFactoryList = [
LLMFactory.Ollama,
LLMFactory.Xinference,
LLMFactory.Ai302,
LLMFactory.CometAPI,
];
export const sortLLmFactoryListBySpecifiedOrder = (list: IFactory[]) => {