Compare commits

..

109 Commits

Author SHA1 Message Date
cd0216cce3 Revert "Refa: make RAGFlow more asynchronous 2 (#11664)"
This reverts commit 627c11c429.
2025-12-02 19:34:56 +08:00
962bd5f5df feat: improve Moodle connector functionality (#11665)
### What problem does this PR solve?

Add metadata from moodle data source.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 19:12:43 +08:00
627c11c429 Refa: make RAGFlow more asynchronous 2 (#11664)
### What problem does this PR solve?

Make RAGFlow more asynchronous 2. #11551, #11579, #11619.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
- [x] Performance Improvement
2025-12-02 18:57:07 +08:00
4ba17361e9 feat: improve presentation PdfParser (#11639)
The old presentation PdfParser lost table format after parse

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:35:14 +08:00
c946858328 Feat: add mineru auto installer (#11649)
### What problem does this PR solve?

Feat: add mineru auto installer

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:29:26 +08:00
ba6e2af5fd Feat: Delete useless request hooks. #10427 (#11659)
### What problem does this PR solve?

Feat: Delete useless request hooks. #10427

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:24:29 +08:00
2ffe6f7439 Import rag_tokenizer from Infinity (#11647)
### What problem does this PR solve?

- Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk
via https://github.com/infiniflow/infinity/pull/3117 .
Import rag_tokenizer from infinity and inherit from
rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py.

- Bump infinity to 0.6.8

### Type of change
- [x] Refactoring
2025-12-02 14:59:37 +08:00
e3987e21b9 Update upgrade guide: add stop server step and rename section (#11654)
### What problem does this PR solve?

Update upgrade guide: add stop server step and rename section

### Type of change

- [x] Documentation Update
2025-12-02 14:51:03 +08:00
a713f54732 Refa: add MiniMax-M2 and remove deprecated MiniMax models (#11642)
### What problem does this PR solve?

Add MiniMax-M2 and remove deprecated models.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-12-02 14:43:44 +08:00
519f03097e Feat: Remove unnecessary dialogue-related code. #10427 (#11652)
### What problem does this PR solve?

Feat: Remove unnecessary dialogue-related code. #10427

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 14:42:28 +08:00
299c655e39 Fix: file manager KB link issue. (#11648)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-02 12:14:27 +08:00
b8c0fb4572 Feat:new api /sequence2txt and update QWenSeq2txt (#11643)
### What problem does this PR solve?
change:
new api /sequence2txt,
update QWenSeq2txt and ZhipuSeq2txt

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 11:17:31 +08:00
d1e172171f Refactor: better describe how to get prefix for sync data source (#11636)
### What problem does this PR solve?

better describe how to get prefix for sync data source

### Type of change

- [x] Refactoring
2025-12-01 17:46:44 +08:00
81ae6cf78d Feat: support uploading in dialog. (#11634)
### What problem does this PR solve?

#9590

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 16:54:57 +08:00
1120575021 Feat: Files uploaded via the dialog box can be uploaded without binding to a dataset. #9590 (#11630)
### What problem does this PR solve?

Feat: Files uploaded via the dialog box can be uploaded without binding
to a dataset. #9590

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 16:29:02 +08:00
221947acc4 Fix workflows 2025-12-01 15:36:43 +08:00
21d8ffca56 Fix workflows 2025-12-01 14:58:33 +08:00
41cff3e09e Fix: jina embedding issue (#11628)
### What problem does this PR solve?

Fix: jina embedding issue #11614 
Feat: Add jina embedding v4

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 14:24:35 +08:00
b6c4722687 Refa: make RAGFlow more asynchronous (#11601)
### What problem does this PR solve?

Try to make this more asynchronous. Verified in chat and agent
scenarios, reducing blocking behavior. #11551, #11579.

However, the impact of these changes still requires further
investigation to ensure everything works as expected.

### Type of change

- [x] Refactoring
2025-12-01 14:24:06 +08:00
6ea4248bdc Feat: support parent-child in search procedure. (#11629)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 14:03:09 +08:00
88a28212b3 Fix: Table parse method issue. (#11627)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:42:35 +08:00
9d0309aedc Fix: [MinerU] Missing output file (#11623)
### What problem does this PR solve?

Add fallbacks for MinerU output path. #11613, #11620.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:17:43 +08:00
9a8ce9d3e2 fix: increase Quart RESPONSE_TIMEOUT and BODY_TIMEOUT for slow LLM responses (#11612)
### What problem does this PR solve?

Quart framework has default RESPONSE_TIMEOUT and BODY_TIMEOUT of 60
seconds.
This causes the frontend chat to hang exactly after 60 seconds when
using
slow LLM backends (e.g., Ollama on CPU, or remote APIs with high
latency).

This fix adds configurable timeout settings via environment variables
with
sensible defaults (600 seconds = 10 minutes) to match other timeout
configurations in RAGFlow.

Fixes issues with chat timeout when:
- Using local Ollama on CPU (response time ~2 minutes)
- Using remote LLM APIs with high latency
- Processing complex RAG queries with many chunks

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Grzegorz Sterniczuk <grzegorz@sternicz.uk>
2025-12-01 11:26:34 +08:00
7499608a8b feat: add Redis username support (#11608)
### What problem does this PR solve?

Support for Redis 6+ ACL authentication (username)

close #11606 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
2025-12-01 11:26:20 +08:00
0ebbb60102 Docs: deploying a local model using Jina not supported (#11624)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2025-12-01 11:24:29 +08:00
80f6d22d2a Fix typos (#11607)
### What problem does this PR solve?

Fix typos

### Type of change

- [x] Fix typos
2025-12-01 09:49:46 +08:00
088b049b4c Feature: embedded chat theme (#11581)
### What problem does this PR solve?

This PR closing feature request #11286. 
It implements ability to choose the background theme of the _Full screen
chat_ which is Embed into webpage.
Looks like that:
<img width="501" height="349" alt="image"
src="https://github.com/user-attachments/assets/e5fdfb14-9ed9-43bb-a40d-4b580985b9d4"
/>

It works similar to `Locale`, using url parameter to set the theme.
if the parameter is invalid then is using the default theme.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Your Name <you@example.com>
2025-12-01 09:49:28 +08:00
fa9b7b259c Feat: create datasets from http api supports ingestion pipeline (#11597)
### What problem does this PR solve?

Feat: create datasets from http api supports ingestion pipeline

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:55:24 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
d2915f6984 Fix: Error 102 "Can't find dialog by ID" when embedding agent with from=agent** #11552 (#11594)
### What problem does this PR solve?

Fix: Error 102 "Can't find dialog by ID" when embedding agent with
from=agent** #11552

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-28 19:05:43 +08:00
ccce8beeeb Feat: Replace antd in the chat message with shadcn. #10427 (#11590)
### What problem does this PR solve?

Feat: Replace antd in the chat message with shadcn. #10427

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 17:15:01 +08:00
3d2e0f1a1b fix: tolerate null mergeable status in tests workflow 2025-11-28 17:09:58 +08:00
918d5a9ff8 [issue-11572]fix:metadata_condition filtering failed (#11573)
### What problem does this PR solve?

When using the 'metadata_condition' for metadata filtering, if no
documents match the filtering criteria, the system will return the
search results of all documents instead of returning an empty result.

When the metadata_condition has conditions but no matching documents,
simply return an empty result.
#11572

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Chenguang Wang <chenguangwang@deepglint.com>
2025-11-28 14:04:14 +08:00
7d05d4ced7 Fix: Added styles for empty states on the page. #10703 (#11588)
### What problem does this PR solve?

Fix: Added styles for empty states on the page.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-28 14:03:20 +08:00
dbdda0fbab Feat: optimize meta filter generation for better structure handling (#11586)
### What problem does this PR solve?

optimize meta filter generation for better structure handling

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:30:53 +08:00
cf7fdd274b Feat: add gmail connector (#11549)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:09:40 +08:00
982ed233a2 Fix: doc_aggs not correctly returned when no chunks retrieved. (#11578)
### What problem does this PR solve?

Fix: doc_aggs not correctly returned when no chunks retrieved.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-28 13:09:05 +08:00
1f96c95b42 update new models for tokenpony (#11571)
update new models for TokenPony

Co-authored-by: huangzl <huangzl@shinemo.com>
2025-11-28 12:10:04 +08:00
8604c4f57c Feat: add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5 (#11559)
### What problem does this PR solve?

Add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5. #11548

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 17:59:17 +08:00
a674338c21 Fix: remove garbage filtering rules (#11567)
### What problem does this PR solve?
change:

remove garbage filtering rules

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 17:54:49 +08:00
89d82ff031 Feat: Delete useless knowledge base, chat, and search files. #10427 (#11568)
### What problem does this PR solve?

Feat: Delete useless knowledge base, chat, and search files.  #10427
### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 17:54:27 +08:00
c71d25f744 Fix: enable structured output for agent with tool (#11558)
### What problem does this PR solve?

issue:
[#11541](https://github.com/infiniflow/ragflow/issues/11541)
change:
enable structured output for agent with tool

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 16:00:56 +08:00
f57f32cf3a Feat: Add loop operator node. #10427 (#11449)
### What problem does this PR solve?

Feat: Add loop operator node. #10427

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 15:55:46 +08:00
b6314164c5 Feat:new component Loop (#11447)
### What problem does this PR solve?
issue:
#10427
change: 
new component Loop

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 15:55:32 +08:00
856201c0f2 Fix ft_title_rag_fine (#11555)
### What problem does this PR solve?

Fix ft_title_rag_fine

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 10:26:08 +08:00
9d8b96c1d0 Feat: add context for figure and table (#11547)
### What problem does this PR solve?

Add context for figure table.



![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e)


`==================()` for demonstrating purpose. 
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 10:21:44 +08:00
7c3c185038 Minor style changes (#11554)
### What problem does this PR solve?

### Type of change


- [ ] Documentation Update
2025-11-27 09:42:06 +08:00
a9259917c6 fix(files): replace hard coded status codes with constants (#11544)
### What problem does this PR solve?

To solve the problem of error reporting caused by type errors when
various types of exception returns are triggered

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 09:41:24 +08:00
8c28587821 Fix issue where HTML file parsing may lose content. (#11536)
### What problem does this PR solve?

##### Problem Description
When parsing HTML files, some page content may be lost.  
For example, text inside nested `<font>` tags within multiple `<div>`
elements (e.g.,
`<div><font>Text_1</font></div><div><font>Text_2</font></div>`) fails to
be preserved correctly.

###### Root Cause #1: Block ID propagation is interrupted
1. **Block ID generation**: When the parser encounters a `<div>`, it
generates a new `block_id` because `<div>` belongs to `BLOCK_TAGS`.
2. **Recursive processing**: This `block_id` is passed down recursively
to process the `<div>`’s child nodes.
3. **Interruption occurs**: When processing a child `<font>` tag, the
code enters the `else` branch of `read_text_recursively` (since `<font>`
is a Tag).
4. **Bug location**: The first line in this `else` branch explicitly
sets **`block_id = None`**.
- This discards the valid `block_id` inherited from the parent `<div>`.
- Since `<font>` is not in `BLOCK_TAGS`, it does not generate a new
`block_id`, so it passes `None` to its child text nodes.
5. **Consequence**: The extracted text nodes have an empty `block_id` in
their `metadata`. During the subsequent `merge_block_text` step, these
texts cannot be correctly associated with their original `<div>` block
due to the missing ID. As a result, all text from `<font>` tags gets
merged together, which then triggers a second issue during
concatenation.
6. **Solution:** Remove the forced reset of `block_id` to `None`. When
the current tag (e.g., `<font>`) is not a block-level element, it should
inherit the `block_id` passed down from its parent. This ensures
consistent ownership across the hierarchy: `div` → `font` → `text`.

###### Root Cause #2: Data loss during text concatenation
1. The line `current_content += (" " if current_content else "" +
content)` has a misplaced parenthesis. When `current_content` is
non-empty (`True`):
    - The ternary expression evaluates to `" "` (a single space).
    - The code executes `current_content += " "`.
- **Result**: Only a space is appended—**the new `content` string is
completely discarded**.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 09:40:10 +08:00
12979a3f21 feat: improve metadata handling in connector service (#11421)
### What problem does this PR solve?

- Update sync data source to handle metadata properly

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-26 19:55:48 +08:00
376eb15c63 Fix: Refactoring and enhancing the functionality of the delete confirmation dialog component #10703 (#11542)
### What problem does this PR solve?

Fix: Refactoring and enhancing the functionality of the delete
confirmation dialog component

- Refactoring and enhancing the functionality of the delete confirmation
dialog component
- Modifying the style of the user center

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-26 19:49:21 +08:00
89ba7abe30 Check if PR is mergeable at first step 2025-11-26 19:26:33 +08:00
2fd5ac1031 Feat: Add Webdav storage as data source (#11422)
### What problem does this PR solve?

This PR adds webdav storage as data source for data sync service.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-26 14:14:42 +08:00
40e84ca41a Use Infinity single-field-multi-index (#11444)
### What problem does this PR solve?

Use Infinity single-field-multi-index

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-11-26 11:06:37 +08:00
a28c672695 Bump infinity to 0.6.7 (#11528)
### What problem does this PR solve?

Bump infinity to 0.6.7
### Type of change

- [x] Refactoring
2025-11-26 10:28:31 +08:00
74e0b58d89 Fix: excel default optimization. (#11519)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:20 +08:00
7c20c964b4 Fix: incorrect image merging for naive markdown parser (#11520)
### What problem does this PR solve?

Fix incorrect image merging for naive markdown parser. #9349 


[ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:06 +08:00
5d0981d046 Refactoring: Integrating the file preview component (#11523)
### What problem does this PR solve?

Refactoring: Integrating the file preview component

### Type of change

- [x] Refactoring
2025-11-25 19:13:00 +08:00
a793dd2ea8 Feat: add addressing style config for S3-compatible storage (#11510)
### Type of change
* [x]  New Feature (non-breaking change which adds functionality)


Add support for Virtual Hosted Style and Path Style URL addressing in
S3_COMPATIBLE storage connector. Default to Virtual Hosted Style for
better compatibility with COS and other S3-compatible services.

- Add addressing_style field to credentials (virtual/path)
- Update frontend form with selection dropdown
- Add validation and tooltips for S3 Compatible endpoint URL

<img width="703" height="875" alt="image"
src="https://github.com/user-attachments/assets/af5ba7ca-f160-47fa-8ba1-32eace8f5fdf"
/>

<img width="1620" height="788" alt="image"
src="https://github.com/user-attachments/assets/6012b5ce-8bcb-478e-a9cb-425f886d5046"
/>

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-25 16:24:14 +08:00
915e385244 Fix: uv lock updates (#11511)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 16:01:12 +08:00
7a344a32f9 Fix: code exec component vulnerability and add support for nested list and dict object (#11504)
### What problem does this PR solve?

Fix code exec component vulnerability and add support for nested list
and dict object.

<img width="1491" height="952" alt="image"
src="https://github.com/user-attachments/assets/ec2de4e3-0919-413d-abe6-d19431292f14"
/>

Return a single value:

<img width="1156" height="719" alt="image"
src="https://github.com/user-attachments/assets/baa35caa-e27c-4064-a9f9-4c0af9a3d5b8"
/>


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 14:35:41 +08:00
8c1ee3845a Chore(deps): Bump pypdf from 6.0.0 to 6.4.0 (#11505)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.0.0 to 6.4.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/py-pdf/pypdf/releases">pypdf's
releases</a>.</em></p>
<blockquote>
<h2>Version 6.4.0, 2025-11-23</h2>
<h2>What's new</h2>
<h3>Security (SEC)</h3>
<ul>
<li>Reduce default limit for LZW decoding by <a
href="https://github.com/stefan6419846"><code>@​stefan6419846</code></a></li>
</ul>
<h3>New Features (ENH)</h3>
<ul>
<li>Parse and format comb fields in text widget annotations (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3519">#3519</a>)
by <a href="https://github.com/PJBrs"><code>@​PJBrs</code></a></li>
</ul>
<h3>Robustness (ROB)</h3>
<ul>
<li>Silently ignore Adobe Ascii85 whitespace for suffix detection (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3528">#3528</a>)
by <a href="https://github.com/mbierma"><code>@​mbierma</code></a></li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.3.0...6.4.0">Full
Changelog</a></p>
<h2>Version 6.3.0, 2025-11-16</h2>
<h2>What's new</h2>
<h3>New Features (ENH)</h3>
<ul>
<li>Wrap and align text in flattened PDF forms (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3465">#3465</a>)
by <a href="https://github.com/PJBrs"><code>@​PJBrs</code></a></li>
</ul>
<h3>Bug Fixes (BUG)</h3>
<ul>
<li>Fix missing &quot;PreventGC&quot; when cloning (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3520">#3520</a>)
by <a
href="https://github.com/patrick91"><code>@​patrick91</code></a></li>
<li>Preserve JPEG image quality by default (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3516">#3516</a>)
by <a href="https://github.com/Lucas-C"><code>@​Lucas-C</code></a></li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.2.0...6.3.0">Full
Changelog</a></p>
<h2>Version 6.2.0, 2025-11-09</h2>
<h2>What's new</h2>
<h3>New Features (ENH)</h3>
<ul>
<li>Add 'strict' parameter to PDFWriter (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3503">#3503</a>)
by <a
href="https://github.com/Arya-A-Nair"><code>@​Arya-A-Nair</code></a></li>
</ul>
<h3>Bug Fixes (BUG)</h3>
<ul>
<li>PdfWriter.append fails when there are articles being None (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3509">#3509</a>)
by <a
href="https://github.com/Noah-Houghton"><code>@​Noah-Houghton</code></a></li>
</ul>
<h3>Documentation (DOC)</h3>
<ul>
<li>Execute docs examples in CI (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3507">#3507</a>)
by <a
href="https://github.com/ievgen-kapinos"><code>@​ievgen-kapinos</code></a></li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.1.3...6.2.0">Full
Changelog</a></p>
<h2>Version 6.1.3, 2025-10-22</h2>
<h2>What's new</h2>
<h3>Security (SEC)</h3>
<ul>
<li>Allow limiting size of LZWDecode streams (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3502">#3502</a>)
by <a
href="https://github.com/stefan6419846"><code>@​stefan6419846</code></a></li>
<li>Avoid infinite loop when reading broken DCT-based inline images (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3501">#3501</a>)
by <a
href="https://github.com/stefan6419846"><code>@​stefan6419846</code></a></li>
</ul>
<h3>Bug Fixes (BUG)</h3>
<ul>
<li>PageObject.scale() scales media box incorrectly (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3489">#3489</a>)
by <a href="https://github.com/Nid01"><code>@​Nid01</code></a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md">pypdf's
changelog</a>.</em></p>
<blockquote>
<h2>Version 6.4.0, 2025-11-23</h2>
<h3>Security (SEC)</h3>
<ul>
<li>Reduce default limit for LZW decoding</li>
</ul>
<h3>New Features (ENH)</h3>
<ul>
<li>Parse and format comb fields in text widget annotations (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3519">#3519</a>)</li>
</ul>
<h3>Robustness (ROB)</h3>
<ul>
<li>Silently ignore Adobe Ascii85 whitespace for suffix detection (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3528">#3528</a>)</li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.3.0...6.4.0">Full
Changelog</a></p>
<h2>Version 6.3.0, 2025-11-16</h2>
<h3>New Features (ENH)</h3>
<ul>
<li>Wrap and align text in flattened PDF forms (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3465">#3465</a>)</li>
</ul>
<h3>Bug Fixes (BUG)</h3>
<ul>
<li>Fix missing &quot;PreventGC&quot; when cloning (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3520">#3520</a>)</li>
<li>Preserve JPEG image quality by default (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3516">#3516</a>)</li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.2.0...6.3.0">Full
Changelog</a></p>
<h2>Version 6.2.0, 2025-11-09</h2>
<h3>New Features (ENH)</h3>
<ul>
<li>Add 'strict' parameter to PDFWriter (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3503">#3503</a>)</li>
</ul>
<h3>Bug Fixes (BUG)</h3>
<ul>
<li>PdfWriter.append fails when there are articles being None (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3509">#3509</a>)</li>
</ul>
<h3>Documentation (DOC)</h3>
<ul>
<li>Execute docs examples in CI (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3507">#3507</a>)</li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.1.3...6.2.0">Full
Changelog</a></p>
<h2>Version 6.1.3, 2025-10-22</h2>
<h3>Security (SEC)</h3>
<ul>
<li>Allow limiting size of LZWDecode streams (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3502">#3502</a>)</li>
<li>Avoid infinite loop when reading broken DCT-based inline images (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3501">#3501</a>)</li>
</ul>
<h3>Bug Fixes (BUG)</h3>
<ul>
<li>PageObject.scale() scales media box incorrectly (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3489">#3489</a>)</li>
</ul>
<h3>Robustness (ROB)</h3>
<ul>
<li>Fail with explicit exception when image mode is an empty array (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3500">#3500</a>)</li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.1.2...6.1.3">Full
Changelog</a></p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="310e571f2b"><code>310e571</code></a>
REL: 6.4.0</li>
<li><a
href="96186725e5"><code>9618672</code></a>
Merge commit from fork</li>
<li><a
href="41e2e55c15"><code>41e2e55</code></a>
MAINT: Disable automated tagging on release</li>
<li><a
href="82faf984c0"><code>82faf98</code></a>
ROB: Silently ignore Adobe Ascii85 whitespace for suffix detection (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3528">#3528</a>)</li>
<li><a
href="cd172d91da"><code>cd172d9</code></a>
DEV: Bump actions/checkout from 5 to 6 (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3531">#3531</a>)</li>
<li><a
href="ff561f4473"><code>ff561f4</code></a>
STY: Tweak PdfWriter (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3337">#3337</a>)</li>
<li><a
href="e9e3735f12"><code>e9e3735</code></a>
MAINT: Update comments, check for warning message (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3521">#3521</a>)</li>
<li><a
href="905745a12c"><code>905745a</code></a>
TST: Add test for retrieving P image with alpha mask (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3525">#3525</a>)</li>
<li><a
href="bd433f7ae0"><code>bd433f7</code></a>
ENH: Parse and format comb fields in text widget annotations (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3519">#3519</a>)</li>
<li><a
href="c0caa5d2c8"><code>c0caa5d</code></a>
REL: 6.3.0</li>
<li>Additional commits viewable in <a
href="https://github.com/py-pdf/pypdf/compare/6.0.0...6.4.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pypdf&package-manager=pip&previous-version=6.0.0&new-version=6.4.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/infiniflow/ragflow/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-25 14:26:43 +08:00
8c751d5afc Feat: support operator in/not in for metadata filter. #11376 #11378 (#11506)
### What problem does this PR solve?

Feat: support operator in/not in for metadata filter.  #11376 #11378
### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 14:25:32 +08:00
f5faf0c94f Feat: support operator in/not in for metadata filter. (#11503)
### What problem does this PR solve?

#11376 #11378

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 12:44:26 +08:00
af72e8dc33 Fix: Modify the style of your personal center #10703 (#11487)
### What problem does this PR solve?

Modify the style of your personal center
Add resizable component

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 11:17:39 +08:00
bcd70affb5 Fix: unexpected parameter. (#11497)
### What problem does this PR solve?

#11489

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 11:17:27 +08:00
6987e9f23b Fix: After saving the model parameters of the chat page, the parameter disappears. #11500 (#11501)
### What problem does this PR solve?

Fix: After saving the model parameters of the chat page, the parameter
disappears. #11500

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 11:17:13 +08:00
41665b0865 Refactor: Email parser use with to handle buffer (#11496)
### What problem does this PR solve?
 Email parser use with to handle buffer

### Type of change

- [x] Refactoring
2025-11-25 10:03:37 +08:00
d1744aaaf3 Feat: add datasource Dropbox (#11488)
### What problem does this PR solve?

Add datasource Dropbox.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 09:40:03 +08:00
d5f8548200 Allow create super user when start rag server. (#10634)
### What problem does this PR solve?

New options for rag server scripts to create the super admin user when
start server.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2025-11-24 19:02:08 +08:00
4d8698624c Docs: Updated use_kg and toc_enhance switch descriptions (#11485)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2025-11-24 17:38:04 +08:00
1009819801 Fix: coroutine object has no attribute get (#11472)
### What problem does this PR solve?

Fix: coroutine object has no attribute get

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-24 12:21:33 +08:00
8fe782f4ea Fix:Modify the personal center style #10703 (#11470)
### What problem does this PR solve?

Fix:Modify the personal center style #10703

- All form-label font styles are no longer bold
- Menus are not highlighted on first visit to the personal center

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-24 12:20:48 +08:00
7140950e93 Feat: Implement temporary conversation removal logic in ConversationD… (#11454)
### What problem does this PR solve?

Implement temporary conversation removal logic in ConversationDropDown

Before modification:

<img width="2120" height="1034" alt="图片"
src="https://github.com/user-attachments/assets/21cf0a92-5660-401c-8b4c-31d85ec800f0"
/>

After modification:

<img width="2120" height="1034" alt="图片"
src="https://github.com/user-attachments/assets/0a3fffa5-dc9a-4af9-a3c6-c2e976e4bd6b"
/>
<img width="2120" height="1034" alt="图片"
src="https://github.com/user-attachments/assets/45473971-ba83-43e0-8941-64a5c6f552a2"
/>


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-24 10:27:22 +08:00
0181747881 Fix nginx startup failure in HTTPS mode (host not found) (#11455)
### Description
This PR fixes a bug where Nginx fails to start when using the
`ragflow.https.conf` configuration. The upstream host `ragflow` was not
resolving correctly inside the container context, causing an `[emerg]
host not found` error.

### Changes
- Updated `docker/nginx/ragflow.https.conf`: Changed upstream host from
`ragflow` to `localhost` for both the admin API and the main API.

### Related Issue
Fixes #11453

### Testing
- [x] Enabled HTTPS config in Docker.
- [x] Verified Nginx starts successfully without "host not found"
errors.
- [x] Verified API accessibility.
2025-11-24 10:21:27 +08:00
3c41159d26 Update logging for auto-generated SECRET_KEY (#11458)
Remove the code that exposes the generated key in the log, as it poses a
security risk.
 
<img width="1170" height="269" alt="image"
src="https://github.com/user-attachments/assets/03c42516-af1a-49a4-ade2-4ef3ee4b3cdd"
/>

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-24 10:21:06 +08:00
e0e1d04da5 Bump beartype to 0.22.6 (#11463)
### What problem does this PR solve?

Bump beartype to 0.22.6

### Type of change

- [x] Refactoring
2025-11-22 11:56:43 +08:00
f0a14f5fce Add Moodle data source integration (#11325)
### What problem does this PR solve?

This PR adds a native Moodle connector to sync content (courses,
resources, forums, assignments, pages, books) into RAGFlow.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-21 19:58:49 +08:00
174a2578e8 Feat: add auth header for Ollama chat model (#11452)
### What problem does this PR solve?

Add auth header for Ollama chat model. #11350

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 19:47:06 +08:00
a0959b9d38 Fix:Resolves the issue of sessions not being saved when the variable is array<object>. (#11446)
### What problem does this PR solve?

Fix:Resolves the issue of sessions not being saved when the variable is
array<object>.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-21 17:20:26 +08:00
13299197b8 Feat: Enable logical operators in metadata. #11387 #11376 (#11442)
### What problem does this PR solve?

Feat: Enable logical operators in metadata. #11387  #11376
### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 16:21:27 +08:00
249296e417 Feat: API supports toc_enhance. (#11437)
### What problem does this PR solve?

Close #11433

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 14:51:58 +08:00
db0f6840d9 Feat: ignore chunk size when using custom delimiters (#11434)
### What problem does this PR solve?

Ignore chunk size when using custom delimiter.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 14:36:26 +08:00
1033a3ae26 Fix: improve PDF text type detection by expanding regex content (#11432)
- Add whitespace validation to the PDF English text checking regex
- Reduce false negatives in English PDF content recognition

### What problem does this PR solve?

The core idea is to **expand the regex content used for English text
detection** so it can accommodate more valid characters commonly found
in English PDFs. The modifications include:

- Adding support for **space** in the regex.
- Ensuring the update does not reduce existing detection accuracy.

### Type of change

- [] Bug Fix (non-breaking change which fixes an issue)
2025-11-21 14:33:29 +08:00
1845daf41f Fix: UI adjustments, replacing private components with public components (#11438)
### What problem does this PR solve?

Fix: UI adjustments, replacing private components with public components

- UI adjustments for public components (input, multiselect,
SliderInputFormField)

- Replacing the private LlmSettingFieldItems component in search with
the public LlmSettingFieldItems component


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-21 14:32:50 +08:00
4c8f9f0d77 Feat: Add a loop variable to the loop operator. #10427 (#11423)
### What problem does this PR solve?

Feat: Add a loop variable to the loop operator. #10427

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 10:11:38 +08:00
cc00c3ec93 <Input> component horizontal padding adjustment (#11418)
### What problem does this PR solve?

- Adjust <Input> component a suitable horizontal padding when have
prefix or suffix icon
- Slightly change visual effect of <ThemeSwitch> in admin UI

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-21 09:58:55 +08:00
653b785958 Fix: Modify the style of the user center #10703 (#11419)
### What problem does this PR solve?

Fix: Modify the style of the user center

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-21 09:33:50 +08:00
971c1bcba7 Fix: missing parameters in by_plaintext method for PDF naive mode (#11408)
### What problem does this PR solve?

FIx: missing parameters in by_plaintext method for PDF naive mode

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: lih <dev_lih@139.com>
2025-11-21 09:33:36 +08:00
065917bf1c Feat: enriches Notion connector (#11414)
### What problem does this PR solve?

Enriches rich text (links, mentions, equations), flags to-do blocks with
[x]/[ ], captures block-level equations, builds table HTML, downloads
attachments.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 19:51:37 +08:00
820934fc77 Fix: no result if metadata returns none. (#11412)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 19:51:25 +08:00
d3d2ccc76c Feat: add more chunking method (#11413)
### What problem does this PR solve?

Feat: add more chunking method #11311

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 19:07:17 +08:00
c8ab9079b3 Fix:improve multi-column document detection (#11415)
### What problem does this PR solve?

change:
improve multi-column document detection

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 19:00:38 +08:00
0d5589bfda Feat: Outputs data is directly synchronized to the canvas without going through the form. #10427 (#11406)
### What problem does this PR solve?

Feat: Outputs data is directly synchronized to the canvas without going
through the form. #10427

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 15:35:28 +08:00
b846a0f547 Fix: incorrect retrieval total count with pagination enabled (#11400)
### What problem does this PR solve?

Incorrect retrieval total count with pagination enabled.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 15:35:09 +08:00
69578ebfce Fix: Change package-lock.json (#11407)
### What problem does this PR solve?

Fix: Change package-lock.json

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 15:32:41 +08:00
06cef71ba6 Feat: add or logic operations for meta data filters. (#11404)
### What problem does this PR solve?

#11376 #11387

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 14:31:12 +08:00
d2b1da0e26 Fix: Optimize edge check & incorrect parameter usage (#11396)
### What problem does this PR solve?

Fix: incorrect parameter usage #8084
Fix: Optimize edge check #10851

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 12:49:47 +08:00
7c6d30f4c8 Fix:RagFlow not starting with Postgres DB (#11398)
### What problem does this PR solve?
issue:
#11293 
change:
RagFlow not starting with Postgres DB
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 12:49:13 +08:00
ea0352ee4a Fix: Introducing a new JSON editor (#11401)
### What problem does this PR solve?

Fix: Introducing a new JSON editor

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 12:44:32 +08:00
fa5cf10f56 Bump infinity to 0.6.6 (#11399)
Bump infinity to 0.6.6

- [x] Refactoring
2025-11-20 11:23:54 +08:00
3fe71ab7dd Use array syntax for commands in docker-compose-base.yml (#11391)
Use array syntax in order to prevent parameter quoting issues. This also
runs the command directly without a bash process, which means signals
(like SIGTERM) will be delivered directly to the server process.

Fixes issue #11390

### What problem does this PR solve?

`${REDIS_PASSWORD}` was not passed correctly, meaning if it was unset or
contains spaces (or shell code!) it was interpreted wrongly.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 10:14:56 +08:00
9f715d6bc2 Feature (canvas): Add mind tagging support (#11359)
### What problem does this PR solve?
Resolve the issue of missing thinking labels when viewing pre-existing
conversations
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:11:28 +08:00
48de3b26ba locale en add russian language option (#11392)
### What problem does this PR solve?
add russian language option

### Type of change


- [x] Other (please describe):
2025-11-20 10:10:51 +08:00
273c4bc4d3 Locale: update russian language (#11393)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change


- [x] Documentation Update
- [x] Other (please describe):
2025-11-20 10:10:39 +08:00
420c97199a Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041)
### What problem does this PR solve?

- Added TCADP Parser configuration fields to PDF, PPT, and spreadsheet
parsing forms
- Implemented support for setting table result type (Markdown/HTML) and
Markdown image response type (URL/Text)
- Updated TCADP Parser to handle return format settings from
configuration or parameters
- Enhanced frontend to dynamically show TCADP options based on selected
parsing method
- Modified backend to pass format parameters when calling TCADP API
- Optimized form default value logic for TCADP configuration items
- Updated multilingual resource files for new configuration options

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:08:42 +08:00
ecf0322165 fix(llm): handle None response in total_token_count_from_response (#10941)
### What problem does this PR solve?

Fixes #10933

This PR fixes a `TypeError` in the Gemini model provider where the
`total_token_count_from_response()` function could receive a `None`
response object, causing the error:

TypeError: argument of type 'NoneType' is not iterable

**Root Cause:**
The function attempted to use the `in` operator to check dictionary keys
(lines 48, 54, 60) without first validating that `resp` was not `None`.
When Gemini's `chat_streamly()` method returns `None`, this triggers the
error.

**Solution:**
1. Added a null check at the beginning of the function to return `0` if
`resp is None`
2. Added `isinstance(resp, dict)` checks before all `in` operations to
ensure type safety
3. This defensive programming approach prevents the TypeError while
maintaining backward compatibility

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Changes Made

**File:** `rag/utils/__init__.py`

- Line 36-38: Added `if resp is None: return 0` check
- Line 52: Added `isinstance(resp, dict)` before `'usage' in resp`
- Line 58: Added `isinstance(resp, dict)` before `'usage' in resp`  
- Line 64: Added `isinstance(resp, dict)` before `'meta' in resp`

### Testing

- [x] Code compiles without errors
- [x] Follows existing code style and conventions
- [x] Change is minimal and focused on the specific issue

### Additional Notes

This fix ensures robust handling of various response types from LLM
providers, particularly Gemini, w

---------

Signed-off-by: Zhang Zhefang <zhangzhefang@example.com>
2025-11-20 10:04:03 +08:00
38234aca53 feat: add OceanBase doc engine (#11228)
### What problem does this PR solve?

Add OceanBase doc engine. Close #5350

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:00:14 +08:00
1c06ec39ca fix cohere rerank base_url default (#11353)
### What problem does this PR solve?

**Cohere rerank base_url default handling**

- Background: When no rerank base URL is configured, the settings
pipeline was passing an empty string through RERANK_CFG →
TenantLLMService → CoHereRerank, so the Cohere client received
base_url="" and produced “missing protocol” errors during rerank calls.

- What changed: The CoHereRerank constructor now only forwards base_url
to the Cohere client when it isn’t empty/whitespace, causing the client
to fall back to its default API endpoint otherwise.

- Why it matters: This prevents invalid URL construction in the rerank
workflow and keeps tests/sanity checks that rely on the default Cohere
endpoint from failing when no custom base URL is specified.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>
2025-11-20 09:46:39 +08:00
615 changed files with 20202 additions and 25630 deletions

View File

@ -12,7 +12,7 @@ on:
# The only difference between pull_request and pull_request_target is the context in which the workflow runs: # The only difference between pull_request and pull_request_target is the context in which the workflow runs:
# — pull_request_target workflows use the workflow files from the default branch, and secrets are available. # — pull_request_target workflows use the workflow files from the default branch, and secrets are available.
# — pull_request workflows use the workflow files from the pull request branch, and secrets are unavailable. # — pull_request workflows use the workflow files from the pull request branch, and secrets are unavailable.
pull_request_target: pull_request:
types: [ synchronize, ready_for_review ] types: [ synchronize, ready_for_review ]
paths-ignore: paths-ignore:
- 'docs/**' - 'docs/**'
@ -31,7 +31,7 @@ jobs:
name: ragflow_tests name: ragflow_tests
# https://docs.github.com/en/actions/using-jobs/using-conditions-to-control-job-execution # https://docs.github.com/en/actions/using-jobs/using-conditions-to-control-job-execution
# https://github.com/orgs/community/discussions/26261 # https://github.com/orgs/community/discussions/26261
if: ${{ github.event_name != 'pull_request_target' || contains(github.event.pull_request.labels.*.name, 'ci') }} if: ${{ github.event_name != 'pull_request' || (github.event.pull_request.draft == false && contains(github.event.pull_request.labels.*.name, 'ci')) }}
runs-on: [ "self-hosted", "ragflow-test" ] runs-on: [ "self-hosted", "ragflow-test" ]
steps: steps:
# https://github.com/hmarr/debug-action # https://github.com/hmarr/debug-action
@ -53,7 +53,7 @@ jobs:
- name: Check workflow duplication - name: Check workflow duplication
if: ${{ !cancelled() && !failure() }} if: ${{ !cancelled() && !failure() }}
run: | run: |
if [[ ${GITHUB_EVENT_NAME} != "pull_request_target" && ${GITHUB_EVENT_NAME} != "schedule" ]]; then if [[ ${GITHUB_EVENT_NAME} != "pull_request" && ${GITHUB_EVENT_NAME} != "schedule" ]]; then
HEAD=$(git rev-parse HEAD) HEAD=$(git rev-parse HEAD)
# Find a PR that introduced a given commit # Find a PR that introduced a given commit
gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}" gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}"
@ -78,7 +78,7 @@ jobs:
fi fi
fi fi
fi fi
elif [[ ${GITHUB_EVENT_NAME} == "pull_request_target" ]]; then elif [[ ${GITHUB_EVENT_NAME} == "pull_request" ]]; then
PR_NUMBER=${{ github.event.pull_request.number }} PR_NUMBER=${{ github.event.pull_request.number }}
PR_SHA_FP=${RUNNER_WORKSPACE_PREFIX}/artifacts/${GITHUB_REPOSITORY}/PR_${PR_NUMBER} PR_SHA_FP=${RUNNER_WORKSPACE_PREFIX}/artifacts/${GITHUB_REPOSITORY}/PR_${PR_NUMBER}
# Calculate the hash of the current workspace content # Calculate the hash of the current workspace content
@ -98,7 +98,7 @@ jobs:
- name: Check comments of changed Python files - name: Check comments of changed Python files
if: ${{ false }} if: ${{ false }}
run: | run: |
if [[ ${{ github.event_name }} == 'pull_request_target' ]]; then if [[ ${{ github.event_name }} == 'pull_request' || ${{ github.event_name }} == 'pull_request_target' ]]; then
CHANGED_FILES=$(git diff --name-only ${{ github.event.pull_request.base.sha }}...${{ github.event.pull_request.head.sha }} \ CHANGED_FILES=$(git diff --name-only ${{ github.event.pull_request.base.sha }}...${{ github.event.pull_request.head.sha }} \
| grep -E '\.(py)$' || true) | grep -E '\.(py)$' || true)
@ -193,7 +193,7 @@ jobs:
echo "HOST_ADDRESS=http://host.docker.internal:${SVR_HTTP_PORT}" >> ${GITHUB_ENV} echo "HOST_ADDRESS=http://host.docker.internal:${SVR_HTTP_PORT}" >> ${GITHUB_ENV}
sudo docker compose -f docker/docker-compose.yml -p ${GITHUB_RUN_ID} up -d sudo docker compose -f docker/docker-compose.yml -p ${GITHUB_RUN_ID} up -d
uv sync --python 3.10 --only-group test --no-default-groups --frozen && uv pip install sdk/python uv sync --python 3.10 --only-group test --no-default-groups --frozen && uv pip install sdk/python --group test
- name: Run sdk tests against Elasticsearch - name: Run sdk tests against Elasticsearch
run: | run: |

View File

@ -86,7 +86,7 @@ Try our demo at [https://demo.ragflow.io](https://demo.ragflow.io).
## 🔥 Latest Updates ## 🔥 Latest Updates
- 2025-11-19 Supports Gemini 3 Pro. - 2025-11-19 Supports Gemini 3 Pro.
- 2025-11-12 Supports data synchronization from Confluence, AWS S3, Discord, Google Drive. - 2025-11-12 Supports data synchronization from Confluence, S3, Notion, Discord, Google Drive.
- 2025-10-23 Supports MinerU & Docling as document parsing methods. - 2025-10-23 Supports MinerU & Docling as document parsing methods.
- 2025-10-15 Supports orchestrable ingestion pipeline. - 2025-10-15 Supports orchestrable ingestion pipeline.
- 2025-08-08 Supports OpenAI's latest GPT-5 series models. - 2025-08-08 Supports OpenAI's latest GPT-5 series models.
@ -194,7 +194,7 @@ releases! 🌟
# git checkout v0.22.1 # git checkout v0.22.1
# Optional: use a stable tag (see releases: https://github.com/infiniflow/ragflow/releases) # Optional: use a stable tag (see releases: https://github.com/infiniflow/ragflow/releases)
# This steps ensures the **entrypoint.sh** file in the code matches the Docker image version. # This step ensures the **entrypoint.sh** file in the code matches the Docker image version.
# Use CPU for DeepDoc tasks: # Use CPU for DeepDoc tasks:
$ docker compose -f docker-compose.yml up -d $ docker compose -f docker-compose.yml up -d

View File

@ -86,7 +86,7 @@ Coba demo kami di [https://demo.ragflow.io](https://demo.ragflow.io).
## 🔥 Pembaruan Terbaru ## 🔥 Pembaruan Terbaru
- 2025-11-19 Mendukung Gemini 3 Pro. - 2025-11-19 Mendukung Gemini 3 Pro.
- 2025-11-12 Mendukung sinkronisasi data dari Confluence, AWS S3, Discord, Google Drive. - 2025-11-12 Mendukung sinkronisasi data dari Confluence, S3, Notion, Discord, Google Drive.
- 2025-10-23 Mendukung MinerU & Docling sebagai metode penguraian dokumen. - 2025-10-23 Mendukung MinerU & Docling sebagai metode penguraian dokumen.
- 2025-10-15 Dukungan untuk jalur data yang terorkestrasi. - 2025-10-15 Dukungan untuk jalur data yang terorkestrasi.
- 2025-08-08 Mendukung model seri GPT-5 terbaru dari OpenAI. - 2025-08-08 Mendukung model seri GPT-5 terbaru dari OpenAI.

View File

@ -67,7 +67,7 @@
## 🔥 最新情報 ## 🔥 最新情報
- 2025-11-19 Gemini 3 Proをサポートしています - 2025-11-19 Gemini 3 Proをサポートしています
- 2025-11-12 Confluence、AWS S3、Discord、Google Drive からのデータ同期をサポートします。 - 2025-11-12 Confluence、S3、Notion、Discord、Google Drive からのデータ同期をサポートします。
- 2025-10-23 ドキュメント解析方法として MinerU と Docling をサポートします。 - 2025-10-23 ドキュメント解析方法として MinerU と Docling をサポートします。
- 2025-10-15 オーケストレーションされたデータパイプラインのサポート。 - 2025-10-15 オーケストレーションされたデータパイプラインのサポート。
- 2025-08-08 OpenAI の最新 GPT-5 シリーズモデルをサポートします。 - 2025-08-08 OpenAI の最新 GPT-5 シリーズモデルをサポートします。

View File

@ -68,7 +68,7 @@
## 🔥 업데이트 ## 🔥 업데이트
- 2025-11-19 Gemini 3 Pro를 지원합니다. - 2025-11-19 Gemini 3 Pro를 지원합니다.
- 2025-11-12 Confluence, AWS S3, Discord, Google Drive에서 데이터 동기화를 지원합니다. - 2025-11-12 Confluence, S3, Notion, Discord, Google Drive에서 데이터 동기화를 지원합니다.
- 2025-10-23 문서 파싱 방법으로 MinerU 및 Docling을 지원합니다. - 2025-10-23 문서 파싱 방법으로 MinerU 및 Docling을 지원합니다.
- 2025-10-15 조정된 데이터 파이프라인 지원. - 2025-10-15 조정된 데이터 파이프라인 지원.
- 2025-08-08 OpenAI의 최신 GPT-5 시리즈 모델을 지원합니다. - 2025-08-08 OpenAI의 최신 GPT-5 시리즈 모델을 지원합니다.

View File

@ -87,7 +87,7 @@ Experimente nossa demo em [https://demo.ragflow.io](https://demo.ragflow.io).
## 🔥 Últimas Atualizações ## 🔥 Últimas Atualizações
- 19-11-2025 Suporta Gemini 3 Pro. - 19-11-2025 Suporta Gemini 3 Pro.
- 12-11-2025 Suporta a sincronização de dados do Confluence, AWS S3, Discord e Google Drive. - 12-11-2025 Suporta a sincronização de dados do Confluence, S3, Notion, Discord e Google Drive.
- 23-10-2025 Suporta MinerU e Docling como métodos de análise de documentos. - 23-10-2025 Suporta MinerU e Docling como métodos de análise de documentos.
- 15-10-2025 Suporte para pipelines de dados orquestrados. - 15-10-2025 Suporte para pipelines de dados orquestrados.
- 08-08-2025 Suporta a mais recente série GPT-5 da OpenAI. - 08-08-2025 Suporta a mais recente série GPT-5 da OpenAI.

View File

@ -86,7 +86,7 @@
## 🔥 近期更新 ## 🔥 近期更新
- 2025-11-19 支援 Gemini 3 Pro. - 2025-11-19 支援 Gemini 3 Pro.
- 2025-11-12 支援從 Confluence、AWS S3、Discord、Google Drive 進行資料同步。 - 2025-11-12 支援從 Confluence、S3、Notion、Discord、Google Drive 進行資料同步。
- 2025-10-23 支援 MinerU 和 Docling 作為文件解析方法。 - 2025-10-23 支援 MinerU 和 Docling 作為文件解析方法。
- 2025-10-15 支援可編排的資料管道。 - 2025-10-15 支援可編排的資料管道。
- 2025-08-08 支援 OpenAI 最新的 GPT-5 系列模型。 - 2025-08-08 支援 OpenAI 最新的 GPT-5 系列模型。

View File

@ -86,7 +86,7 @@
## 🔥 近期更新 ## 🔥 近期更新
- 2025-11-19 支持 Gemini 3 Pro. - 2025-11-19 支持 Gemini 3 Pro.
- 2025-11-12 支持从 Confluence、AWS S3、Discord、Google Drive 进行数据同步。 - 2025-11-12 支持从 Confluence、S3、Notion、Discord、Google Drive 进行数据同步。
- 2025-10-23 支持 MinerU 和 Docling 作为文档解析方法。 - 2025-10-23 支持 MinerU 和 Docling 作为文档解析方法。
- 2025-10-15 支持可编排的数据管道。 - 2025-10-15 支持可编排的数据管道。
- 2025-08-08 支持 OpenAI 最新的 GPT-5 系列模型。 - 2025-08-08 支持 OpenAI 最新的 GPT-5 系列模型。

View File

@ -8,7 +8,7 @@ readme = "README.md"
requires-python = ">=3.10,<3.13" requires-python = ">=3.10,<3.13"
dependencies = [ dependencies = [
"requests>=2.30.0,<3.0.0", "requests>=2.30.0,<3.0.0",
"beartype>=0.18.5,<0.19.0", "beartype>=0.20.0,<1.0.0",
"pycryptodomex>=3.10.0", "pycryptodomex>=3.10.0",
"lark>=1.1.0", "lark>=1.1.0",
] ]

298
admin/client/uv.lock generated Normal file
View File

@ -0,0 +1,298 @@
version = 1
revision = 3
requires-python = ">=3.10, <3.13"
[[package]]
name = "beartype"
version = "0.22.6"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/88/e2/105ceb1704cb80fe4ab3872529ab7b6f365cf7c74f725e6132d0efcf1560/beartype-0.22.6.tar.gz", hash = "sha256:97fbda69c20b48c5780ac2ca60ce3c1bb9af29b3a1a0216898ffabdd523e48f4", size = 1588975, upload-time = "2025-11-20T04:47:14.736Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/98/c9/ceecc71fe2c9495a1d8e08d44f5f31f5bca1350d5b2e27a4b6265424f59e/beartype-0.22.6-py3-none-any.whl", hash = "sha256:0584bc46a2ea2a871509679278cda992eadde676c01356ab0ac77421f3c9a093", size = 1324807, upload-time = "2025-11-20T04:47:11.837Z" },
]
[[package]]
name = "certifi"
version = "2025.11.12"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/a2/8c/58f469717fa48465e4a50c014a0400602d3c437d7c0c468e17ada824da3a/certifi-2025.11.12.tar.gz", hash = "sha256:d8ab5478f2ecd78af242878415affce761ca6bc54a22a27e026d7c25357c3316", size = 160538, upload-time = "2025-11-12T02:54:51.517Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/70/7d/9bc192684cea499815ff478dfcdc13835ddf401365057044fb721ec6bddb/certifi-2025.11.12-py3-none-any.whl", hash = "sha256:97de8790030bbd5c2d96b7ec782fc2f7820ef8dba6db909ccf95449f2d062d4b", size = 159438, upload-time = "2025-11-12T02:54:49.735Z" },
]
[[package]]
name = "charset-normalizer"
version = "3.4.4"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/13/69/33ddede1939fdd074bce5434295f38fae7136463422fe4fd3e0e89b98062/charset_normalizer-3.4.4.tar.gz", hash = "sha256:94537985111c35f28720e43603b8e7b43a6ecfb2ce1d3058bbe955b73404e21a", size = 129418, upload-time = "2025-10-14T04:42:32.879Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/1f/b8/6d51fc1d52cbd52cd4ccedd5b5b2f0f6a11bbf6765c782298b0f3e808541/charset_normalizer-3.4.4-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:e824f1492727fa856dd6eda4f7cee25f8518a12f3c4a56a74e8095695089cf6d", size = 209709, upload-time = "2025-10-14T04:40:11.385Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/5c/af/1f9d7f7faafe2ddfb6f72a2e07a548a629c61ad510fe60f9630309908fef/charset_normalizer-3.4.4-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4bd5d4137d500351a30687c2d3971758aac9a19208fc110ccb9d7188fbe709e8", size = 148814, upload-time = "2025-10-14T04:40:13.135Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/79/3d/f2e3ac2bbc056ca0c204298ea4e3d9db9b4afe437812638759db2c976b5f/charset_normalizer-3.4.4-cp310-cp310-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:027f6de494925c0ab2a55eab46ae5129951638a49a34d87f4c3eda90f696b4ad", size = 144467, upload-time = "2025-10-14T04:40:14.728Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/ec/85/1bf997003815e60d57de7bd972c57dc6950446a3e4ccac43bc3070721856/charset_normalizer-3.4.4-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f820802628d2694cb7e56db99213f930856014862f3fd943d290ea8438d07ca8", size = 162280, upload-time = "2025-10-14T04:40:16.14Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/3e/8e/6aa1952f56b192f54921c436b87f2aaf7c7a7c3d0d1a765547d64fd83c13/charset_normalizer-3.4.4-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:798d75d81754988d2565bff1b97ba5a44411867c0cf32b77a7e8f8d84796b10d", size = 159454, upload-time = "2025-10-14T04:40:17.567Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/36/3b/60cbd1f8e93aa25d1c669c649b7a655b0b5fb4c571858910ea9332678558/charset_normalizer-3.4.4-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9d1bb833febdff5c8927f922386db610b49db6e0d4f4ee29601d71e7c2694313", size = 153609, upload-time = "2025-10-14T04:40:19.08Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/64/91/6a13396948b8fd3c4b4fd5bc74d045f5637d78c9675585e8e9fbe5636554/charset_normalizer-3.4.4-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:9cd98cdc06614a2f768d2b7286d66805f94c48cde050acdbbb7db2600ab3197e", size = 151849, upload-time = "2025-10-14T04:40:20.607Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/b7/7a/59482e28b9981d105691e968c544cc0df3b7d6133152fb3dcdc8f135da7a/charset_normalizer-3.4.4-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:077fbb858e903c73f6c9db43374fd213b0b6a778106bc7032446a8e8b5b38b93", size = 151586, upload-time = "2025-10-14T04:40:21.719Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/92/59/f64ef6a1c4bdd2baf892b04cd78792ed8684fbc48d4c2afe467d96b4df57/charset_normalizer-3.4.4-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:244bfb999c71b35de57821b8ea746b24e863398194a4014e4c76adc2bbdfeff0", size = 145290, upload-time = "2025-10-14T04:40:23.069Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/6b/63/3bf9f279ddfa641ffa1962b0db6a57a9c294361cc2f5fcac997049a00e9c/charset_normalizer-3.4.4-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:64b55f9dce520635f018f907ff1b0df1fdc31f2795a922fb49dd14fbcdf48c84", size = 163663, upload-time = "2025-10-14T04:40:24.17Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/ed/09/c9e38fc8fa9e0849b172b581fd9803bdf6e694041127933934184e19f8c3/charset_normalizer-3.4.4-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:faa3a41b2b66b6e50f84ae4a68c64fcd0c44355741c6374813a800cd6695db9e", size = 151964, upload-time = "2025-10-14T04:40:25.368Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/d2/d1/d28b747e512d0da79d8b6a1ac18b7ab2ecfd81b2944c4c710e166d8dd09c/charset_normalizer-3.4.4-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:6515f3182dbe4ea06ced2d9e8666d97b46ef4c75e326b79bb624110f122551db", size = 161064, upload-time = "2025-10-14T04:40:26.806Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/bb/9a/31d62b611d901c3b9e5500c36aab0ff5eb442043fb3a1c254200d3d397d9/charset_normalizer-3.4.4-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:cc00f04ed596e9dc0da42ed17ac5e596c6ccba999ba6bd92b0e0aef2f170f2d6", size = 155015, upload-time = "2025-10-14T04:40:28.284Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/1f/f3/107e008fa2bff0c8b9319584174418e5e5285fef32f79d8ee6a430d0039c/charset_normalizer-3.4.4-cp310-cp310-win32.whl", hash = "sha256:f34be2938726fc13801220747472850852fe6b1ea75869a048d6f896838c896f", size = 99792, upload-time = "2025-10-14T04:40:29.613Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/eb/66/e396e8a408843337d7315bab30dbf106c38966f1819f123257f5520f8a96/charset_normalizer-3.4.4-cp310-cp310-win_amd64.whl", hash = "sha256:a61900df84c667873b292c3de315a786dd8dac506704dea57bc957bd31e22c7d", size = 107198, upload-time = "2025-10-14T04:40:30.644Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/b5/58/01b4f815bf0312704c267f2ccb6e5d42bcc7752340cd487bc9f8c3710597/charset_normalizer-3.4.4-cp310-cp310-win_arm64.whl", hash = "sha256:cead0978fc57397645f12578bfd2d5ea9138ea0fac82b2f63f7f7c6877986a69", size = 100262, upload-time = "2025-10-14T04:40:32.108Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/ed/27/c6491ff4954e58a10f69ad90aca8a1b6fe9c5d3c6f380907af3c37435b59/charset_normalizer-3.4.4-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6e1fcf0720908f200cd21aa4e6750a48ff6ce4afe7ff5a79a90d5ed8a08296f8", size = 206988, upload-time = "2025-10-14T04:40:33.79Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/94/59/2e87300fe67ab820b5428580a53cad894272dbb97f38a7a814a2a1ac1011/charset_normalizer-3.4.4-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5f819d5fe9234f9f82d75bdfa9aef3a3d72c4d24a6e57aeaebba32a704553aa0", size = 147324, upload-time = "2025-10-14T04:40:34.961Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/07/fb/0cf61dc84b2b088391830f6274cb57c82e4da8bbc2efeac8c025edb88772/charset_normalizer-3.4.4-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:a59cb51917aa591b1c4e6a43c132f0cdc3c76dbad6155df4e28ee626cc77a0a3", size = 142742, upload-time = "2025-10-14T04:40:36.105Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/62/8b/171935adf2312cd745d290ed93cf16cf0dfe320863ab7cbeeae1dcd6535f/charset_normalizer-3.4.4-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:8ef3c867360f88ac904fd3f5e1f902f13307af9052646963ee08ff4f131adafc", size = 160863, upload-time = "2025-10-14T04:40:37.188Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/09/73/ad875b192bda14f2173bfc1bc9a55e009808484a4b256748d931b6948442/charset_normalizer-3.4.4-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:d9e45d7faa48ee908174d8fe84854479ef838fc6a705c9315372eacbc2f02897", size = 157837, upload-time = "2025-10-14T04:40:38.435Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/6d/fc/de9cce525b2c5b94b47c70a4b4fb19f871b24995c728e957ee68ab1671ea/charset_normalizer-3.4.4-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:840c25fb618a231545cbab0564a799f101b63b9901f2569faecd6b222ac72381", size = 151550, upload-time = "2025-10-14T04:40:40.053Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/55/c2/43edd615fdfba8c6f2dfbd459b25a6b3b551f24ea21981e23fb768503ce1/charset_normalizer-3.4.4-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:ca5862d5b3928c4940729dacc329aa9102900382fea192fc5e52eb69d6093815", size = 149162, upload-time = "2025-10-14T04:40:41.163Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/03/86/bde4ad8b4d0e9429a4e82c1e8f5c659993a9a863ad62c7df05cf7b678d75/charset_normalizer-3.4.4-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:d9c7f57c3d666a53421049053eaacdd14bbd0a528e2186fcb2e672effd053bb0", size = 150019, upload-time = "2025-10-14T04:40:42.276Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/1f/86/a151eb2af293a7e7bac3a739b81072585ce36ccfb4493039f49f1d3cae8c/charset_normalizer-3.4.4-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:277e970e750505ed74c832b4bf75dac7476262ee2a013f5574dd49075879e161", size = 143310, upload-time = "2025-10-14T04:40:43.439Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/b5/fe/43dae6144a7e07b87478fdfc4dbe9efd5defb0e7ec29f5f58a55aeef7bf7/charset_normalizer-3.4.4-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:31fd66405eaf47bb62e8cd575dc621c56c668f27d46a61d975a249930dd5e2a4", size = 162022, upload-time = "2025-10-14T04:40:44.547Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/80/e6/7aab83774f5d2bca81f42ac58d04caf44f0cc2b65fc6db2b3b2e8a05f3b3/charset_normalizer-3.4.4-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:0d3d8f15c07f86e9ff82319b3d9ef6f4bf907608f53fe9d92b28ea9ae3d1fd89", size = 149383, upload-time = "2025-10-14T04:40:46.018Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/4f/e8/b289173b4edae05c0dde07f69f8db476a0b511eac556dfe0d6bda3c43384/charset_normalizer-3.4.4-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:9f7fcd74d410a36883701fafa2482a6af2ff5ba96b9a620e9e0721e28ead5569", size = 159098, upload-time = "2025-10-14T04:40:47.081Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/d8/df/fe699727754cae3f8478493c7f45f777b17c3ef0600e28abfec8619eb49c/charset_normalizer-3.4.4-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:ebf3e58c7ec8a8bed6d66a75d7fb37b55e5015b03ceae72a8e7c74495551e224", size = 152991, upload-time = "2025-10-14T04:40:48.246Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/1a/86/584869fe4ddb6ffa3bd9f491b87a01568797fb9bd8933f557dba9771beaf/charset_normalizer-3.4.4-cp311-cp311-win32.whl", hash = "sha256:eecbc200c7fd5ddb9a7f16c7decb07b566c29fa2161a16cf67b8d068bd21690a", size = 99456, upload-time = "2025-10-14T04:40:49.376Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/65/f6/62fdd5feb60530f50f7e38b4f6a1d5203f4d16ff4f9f0952962c044e919a/charset_normalizer-3.4.4-cp311-cp311-win_amd64.whl", hash = "sha256:5ae497466c7901d54b639cf42d5b8c1b6a4fead55215500d2f486d34db48d016", size = 106978, upload-time = "2025-10-14T04:40:50.844Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/7a/9d/0710916e6c82948b3be62d9d398cb4fcf4e97b56d6a6aeccd66c4b2f2bd5/charset_normalizer-3.4.4-cp311-cp311-win_arm64.whl", hash = "sha256:65e2befcd84bc6f37095f5961e68a6f077bf44946771354a28ad434c2cce0ae1", size = 99969, upload-time = "2025-10-14T04:40:52.272Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/f3/85/1637cd4af66fa687396e757dec650f28025f2a2f5a5531a3208dc0ec43f2/charset_normalizer-3.4.4-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:0a98e6759f854bd25a58a73fa88833fba3b7c491169f86ce1180c948ab3fd394", size = 208425, upload-time = "2025-10-14T04:40:53.353Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/9d/6a/04130023fef2a0d9c62d0bae2649b69f7b7d8d24ea5536feef50551029df/charset_normalizer-3.4.4-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b5b290ccc2a263e8d185130284f8501e3e36c5e02750fc6b6bdeb2e9e96f1e25", size = 148162, upload-time = "2025-10-14T04:40:54.558Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/78/29/62328d79aa60da22c9e0b9a66539feae06ca0f5a4171ac4f7dc285b83688/charset_normalizer-3.4.4-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:74bb723680f9f7a6234dcf67aea57e708ec1fbdf5699fb91dfd6f511b0a320ef", size = 144558, upload-time = "2025-10-14T04:40:55.677Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/86/bb/b32194a4bf15b88403537c2e120b817c61cd4ecffa9b6876e941c3ee38fe/charset_normalizer-3.4.4-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f1e34719c6ed0b92f418c7c780480b26b5d9c50349e9a9af7d76bf757530350d", size = 161497, upload-time = "2025-10-14T04:40:57.217Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/19/89/a54c82b253d5b9b111dc74aca196ba5ccfcca8242d0fb64146d4d3183ff1/charset_normalizer-3.4.4-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:2437418e20515acec67d86e12bf70056a33abdacb5cb1655042f6538d6b085a8", size = 159240, upload-time = "2025-10-14T04:40:58.358Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/c0/10/d20b513afe03acc89ec33948320a5544d31f21b05368436d580dec4e234d/charset_normalizer-3.4.4-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:11d694519d7f29d6cd09f6ac70028dba10f92f6cdd059096db198c283794ac86", size = 153471, upload-time = "2025-10-14T04:40:59.468Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/61/fa/fbf177b55bdd727010f9c0a3c49eefa1d10f960e5f09d1d887bf93c2e698/charset_normalizer-3.4.4-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:ac1c4a689edcc530fc9d9aa11f5774b9e2f33f9a0c6a57864e90908f5208d30a", size = 150864, upload-time = "2025-10-14T04:41:00.623Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/05/12/9fbc6a4d39c0198adeebbde20b619790e9236557ca59fc40e0e3cebe6f40/charset_normalizer-3.4.4-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:21d142cc6c0ec30d2efee5068ca36c128a30b0f2c53c1c07bd78cb6bc1d3be5f", size = 150647, upload-time = "2025-10-14T04:41:01.754Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/ad/1f/6a9a593d52e3e8c5d2b167daf8c6b968808efb57ef4c210acb907c365bc4/charset_normalizer-3.4.4-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:5dbe56a36425d26d6cfb40ce79c314a2e4dd6211d51d6d2191c00bed34f354cc", size = 145110, upload-time = "2025-10-14T04:41:03.231Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/30/42/9a52c609e72471b0fc54386dc63c3781a387bb4fe61c20231a4ebcd58bdd/charset_normalizer-3.4.4-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:5bfbb1b9acf3334612667b61bd3002196fe2a1eb4dd74d247e0f2a4d50ec9bbf", size = 162839, upload-time = "2025-10-14T04:41:04.715Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/c4/5b/c0682bbf9f11597073052628ddd38344a3d673fda35a36773f7d19344b23/charset_normalizer-3.4.4-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:d055ec1e26e441f6187acf818b73564e6e6282709e9bcb5b63f5b23068356a15", size = 150667, upload-time = "2025-10-14T04:41:05.827Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/e4/24/a41afeab6f990cf2daf6cb8c67419b63b48cf518e4f56022230840c9bfb2/charset_normalizer-3.4.4-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:af2d8c67d8e573d6de5bc30cdb27e9b95e49115cd9baad5ddbd1a6207aaa82a9", size = 160535, upload-time = "2025-10-14T04:41:06.938Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/2a/e5/6a4ce77ed243c4a50a1fecca6aaaab419628c818a49434be428fe24c9957/charset_normalizer-3.4.4-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:780236ac706e66881f3b7f2f32dfe90507a09e67d1d454c762cf642e6e1586e0", size = 154816, upload-time = "2025-10-14T04:41:08.101Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/a8/ef/89297262b8092b312d29cdb2517cb1237e51db8ecef2e9af5edbe7b683b1/charset_normalizer-3.4.4-cp312-cp312-win32.whl", hash = "sha256:5833d2c39d8896e4e19b689ffc198f08ea58116bee26dea51e362ecc7cd3ed26", size = 99694, upload-time = "2025-10-14T04:41:09.23Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/3d/2d/1e5ed9dd3b3803994c155cd9aacb60c82c331bad84daf75bcb9c91b3295e/charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl", hash = "sha256:a79cfe37875f822425b89a82333404539ae63dbdddf97f84dcbc3d339aae9525", size = 107131, upload-time = "2025-10-14T04:41:10.467Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/d0/d9/0ed4c7098a861482a7b6a95603edce4c0d9db2311af23da1fb2b75ec26fc/charset_normalizer-3.4.4-cp312-cp312-win_arm64.whl", hash = "sha256:376bec83a63b8021bb5c8ea75e21c4ccb86e7e45ca4eb81146091b56599b80c3", size = 100390, upload-time = "2025-10-14T04:41:11.915Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/0a/4c/925909008ed5a988ccbb72dcc897407e5d6d3bd72410d69e051fc0c14647/charset_normalizer-3.4.4-py3-none-any.whl", hash = "sha256:7a32c560861a02ff789ad905a2fe94e3f840803362c84fecf1851cb4cf3dc37f", size = 53402, upload-time = "2025-10-14T04:42:31.76Z" },
]
[[package]]
name = "colorama"
version = "0.4.6"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/d8/53/6f443c9a4a8358a93a6792e2acffb9d9d5cb0a5cfd8802644b7b1c9a02e4/colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44", size = 27697, upload-time = "2022-10-25T02:36:22.414Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/d1/d6/3965ed04c63042e047cb6a3e6ed1a63a35087b6a609aa3a15ed8ac56c221/colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6", size = 25335, upload-time = "2022-10-25T02:36:20.889Z" },
]
[[package]]
name = "exceptiongroup"
version = "1.3.1"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
dependencies = [
{ name = "typing-extensions" },
]
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/50/79/66800aadf48771f6b62f7eb014e352e5d06856655206165d775e675a02c9/exceptiongroup-1.3.1.tar.gz", hash = "sha256:8b412432c6055b0b7d14c310000ae93352ed6754f70fa8f7c34141f91c4e3219", size = 30371, upload-time = "2025-11-21T23:01:54.787Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/8a/0e/97c33bf5009bdbac74fd2beace167cab3f978feb69cc36f1ef79360d6c4e/exceptiongroup-1.3.1-py3-none-any.whl", hash = "sha256:a7a39a3bd276781e98394987d3a5701d0c4edffb633bb7a5144577f82c773598", size = 16740, upload-time = "2025-11-21T23:01:53.443Z" },
]
[[package]]
name = "idna"
version = "3.11"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/6f/6d/0703ccc57f3a7233505399edb88de3cbd678da106337b9fcde432b65ed60/idna-3.11.tar.gz", hash = "sha256:795dafcc9c04ed0c1fb032c2aa73654d8e8c5023a7df64a53f39190ada629902", size = 194582, upload-time = "2025-10-12T14:55:20.501Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/0e/61/66938bbb5fc52dbdf84594873d5b51fb1f7c7794e9c0f5bd885f30bc507b/idna-3.11-py3-none-any.whl", hash = "sha256:771a87f49d9defaf64091e6e6fe9c18d4833f140bd19464795bc32d966ca37ea", size = 71008, upload-time = "2025-10-12T14:55:18.883Z" },
]
[[package]]
name = "iniconfig"
version = "2.3.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" },
]
[[package]]
name = "lark"
version = "1.3.1"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/da/34/28fff3ab31ccff1fd4f6c7c7b0ceb2b6968d8ea4950663eadcb5720591a0/lark-1.3.1.tar.gz", hash = "sha256:b426a7a6d6d53189d318f2b6236ab5d6429eaf09259f1ca33eb716eed10d2905", size = 382732, upload-time = "2025-10-27T18:25:56.653Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/82/3d/14ce75ef66813643812f3093ab17e46d3a206942ce7376d31ec2d36229e7/lark-1.3.1-py3-none-any.whl", hash = "sha256:c629b661023a014c37da873b4ff58a817398d12635d3bbb2c5a03be7fe5d1e12", size = 113151, upload-time = "2025-10-27T18:25:54.882Z" },
]
[[package]]
name = "packaging"
version = "25.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/a1/d4/1fc4078c65507b51b96ca8f8c3ba19e6a61c8253c72794544580a7b6c24d/packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f", size = 165727, upload-time = "2025-04-19T11:48:59.673Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload-time = "2025-04-19T11:48:57.875Z" },
]
[[package]]
name = "pluggy"
version = "1.6.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
]
[[package]]
name = "pycryptodomex"
version = "3.23.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/c9/85/e24bf90972a30b0fcd16c73009add1d7d7cd9140c2498a68252028899e41/pycryptodomex-3.23.0.tar.gz", hash = "sha256:71909758f010c82bc99b0abf4ea12012c98962fbf0583c2164f8b84533c2e4da", size = 4922157, upload-time = "2025-05-17T17:23:41.434Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/dd/9c/1a8f35daa39784ed8adf93a694e7e5dc15c23c741bbda06e1d45f8979e9e/pycryptodomex-3.23.0-cp37-abi3-macosx_10_9_universal2.whl", hash = "sha256:06698f957fe1ab229a99ba2defeeae1c09af185baa909a31a5d1f9d42b1aaed6", size = 2499240, upload-time = "2025-05-17T17:22:46.953Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/7a/62/f5221a191a97157d240cf6643747558759126c76ee92f29a3f4aee3197a5/pycryptodomex-3.23.0-cp37-abi3-macosx_10_9_x86_64.whl", hash = "sha256:b2c2537863eccef2d41061e82a881dcabb04944c5c06c5aa7110b577cc487545", size = 1644042, upload-time = "2025-05-17T17:22:49.098Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/8c/fd/5a054543c8988d4ed7b612721d7e78a4b9bf36bc3c5ad45ef45c22d0060e/pycryptodomex-3.23.0-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:43c446e2ba8df8889e0e16f02211c25b4934898384c1ec1ec04d7889c0333587", size = 2186227, upload-time = "2025-05-17T17:22:51.139Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/c8/a9/8862616a85cf450d2822dbd4fff1fcaba90877907a6ff5bc2672cafe42f8/pycryptodomex-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f489c4765093fb60e2edafdf223397bc716491b2b69fe74367b70d6999257a5c", size = 2272578, upload-time = "2025-05-17T17:22:53.676Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/46/9f/bda9c49a7c1842820de674ab36c79f4fbeeee03f8ff0e4f3546c3889076b/pycryptodomex-3.23.0-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:bdc69d0d3d989a1029df0eed67cc5e8e5d968f3724f4519bd03e0ec68df7543c", size = 2312166, upload-time = "2025-05-17T17:22:56.585Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/03/cc/870b9bf8ca92866ca0186534801cf8d20554ad2a76ca959538041b7a7cf4/pycryptodomex-3.23.0-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:6bbcb1dd0f646484939e142462d9e532482bc74475cecf9c4903d4e1cd21f003", size = 2185467, upload-time = "2025-05-17T17:22:59.237Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/96/e3/ce9348236d8e669fea5dd82a90e86be48b9c341210f44e25443162aba187/pycryptodomex-3.23.0-cp37-abi3-musllinux_1_2_i686.whl", hash = "sha256:8a4fcd42ccb04c31268d1efeecfccfd1249612b4de6374205376b8f280321744", size = 2346104, upload-time = "2025-05-17T17:23:02.112Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/a5/e9/e869bcee87beb89040263c416a8a50204f7f7a83ac11897646c9e71e0daf/pycryptodomex-3.23.0-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:55ccbe27f049743a4caf4f4221b166560d3438d0b1e5ab929e07ae1702a4d6fd", size = 2271038, upload-time = "2025-05-17T17:23:04.872Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/8d/67/09ee8500dd22614af5fbaa51a4aee6e342b5fa8aecf0a6cb9cbf52fa6d45/pycryptodomex-3.23.0-cp37-abi3-win32.whl", hash = "sha256:189afbc87f0b9f158386bf051f720e20fa6145975f1e76369303d0f31d1a8d7c", size = 1771969, upload-time = "2025-05-17T17:23:07.115Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/69/96/11f36f71a865dd6df03716d33bd07a67e9d20f6b8d39820470b766af323c/pycryptodomex-3.23.0-cp37-abi3-win_amd64.whl", hash = "sha256:52e5ca58c3a0b0bd5e100a9fbc8015059b05cffc6c66ce9d98b4b45e023443b9", size = 1803124, upload-time = "2025-05-17T17:23:09.267Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/f9/93/45c1cdcbeb182ccd2e144c693eaa097763b08b38cded279f0053ed53c553/pycryptodomex-3.23.0-cp37-abi3-win_arm64.whl", hash = "sha256:02d87b80778c171445d67e23d1caef279bf4b25c3597050ccd2e13970b57fd51", size = 1707161, upload-time = "2025-05-17T17:23:11.414Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/f3/b8/3e76d948c3c4ac71335bbe75dac53e154b40b0f8f1f022dfa295257a0c96/pycryptodomex-3.23.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:ebfff755c360d674306e5891c564a274a47953562b42fb74a5c25b8fc1fb1cb5", size = 1627695, upload-time = "2025-05-17T17:23:17.38Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/6a/cf/80f4297a4820dfdfd1c88cf6c4666a200f204b3488103d027b5edd9176ec/pycryptodomex-3.23.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:eca54f4bb349d45afc17e3011ed4264ef1cc9e266699874cdd1349c504e64798", size = 1675772, upload-time = "2025-05-17T17:23:19.202Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/d1/42/1e969ee0ad19fe3134b0e1b856c39bd0b70d47a4d0e81c2a8b05727394c9/pycryptodomex-3.23.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4f2596e643d4365e14d0879dc5aafe6355616c61c2176009270f3048f6d9a61f", size = 1668083, upload-time = "2025-05-17T17:23:21.867Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/6e/c3/1de4f7631fea8a992a44ba632aa40e0008764c0fb9bf2854b0acf78c2cf2/pycryptodomex-3.23.0-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:fdfac7cda115bca3a5abb2f9e43bc2fb66c2b65ab074913643803ca7083a79ea", size = 1706056, upload-time = "2025-05-17T17:23:24.031Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/f2/5f/af7da8e6f1e42b52f44a24d08b8e4c726207434e2593732d39e7af5e7256/pycryptodomex-3.23.0-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:14c37aaece158d0ace436f76a7bb19093db3b4deade9797abfc39ec6cd6cc2fe", size = 1806478, upload-time = "2025-05-17T17:23:26.066Z" },
]
[[package]]
name = "pygments"
version = "2.19.2"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/b0/77/a5b8c569bf593b0140bde72ea885a803b82086995367bf2037de0159d924/pygments-2.19.2.tar.gz", hash = "sha256:636cb2477cec7f8952536970bc533bc43743542f70392ae026374600add5b887", size = 4968631, upload-time = "2025-06-21T13:39:12.283Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/c7/21/705964c7812476f378728bdf590ca4b771ec72385c533964653c68e86bdc/pygments-2.19.2-py3-none-any.whl", hash = "sha256:86540386c03d588bb81d44bc3928634ff26449851e99741617ecb9037ee5ec0b", size = 1225217, upload-time = "2025-06-21T13:39:07.939Z" },
]
[[package]]
name = "pytest"
version = "9.0.1"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
dependencies = [
{ name = "colorama", marker = "sys_platform == 'win32'" },
{ name = "exceptiongroup", marker = "python_full_version < '3.11'" },
{ name = "iniconfig" },
{ name = "packaging" },
{ name = "pluggy" },
{ name = "pygments" },
{ name = "tomli", marker = "python_full_version < '3.11'" },
]
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/07/56/f013048ac4bc4c1d9be45afd4ab209ea62822fb1598f40687e6bf45dcea4/pytest-9.0.1.tar.gz", hash = "sha256:3e9c069ea73583e255c3b21cf46b8d3c56f6e3a1a8f6da94ccb0fcf57b9d73c8", size = 1564125, upload-time = "2025-11-12T13:05:09.333Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/0b/8b/6300fb80f858cda1c51ffa17075df5d846757081d11ab4aa35cef9e6258b/pytest-9.0.1-py3-none-any.whl", hash = "sha256:67be0030d194df2dfa7b556f2e56fb3c3315bd5c8822c6951162b92b32ce7dad", size = 373668, upload-time = "2025-11-12T13:05:07.379Z" },
]
[[package]]
name = "ragflow-cli"
version = "0.22.1"
source = { virtual = "." }
dependencies = [
{ name = "beartype" },
{ name = "lark" },
{ name = "pycryptodomex" },
{ name = "requests" },
]
[package.dev-dependencies]
test = [
{ name = "pytest" },
{ name = "requests" },
{ name = "requests-toolbelt" },
]
[package.metadata]
requires-dist = [
{ name = "beartype", specifier = ">=0.20.0,<1.0.0" },
{ name = "lark", specifier = ">=1.1.0" },
{ name = "pycryptodomex", specifier = ">=3.10.0" },
{ name = "requests", specifier = ">=2.30.0,<3.0.0" },
]
[package.metadata.requires-dev]
test = [
{ name = "pytest", specifier = ">=8.3.5" },
{ name = "requests", specifier = ">=2.32.3" },
{ name = "requests-toolbelt", specifier = ">=1.0.0" },
]
[[package]]
name = "requests"
version = "2.32.5"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
dependencies = [
{ name = "certifi" },
{ name = "charset-normalizer" },
{ name = "idna" },
{ name = "urllib3" },
]
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/c9/74/b3ff8e6c8446842c3f5c837e9c3dfcfe2018ea6ecef224c710c85ef728f4/requests-2.32.5.tar.gz", hash = "sha256:dbba0bac56e100853db0ea71b82b4dfd5fe2bf6d3754a8893c3af500cec7d7cf", size = 134517, upload-time = "2025-08-18T20:46:02.573Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/1e/db/4254e3eabe8020b458f1a747140d32277ec7a271daf1d235b70dc0b4e6e3/requests-2.32.5-py3-none-any.whl", hash = "sha256:2462f94637a34fd532264295e186976db0f5d453d1cdd31473c85a6a161affb6", size = 64738, upload-time = "2025-08-18T20:46:00.542Z" },
]
[[package]]
name = "requests-toolbelt"
version = "1.0.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
dependencies = [
{ name = "requests" },
]
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/f3/61/d7545dafb7ac2230c70d38d31cbfe4cc64f7144dc41f6e4e4b78ecd9f5bb/requests-toolbelt-1.0.0.tar.gz", hash = "sha256:7681a0a3d047012b5bdc0ee37d7f8f07ebe76ab08caeccfc3921ce23c88d5bc6", size = 206888, upload-time = "2023-05-01T04:11:33.229Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/3f/51/d4db610ef29373b879047326cbf6fa98b6c1969d6f6dc423279de2b1be2c/requests_toolbelt-1.0.0-py2.py3-none-any.whl", hash = "sha256:cccfdd665f0a24fcf4726e690f65639d272bb0637b9b92dfd91a5568ccf6bd06", size = 54481, upload-time = "2023-05-01T04:11:28.427Z" },
]
[[package]]
name = "tomli"
version = "2.3.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/52/ed/3f73f72945444548f33eba9a87fc7a6e969915e7b1acc8260b30e1f76a2f/tomli-2.3.0.tar.gz", hash = "sha256:64be704a875d2a59753d80ee8a533c3fe183e3f06807ff7dc2232938ccb01549", size = 17392, upload-time = "2025-10-08T22:01:47.119Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/b3/2e/299f62b401438d5fe1624119c723f5d877acc86a4c2492da405626665f12/tomli-2.3.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:88bd15eb972f3664f5ed4b57c1634a97153b4bac4479dcb6a495f41921eb7f45", size = 153236, upload-time = "2025-10-08T22:01:00.137Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/86/7f/d8fffe6a7aefdb61bced88fcb5e280cfd71e08939da5894161bd71bea022/tomli-2.3.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:883b1c0d6398a6a9d29b508c331fa56adbcdff647f6ace4dfca0f50e90dfd0ba", size = 148084, upload-time = "2025-10-08T22:01:01.63Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/47/5c/24935fb6a2ee63e86d80e4d3b58b222dafaf438c416752c8b58537c8b89a/tomli-2.3.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d1381caf13ab9f300e30dd8feadb3de072aeb86f1d34a8569453ff32a7dea4bf", size = 234832, upload-time = "2025-10-08T22:01:02.543Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/89/da/75dfd804fc11e6612846758a23f13271b76d577e299592b4371a4ca4cd09/tomli-2.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a0e285d2649b78c0d9027570d4da3425bdb49830a6156121360b3f8511ea3441", size = 242052, upload-time = "2025-10-08T22:01:03.836Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/70/8c/f48ac899f7b3ca7eb13af73bacbc93aec37f9c954df3c08ad96991c8c373/tomli-2.3.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:0a154a9ae14bfcf5d8917a59b51ffd5a3ac1fd149b71b47a3a104ca4edcfa845", size = 239555, upload-time = "2025-10-08T22:01:04.834Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/ba/28/72f8afd73f1d0e7829bfc093f4cb98ce0a40ffc0cc997009ee1ed94ba705/tomli-2.3.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:74bf8464ff93e413514fefd2be591c3b0b23231a77f901db1eb30d6f712fc42c", size = 245128, upload-time = "2025-10-08T22:01:05.84Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/b6/eb/a7679c8ac85208706d27436e8d421dfa39d4c914dcf5fa8083a9305f58d9/tomli-2.3.0-cp311-cp311-win32.whl", hash = "sha256:00b5f5d95bbfc7d12f91ad8c593a1659b6387b43f054104cda404be6bda62456", size = 96445, upload-time = "2025-10-08T22:01:06.896Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/0a/fe/3d3420c4cb1ad9cb462fb52967080575f15898da97e21cb6f1361d505383/tomli-2.3.0-cp311-cp311-win_amd64.whl", hash = "sha256:4dc4ce8483a5d429ab602f111a93a6ab1ed425eae3122032db7e9acf449451be", size = 107165, upload-time = "2025-10-08T22:01:08.107Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/ff/b7/40f36368fcabc518bb11c8f06379a0fd631985046c038aca08c6d6a43c6e/tomli-2.3.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d7d86942e56ded512a594786a5ba0a5e521d02529b3826e7761a05138341a2ac", size = 154891, upload-time = "2025-10-08T22:01:09.082Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/f9/3f/d9dd692199e3b3aab2e4e4dd948abd0f790d9ded8cd10cbaae276a898434/tomli-2.3.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:73ee0b47d4dad1c5e996e3cd33b8a76a50167ae5f96a2607cbe8cc773506ab22", size = 148796, upload-time = "2025-10-08T22:01:10.266Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/60/83/59bff4996c2cf9f9387a0f5a3394629c7efa5ef16142076a23a90f1955fa/tomli-2.3.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:792262b94d5d0a466afb5bc63c7daa9d75520110971ee269152083270998316f", size = 242121, upload-time = "2025-10-08T22:01:11.332Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/45/e5/7c5119ff39de8693d6baab6c0b6dcb556d192c165596e9fc231ea1052041/tomli-2.3.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4f195fe57ecceac95a66a75ac24d9d5fbc98ef0962e09b2eddec5d39375aae52", size = 250070, upload-time = "2025-10-08T22:01:12.498Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/45/12/ad5126d3a278f27e6701abde51d342aa78d06e27ce2bb596a01f7709a5a2/tomli-2.3.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:e31d432427dcbf4d86958c184b9bfd1e96b5b71f8eb17e6d02531f434fd335b8", size = 245859, upload-time = "2025-10-08T22:01:13.551Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/fb/a1/4d6865da6a71c603cfe6ad0e6556c73c76548557a8d658f9e3b142df245f/tomli-2.3.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:7b0882799624980785240ab732537fcfc372601015c00f7fc367c55308c186f6", size = 250296, upload-time = "2025-10-08T22:01:14.614Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/a0/b7/a7a7042715d55c9ba6e8b196d65d2cb662578b4d8cd17d882d45322b0d78/tomli-2.3.0-cp312-cp312-win32.whl", hash = "sha256:ff72b71b5d10d22ecb084d345fc26f42b5143c5533db5e2eaba7d2d335358876", size = 97124, upload-time = "2025-10-08T22:01:15.629Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/06/1e/f22f100db15a68b520664eb3328fb0ae4e90530887928558112c8d1f4515/tomli-2.3.0-cp312-cp312-win_amd64.whl", hash = "sha256:1cb4ed918939151a03f33d4242ccd0aa5f11b3547d0cf30f7c74a408a5b99878", size = 107698, upload-time = "2025-10-08T22:01:16.51Z" },
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/77/b8/0135fadc89e73be292b473cb820b4f5a08197779206b33191e801feeae40/tomli-2.3.0-py3-none-any.whl", hash = "sha256:e95b1af3c5b07d9e643909b5abbec77cd9f1217e6d0bca72b0234736b9fb1f1b", size = 14408, upload-time = "2025-10-08T22:01:46.04Z" },
]
[[package]]
name = "typing-extensions"
version = "4.15.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/72/94/1a15dd82efb362ac84269196e94cf00f187f7ed21c242792a923cdb1c61f/typing_extensions-4.15.0.tar.gz", hash = "sha256:0cea48d173cc12fa28ecabc3b837ea3cf6f38c6d1136f85cbaaf598984861466", size = 109391, upload-time = "2025-08-25T13:49:26.313Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/18/67/36e9267722cc04a6b9f15c7f3441c2363321a3ea07da7ae0c0707beb2a9c/typing_extensions-4.15.0-py3-none-any.whl", hash = "sha256:f0fa19c6845758ab08074a0cfa8b7aecb71c999ca73d62883bc25cc018c4e548", size = 44614, upload-time = "2025-08-25T13:49:24.86Z" },
]
[[package]]
name = "urllib3"
version = "2.5.0"
source = { registry = "https://pypi.tuna.tsinghua.edu.cn/simple" }
sdist = { url = "https://pypi.tuna.tsinghua.edu.cn/packages/15/22/9ee70a2574a4f4599c47dd506532914ce044817c7752a79b6a51286319bc/urllib3-2.5.0.tar.gz", hash = "sha256:3fc47733c7e419d4bc3f6b3dc2b4f890bb743906a30d56ba4a5bfa4bbff92760", size = 393185, upload-time = "2025-06-18T14:07:41.644Z" }
wheels = [
{ url = "https://pypi.tuna.tsinghua.edu.cn/packages/a7/c2/fe1e52489ae3122415c51f387e221dd0773709bad6c6cdaa599e8a2c5185/urllib3-2.5.0-py3-none-any.whl", hash = "sha256:e6b01673c0fa6a13e374b50871808eb3bf7046c4b125b216f6bf1cc604cff0dc", size = 129795, upload-time = "2025-06-18T14:07:40.39Z" },
]

View File

@ -20,6 +20,7 @@ import logging
import time import time
import threading import threading
import traceback import traceback
import faulthandler
from flask import Flask from flask import Flask
from flask_login import LoginManager from flask_login import LoginManager
@ -37,6 +38,7 @@ from common.versions import get_ragflow_version
stop_event = threading.Event() stop_event = threading.Event()
if __name__ == '__main__': if __name__ == '__main__':
faulthandler.enable()
init_root_logger("admin_service") init_root_logger("admin_service")
logging.info(r""" logging.info(r"""
____ ___ ______________ ___ __ _ ____ ___ ______________ ___ __ _

View File

@ -13,7 +13,9 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import asyncio
import base64 import base64
import inspect
import json import json
import logging import logging
import re import re
@ -25,6 +27,7 @@ from typing import Any, Union, Tuple
from agent.component import component_class from agent.component import component_class
from agent.component.base import ComponentBase from agent.component.base import ComponentBase
from api.db.services.file_service import FileService
from api.db.services.task_service import has_canceled from api.db.services.task_service import has_canceled
from common.misc_utils import get_uuid, hash_str2int from common.misc_utils import get_uuid, hash_str2int
from common.exceptions import TaskCanceledException from common.exceptions import TaskCanceledException
@ -79,6 +82,7 @@ class Graph:
self.dsl = json.loads(dsl) self.dsl = json.loads(dsl)
self._tenant_id = tenant_id self._tenant_id = tenant_id
self.task_id = task_id if task_id else get_uuid() self.task_id = task_id if task_id else get_uuid()
self._thread_pool = ThreadPoolExecutor(max_workers=5)
self.load() self.load()
def load(self): def load(self):
@ -206,17 +210,28 @@ class Graph:
for key in path.split('.'): for key in path.split('.'):
if cur is None: if cur is None:
return None return None
if isinstance(cur, str): if isinstance(cur, str):
try: try:
cur = json.loads(cur) cur = json.loads(cur)
except Exception: except Exception:
return None return None
if isinstance(cur, dict): if isinstance(cur, dict):
cur = cur.get(key) cur = cur.get(key)
else: continue
cur = getattr(cur, key, None)
if isinstance(cur, (list, tuple)):
try:
idx = int(key)
cur = cur[idx]
except Exception:
return None
continue
cur = getattr(cur, key, None)
return cur return cur
def set_variable_value(self, exp: str,value): def set_variable_value(self, exp: str,value):
exp = exp.strip("{").strip("}").strip(" ").strip("{").strip("}") exp = exp.strip("{").strip("}").strip(" ").strip("{").strip("}")
if exp.find("@") < 0: if exp.find("@") < 0:
@ -270,6 +285,7 @@ class Canvas(Graph):
"sys.conversation_turns": 0, "sys.conversation_turns": 0,
"sys.files": [] "sys.files": []
} }
self.variables = {}
super().__init__(dsl, tenant_id, task_id) super().__init__(dsl, tenant_id, task_id)
def load(self): def load(self):
@ -284,6 +300,10 @@ class Canvas(Graph):
"sys.conversation_turns": 0, "sys.conversation_turns": 0,
"sys.files": [] "sys.files": []
} }
if "variables" in self.dsl:
self.variables = self.dsl["variables"]
else:
self.variables = {}
self.retrieval = self.dsl["retrieval"] self.retrieval = self.dsl["retrieval"]
self.memory = self.dsl.get("memory", []) self.memory = self.dsl.get("memory", [])
@ -300,8 +320,9 @@ class Canvas(Graph):
self.history = [] self.history = []
self.retrieval = [] self.retrieval = []
self.memory = [] self.memory = []
print(self.variables)
for k in self.globals.keys(): for k in self.globals.keys():
if k.startswith("sys.") or k.startswith("env."): if k.startswith("sys."):
if isinstance(self.globals[k], str): if isinstance(self.globals[k], str):
self.globals[k] = "" self.globals[k] = ""
elif isinstance(self.globals[k], int): elif isinstance(self.globals[k], int):
@ -314,9 +335,33 @@ class Canvas(Graph):
self.globals[k] = {} self.globals[k] = {}
else: else:
self.globals[k] = None self.globals[k] = None
if k.startswith("env."):
key = k[4:]
if key in self.variables:
variable = self.variables[key]
if variable["value"]:
self.globals[k] = variable["value"]
else:
if variable["type"] == "string":
self.globals[k] = ""
elif variable["type"] == "number":
self.globals[k] = 0
elif variable["type"] == "boolean":
self.globals[k] = False
elif variable["type"] == "object":
self.globals[k] = {}
elif variable["type"].startswith("array"):
self.globals[k] = []
else:
self.globals[k] = ""
else:
self.globals[k] = ""
print(self.globals)
async def run(self, **kwargs): async def run(self, **kwargs):
st = time.perf_counter() st = time.perf_counter()
self._loop = asyncio.get_running_loop()
self.message_id = get_uuid() self.message_id = get_uuid()
created_at = int(time.time()) created_at = int(time.time())
self.add_user_input(kwargs.get("query")) self.add_user_input(kwargs.get("query"))
@ -332,7 +377,7 @@ class Canvas(Graph):
for k in kwargs.keys(): for k in kwargs.keys():
if k in ["query", "user_id", "files"] and kwargs[k]: if k in ["query", "user_id", "files"] and kwargs[k]:
if k == "files": if k == "files":
self.globals[f"sys.{k}"] = self.get_files(kwargs[k]) self.globals[f"sys.{k}"] = await self.get_files_async(kwargs[k])
else: else:
self.globals[f"sys.{k}"] = kwargs[k] self.globals[f"sys.{k}"] = kwargs[k]
if not self.globals["sys.conversation_turns"] : if not self.globals["sys.conversation_turns"] :
@ -362,31 +407,39 @@ class Canvas(Graph):
yield decorate("workflow_started", {"inputs": kwargs.get("inputs")}) yield decorate("workflow_started", {"inputs": kwargs.get("inputs")})
self.retrieval.append({"chunks": {}, "doc_aggs": {}}) self.retrieval.append({"chunks": {}, "doc_aggs": {}})
def _run_batch(f, t): async def _run_batch(f, t):
if self.is_canceled(): if self.is_canceled():
msg = f"Task {self.task_id} has been canceled during batch execution." msg = f"Task {self.task_id} has been canceled during batch execution."
logging.info(msg) logging.info(msg)
raise TaskCanceledException(msg) raise TaskCanceledException(msg)
with ThreadPoolExecutor(max_workers=5) as executor: loop = asyncio.get_running_loop()
thr = [] tasks = []
i = f i = f
while i < t: while i < t:
cpn = self.get_component_obj(self.path[i]) cpn = self.get_component_obj(self.path[i])
if cpn.component_name.lower() in ["begin", "userfillup"]: task_fn = None
thr.append(executor.submit(cpn.invoke, inputs=kwargs.get("inputs", {})))
i += 1 if cpn.component_name.lower() in ["begin", "userfillup"]:
task_fn = partial(cpn.invoke, inputs=kwargs.get("inputs", {}))
i += 1
else:
for _, ele in cpn.get_input_elements().items():
if isinstance(ele, dict) and ele.get("_cpn_id") and ele.get("_cpn_id") not in self.path[:i] and self.path[0].lower().find("userfillup") < 0:
self.path.pop(i)
t -= 1
break
else: else:
for _, ele in cpn.get_input_elements().items(): task_fn = partial(cpn.invoke, **cpn.get_input())
if isinstance(ele, dict) and ele.get("_cpn_id") and ele.get("_cpn_id") not in self.path[:i] and self.path[0].lower().find("userfillup") < 0: i += 1
self.path.pop(i)
t -= 1 if task_fn is None:
break continue
else:
thr.append(executor.submit(cpn.invoke, **cpn.get_input())) tasks.append(loop.run_in_executor(self._thread_pool, task_fn))
i += 1
for t in thr: if tasks:
t.result() await asyncio.gather(*tasks)
def _node_finished(cpn_obj): def _node_finished(cpn_obj):
return decorate("node_finished",{ return decorate("node_finished",{
@ -413,7 +466,7 @@ class Canvas(Graph):
"component_type": self.get_component_type(self.path[i]), "component_type": self.get_component_type(self.path[i]),
"thoughts": self.get_component_thoughts(self.path[i]) "thoughts": self.get_component_thoughts(self.path[i])
}) })
_run_batch(idx, to) await _run_batch(idx, to)
to = len(self.path) to = len(self.path)
# post processing of components invocation # post processing of components invocation
for i in range(idx, to): for i in range(idx, to):
@ -422,16 +475,29 @@ class Canvas(Graph):
if cpn_obj.component_name.lower() == "message": if cpn_obj.component_name.lower() == "message":
if isinstance(cpn_obj.output("content"), partial): if isinstance(cpn_obj.output("content"), partial):
_m = "" _m = ""
for m in cpn_obj.output("content")(): stream = cpn_obj.output("content")()
if not m: if inspect.isasyncgen(stream):
continue async for m in stream:
if m == "<think>": if not m:
yield decorate("message", {"content": "", "start_to_think": True}) continue
elif m == "</think>": if m == "<think>":
yield decorate("message", {"content": "", "end_to_think": True}) yield decorate("message", {"content": "", "start_to_think": True})
else: elif m == "</think>":
yield decorate("message", {"content": m}) yield decorate("message", {"content": "", "end_to_think": True})
_m += m else:
yield decorate("message", {"content": m})
_m += m
else:
for m in stream:
if not m:
continue
if m == "<think>":
yield decorate("message", {"content": "", "start_to_think": True})
elif m == "</think>":
yield decorate("message", {"content": "", "end_to_think": True})
else:
yield decorate("message", {"content": m})
_m += m
cpn_obj.set_output("content", _m) cpn_obj.set_output("content", _m)
cite = re.search(r"\[ID:[ 0-9]+\]", _m) cite = re.search(r"\[ID:[ 0-9]+\]", _m)
else: else:
@ -440,7 +506,7 @@ class Canvas(Graph):
if isinstance(cpn_obj.output("attachment"), tuple): if isinstance(cpn_obj.output("attachment"), tuple):
yield decorate("message", {"attachment": cpn_obj.output("attachment")}) yield decorate("message", {"attachment": cpn_obj.output("attachment")})
yield decorate("message_end", {"reference": self.get_reference() if cite else None}) yield decorate("message_end", {"reference": self.get_reference() if cite else None})
while partials: while partials:
@ -462,7 +528,7 @@ class Canvas(Graph):
else: else:
self.error = cpn_obj.error() self.error = cpn_obj.error()
if cpn_obj.component_name.lower() != "iteration": if cpn_obj.component_name.lower() not in ("iteration","loop"):
if isinstance(cpn_obj.output("content"), partial): if isinstance(cpn_obj.output("content"), partial):
if self.error: if self.error:
cpn_obj.set_output("content", None) cpn_obj.set_output("content", None)
@ -487,14 +553,16 @@ class Canvas(Graph):
for cpn_id in cpn_ids: for cpn_id in cpn_ids:
_append_path(cpn_id) _append_path(cpn_id)
if cpn_obj.component_name.lower() == "iterationitem" and cpn_obj.end(): if cpn_obj.component_name.lower() in ("iterationitem","loopitem") and cpn_obj.end():
iter = cpn_obj.get_parent() iter = cpn_obj.get_parent()
yield _node_finished(iter) yield _node_finished(iter)
_extend_path(self.get_component(cpn["parent_id"])["downstream"]) _extend_path(self.get_component(cpn["parent_id"])["downstream"])
elif cpn_obj.component_name.lower() in ["categorize", "switch"]: elif cpn_obj.component_name.lower() in ["categorize", "switch"]:
_extend_path(cpn_obj.output("_next")) _extend_path(cpn_obj.output("_next"))
elif cpn_obj.component_name.lower() == "iteration": elif cpn_obj.component_name.lower() in ("iteration", "loop"):
_append_path(cpn_obj.get_start()) _append_path(cpn_obj.get_start())
elif cpn_obj.component_name.lower() == "exitloop" and cpn_obj.get_parent().component_name.lower() == "loop":
_extend_path(self.get_component(cpn["parent_id"])["downstream"])
elif not cpn["downstream"] and cpn_obj.get_parent(): elif not cpn["downstream"] and cpn_obj.get_parent():
_append_path(cpn_obj.get_parent().get_start()) _append_path(cpn_obj.get_parent().get_start())
else: else:
@ -579,21 +647,30 @@ class Canvas(Graph):
def get_component_input_elements(self, cpnnm): def get_component_input_elements(self, cpnnm):
return self.components[cpnnm]["obj"].get_input_elements() return self.components[cpnnm]["obj"].get_input_elements()
def get_files(self, files: Union[None, list[dict]]) -> list[str]: async def get_files_async(self, files: Union[None, list[dict]]) -> list[str]:
from api.db.services.file_service import FileService
if not files: if not files:
return [] return []
def image_to_base64(file): def image_to_base64(file):
return "data:{};base64,{}".format(file["mime_type"], return "data:{};base64,{}".format(file["mime_type"],
base64.b64encode(FileService.get_blob(file["created_by"], file["id"])).decode("utf-8")) base64.b64encode(FileService.get_blob(file["created_by"], file["id"])).decode("utf-8"))
exe = ThreadPoolExecutor(max_workers=5) loop = asyncio.get_running_loop()
threads = [] tasks = []
for file in files: for file in files:
if file["mime_type"].find("image") >=0: if file["mime_type"].find("image") >=0:
threads.append(exe.submit(image_to_base64, file)) tasks.append(loop.run_in_executor(self._thread_pool, image_to_base64, file))
continue continue
threads.append(exe.submit(FileService.parse, file["name"], FileService.get_blob(file["created_by"], file["id"]), True, file["created_by"])) tasks.append(loop.run_in_executor(self._thread_pool, FileService.parse, file["name"], FileService.get_blob(file["created_by"], file["id"]), True, file["created_by"]))
return [th.result() for th in threads] return await asyncio.gather(*tasks)
def get_files(self, files: Union[None, list[dict]]) -> list[str]:
"""
Synchronous wrapper for get_files_async, used by sync component invoke paths.
"""
loop = getattr(self, "_loop", None)
if loop and loop.is_running():
return asyncio.run_coroutine_threadsafe(self.get_files_async(files), loop).result()
return asyncio.run(self.get_files_async(files))
def tool_use_callback(self, agent_id: str, func_name: str, params: dict, result: Any, elapsed_time=None): def tool_use_callback(self, agent_id: str, func_name: str, params: dict, result: Any, elapsed_time=None):
agent_ids = agent_id.split("-->") agent_ids = agent_id.split("-->")
@ -647,4 +724,3 @@ class Canvas(Graph):
def get_component_thoughts(self, cpn_id) -> str: def get_component_thoughts(self, cpn_id) -> str:
return self.components.get(cpn_id)["obj"].thoughts() return self.components.get(cpn_id)["obj"].thoughts()

View File

@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import json
import logging import logging
import os import os
import re import re
@ -29,7 +30,7 @@ from api.db.services.tenant_llm_service import TenantLLMService
from api.db.services.mcp_server_service import MCPServerService from api.db.services.mcp_server_service import MCPServerService
from common.connection_utils import timeout from common.connection_utils import timeout
from rag.prompts.generator import next_step, COMPLETE_TASK, analyze_task, \ from rag.prompts.generator import next_step, COMPLETE_TASK, analyze_task, \
citation_prompt, reflect, rank_memories, kb_prompt, citation_plus, full_question, message_fit_in citation_prompt, reflect, rank_memories, kb_prompt, citation_plus, full_question, message_fit_in, structured_output_prompt
from common.mcp_tool_call_conn import MCPToolCallSession, mcp_tool_metadata_to_openai_tool from common.mcp_tool_call_conn import MCPToolCallSession, mcp_tool_metadata_to_openai_tool
from agent.component.llm import LLMParam, LLM from agent.component.llm import LLMParam, LLM
@ -137,6 +138,29 @@ class Agent(LLM, ToolBase):
res.update(cpn.get_input_form()) res.update(cpn.get_input_form())
return res return res
def _get_output_schema(self):
try:
cand = self._param.outputs.get("structured")
except Exception:
return None
if isinstance(cand, dict):
if isinstance(cand.get("properties"), dict) and len(cand["properties"]) > 0:
return cand
for k in ("schema", "structured"):
if isinstance(cand.get(k), dict) and isinstance(cand[k].get("properties"), dict) and len(cand[k]["properties"]) > 0:
return cand[k]
return None
def _force_format_to_schema(self, text: str, schema_prompt: str) -> str:
fmt_msgs = [
{"role": "system", "content": schema_prompt + "\nIMPORTANT: Output ONLY valid JSON. No markdown, no extra text."},
{"role": "user", "content": text},
]
_, fmt_msgs = message_fit_in(fmt_msgs, int(self.chat_mdl.max_length * 0.97))
return self._generate(fmt_msgs)
@timeout(int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 20*60))) @timeout(int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 20*60)))
def _invoke(self, **kwargs): def _invoke(self, **kwargs):
if self.check_if_canceled("Agent processing"): if self.check_if_canceled("Agent processing"):
@ -160,17 +184,22 @@ class Agent(LLM, ToolBase):
return LLM._invoke(self, **kwargs) return LLM._invoke(self, **kwargs)
prompt, msg, user_defined_prompt = self._prepare_prompt_variables() prompt, msg, user_defined_prompt = self._prepare_prompt_variables()
output_schema = self._get_output_schema()
schema_prompt = ""
if output_schema:
schema = json.dumps(output_schema, ensure_ascii=False, indent=2)
schema_prompt = structured_output_prompt(schema)
downstreams = self._canvas.get_component(self._id)["downstream"] if self._canvas.get_component(self._id) else [] downstreams = self._canvas.get_component(self._id)["downstream"] if self._canvas.get_component(self._id) else []
ex = self.exception_handler() ex = self.exception_handler()
if any([self._canvas.get_component_obj(cid).component_name.lower()=="message" for cid in downstreams]) and not (ex and ex["goto"]): if any([self._canvas.get_component_obj(cid).component_name.lower()=="message" for cid in downstreams]) and not (ex and ex["goto"]) and not output_schema:
self.set_output("content", partial(self.stream_output_with_tools, prompt, msg, user_defined_prompt)) self.set_output("content", partial(self.stream_output_with_tools, prompt, msg, user_defined_prompt))
return return
_, msg = message_fit_in([{"role": "system", "content": prompt}, *msg], int(self.chat_mdl.max_length * 0.97)) _, msg = message_fit_in([{"role": "system", "content": prompt}, *msg], int(self.chat_mdl.max_length * 0.97))
use_tools = [] use_tools = []
ans = "" ans = ""
for delta_ans, tk in self._react_with_tools_streamly(prompt, msg, use_tools, user_defined_prompt): for delta_ans, tk in self._react_with_tools_streamly(prompt, msg, use_tools, user_defined_prompt,schema_prompt=schema_prompt):
if self.check_if_canceled("Agent processing"): if self.check_if_canceled("Agent processing"):
return return
ans += delta_ans ans += delta_ans
@ -183,6 +212,28 @@ class Agent(LLM, ToolBase):
self.set_output("_ERROR", ans) self.set_output("_ERROR", ans)
return return
if output_schema:
error = ""
for _ in range(self._param.max_retries + 1):
try:
def clean_formated_answer(ans: str) -> str:
ans = re.sub(r"^.*</think>", "", ans, flags=re.DOTALL)
ans = re.sub(r"^.*```json", "", ans, flags=re.DOTALL)
return re.sub(r"```\n*$", "", ans, flags=re.DOTALL)
obj = json_repair.loads(clean_formated_answer(ans))
self.set_output("structured", obj)
if use_tools:
self.set_output("use_tools", use_tools)
return obj
except Exception:
error = "The answer cannot be parsed as JSON"
ans = self._force_format_to_schema(ans, schema_prompt)
if ans.find("**ERROR**") >= 0:
continue
self.set_output("_ERROR", error)
return
self.set_output("content", ans) self.set_output("content", ans)
if use_tools: if use_tools:
self.set_output("use_tools", use_tools) self.set_output("use_tools", use_tools)
@ -219,7 +270,7 @@ class Agent(LLM, ToolBase):
]): ]):
yield delta_ans yield delta_ans
def _react_with_tools_streamly(self, prompt, history: list[dict], use_tools, user_defined_prompt={}): def _react_with_tools_streamly(self, prompt, history: list[dict], use_tools, user_defined_prompt={}, schema_prompt: str = ""):
token_count = 0 token_count = 0
tool_metas = self.tool_meta tool_metas = self.tool_meta
hist = deepcopy(history) hist = deepcopy(history)
@ -256,9 +307,13 @@ class Agent(LLM, ToolBase):
def complete(): def complete():
nonlocal hist nonlocal hist
need2cite = self._param.cite and self._canvas.get_reference()["chunks"] and self._id.find("-->") < 0 need2cite = self._param.cite and self._canvas.get_reference()["chunks"] and self._id.find("-->") < 0
if schema_prompt:
need2cite = False
cited = False cited = False
if hist[0]["role"] == "system" and need2cite: if hist and hist[0]["role"] == "system":
if len(hist) < 7: if schema_prompt:
hist[0]["content"] += "\n" + schema_prompt
if need2cite and len(hist) < 7:
hist[0]["content"] += citation_prompt() hist[0]["content"] += citation_prompt()
cited = True cited = True
yield "", token_count yield "", token_count
@ -369,7 +424,7 @@ Respond immediately with your final comprehensive answer.
""" """
for k in self._param.outputs.keys(): for k in self._param.outputs.keys():
self._param.outputs[k]["value"] = None self._param.outputs[k]["value"] = None
for k, cpn in self.tools.items(): for k, cpn in self.tools.items():
if hasattr(cpn, "reset") and callable(cpn.reset): if hasattr(cpn, "reset") and callable(cpn.reset):
cpn.reset() cpn.reset()

View File

@ -14,6 +14,7 @@
# limitations under the License. # limitations under the License.
# #
from agent.component.fillup import UserFillUpParam, UserFillUp from agent.component.fillup import UserFillUpParam, UserFillUp
from api.db.services.file_service import FileService
class BeginParam(UserFillUpParam): class BeginParam(UserFillUpParam):
@ -48,7 +49,7 @@ class Begin(UserFillUp):
if v.get("optional") and v.get("value", None) is None: if v.get("optional") and v.get("value", None) is None:
v = None v = None
else: else:
v = self._canvas.get_files([v["value"]]) v = FileService.get_files([v["value"]])
else: else:
v = v.get("value") v = v.get("value")
self.set_output(k, v) self.set_output(k, v)

View File

@ -0,0 +1,32 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from abc import ABC
from agent.component.base import ComponentBase, ComponentParamBase
class ExitLoopParam(ComponentParamBase, ABC):
def check(self):
return True
class ExitLoop(ComponentBase, ABC):
component_name = "ExitLoop"
def _invoke(self, **kwargs):
pass
def thoughts(self) -> str:
return ""

View File

@ -32,7 +32,7 @@ class IterationParam(ComponentParamBase):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
self.items_ref = "" self.items_ref = ""
self.veriable={} self.variable={}
def get_input_form(self) -> dict[str, dict]: def get_input_form(self) -> dict[str, dict]:
return { return {

View File

@ -205,6 +205,55 @@ class LLM(ComponentBase):
for txt in self.chat_mdl.chat_streamly(msg[0]["content"], msg[1:], self._param.gen_conf(), images=self.imgs, **kwargs): for txt in self.chat_mdl.chat_streamly(msg[0]["content"], msg[1:], self._param.gen_conf(), images=self.imgs, **kwargs):
yield delta(txt) yield delta(txt)
async def _stream_output_async(self, prompt, msg):
_, msg = message_fit_in([{"role": "system", "content": prompt}, *msg], int(self.chat_mdl.max_length * 0.97))
answer = ""
last_idx = 0
endswith_think = False
def delta(txt):
nonlocal answer, last_idx, endswith_think
delta_ans = txt[last_idx:]
answer = txt
if delta_ans.find("<think>") == 0:
last_idx += len("<think>")
return "<think>"
elif delta_ans.find("<think>") > 0:
delta_ans = txt[last_idx:last_idx + delta_ans.find("<think>")]
last_idx += delta_ans.find("<think>")
return delta_ans
elif delta_ans.endswith("</think>"):
endswith_think = True
elif endswith_think:
endswith_think = False
return "</think>"
last_idx = len(answer)
if answer.endswith("</think>"):
last_idx -= len("</think>")
return re.sub(r"(<think>|</think>)", "", delta_ans)
stream_kwargs = {"images": self.imgs} if self.imgs else {}
async for ans in self.chat_mdl.async_chat_streamly(msg[0]["content"], msg[1:], self._param.gen_conf(), **stream_kwargs):
if self.check_if_canceled("LLM streaming"):
return
if isinstance(ans, int):
continue
if ans.find("**ERROR**") >= 0:
if self.get_exception_default_value():
self.set_output("content", self.get_exception_default_value())
yield self.get_exception_default_value()
else:
self.set_output("_ERROR", ans)
return
yield delta(ans)
self.set_output("content", answer)
@timeout(int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 10*60))) @timeout(int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 10*60)))
def _invoke(self, **kwargs): def _invoke(self, **kwargs):
if self.check_if_canceled("LLM processing"): if self.check_if_canceled("LLM processing"):
@ -222,7 +271,7 @@ class LLM(ComponentBase):
output_structure = self._param.outputs['structured'] output_structure = self._param.outputs['structured']
except Exception: except Exception:
pass pass
if output_structure and isinstance(output_structure, dict) and output_structure.get("properties"): if output_structure and isinstance(output_structure, dict) and output_structure.get("properties") and len(output_structure["properties"]) > 0:
schema=json.dumps(output_structure, ensure_ascii=False, indent=2) schema=json.dumps(output_structure, ensure_ascii=False, indent=2)
prompt += structured_output_prompt(schema) prompt += structured_output_prompt(schema)
for _ in range(self._param.max_retries+1): for _ in range(self._param.max_retries+1):
@ -250,7 +299,7 @@ class LLM(ComponentBase):
downstreams = self._canvas.get_component(self._id)["downstream"] if self._canvas.get_component(self._id) else [] downstreams = self._canvas.get_component(self._id)["downstream"] if self._canvas.get_component(self._id) else []
ex = self.exception_handler() ex = self.exception_handler()
if any([self._canvas.get_component_obj(cid).component_name.lower()=="message" for cid in downstreams]) and not (ex and ex["goto"]): if any([self._canvas.get_component_obj(cid).component_name.lower()=="message" for cid in downstreams]) and not (ex and ex["goto"]):
self.set_output("content", partial(self._stream_output, prompt, msg)) self.set_output("content", partial(self._stream_output_async, prompt, msg))
return return
for _ in range(self._param.max_retries+1): for _ in range(self._param.max_retries+1):

80
agent/component/loop.py Normal file
View File

@ -0,0 +1,80 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from abc import ABC
from agent.component.base import ComponentBase, ComponentParamBase
class LoopParam(ComponentParamBase):
"""
Define the Loop component parameters.
"""
def __init__(self):
super().__init__()
self.loop_variables = []
self.loop_termination_condition=[]
self.maximum_loop_count = 0
def get_input_form(self) -> dict[str, dict]:
return {
"items": {
"type": "json",
"name": "Items"
}
}
def check(self):
return True
class Loop(ComponentBase, ABC):
component_name = "Loop"
def get_start(self):
for cid in self._canvas.components.keys():
if self._canvas.get_component(cid)["obj"].component_name.lower() != "loopitem":
continue
if self._canvas.get_component(cid)["parent_id"] == self._id:
return cid
def _invoke(self, **kwargs):
if self.check_if_canceled("Loop processing"):
return
for item in self._param.loop_variables:
if any([not item.get("variable"), not item.get("input_mode"), not item.get("value"),not item.get("type")]):
assert "Loop Variable is not complete."
if item["input_mode"]=="variable":
self.set_output(item["variable"],self._canvas.get_variable_value(item["value"]))
elif item["input_mode"]=="constant":
self.set_output(item["variable"],item["value"])
else:
if item["type"] == "number":
self.set_output(item["variable"], 0)
elif item["type"] == "string":
self.set_output(item["variable"], "")
elif item["type"] == "boolean":
self.set_output(item["variable"], False)
elif item["type"].startswith("object"):
self.set_output(item["variable"], {})
elif item["type"].startswith("array"):
self.set_output(item["variable"], [])
else:
self.set_output(item["variable"], "")
def thoughts(self) -> str:
return "Loop from canvas."

163
agent/component/loopitem.py Normal file
View File

@ -0,0 +1,163 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from abc import ABC
from agent.component.base import ComponentBase, ComponentParamBase
class LoopItemParam(ComponentParamBase):
"""
Define the LoopItem component parameters.
"""
def check(self):
return True
class LoopItem(ComponentBase, ABC):
component_name = "LoopItem"
def __init__(self, canvas, id, param: ComponentParamBase):
super().__init__(canvas, id, param)
self._idx = 0
def _invoke(self, **kwargs):
if self.check_if_canceled("LoopItem processing"):
return
parent = self.get_parent()
maximum_loop_count = parent._param.maximum_loop_count
if self._idx >= maximum_loop_count:
self._idx = -1
return
if self._idx > 0:
if self.check_if_canceled("LoopItem processing"):
return
self._idx += 1
def evaluate_condition(self,var, operator, value):
if isinstance(var, str):
if operator == "contains":
return value in var
elif operator == "not contains":
return value not in var
elif operator == "start with":
return var.startswith(value)
elif operator == "end with":
return var.endswith(value)
elif operator == "is":
return var == value
elif operator == "is not":
return var != value
elif operator == "empty":
return var == ""
elif operator == "not empty":
return var != ""
elif isinstance(var, (int, float)):
if operator == "=":
return var == value
elif operator == "":
return var != value
elif operator == ">":
return var > value
elif operator == "<":
return var < value
elif operator == "":
return var >= value
elif operator == "":
return var <= value
elif operator == "empty":
return var is None
elif operator == "not empty":
return var is not None
elif isinstance(var, bool):
if operator == "is":
return var is value
elif operator == "is not":
return var is not value
elif operator == "empty":
return var is None
elif operator == "not empty":
return var is not None
elif isinstance(var, dict):
if operator == "empty":
return len(var) == 0
elif operator == "not empty":
return len(var) > 0
elif isinstance(var, list):
if operator == "contains":
return value in var
elif operator == "not contains":
return value not in var
elif operator == "is":
return var == value
elif operator == "is not":
return var != value
elif operator == "empty":
return len(var) == 0
elif operator == "not empty":
return len(var) > 0
raise Exception(f"Invalid operator: {operator}")
def end(self):
if self._idx == -1:
return True
parent = self.get_parent()
logical_operator = parent._param.logical_operator if hasattr(parent._param, "logical_operator") else "and"
conditions = []
for item in parent._param.loop_termination_condition:
if not item.get("variable") or not item.get("operator"):
raise ValueError("Loop condition is incomplete.")
var = self._canvas.get_variable_value(item["variable"])
operator = item["operator"]
input_mode = item.get("input_mode", "constant")
if input_mode == "variable":
value = self._canvas.get_variable_value(item.get("value", ""))
elif input_mode == "constant":
value = item.get("value", "")
else:
raise ValueError("Invalid input mode.")
conditions.append(self.evaluate_condition(var, operator, value))
should_end = (
all(conditions) if logical_operator == "and"
else any(conditions) if logical_operator == "or"
else None
)
if should_end is None:
raise ValueError("Invalid logical operator,should be 'and' or 'or'.")
if should_end:
self._idx = -1
return True
return False
def next(self):
if self._idx == -1:
self._idx = 0
else:
self._idx += 1
if self._idx >= len(self._items):
self._idx = -1
return False
def thoughts(self) -> str:
return "Next turn..."

View File

@ -13,6 +13,8 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import asyncio
import inspect
import json import json
import os import os
import random import random
@ -39,6 +41,7 @@ class MessageParam(ComponentParamBase):
self.content = [] self.content = []
self.stream = True self.stream = True
self.output_format = None # default output format self.output_format = None # default output format
self.auto_play = False
self.outputs = { self.outputs = {
"content": { "content": {
"type": "str" "type": "str"
@ -66,8 +69,12 @@ class Message(ComponentBase):
v = "" v = ""
ans = "" ans = ""
if isinstance(v, partial): if isinstance(v, partial):
for t in v(): iter_obj = v()
ans += t if inspect.isasyncgen(iter_obj):
ans = asyncio.run(self._consume_async_gen(iter_obj))
else:
for t in iter_obj:
ans += t
elif isinstance(v, list) and delimiter: elif isinstance(v, list) and delimiter:
ans = delimiter.join([str(vv) for vv in v]) ans = delimiter.join([str(vv) for vv in v])
elif not isinstance(v, str): elif not isinstance(v, str):
@ -89,7 +96,13 @@ class Message(ComponentBase):
_kwargs[_n] = v _kwargs[_n] = v
return script, _kwargs return script, _kwargs
def _stream(self, rand_cnt:str): async def _consume_async_gen(self, agen):
buf = ""
async for t in agen:
buf += t
return buf
async def _stream(self, rand_cnt:str):
s = 0 s = 0
all_content = "" all_content = ""
cache = {} cache = {}
@ -111,15 +124,27 @@ class Message(ComponentBase):
v = "" v = ""
if isinstance(v, partial): if isinstance(v, partial):
cnt = "" cnt = ""
for t in v(): iter_obj = v()
if self.check_if_canceled("Message streaming"): if inspect.isasyncgen(iter_obj):
return async for t in iter_obj:
if self.check_if_canceled("Message streaming"):
return
all_content += t all_content += t
cnt += t cnt += t
yield t yield t
else:
for t in iter_obj:
if self.check_if_canceled("Message streaming"):
return
all_content += t
cnt += t
yield t
self.set_input_value(exp, cnt) self.set_input_value(exp, cnt)
continue continue
elif inspect.isawaitable(v):
v = await v
elif not isinstance(v, str): elif not isinstance(v, str):
try: try:
v = json.dumps(v, ensure_ascii=False) v = json.dumps(v, ensure_ascii=False)
@ -181,7 +206,7 @@ class Message(ComponentBase):
import pypandoc import pypandoc
doc_id = get_uuid() doc_id = get_uuid()
if self._param.output_format.lower() not in {"markdown", "html", "pdf", "docx"}: if self._param.output_format.lower() not in {"markdown", "html", "pdf", "docx"}:
self._param.output_format = "markdown" self._param.output_format = "markdown"
@ -231,11 +256,11 @@ class Message(ComponentBase):
settings.STORAGE_IMPL.put(self._canvas._tenant_id, doc_id, binary_content) settings.STORAGE_IMPL.put(self._canvas._tenant_id, doc_id, binary_content)
self.set_output("attachment", { self.set_output("attachment", {
"doc_id":doc_id, "doc_id":doc_id,
"format":self._param.output_format, "format":self._param.output_format,
"file_name":f"{doc_id[:8]}.{self._param.output_format}"}) "file_name":f"{doc_id[:8]}.{self._param.output_format}"})
logging.info(f"Converted content uploaded as {doc_id} (format={self._param.output_format})") logging.info(f"Converted content uploaded as {doc_id} (format={self._param.output_format})")
except Exception as e: except Exception as e:
logging.error(f"Error converting content to {self._param.output_format}: {e}") logging.error(f"Error converting content to {self._param.output_format}: {e}")

View File

@ -13,16 +13,20 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import ast
import base64 import base64
import json
import logging import logging
import os import os
from abc import ABC from abc import ABC
from strenum import StrEnum
from typing import Optional from typing import Optional
from pydantic import BaseModel, Field, field_validator from pydantic import BaseModel, Field, field_validator
from agent.tools.base import ToolParamBase, ToolBase, ToolMeta from strenum import StrEnum
from common.connection_utils import timeout
from agent.tools.base import ToolBase, ToolMeta, ToolParamBase
from common import settings from common import settings
from common.connection_utils import timeout
class Language(StrEnum): class Language(StrEnum):
@ -62,10 +66,10 @@ class CodeExecParam(ToolParamBase):
""" """
def __init__(self): def __init__(self):
self.meta:ToolMeta = { self.meta: ToolMeta = {
"name": "execute_code", "name": "execute_code",
"description": """ "description": """
This tool has a sandbox that can execute code written in 'Python'/'Javascript'. It recieves a piece of code and return a Json string. This tool has a sandbox that can execute code written in 'Python'/'Javascript'. It receives a piece of code and return a Json string.
Here's a code example for Python(`main` function MUST be included): Here's a code example for Python(`main` function MUST be included):
def main() -> dict: def main() -> dict:
\"\"\" \"\"\"
@ -99,16 +103,12 @@ module.exports = { main };
"enum": ["python", "javascript"], "enum": ["python", "javascript"],
"required": True, "required": True,
}, },
"script": { "script": {"type": "string", "description": "A piece of code in right format. There MUST be main function.", "required": True},
"type": "string", },
"description": "A piece of code in right format. There MUST be main function.",
"required": True
}
}
} }
super().__init__() super().__init__()
self.lang = Language.PYTHON.value self.lang = Language.PYTHON.value
self.script = "def main(arg1: str, arg2: str) -> dict: return {\"result\": arg1 + arg2}" self.script = 'def main(arg1: str, arg2: str) -> dict: return {"result": arg1 + arg2}'
self.arguments = {} self.arguments = {}
self.outputs = {"result": {"value": "", "type": "string"}} self.outputs = {"result": {"value": "", "type": "string"}}
@ -119,17 +119,14 @@ module.exports = { main };
def get_input_form(self) -> dict[str, dict]: def get_input_form(self) -> dict[str, dict]:
res = {} res = {}
for k, v in self.arguments.items(): for k, v in self.arguments.items():
res[k] = { res[k] = {"type": "line", "name": k}
"type": "line",
"name": k
}
return res return res
class CodeExec(ToolBase, ABC): class CodeExec(ToolBase, ABC):
component_name = "CodeExec" component_name = "CodeExec"
@timeout(int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 10*60))) @timeout(int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 10 * 60)))
def _invoke(self, **kwargs): def _invoke(self, **kwargs):
if self.check_if_canceled("CodeExec processing"): if self.check_if_canceled("CodeExec processing"):
return return
@ -138,17 +135,12 @@ class CodeExec(ToolBase, ABC):
script = kwargs.get("script", self._param.script) script = kwargs.get("script", self._param.script)
arguments = {} arguments = {}
for k, v in self._param.arguments.items(): for k, v in self._param.arguments.items():
if kwargs.get(k): if kwargs.get(k):
arguments[k] = kwargs[k] arguments[k] = kwargs[k]
continue continue
arguments[k] = self._canvas.get_variable_value(v) if v else None arguments[k] = self._canvas.get_variable_value(v) if v else None
self._execute_code( self._execute_code(language=lang, code=script, arguments=arguments)
language=lang,
code=script,
arguments=arguments
)
def _execute_code(self, language: str, code: str, arguments: dict): def _execute_code(self, language: str, code: str, arguments: dict):
import requests import requests
@ -169,7 +161,7 @@ class CodeExec(ToolBase, ABC):
if self.check_if_canceled("CodeExec execution"): if self.check_if_canceled("CodeExec execution"):
return "Task has been canceled" return "Task has been canceled"
resp = requests.post(url=f"http://{settings.SANDBOX_HOST}:9385/run", json=code_req, timeout=int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 10*60))) resp = requests.post(url=f"http://{settings.SANDBOX_HOST}:9385/run", json=code_req, timeout=int(os.environ.get("COMPONENT_EXEC_TIMEOUT", 10 * 60)))
logging.info(f"http://{settings.SANDBOX_HOST}:9385/run, code_req: {code_req}, resp.status_code {resp.status_code}:") logging.info(f"http://{settings.SANDBOX_HOST}:9385/run, code_req: {code_req}, resp.status_code {resp.status_code}:")
if self.check_if_canceled("CodeExec execution"): if self.check_if_canceled("CodeExec execution"):
@ -183,35 +175,10 @@ class CodeExec(ToolBase, ABC):
if stderr: if stderr:
self.set_output("_ERROR", stderr) self.set_output("_ERROR", stderr)
return return
try: raw_stdout = body.get("stdout", "")
rt = eval(body.get("stdout", "")) parsed_stdout = self._deserialize_stdout(raw_stdout)
except Exception: logging.info(f"[CodeExec]: http://{settings.SANDBOX_HOST}:9385/run -> {parsed_stdout}")
rt = body.get("stdout", "") self._populate_outputs(parsed_stdout, raw_stdout)
logging.info(f"http://{settings.SANDBOX_HOST}:9385/run -> {rt}")
if isinstance(rt, tuple):
for i, (k, o) in enumerate(self._param.outputs.items()):
if self.check_if_canceled("CodeExec execution"):
return
if k.find("_") == 0:
continue
o["value"] = rt[i]
elif isinstance(rt, dict):
for i, (k, o) in enumerate(self._param.outputs.items()):
if self.check_if_canceled("CodeExec execution"):
return
if k not in rt or k.find("_") == 0:
continue
o["value"] = rt[k]
else:
for i, (k, o) in enumerate(self._param.outputs.items()):
if self.check_if_canceled("CodeExec execution"):
return
if k.find("_") == 0:
continue
o["value"] = rt
else: else:
self.set_output("_ERROR", "There is no response from sandbox") self.set_output("_ERROR", "There is no response from sandbox")
@ -228,3 +195,149 @@ class CodeExec(ToolBase, ABC):
def thoughts(self) -> str: def thoughts(self) -> str:
return "Running a short script to process data." return "Running a short script to process data."
def _deserialize_stdout(self, stdout: str):
text = str(stdout).strip()
if not text:
return ""
for loader in (json.loads, ast.literal_eval):
try:
return loader(text)
except Exception:
continue
return text
def _coerce_output_value(self, value, expected_type: Optional[str]):
if expected_type is None:
return value
etype = expected_type.strip().lower()
inner_type = None
if etype.startswith("array<") and etype.endswith(">"):
inner_type = etype[6:-1].strip()
etype = "array"
try:
if etype == "string":
return "" if value is None else str(value)
if etype == "number":
if value is None or value == "":
return None
if isinstance(value, (int, float)):
return value
if isinstance(value, str):
try:
return float(value)
except Exception:
return value
return float(value)
if etype == "boolean":
if isinstance(value, bool):
return value
if isinstance(value, str):
lv = value.lower()
if lv in ("true", "1", "yes", "y", "on"):
return True
if lv in ("false", "0", "no", "n", "off"):
return False
return bool(value)
if etype == "array":
candidate = value
if isinstance(candidate, str):
parsed = self._deserialize_stdout(candidate)
candidate = parsed
if isinstance(candidate, tuple):
candidate = list(candidate)
if not isinstance(candidate, list):
candidate = [] if candidate is None else [candidate]
if inner_type == "string":
return ["" if v is None else str(v) for v in candidate]
if inner_type == "number":
coerced = []
for v in candidate:
try:
if v is None or v == "":
coerced.append(None)
elif isinstance(v, (int, float)):
coerced.append(v)
else:
coerced.append(float(v))
except Exception:
coerced.append(v)
return coerced
return candidate
if etype == "object":
if isinstance(value, dict):
return value
if isinstance(value, str):
parsed = self._deserialize_stdout(value)
if isinstance(parsed, dict):
return parsed
return value
except Exception:
return value
return value
def _populate_outputs(self, parsed_stdout, raw_stdout: str):
outputs_items = list(self._param.outputs.items())
logging.info(f"[CodeExec]: outputs schema keys: {[k for k, _ in outputs_items]}")
if not outputs_items:
return
if isinstance(parsed_stdout, dict):
for key, meta in outputs_items:
if key.startswith("_"):
continue
val = self._get_by_path(parsed_stdout, key)
coerced = self._coerce_output_value(val, meta.get("type"))
logging.info(f"[CodeExec]: populate dict key='{key}' raw='{val}' coerced='{coerced}'")
self.set_output(key, coerced)
return
if isinstance(parsed_stdout, (list, tuple)):
for idx, (key, meta) in enumerate(outputs_items):
if key.startswith("_"):
continue
val = parsed_stdout[idx] if idx < len(parsed_stdout) else None
coerced = self._coerce_output_value(val, meta.get("type"))
logging.info(f"[CodeExec]: populate list key='{key}' raw='{val}' coerced='{coerced}'")
self.set_output(key, coerced)
return
default_val = parsed_stdout if parsed_stdout is not None else raw_stdout
for idx, (key, meta) in enumerate(outputs_items):
if key.startswith("_"):
continue
val = default_val if idx == 0 else None
coerced = self._coerce_output_value(val, meta.get("type"))
logging.info(f"[CodeExec]: populate scalar key='{key}' raw='{val}' coerced='{coerced}'")
self.set_output(key, coerced)
def _get_by_path(self, data, path: str):
if not path:
return None
cur = data
for part in path.split("."):
part = part.strip()
if not part:
return None
if isinstance(cur, dict):
cur = cur.get(part)
elif isinstance(cur, list):
try:
idx = int(part)
cur = cur[idx]
except Exception:
return None
else:
return None
if cur is None:
return None
logging.info(f"[CodeExec]: resolve path '{path}' -> {cur}")
return cur

View File

@ -132,12 +132,12 @@ class Retrieval(ToolBase, ABC):
metas = DocumentService.get_meta_by_kbs(kb_ids) metas = DocumentService.get_meta_by_kbs(kb_ids)
if self._param.meta_data_filter.get("method") == "auto": if self._param.meta_data_filter.get("method") == "auto":
chat_mdl = LLMBundle(self._canvas.get_tenant_id(), LLMType.CHAT) chat_mdl = LLMBundle(self._canvas.get_tenant_id(), LLMType.CHAT)
filters = gen_meta_filter(chat_mdl, metas, query) filters: dict = gen_meta_filter(chat_mdl, metas, query)
doc_ids.extend(meta_filter(metas, filters)) doc_ids.extend(meta_filter(metas, filters["conditions"], filters.get("logic", "and")))
if not doc_ids: if not doc_ids:
doc_ids = None doc_ids = None
elif self._param.meta_data_filter.get("method") == "manual": elif self._param.meta_data_filter.get("method") == "manual":
filters=self._param.meta_data_filter["manual"] filters = self._param.meta_data_filter["manual"]
for flt in filters: for flt in filters:
pat = re.compile(self.variable_ref_patt) pat = re.compile(self.variable_ref_patt)
s = flt["value"] s = flt["value"]
@ -165,9 +165,9 @@ class Retrieval(ToolBase, ABC):
out_parts.append(s[last:]) out_parts.append(s[last:])
flt["value"] = "".join(out_parts) flt["value"] = "".join(out_parts)
doc_ids.extend(meta_filter(metas, filters)) doc_ids.extend(meta_filter(metas, filters, self._param.meta_data_filter.get("logic", "and")))
if not doc_ids: if filters and not doc_ids:
doc_ids = None doc_ids = ["-999"]
if self._param.cross_languages: if self._param.cross_languages:
query = cross_languages(kbs[0].tenant_id, None, query, self._param.cross_languages) query = cross_languages(kbs[0].tenant_id, None, query, self._param.cross_languages)

View File

@ -13,18 +13,17 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import logging
import os import os
import sys import sys
import logging
from importlib.util import module_from_spec, spec_from_file_location from importlib.util import module_from_spec, spec_from_file_location
from pathlib import Path from pathlib import Path
from quart import Blueprint, Quart, request, g, current_app, session from quart import Blueprint, Quart, request, g, current_app, session
from werkzeug.wrappers.request import Request
from flasgger import Swagger from flasgger import Swagger
from itsdangerous.url_safe import URLSafeTimedSerializer as Serializer from itsdangerous.url_safe import URLSafeTimedSerializer as Serializer
from quart_cors import cors from quart_cors import cors
from common.constants import StatusEnum from common.constants import StatusEnum
from api.db.db_models import close_connection from api.db.db_models import close_connection, APIToken
from api.db.services import UserService from api.db.services import UserService
from api.utils.json_encode import CustomJSONEncoder from api.utils.json_encode import CustomJSONEncoder
from api.utils import commands from api.utils import commands
@ -40,7 +39,6 @@ settings.init_settings()
__all__ = ["app"] __all__ = ["app"]
Request.json = property(lambda self: self.get_json(force=True, silent=True))
app = Quart(__name__) app = Quart(__name__)
app = cors(app, allow_origin="*") app = cors(app, allow_origin="*")
@ -82,6 +80,11 @@ app.url_map.strict_slashes = False
app.json_encoder = CustomJSONEncoder app.json_encoder = CustomJSONEncoder
app.errorhandler(Exception)(server_error_response) app.errorhandler(Exception)(server_error_response)
# Configure Quart timeouts for slow LLM responses (e.g., local Ollama on CPU)
# Default Quart timeouts are 60 seconds which is too short for many LLM backends
app.config["RESPONSE_TIMEOUT"] = int(os.environ.get("QUART_RESPONSE_TIMEOUT", 600))
app.config["BODY_TIMEOUT"] = int(os.environ.get("QUART_BODY_TIMEOUT", 600))
## convince for dev and debug ## convince for dev and debug
# app.config["LOGIN_DISABLED"] = True # app.config["LOGIN_DISABLED"] = True
app.config["SESSION_PERMANENT"] = False app.config["SESSION_PERMANENT"] = False
@ -124,6 +127,10 @@ def _load_user():
user = UserService.query( user = UserService.query(
access_token=access_token, status=StatusEnum.VALID.value access_token=access_token, status=StatusEnum.VALID.value
) )
if not user and len(authorization.split()) == 2:
objs = APIToken.query(token=authorization.split()[1])
if objs:
user = UserService.query(id=objs[0].tenant_id, status=StatusEnum.VALID.value)
if user: if user:
if not user[0].access_token or not user[0].access_token.strip(): if not user[0].access_token or not user[0].access_token.strip():
logging.warning(f"User {user[0].email} has empty access_token in database") logging.warning(f"User {user[0].email} has empty access_token in database")

View File

@ -18,8 +18,7 @@ from quart import request
from api.db.db_models import APIToken from api.db.db_models import APIToken
from api.db.services.api_service import APITokenService, API4ConversationService from api.db.services.api_service import APITokenService, API4ConversationService
from api.db.services.user_service import UserTenantService from api.db.services.user_service import UserTenantService
from api.utils.api_utils import server_error_response, get_data_error_result, get_json_result, validate_request, \ from api.utils.api_utils import generate_confirmation_token, get_data_error_result, get_json_result, get_request_json, server_error_response, validate_request
generate_confirmation_token
from common.time_utils import current_timestamp, datetime_format from common.time_utils import current_timestamp, datetime_format
from api.apps import login_required, current_user from api.apps import login_required, current_user
@ -27,7 +26,7 @@ from api.apps import login_required, current_user
@manager.route('/new_token', methods=['POST']) # noqa: F821 @manager.route('/new_token', methods=['POST']) # noqa: F821
@login_required @login_required
async def new_token(): async def new_token():
req = await request.json req = await get_request_json()
try: try:
tenants = UserTenantService.query(user_id=current_user.id) tenants = UserTenantService.query(user_id=current_user.id)
if not tenants: if not tenants:
@ -73,7 +72,7 @@ def token_list():
@validate_request("tokens", "tenant_id") @validate_request("tokens", "tenant_id")
@login_required @login_required
async def rm(): async def rm():
req = await request.json req = await get_request_json()
try: try:
for token in req["tokens"]: for token in req["tokens"]:
APITokenService.filter_delete( APITokenService.filter_delete(
@ -116,4 +115,3 @@ def stats():
return get_json_result(data=res) return get_json_result(data=res)
except Exception as e: except Exception as e:
return server_error_response(e) return server_error_response(e)

View File

@ -14,7 +14,7 @@
# limitations under the License. # limitations under the License.
# #
import requests from common.http_client import async_request, sync_request
from .oauth import OAuthClient, UserInfo from .oauth import OAuthClient, UserInfo
@ -34,24 +34,49 @@ class GithubOAuthClient(OAuthClient):
def fetch_user_info(self, access_token, **kwargs): def fetch_user_info(self, access_token, **kwargs):
""" """
Fetch GitHub user info. Fetch GitHub user info (synchronous).
""" """
user_info = {} user_info = {}
try: try:
headers = {"Authorization": f"Bearer {access_token}"} headers = {"Authorization": f"Bearer {access_token}"}
# user info response = sync_request("GET", self.userinfo_url, headers=headers, timeout=self.http_request_timeout)
response = requests.get(self.userinfo_url, headers=headers, timeout=self.http_request_timeout)
response.raise_for_status() response.raise_for_status()
user_info.update(response.json()) user_info.update(response.json())
# email info email_response = sync_request(
response = requests.get(self.userinfo_url+"/emails", headers=headers, timeout=self.http_request_timeout) "GET", self.userinfo_url + "/emails", headers=headers, timeout=self.http_request_timeout
response.raise_for_status() )
email_info = response.json() email_response.raise_for_status()
user_info["email"] = next( email_info = email_response.json()
(email for email in email_info if email["primary"]), None user_info["email"] = next((email for email in email_info if email["primary"]), None)["email"]
)["email"]
return self.normalize_user_info(user_info) return self.normalize_user_info(user_info)
except requests.exceptions.RequestException as e: except Exception as e:
raise ValueError(f"Failed to fetch github user info: {e}")
async def async_fetch_user_info(self, access_token, **kwargs):
"""Async variant of fetch_user_info using httpx."""
user_info = {}
headers = {"Authorization": f"Bearer {access_token}"}
try:
response = await async_request(
"GET",
self.userinfo_url,
headers=headers,
timeout=self.http_request_timeout,
)
response.raise_for_status()
user_info.update(response.json())
email_response = await async_request(
"GET",
self.userinfo_url + "/emails",
headers=headers,
timeout=self.http_request_timeout,
)
email_response.raise_for_status()
email_info = email_response.json()
user_info["email"] = next((email for email in email_info if email["primary"]), None)["email"]
return self.normalize_user_info(user_info)
except Exception as e:
raise ValueError(f"Failed to fetch github user info: {e}") raise ValueError(f"Failed to fetch github user info: {e}")

View File

@ -14,8 +14,8 @@
# limitations under the License. # limitations under the License.
# #
import requests
import urllib.parse import urllib.parse
from common.http_client import async_request, sync_request
class UserInfo: class UserInfo:
@ -74,15 +74,40 @@ class OAuthClient:
"redirect_uri": self.redirect_uri, "redirect_uri": self.redirect_uri,
"grant_type": "authorization_code" "grant_type": "authorization_code"
} }
response = requests.post( response = sync_request(
"POST",
self.token_url, self.token_url,
data=payload, data=payload,
headers={"Accept": "application/json"}, headers={"Accept": "application/json"},
timeout=self.http_request_timeout timeout=self.http_request_timeout,
) )
response.raise_for_status() response.raise_for_status()
return response.json() return response.json()
except requests.exceptions.RequestException as e: except Exception as e:
raise ValueError(f"Failed to exchange authorization code for token: {e}")
async def async_exchange_code_for_token(self, code):
"""
Async variant of exchange_code_for_token using httpx.
"""
payload = {
"client_id": self.client_id,
"client_secret": self.client_secret,
"code": code,
"redirect_uri": self.redirect_uri,
"grant_type": "authorization_code",
}
try:
response = await async_request(
"POST",
self.token_url,
data=payload,
headers={"Accept": "application/json"},
timeout=self.http_request_timeout,
)
response.raise_for_status()
return response.json()
except Exception as e:
raise ValueError(f"Failed to exchange authorization code for token: {e}") raise ValueError(f"Failed to exchange authorization code for token: {e}")
@ -92,11 +117,27 @@ class OAuthClient:
""" """
try: try:
headers = {"Authorization": f"Bearer {access_token}"} headers = {"Authorization": f"Bearer {access_token}"}
response = requests.get(self.userinfo_url, headers=headers, timeout=self.http_request_timeout) response = sync_request("GET", self.userinfo_url, headers=headers, timeout=self.http_request_timeout)
response.raise_for_status() response.raise_for_status()
user_info = response.json() user_info = response.json()
return self.normalize_user_info(user_info) return self.normalize_user_info(user_info)
except requests.exceptions.RequestException as e: except Exception as e:
raise ValueError(f"Failed to fetch user info: {e}")
async def async_fetch_user_info(self, access_token, **kwargs):
"""Async variant of fetch_user_info using httpx."""
headers = {"Authorization": f"Bearer {access_token}"}
try:
response = await async_request(
"GET",
self.userinfo_url,
headers=headers,
timeout=self.http_request_timeout,
)
response.raise_for_status()
user_info = response.json()
return self.normalize_user_info(user_info)
except Exception as e:
raise ValueError(f"Failed to fetch user info: {e}") raise ValueError(f"Failed to fetch user info: {e}")

View File

@ -15,7 +15,7 @@
# #
import jwt import jwt
import requests from common.http_client import sync_request
from .oauth import OAuthClient from .oauth import OAuthClient
@ -50,10 +50,10 @@ class OIDCClient(OAuthClient):
""" """
try: try:
metadata_url = f"{issuer}/.well-known/openid-configuration" metadata_url = f"{issuer}/.well-known/openid-configuration"
response = requests.get(metadata_url, timeout=7) response = sync_request("GET", metadata_url, timeout=7)
response.raise_for_status() response.raise_for_status()
return response.json() return response.json()
except requests.exceptions.RequestException as e: except Exception as e:
raise ValueError(f"Failed to fetch OIDC metadata: {e}") raise ValueError(f"Failed to fetch OIDC metadata: {e}")
@ -95,6 +95,13 @@ class OIDCClient(OAuthClient):
user_info.update(super().fetch_user_info(access_token).to_dict()) user_info.update(super().fetch_user_info(access_token).to_dict())
return self.normalize_user_info(user_info) return self.normalize_user_info(user_info)
async def async_fetch_user_info(self, access_token, id_token=None, **kwargs):
user_info = {}
if id_token:
user_info = self.parse_id_token(id_token)
user_info.update((await super().async_fetch_user_info(access_token)).to_dict())
return self.normalize_user_info(user_info)
def normalize_user_info(self, user_info): def normalize_user_info(self, user_info):
return super().normalize_user_info(user_info) return super().normalize_user_info(user_info)

View File

@ -13,15 +13,13 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import asyncio
import json import json
import logging import logging
import re
import sys
from functools import partial from functools import partial
import trio
from quart import request, Response, make_response from quart import request, Response, make_response
from agent.component import LLM from agent.component import LLM
from api.db import CanvasCategory, FileType from api.db import CanvasCategory
from api.db.services.canvas_service import CanvasTemplateService, UserCanvasService, API4ConversationService from api.db.services.canvas_service import CanvasTemplateService, UserCanvasService, API4ConversationService
from api.db.services.document_service import DocumentService from api.db.services.document_service import DocumentService
from api.db.services.file_service import FileService from api.db.services.file_service import FileService
@ -32,13 +30,12 @@ from api.db.services.user_canvas_version import UserCanvasVersionService
from common.constants import RetCode from common.constants import RetCode
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from api.utils.api_utils import get_json_result, server_error_response, validate_request, get_data_error_result, \ from api.utils.api_utils import get_json_result, server_error_response, validate_request, get_data_error_result, \
request_json get_request_json
from agent.canvas import Canvas from agent.canvas import Canvas
from peewee import MySQLDatabase, PostgresqlDatabase from peewee import MySQLDatabase, PostgresqlDatabase
from api.db.db_models import APIToken, Task from api.db.db_models import APIToken, Task
import time import time
from api.utils.file_utils import filename_type, read_potential_broken_pdf
from rag.flow.pipeline import Pipeline from rag.flow.pipeline import Pipeline
from rag.nlp import search from rag.nlp import search
from rag.utils.redis_conn import REDIS_CONN from rag.utils.redis_conn import REDIS_CONN
@ -56,7 +53,7 @@ def templates():
@validate_request("canvas_ids") @validate_request("canvas_ids")
@login_required @login_required
async def rm(): async def rm():
req = await request_json() req = await get_request_json()
for i in req["canvas_ids"]: for i in req["canvas_ids"]:
if not UserCanvasService.accessible(i, current_user.id): if not UserCanvasService.accessible(i, current_user.id):
return get_json_result( return get_json_result(
@ -70,7 +67,7 @@ async def rm():
@validate_request("dsl", "title") @validate_request("dsl", "title")
@login_required @login_required
async def save(): async def save():
req = await request_json() req = await get_request_json()
if not isinstance(req["dsl"], str): if not isinstance(req["dsl"], str):
req["dsl"] = json.dumps(req["dsl"], ensure_ascii=False) req["dsl"] = json.dumps(req["dsl"], ensure_ascii=False)
req["dsl"] = json.loads(req["dsl"]) req["dsl"] = json.loads(req["dsl"])
@ -129,17 +126,17 @@ def getsse(canvas_id):
@validate_request("id") @validate_request("id")
@login_required @login_required
async def run(): async def run():
req = await request_json() req = await get_request_json()
query = req.get("query", "") query = req.get("query", "")
files = req.get("files", []) files = req.get("files", [])
inputs = req.get("inputs", {}) inputs = req.get("inputs", {})
user_id = req.get("user_id", current_user.id) user_id = req.get("user_id", current_user.id)
if not UserCanvasService.accessible(req["id"], current_user.id): if not await asyncio.to_thread(UserCanvasService.accessible, req["id"], current_user.id):
return get_json_result( return get_json_result(
data=False, message='Only owner of canvas authorized for this operation.', data=False, message='Only owner of canvas authorized for this operation.',
code=RetCode.OPERATING_ERROR) code=RetCode.OPERATING_ERROR)
e, cvs = UserCanvasService.get_by_id(req["id"]) e, cvs = await asyncio.to_thread(UserCanvasService.get_by_id, req["id"])
if not e: if not e:
return get_data_error_result(message="canvas not found.") return get_data_error_result(message="canvas not found.")
@ -149,7 +146,7 @@ async def run():
if cvs.canvas_category == CanvasCategory.DataFlow: if cvs.canvas_category == CanvasCategory.DataFlow:
task_id = get_uuid() task_id = get_uuid()
Pipeline(cvs.dsl, tenant_id=current_user.id, doc_id=CANVAS_DEBUG_DOC_ID, task_id=task_id, flow_id=req["id"]) Pipeline(cvs.dsl, tenant_id=current_user.id, doc_id=CANVAS_DEBUG_DOC_ID, task_id=task_id, flow_id=req["id"])
ok, error_message = queue_dataflow(tenant_id=user_id, flow_id=req["id"], task_id=task_id, file=files[0], priority=0) ok, error_message = await asyncio.to_thread(queue_dataflow, user_id, req["id"], task_id, files[0], 0)
if not ok: if not ok:
return get_data_error_result(message=error_message) return get_data_error_result(message=error_message)
return get_json_result(data={"message_id": task_id}) return get_json_result(data={"message_id": task_id})
@ -186,7 +183,7 @@ async def run():
@validate_request("id", "dsl", "component_id") @validate_request("id", "dsl", "component_id")
@login_required @login_required
async def rerun(): async def rerun():
req = await request_json() req = await get_request_json()
doc = PipelineOperationLogService.get_documents_info(req["id"]) doc = PipelineOperationLogService.get_documents_info(req["id"])
if not doc: if not doc:
return get_data_error_result(message="Document not found.") return get_data_error_result(message="Document not found.")
@ -224,7 +221,7 @@ def cancel(task_id):
@validate_request("id") @validate_request("id")
@login_required @login_required
async def reset(): async def reset():
req = await request_json() req = await get_request_json()
if not UserCanvasService.accessible(req["id"], current_user.id): if not UserCanvasService.accessible(req["id"], current_user.id):
return get_json_result( return get_json_result(
data=False, message='Only owner of canvas authorized for this operation.', data=False, message='Only owner of canvas authorized for this operation.',
@ -250,71 +247,10 @@ async def upload(canvas_id):
return get_data_error_result(message="canvas not found.") return get_data_error_result(message="canvas not found.")
user_id = cvs["user_id"] user_id = cvs["user_id"]
def structured(filename, filetype, blob, content_type):
nonlocal user_id
if filetype == FileType.PDF.value:
blob = read_potential_broken_pdf(blob)
location = get_uuid()
FileService.put_blob(user_id, location, blob)
return {
"id": location,
"name": filename,
"size": sys.getsizeof(blob),
"extension": filename.split(".")[-1].lower(),
"mime_type": content_type,
"created_by": user_id,
"created_at": time.time(),
"preview_url": None
}
if request.args.get("url"):
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter,
CrawlResult
)
try:
url = request.args.get("url")
filename = re.sub(r"\?.*", "", url.split("/")[-1])
async def adownload():
browser_config = BrowserConfig(
headless=True,
verbose=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
pdf=True,
screenshot=False
)
result: CrawlResult = await crawler.arun(
url=url,
config=crawler_config
)
return result
page = trio.run(adownload())
if page.pdf:
if filename.split(".")[-1].lower() != "pdf":
filename += ".pdf"
return get_json_result(data=structured(filename, "pdf", page.pdf, page.response_headers["content-type"]))
return get_json_result(data=structured(filename, "html", str(page.markdown).encode("utf-8"), page.response_headers["content-type"], user_id))
except Exception as e:
return server_error_response(e)
files = await request.files files = await request.files
file = files['file'] file = files['file'] if files and files.get("file") else None
try: try:
DocumentService.check_doc_health(user_id, file.filename) return get_json_result(data=FileService.upload_info(user_id, file, request.args.get("url")))
return get_json_result(data=structured(file.filename, filename_type(file.filename), file.read(), file.content_type))
except Exception as e: except Exception as e:
return server_error_response(e) return server_error_response(e)
@ -343,7 +279,7 @@ def input_form():
@validate_request("id", "component_id", "params") @validate_request("id", "component_id", "params")
@login_required @login_required
async def debug(): async def debug():
req = await request_json() req = await get_request_json()
if not UserCanvasService.accessible(req["id"], current_user.id): if not UserCanvasService.accessible(req["id"], current_user.id):
return get_json_result( return get_json_result(
data=False, message='Only owner of canvas authorized for this operation.', data=False, message='Only owner of canvas authorized for this operation.',
@ -375,7 +311,7 @@ async def debug():
@validate_request("db_type", "database", "username", "host", "port", "password") @validate_request("db_type", "database", "username", "host", "port", "password")
@login_required @login_required
async def test_db_connect(): async def test_db_connect():
req = await request_json() req = await get_request_json()
try: try:
if req["db_type"] in ["mysql", "mariadb"]: if req["db_type"] in ["mysql", "mariadb"]:
db = MySQLDatabase(req["database"], user=req["username"], host=req["host"], port=req["port"], db = MySQLDatabase(req["database"], user=req["username"], host=req["host"], port=req["port"],
@ -520,7 +456,7 @@ def list_canvas():
@validate_request("id", "title", "permission") @validate_request("id", "title", "permission")
@login_required @login_required
async def setting(): async def setting():
req = await request_json() req = await get_request_json()
req["user_id"] = current_user.id req["user_id"] = current_user.id
if not UserCanvasService.accessible(req["id"], current_user.id): if not UserCanvasService.accessible(req["id"], current_user.id):

View File

@ -27,7 +27,7 @@ from api.db.services.llm_service import LLMBundle
from api.db.services.search_service import SearchService from api.db.services.search_service import SearchService
from api.db.services.user_service import UserTenantService from api.db.services.user_service import UserTenantService
from api.utils.api_utils import get_data_error_result, get_json_result, server_error_response, validate_request, \ from api.utils.api_utils import get_data_error_result, get_json_result, server_error_response, validate_request, \
request_json get_request_json
from rag.app.qa import beAdoc, rmPrefix from rag.app.qa import beAdoc, rmPrefix
from rag.app.tag import label_question from rag.app.tag import label_question
from rag.nlp import rag_tokenizer, search from rag.nlp import rag_tokenizer, search
@ -42,7 +42,7 @@ from api.apps import login_required, current_user
@login_required @login_required
@validate_request("doc_id") @validate_request("doc_id")
async def list_chunk(): async def list_chunk():
req = await request_json() req = await get_request_json()
doc_id = req["doc_id"] doc_id = req["doc_id"]
page = int(req.get("page", 1)) page = int(req.get("page", 1))
size = int(req.get("size", 30)) size = int(req.get("size", 30))
@ -123,7 +123,7 @@ def get():
@login_required @login_required
@validate_request("doc_id", "chunk_id", "content_with_weight") @validate_request("doc_id", "chunk_id", "content_with_weight")
async def set(): async def set():
req = await request_json() req = await get_request_json()
d = { d = {
"id": req["chunk_id"], "id": req["chunk_id"],
"content_with_weight": req["content_with_weight"]} "content_with_weight": req["content_with_weight"]}
@ -180,7 +180,7 @@ async def set():
@login_required @login_required
@validate_request("chunk_ids", "available_int", "doc_id") @validate_request("chunk_ids", "available_int", "doc_id")
async def switch(): async def switch():
req = await request_json() req = await get_request_json()
try: try:
e, doc = DocumentService.get_by_id(req["doc_id"]) e, doc = DocumentService.get_by_id(req["doc_id"])
if not e: if not e:
@ -200,7 +200,7 @@ async def switch():
@login_required @login_required
@validate_request("chunk_ids", "doc_id") @validate_request("chunk_ids", "doc_id")
async def rm(): async def rm():
req = await request_json() req = await get_request_json()
try: try:
e, doc = DocumentService.get_by_id(req["doc_id"]) e, doc = DocumentService.get_by_id(req["doc_id"])
if not e: if not e:
@ -224,7 +224,7 @@ async def rm():
@login_required @login_required
@validate_request("doc_id", "content_with_weight") @validate_request("doc_id", "content_with_weight")
async def create(): async def create():
req = await request_json() req = await get_request_json()
chunck_id = xxhash.xxh64((req["content_with_weight"] + req["doc_id"]).encode("utf-8")).hexdigest() chunck_id = xxhash.xxh64((req["content_with_weight"] + req["doc_id"]).encode("utf-8")).hexdigest()
d = {"id": chunck_id, "content_ltks": rag_tokenizer.tokenize(req["content_with_weight"]), d = {"id": chunck_id, "content_ltks": rag_tokenizer.tokenize(req["content_with_weight"]),
"content_with_weight": req["content_with_weight"]} "content_with_weight": req["content_with_weight"]}
@ -282,7 +282,7 @@ async def create():
@login_required @login_required
@validate_request("kb_id", "question") @validate_request("kb_id", "question")
async def retrieval_test(): async def retrieval_test():
req = await request_json() req = await get_request_json()
page = int(req.get("page", 1)) page = int(req.get("page", 1))
size = int(req.get("size", 30)) size = int(req.get("size", 30))
question = req["question"] question = req["question"]
@ -305,14 +305,14 @@ async def retrieval_test():
metas = DocumentService.get_meta_by_kbs(kb_ids) metas = DocumentService.get_meta_by_kbs(kb_ids)
if meta_data_filter.get("method") == "auto": if meta_data_filter.get("method") == "auto":
chat_mdl = LLMBundle(current_user.id, LLMType.CHAT, llm_name=search_config.get("chat_id", "")) chat_mdl = LLMBundle(current_user.id, LLMType.CHAT, llm_name=search_config.get("chat_id", ""))
filters = gen_meta_filter(chat_mdl, metas, question) filters: dict = gen_meta_filter(chat_mdl, metas, question)
doc_ids.extend(meta_filter(metas, filters)) doc_ids.extend(meta_filter(metas, filters["conditions"], filters.get("logic", "and")))
if not doc_ids: if not doc_ids:
doc_ids = None doc_ids = None
elif meta_data_filter.get("method") == "manual": elif meta_data_filter.get("method") == "manual":
doc_ids.extend(meta_filter(metas, meta_data_filter["manual"])) doc_ids.extend(meta_filter(metas, meta_data_filter["manual"], meta_data_filter.get("logic", "and")))
if not doc_ids: if meta_data_filter["manual"] and not doc_ids:
doc_ids = None doc_ids = ["-999"]
try: try:
tenants = UserTenantService.query(user_id=current_user.id) tenants = UserTenantService.query(user_id=current_user.id)

View File

@ -26,10 +26,10 @@ from google_auth_oauthlib.flow import Flow
from api.db import InputType from api.db import InputType
from api.db.services.connector_service import ConnectorService, SyncLogsService from api.db.services.connector_service import ConnectorService, SyncLogsService
from api.utils.api_utils import get_data_error_result, get_json_result, validate_request from api.utils.api_utils import get_data_error_result, get_json_result, get_request_json, validate_request
from common.constants import RetCode, TaskStatus from common.constants import RetCode, TaskStatus
from common.data_source.config import GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI, DocumentSource from common.data_source.config import GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI, GMAIL_WEB_OAUTH_REDIRECT_URI, DocumentSource
from common.data_source.google_util.constant import GOOGLE_DRIVE_WEB_OAUTH_POPUP_TEMPLATE, GOOGLE_SCOPES from common.data_source.google_util.constant import GOOGLE_WEB_OAUTH_POPUP_TEMPLATE, GOOGLE_SCOPES
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from rag.utils.redis_conn import REDIS_CONN from rag.utils.redis_conn import REDIS_CONN
from api.apps import login_required, current_user from api.apps import login_required, current_user
@ -38,7 +38,7 @@ from api.apps import login_required, current_user
@manager.route("/set", methods=["POST"]) # noqa: F821 @manager.route("/set", methods=["POST"]) # noqa: F821
@login_required @login_required
async def set_connector(): async def set_connector():
req = await request.json req = await get_request_json()
if req.get("id"): if req.get("id"):
conn = {fld: req[fld] for fld in ["prune_freq", "refresh_freq", "config", "timeout_secs"] if fld in req} conn = {fld: req[fld] for fld in ["prune_freq", "refresh_freq", "config", "timeout_secs"] if fld in req}
ConnectorService.update_by_id(req["id"], conn) ConnectorService.update_by_id(req["id"], conn)
@ -90,7 +90,7 @@ def list_logs(connector_id):
@manager.route("/<connector_id>/resume", methods=["PUT"]) # noqa: F821 @manager.route("/<connector_id>/resume", methods=["PUT"]) # noqa: F821
@login_required @login_required
async def resume(connector_id): async def resume(connector_id):
req = await request.json req = await get_request_json()
if req.get("resume"): if req.get("resume"):
ConnectorService.resume(connector_id, TaskStatus.SCHEDULE) ConnectorService.resume(connector_id, TaskStatus.SCHEDULE)
else: else:
@ -102,7 +102,7 @@ async def resume(connector_id):
@login_required @login_required
@validate_request("kb_id") @validate_request("kb_id")
async def rebuild(connector_id): async def rebuild(connector_id):
req = await request.json req = await get_request_json()
err = ConnectorService.rebuild(req["kb_id"], connector_id, current_user.id) err = ConnectorService.rebuild(req["kb_id"], connector_id, current_user.id)
if err: if err:
return get_json_result(data=False, message=err, code=RetCode.SERVER_ERROR) return get_json_result(data=False, message=err, code=RetCode.SERVER_ERROR)
@ -122,12 +122,30 @@ GOOGLE_WEB_FLOW_RESULT_PREFIX = "google_drive_web_flow_result"
WEB_FLOW_TTL_SECS = 15 * 60 WEB_FLOW_TTL_SECS = 15 * 60
def _web_state_cache_key(flow_id: str) -> str: def _web_state_cache_key(flow_id: str, source_type: str | None = None) -> str:
return f"{GOOGLE_WEB_FLOW_STATE_PREFIX}:{flow_id}" """Return Redis key for web OAuth state.
The default prefix keeps backward compatibility for Google Drive.
When source_type == "gmail", a different prefix is used so that
Drive/Gmail flows don't clash in Redis.
"""
if source_type == "gmail":
prefix = "gmail_web_flow_state"
else:
prefix = GOOGLE_WEB_FLOW_STATE_PREFIX
return f"{prefix}:{flow_id}"
def _web_result_cache_key(flow_id: str) -> str: def _web_result_cache_key(flow_id: str, source_type: str | None = None) -> str:
return f"{GOOGLE_WEB_FLOW_RESULT_PREFIX}:{flow_id}" """Return Redis key for web OAuth result.
Mirrors _web_state_cache_key logic for result storage.
"""
if source_type == "gmail":
prefix = "gmail_web_flow_result"
else:
prefix = GOOGLE_WEB_FLOW_RESULT_PREFIX
return f"{prefix}:{flow_id}"
def _load_credentials(payload: str | dict[str, Any]) -> dict[str, Any]: def _load_credentials(payload: str | dict[str, Any]) -> dict[str, Any]:
@ -146,19 +164,22 @@ def _get_web_client_config(credentials: dict[str, Any]) -> dict[str, Any]:
return {"web": web_section} return {"web": web_section}
async def _render_web_oauth_popup(flow_id: str, success: bool, message: str): async def _render_web_oauth_popup(flow_id: str, success: bool, message: str, source="drive"):
status = "success" if success else "error" status = "success" if success else "error"
auto_close = "window.close();" if success else "" auto_close = "window.close();" if success else ""
escaped_message = escape(message) escaped_message = escape(message)
payload_json = json.dumps( payload_json = json.dumps(
{ {
"type": "ragflow-google-drive-oauth", # TODO(google-oauth): include connector type (drive/gmail) in payload type if needed
"type": f"ragflow-google-{source}-oauth",
"status": status, "status": status,
"flowId": flow_id or "", "flowId": flow_id or "",
"message": message, "message": message,
} }
) )
html = GOOGLE_DRIVE_WEB_OAUTH_POPUP_TEMPLATE.format( # TODO(google-oauth): title/heading/message may need to reflect drive/gmail based on cached type
html = GOOGLE_WEB_OAUTH_POPUP_TEMPLATE.format(
title=f"Google {source.capitalize()} Authorization",
heading="Authorization complete" if success else "Authorization failed", heading="Authorization complete" if success else "Authorization failed",
message=escaped_message, message=escaped_message,
payload_json=payload_json, payload_json=payload_json,
@ -169,20 +190,33 @@ async def _render_web_oauth_popup(flow_id: str, success: bool, message: str):
return response return response
@manager.route("/google-drive/oauth/web/start", methods=["POST"]) # noqa: F821 @manager.route("/google/oauth/web/start", methods=["POST"]) # noqa: F821
@login_required @login_required
@validate_request("credentials") @validate_request("credentials")
async def start_google_drive_web_oauth(): async def start_google_web_oauth():
if not GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI: source = request.args.get("type", "google-drive")
if source not in ("google-drive", "gmail"):
return get_json_result(code=RetCode.ARGUMENT_ERROR, message="Invalid Google OAuth type.")
if source == "gmail":
redirect_uri = GMAIL_WEB_OAUTH_REDIRECT_URI
scopes = GOOGLE_SCOPES[DocumentSource.GMAIL]
else:
redirect_uri = GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI if source == "google-drive" else GMAIL_WEB_OAUTH_REDIRECT_URI
scopes = GOOGLE_SCOPES[DocumentSource.GOOGLE_DRIVE if source == "google-drive" else DocumentSource.GMAIL]
if not redirect_uri:
return get_json_result( return get_json_result(
code=RetCode.SERVER_ERROR, code=RetCode.SERVER_ERROR,
message="Google Drive OAuth redirect URI is not configured on the server.", message="Google OAuth redirect URI is not configured on the server.",
) )
req = await request.json or {} req = await get_request_json()
raw_credentials = req.get("credentials", "") raw_credentials = req.get("credentials", "")
try: try:
credentials = _load_credentials(raw_credentials) credentials = _load_credentials(raw_credentials)
print(credentials)
except ValueError as exc: except ValueError as exc:
return get_json_result(code=RetCode.ARGUMENT_ERROR, message=str(exc)) return get_json_result(code=RetCode.ARGUMENT_ERROR, message=str(exc))
@ -199,8 +233,8 @@ async def start_google_drive_web_oauth():
flow_id = str(uuid.uuid4()) flow_id = str(uuid.uuid4())
try: try:
flow = Flow.from_client_config(client_config, scopes=GOOGLE_SCOPES[DocumentSource.GOOGLE_DRIVE]) flow = Flow.from_client_config(client_config, scopes=scopes)
flow.redirect_uri = GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI flow.redirect_uri = redirect_uri
authorization_url, _ = flow.authorization_url( authorization_url, _ = flow.authorization_url(
access_type="offline", access_type="offline",
include_granted_scopes="true", include_granted_scopes="true",
@ -219,7 +253,7 @@ async def start_google_drive_web_oauth():
"client_config": client_config, "client_config": client_config,
"created_at": int(time.time()), "created_at": int(time.time()),
} }
REDIS_CONN.set_obj(_web_state_cache_key(flow_id), cache_payload, WEB_FLOW_TTL_SECS) REDIS_CONN.set_obj(_web_state_cache_key(flow_id, source), cache_payload, WEB_FLOW_TTL_SECS)
return get_json_result( return get_json_result(
data={ data={
@ -230,60 +264,122 @@ async def start_google_drive_web_oauth():
) )
@manager.route("/google-drive/oauth/web/callback", methods=["GET"]) # noqa: F821 @manager.route("/gmail/oauth/web/callback", methods=["GET"]) # noqa: F821
async def google_drive_web_oauth_callback(): async def google_gmail_web_oauth_callback():
state_id = request.args.get("state") state_id = request.args.get("state")
error = request.args.get("error") error = request.args.get("error")
source = "gmail"
if source != 'gmail':
return await _render_web_oauth_popup("", False, "Invalid Google OAuth type.", source)
error_description = request.args.get("error_description") or error error_description = request.args.get("error_description") or error
if not state_id: if not state_id:
return await _render_web_oauth_popup("", False, "Missing OAuth state parameter.") return await _render_web_oauth_popup("", False, "Missing OAuth state parameter.", source)
state_cache = REDIS_CONN.get(_web_state_cache_key(state_id)) state_cache = REDIS_CONN.get(_web_state_cache_key(state_id, source))
if not state_cache: if not state_cache:
return await _render_web_oauth_popup(state_id, False, "Authorization session expired. Please restart from the main window.") return await _render_web_oauth_popup(state_id, False, "Authorization session expired. Please restart from the main window.", source)
state_obj = json.loads(state_cache) state_obj = json.loads(state_cache)
client_config = state_obj.get("client_config") client_config = state_obj.get("client_config")
if not client_config: if not client_config:
REDIS_CONN.delete(_web_state_cache_key(state_id)) REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, False, "Authorization session was invalid. Please retry.") return await _render_web_oauth_popup(state_id, False, "Authorization session was invalid. Please retry.", source)
if error: if error:
REDIS_CONN.delete(_web_state_cache_key(state_id)) REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, False, error_description or "Authorization was cancelled.") return await _render_web_oauth_popup(state_id, False, error_description or "Authorization was cancelled.", source)
code = request.args.get("code") code = request.args.get("code")
if not code: if not code:
return await _render_web_oauth_popup(state_id, False, "Missing authorization code from Google.") return await _render_web_oauth_popup(state_id, False, "Missing authorization code from Google.", source)
try: try:
flow = Flow.from_client_config(client_config, scopes=GOOGLE_SCOPES[DocumentSource.GOOGLE_DRIVE]) # TODO(google-oauth): branch scopes/redirect_uri based on source_type (drive vs gmail)
flow.redirect_uri = GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI flow = Flow.from_client_config(client_config, scopes=GOOGLE_SCOPES[DocumentSource.GMAIL])
flow.redirect_uri = GMAIL_WEB_OAUTH_REDIRECT_URI
flow.fetch_token(code=code) flow.fetch_token(code=code)
except Exception as exc: # pragma: no cover - defensive except Exception as exc: # pragma: no cover - defensive
logging.exception("Failed to exchange Google OAuth code: %s", exc) logging.exception("Failed to exchange Google OAuth code: %s", exc)
REDIS_CONN.delete(_web_state_cache_key(state_id)) REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, False, "Failed to exchange tokens with Google. Please retry.") return await _render_web_oauth_popup(state_id, False, "Failed to exchange tokens with Google. Please retry.", source)
creds_json = flow.credentials.to_json() creds_json = flow.credentials.to_json()
result_payload = { result_payload = {
"user_id": state_obj.get("user_id"), "user_id": state_obj.get("user_id"),
"credentials": creds_json, "credentials": creds_json,
} }
REDIS_CONN.set_obj(_web_result_cache_key(state_id), result_payload, WEB_FLOW_TTL_SECS) REDIS_CONN.set_obj(_web_result_cache_key(state_id, source), result_payload, WEB_FLOW_TTL_SECS)
REDIS_CONN.delete(_web_state_cache_key(state_id))
return await _render_web_oauth_popup(state_id, True, "Authorization completed successfully.") print("\n\n", _web_result_cache_key(state_id, source), "\n\n")
REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, True, "Authorization completed successfully.", source)
@manager.route("/google-drive/oauth/web/result", methods=["POST"]) # noqa: F821 @manager.route("/google-drive/oauth/web/callback", methods=["GET"]) # noqa: F821
async def google_drive_web_oauth_callback():
state_id = request.args.get("state")
error = request.args.get("error")
source = "google-drive"
if source not in ("google-drive", "gmail"):
return await _render_web_oauth_popup("", False, "Invalid Google OAuth type.", source)
error_description = request.args.get("error_description") or error
if not state_id:
return await _render_web_oauth_popup("", False, "Missing OAuth state parameter.", source)
state_cache = REDIS_CONN.get(_web_state_cache_key(state_id, source))
if not state_cache:
return await _render_web_oauth_popup(state_id, False, "Authorization session expired. Please restart from the main window.", source)
state_obj = json.loads(state_cache)
client_config = state_obj.get("client_config")
if not client_config:
REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, False, "Authorization session was invalid. Please retry.", source)
if error:
REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, False, error_description or "Authorization was cancelled.", source)
code = request.args.get("code")
if not code:
return await _render_web_oauth_popup(state_id, False, "Missing authorization code from Google.", source)
try:
# TODO(google-oauth): branch scopes/redirect_uri based on source_type (drive vs gmail)
flow = Flow.from_client_config(client_config, scopes=GOOGLE_SCOPES[DocumentSource.GOOGLE_DRIVE])
flow.redirect_uri = GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI
flow.fetch_token(code=code)
except Exception as exc: # pragma: no cover - defensive
logging.exception("Failed to exchange Google OAuth code: %s", exc)
REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, False, "Failed to exchange tokens with Google. Please retry.", source)
creds_json = flow.credentials.to_json()
result_payload = {
"user_id": state_obj.get("user_id"),
"credentials": creds_json,
}
REDIS_CONN.set_obj(_web_result_cache_key(state_id, source), result_payload, WEB_FLOW_TTL_SECS)
REDIS_CONN.delete(_web_state_cache_key(state_id, source))
return await _render_web_oauth_popup(state_id, True, "Authorization completed successfully.", source)
@manager.route("/google/oauth/web/result", methods=["POST"]) # noqa: F821
@login_required @login_required
@validate_request("flow_id") @validate_request("flow_id")
async def poll_google_drive_web_result(): async def poll_google_web_result():
req = await request.json or {} req = await request.json or {}
source = request.args.get("type")
if source not in ("google-drive", "gmail"):
return get_json_result(code=RetCode.ARGUMENT_ERROR, message="Invalid Google OAuth type.")
flow_id = req.get("flow_id") flow_id = req.get("flow_id")
cache_raw = REDIS_CONN.get(_web_result_cache_key(flow_id)) cache_raw = REDIS_CONN.get(_web_result_cache_key(flow_id, source))
if not cache_raw: if not cache_raw:
return get_json_result(code=RetCode.RUNNING, message="Authorization is still pending.") return get_json_result(code=RetCode.RUNNING, message="Authorization is still pending.")
@ -291,5 +387,5 @@ async def poll_google_drive_web_result():
if result.get("user_id") != current_user.id: if result.get("user_id") != current_user.id:
return get_json_result(code=RetCode.PERMISSION_ERROR, message="You are not allowed to access this authorization result.") return get_json_result(code=RetCode.PERMISSION_ERROR, message="You are not allowed to access this authorization result.")
REDIS_CONN.delete(_web_result_cache_key(flow_id)) REDIS_CONN.delete(_web_result_cache_key(flow_id, source))
return get_json_result(data={"credentials": result.get("credentials")}) return get_json_result(data={"credentials": result.get("credentials")})

View File

@ -14,9 +14,11 @@
# limitations under the License. # limitations under the License.
# #
import json import json
import os
import re import re
import logging import logging
from copy import deepcopy from copy import deepcopy
import tempfile
from quart import Response, request from quart import Response, request
from api.apps import current_user, login_required from api.apps import current_user, login_required
from api.db.db_models import APIToken from api.db.db_models import APIToken
@ -26,7 +28,7 @@ from api.db.services.llm_service import LLMBundle
from api.db.services.search_service import SearchService from api.db.services.search_service import SearchService
from api.db.services.tenant_llm_service import TenantLLMService from api.db.services.tenant_llm_service import TenantLLMService
from api.db.services.user_service import TenantService, UserTenantService from api.db.services.user_service import TenantService, UserTenantService
from api.utils.api_utils import get_data_error_result, get_json_result, server_error_response, validate_request from api.utils.api_utils import get_data_error_result, get_json_result, get_request_json, server_error_response, validate_request
from rag.prompts.template import load_prompt from rag.prompts.template import load_prompt
from rag.prompts.generator import chunks_format from rag.prompts.generator import chunks_format
from common.constants import RetCode, LLMType from common.constants import RetCode, LLMType
@ -35,7 +37,7 @@ from common.constants import RetCode, LLMType
@manager.route("/set", methods=["POST"]) # noqa: F821 @manager.route("/set", methods=["POST"]) # noqa: F821
@login_required @login_required
async def set_conversation(): async def set_conversation():
req = await request.json req = await get_request_json()
conv_id = req.get("conversation_id") conv_id = req.get("conversation_id")
is_new = req.get("is_new") is_new = req.get("is_new")
name = req.get("name", "New conversation") name = req.get("name", "New conversation")
@ -78,7 +80,7 @@ async def set_conversation():
@manager.route("/get", methods=["GET"]) # noqa: F821 @manager.route("/get", methods=["GET"]) # noqa: F821
@login_required @login_required
def get(): async def get():
conv_id = request.args["conversation_id"] conv_id = request.args["conversation_id"]
try: try:
e, conv = ConversationService.get_by_id(conv_id) e, conv = ConversationService.get_by_id(conv_id)
@ -129,7 +131,7 @@ def getsse(dialog_id):
@manager.route("/rm", methods=["POST"]) # noqa: F821 @manager.route("/rm", methods=["POST"]) # noqa: F821
@login_required @login_required
async def rm(): async def rm():
req = await request.json req = await get_request_json()
conv_ids = req["conversation_ids"] conv_ids = req["conversation_ids"]
try: try:
for cid in conv_ids: for cid in conv_ids:
@ -150,7 +152,7 @@ async def rm():
@manager.route("/list", methods=["GET"]) # noqa: F821 @manager.route("/list", methods=["GET"]) # noqa: F821
@login_required @login_required
def list_conversation(): async def list_conversation():
dialog_id = request.args["dialog_id"] dialog_id = request.args["dialog_id"]
try: try:
if not DialogService.query(tenant_id=current_user.id, id=dialog_id): if not DialogService.query(tenant_id=current_user.id, id=dialog_id):
@ -167,7 +169,7 @@ def list_conversation():
@login_required @login_required
@validate_request("conversation_id", "messages") @validate_request("conversation_id", "messages")
async def completion(): async def completion():
req = await request.json req = await get_request_json()
msg = [] msg = []
for m in req["messages"]: for m in req["messages"]:
if m["role"] == "system": if m["role"] == "system":
@ -248,11 +250,69 @@ async def completion():
except Exception as e: except Exception as e:
return server_error_response(e) return server_error_response(e)
@manager.route("/sequence2txt", methods=["POST"]) # noqa: F821
@login_required
async def sequence2txt():
req = await request.form
stream_mode = req.get("stream", "false").lower() == "true"
files = await request.files
if "file" not in files:
return get_data_error_result(message="Missing 'file' in multipart form-data")
uploaded = files["file"]
ALLOWED_EXTS = {
".wav", ".mp3", ".m4a", ".aac",
".flac", ".ogg", ".webm",
".opus", ".wma"
}
filename = uploaded.filename or ""
suffix = os.path.splitext(filename)[-1].lower()
if suffix not in ALLOWED_EXTS:
return get_data_error_result(message=
f"Unsupported audio format: {suffix}. "
f"Allowed: {', '.join(sorted(ALLOWED_EXTS))}"
)
fd, temp_audio_path = tempfile.mkstemp(suffix=suffix)
os.close(fd)
await uploaded.save(temp_audio_path)
tenants = TenantService.get_info_by(current_user.id)
if not tenants:
return get_data_error_result(message="Tenant not found!")
asr_id = tenants[0]["asr_id"]
if not asr_id:
return get_data_error_result(message="No default ASR model is set")
asr_mdl=LLMBundle(tenants[0]["tenant_id"], LLMType.SPEECH2TEXT, asr_id)
if not stream_mode:
text = asr_mdl.transcription(temp_audio_path)
try:
os.remove(temp_audio_path)
except Exception as e:
logging.error(f"Failed to remove temp audio file: {str(e)}")
return get_json_result(data={"text": text})
async def event_stream():
try:
for evt in asr_mdl.stream_transcription(temp_audio_path):
yield f"data: {json.dumps(evt, ensure_ascii=False)}\n\n"
except Exception as e:
err = {"event": "error", "text": str(e)}
yield f"data: {json.dumps(err, ensure_ascii=False)}\n\n"
finally:
try:
os.remove(temp_audio_path)
except Exception as e:
logging.error(f"Failed to remove temp audio file: {str(e)}")
return Response(event_stream(), content_type="text/event-stream")
@manager.route("/tts", methods=["POST"]) # noqa: F821 @manager.route("/tts", methods=["POST"]) # noqa: F821
@login_required @login_required
async def tts(): async def tts():
req = await request.json req = await get_request_json()
text = req["text"] text = req["text"]
tenants = TenantService.get_info_by(current_user.id) tenants = TenantService.get_info_by(current_user.id)
@ -285,7 +345,7 @@ async def tts():
@login_required @login_required
@validate_request("conversation_id", "message_id") @validate_request("conversation_id", "message_id")
async def delete_msg(): async def delete_msg():
req = await request.json req = await get_request_json()
e, conv = ConversationService.get_by_id(req["conversation_id"]) e, conv = ConversationService.get_by_id(req["conversation_id"])
if not e: if not e:
return get_data_error_result(message="Conversation not found!") return get_data_error_result(message="Conversation not found!")
@ -308,7 +368,7 @@ async def delete_msg():
@login_required @login_required
@validate_request("conversation_id", "message_id") @validate_request("conversation_id", "message_id")
async def thumbup(): async def thumbup():
req = await request.json req = await get_request_json()
e, conv = ConversationService.get_by_id(req["conversation_id"]) e, conv = ConversationService.get_by_id(req["conversation_id"])
if not e: if not e:
return get_data_error_result(message="Conversation not found!") return get_data_error_result(message="Conversation not found!")
@ -335,7 +395,7 @@ async def thumbup():
@login_required @login_required
@validate_request("question", "kb_ids") @validate_request("question", "kb_ids")
async def ask_about(): async def ask_about():
req = await request.json req = await get_request_json()
uid = current_user.id uid = current_user.id
search_id = req.get("search_id", "") search_id = req.get("search_id", "")
@ -367,7 +427,7 @@ async def ask_about():
@login_required @login_required
@validate_request("question", "kb_ids") @validate_request("question", "kb_ids")
async def mindmap(): async def mindmap():
req = await request.json req = await get_request_json()
search_id = req.get("search_id", "") search_id = req.get("search_id", "")
search_app = SearchService.get_detail(search_id) if search_id else {} search_app = SearchService.get_detail(search_id) if search_id else {}
search_config = search_app.get("search_config", {}) if search_app else {} search_config = search_app.get("search_config", {}) if search_app else {}
@ -385,7 +445,7 @@ async def mindmap():
@login_required @login_required
@validate_request("question") @validate_request("question")
async def related_questions(): async def related_questions():
req = await request.json req = await get_request_json()
search_id = req.get("search_id", "") search_id = req.get("search_id", "")
search_config = {} search_config = {}

View File

@ -21,10 +21,9 @@ from common.constants import StatusEnum
from api.db.services.tenant_llm_service import TenantLLMService from api.db.services.tenant_llm_service import TenantLLMService
from api.db.services.knowledgebase_service import KnowledgebaseService from api.db.services.knowledgebase_service import KnowledgebaseService
from api.db.services.user_service import TenantService, UserTenantService from api.db.services.user_service import TenantService, UserTenantService
from api.utils.api_utils import server_error_response, get_data_error_result, validate_request from api.utils.api_utils import get_data_error_result, get_json_result, get_request_json, server_error_response, validate_request
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from common.constants import RetCode from common.constants import RetCode
from api.utils.api_utils import get_json_result
from api.apps import login_required, current_user from api.apps import login_required, current_user
@ -32,7 +31,7 @@ from api.apps import login_required, current_user
@validate_request("prompt_config") @validate_request("prompt_config")
@login_required @login_required
async def set_dialog(): async def set_dialog():
req = await request.json req = await get_request_json()
dialog_id = req.get("dialog_id", "") dialog_id = req.get("dialog_id", "")
is_create = not dialog_id is_create = not dialog_id
name = req.get("name", "New Dialog") name = req.get("name", "New Dialog")
@ -181,7 +180,7 @@ async def list_dialogs_next():
else: else:
desc = True desc = True
req = await request.get_json() req = await get_request_json()
owner_ids = req.get("owner_ids", []) owner_ids = req.get("owner_ids", [])
try: try:
if not owner_ids: if not owner_ids:
@ -209,7 +208,7 @@ async def list_dialogs_next():
@login_required @login_required
@validate_request("dialog_ids") @validate_request("dialog_ids")
async def rm(): async def rm():
req = await request.json req = await get_request_json()
dialog_list=[] dialog_list=[]
tenants = UserTenantService.query(user_id=current_user.id) tenants = UserTenantService.query(user_id=current_user.id)
try: try:

View File

@ -36,7 +36,7 @@ from api.utils.api_utils import (
get_data_error_result, get_data_error_result,
get_json_result, get_json_result,
server_error_response, server_error_response,
validate_request, request_json, validate_request, get_request_json,
) )
from api.utils.file_utils import filename_type, thumbnail from api.utils.file_utils import filename_type, thumbnail
from common.file_utils import get_project_base_directory from common.file_utils import get_project_base_directory
@ -153,7 +153,7 @@ async def web_crawl():
@login_required @login_required
@validate_request("name", "kb_id") @validate_request("name", "kb_id")
async def create(): async def create():
req = await request_json() req = await get_request_json()
kb_id = req["kb_id"] kb_id = req["kb_id"]
if not kb_id: if not kb_id:
return get_json_result(data=False, message='Lack of "KB ID"', code=RetCode.ARGUMENT_ERROR) return get_json_result(data=False, message='Lack of "KB ID"', code=RetCode.ARGUMENT_ERROR)
@ -230,7 +230,7 @@ async def list_docs():
create_time_from = int(request.args.get("create_time_from", 0)) create_time_from = int(request.args.get("create_time_from", 0))
create_time_to = int(request.args.get("create_time_to", 0)) create_time_to = int(request.args.get("create_time_to", 0))
req = await request.get_json() req = await get_request_json()
run_status = req.get("run_status", []) run_status = req.get("run_status", [])
if run_status: if run_status:
@ -271,7 +271,7 @@ async def list_docs():
@manager.route("/filter", methods=["POST"]) # noqa: F821 @manager.route("/filter", methods=["POST"]) # noqa: F821
@login_required @login_required
async def get_filter(): async def get_filter():
req = await request.get_json() req = await get_request_json()
kb_id = req.get("kb_id") kb_id = req.get("kb_id")
if not kb_id: if not kb_id:
@ -309,7 +309,7 @@ async def get_filter():
@manager.route("/infos", methods=["POST"]) # noqa: F821 @manager.route("/infos", methods=["POST"]) # noqa: F821
@login_required @login_required
async def doc_infos(): async def doc_infos():
req = await request_json() req = await get_request_json()
doc_ids = req["doc_ids"] doc_ids = req["doc_ids"]
for doc_id in doc_ids: for doc_id in doc_ids:
if not DocumentService.accessible(doc_id, current_user.id): if not DocumentService.accessible(doc_id, current_user.id):
@ -341,7 +341,7 @@ def thumbnails():
@login_required @login_required
@validate_request("doc_ids", "status") @validate_request("doc_ids", "status")
async def change_status(): async def change_status():
req = await request.get_json() req = await get_request_json()
doc_ids = req.get("doc_ids", []) doc_ids = req.get("doc_ids", [])
status = str(req.get("status", "")) status = str(req.get("status", ""))
@ -381,7 +381,7 @@ async def change_status():
@login_required @login_required
@validate_request("doc_id") @validate_request("doc_id")
async def rm(): async def rm():
req = await request_json() req = await get_request_json()
doc_ids = req["doc_id"] doc_ids = req["doc_id"]
if isinstance(doc_ids, str): if isinstance(doc_ids, str):
doc_ids = [doc_ids] doc_ids = [doc_ids]
@ -402,7 +402,7 @@ async def rm():
@login_required @login_required
@validate_request("doc_ids", "run") @validate_request("doc_ids", "run")
async def run(): async def run():
req = await request_json() req = await get_request_json()
for doc_id in req["doc_ids"]: for doc_id in req["doc_ids"]:
if not DocumentService.accessible(doc_id, current_user.id): if not DocumentService.accessible(doc_id, current_user.id):
return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR) return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)
@ -449,7 +449,7 @@ async def run():
@login_required @login_required
@validate_request("doc_id", "name") @validate_request("doc_id", "name")
async def rename(): async def rename():
req = await request_json() req = await get_request_json()
if not DocumentService.accessible(req["doc_id"], current_user.id): if not DocumentService.accessible(req["doc_id"], current_user.id):
return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR) return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)
try: try:
@ -539,7 +539,7 @@ async def download_attachment(attachment_id):
@validate_request("doc_id") @validate_request("doc_id")
async def change_parser(): async def change_parser():
req = await request_json() req = await get_request_json()
if not DocumentService.accessible(req["doc_id"], current_user.id): if not DocumentService.accessible(req["doc_id"], current_user.id):
return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR) return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)
@ -607,7 +607,7 @@ async def get_image(image_id):
@login_required @login_required
@validate_request("conversation_id") @validate_request("conversation_id")
async def upload_and_parse(): async def upload_and_parse():
files = await request.file files = await request.files
if "file" not in files: if "file" not in files:
return get_json_result(data=False, message="No file part!", code=RetCode.ARGUMENT_ERROR) return get_json_result(data=False, message="No file part!", code=RetCode.ARGUMENT_ERROR)
@ -624,7 +624,8 @@ async def upload_and_parse():
@manager.route("/parse", methods=["POST"]) # noqa: F821 @manager.route("/parse", methods=["POST"]) # noqa: F821
@login_required @login_required
async def parse(): async def parse():
url = await request.json.get("url") if await request.json else "" req = await get_request_json()
url = req.get("url", "")
if url: if url:
if not is_valid_url(url): if not is_valid_url(url):
return get_json_result(data=False, message="The URL format is invalid", code=RetCode.ARGUMENT_ERROR) return get_json_result(data=False, message="The URL format is invalid", code=RetCode.ARGUMENT_ERROR)
@ -679,7 +680,7 @@ async def parse():
@login_required @login_required
@validate_request("doc_id", "meta") @validate_request("doc_id", "meta")
async def set_meta(): async def set_meta():
req = await request_json() req = await get_request_json()
if not DocumentService.accessible(req["doc_id"], current_user.id): if not DocumentService.accessible(req["doc_id"], current_user.id):
return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR) return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)
try: try:
@ -705,3 +706,13 @@ async def set_meta():
return get_json_result(data=True) return get_json_result(data=True)
except Exception as e: except Exception as e:
return server_error_response(e) return server_error_response(e)
@manager.route("/upload_info", methods=["POST"]) # noqa: F821
async def upload_info():
files = await request.files
file = files['file'] if files and files.get("file") else None
try:
return get_json_result(data=FileService.upload_info(current_user.id, file, request.args.get("url")))
except Exception as e:
return server_error_response(e)

View File

@ -19,22 +19,20 @@ from pathlib import Path
from api.db.services.file2document_service import File2DocumentService from api.db.services.file2document_service import File2DocumentService
from api.db.services.file_service import FileService from api.db.services.file_service import FileService
from quart import request
from api.apps import login_required, current_user from api.apps import login_required, current_user
from api.db.services.knowledgebase_service import KnowledgebaseService from api.db.services.knowledgebase_service import KnowledgebaseService
from api.utils.api_utils import server_error_response, get_data_error_result, validate_request from api.utils.api_utils import get_data_error_result, get_json_result, get_request_json, server_error_response, validate_request
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from common.constants import RetCode from common.constants import RetCode
from api.db import FileType from api.db import FileType
from api.db.services.document_service import DocumentService from api.db.services.document_service import DocumentService
from api.utils.api_utils import get_json_result
@manager.route('/convert', methods=['POST']) # noqa: F821 @manager.route('/convert', methods=['POST']) # noqa: F821
@login_required @login_required
@validate_request("file_ids", "kb_ids") @validate_request("file_ids", "kb_ids")
async def convert(): async def convert():
req = await request.json req = await get_request_json()
kb_ids = req["kb_ids"] kb_ids = req["kb_ids"]
file_ids = req["file_ids"] file_ids = req["file_ids"]
file2documents = [] file2documents = []
@ -79,7 +77,8 @@ async def convert():
doc = DocumentService.insert({ doc = DocumentService.insert({
"id": get_uuid(), "id": get_uuid(),
"kb_id": kb.id, "kb_id": kb.id,
"parser_id": FileService.get_parser(file.type, file.name, kb.parser_id), "parser_id": kb.parser_id,
"pipeline_id": kb.pipeline_id,
"parser_config": kb.parser_config, "parser_config": kb.parser_config,
"created_by": current_user.id, "created_by": current_user.id,
"type": file.type, "type": file.type,
@ -104,7 +103,7 @@ async def convert():
@login_required @login_required
@validate_request("file_ids") @validate_request("file_ids")
async def rm(): async def rm():
req = await request.json req = await get_request_json()
file_ids = req["file_ids"] file_ids = req["file_ids"]
if not file_ids: if not file_ids:
return get_json_result( return get_json_result(

View File

@ -29,7 +29,7 @@ from common.constants import RetCode, FileSource
from api.db import FileType from api.db import FileType
from api.db.services import duplicate_name from api.db.services import duplicate_name
from api.db.services.file_service import FileService from api.db.services.file_service import FileService
from api.utils.api_utils import get_json_result from api.utils.api_utils import get_json_result, get_request_json
from api.utils.file_utils import filename_type from api.utils.file_utils import filename_type
from api.utils.web_utils import CONTENT_TYPE_MAP from api.utils.web_utils import CONTENT_TYPE_MAP
from common import settings from common import settings
@ -124,9 +124,9 @@ async def upload():
@login_required @login_required
@validate_request("name") @validate_request("name")
async def create(): async def create():
req = await request.json req = await get_request_json()
pf_id = await request.json.get("parent_id") pf_id = req.get("parent_id")
input_file_type = await request.json.get("type") input_file_type = req.get("type")
if not pf_id: if not pf_id:
root_folder = FileService.get_root_folder(current_user.id) root_folder = FileService.get_root_folder(current_user.id)
pf_id = root_folder["id"] pf_id = root_folder["id"]
@ -239,7 +239,7 @@ def get_all_parent_folders():
@login_required @login_required
@validate_request("file_ids") @validate_request("file_ids")
async def rm(): async def rm():
req = await request.json req = await get_request_json()
file_ids = req["file_ids"] file_ids = req["file_ids"]
def _delete_single_file(file): def _delete_single_file(file):
@ -300,7 +300,7 @@ async def rm():
@login_required @login_required
@validate_request("file_id", "name") @validate_request("file_id", "name")
async def rename(): async def rename():
req = await request.json req = await get_request_json()
try: try:
e, file = FileService.get_by_id(req["file_id"]) e, file = FileService.get_by_id(req["file_id"])
if not e: if not e:
@ -369,7 +369,7 @@ async def get(file_id):
@login_required @login_required
@validate_request("src_file_ids", "dest_file_id") @validate_request("src_file_ids", "dest_file_id")
async def move(): async def move():
req = await request.json req = await get_request_json()
try: try:
file_ids = req["src_file_ids"] file_ids = req["src_file_ids"]
dest_parent_id = req["dest_file_id"] dest_parent_id = req["dest_file_id"]

View File

@ -30,7 +30,7 @@ from api.db.services.pipeline_operation_log_service import PipelineOperationLogS
from api.db.services.task_service import TaskService, GRAPH_RAPTOR_FAKE_DOC_ID from api.db.services.task_service import TaskService, GRAPH_RAPTOR_FAKE_DOC_ID
from api.db.services.user_service import TenantService, UserTenantService from api.db.services.user_service import TenantService, UserTenantService
from api.utils.api_utils import get_error_data_result, server_error_response, get_data_error_result, validate_request, not_allowed_parameters, \ from api.utils.api_utils import get_error_data_result, server_error_response, get_data_error_result, validate_request, not_allowed_parameters, \
request_json get_request_json
from api.db import VALID_FILE_TYPES from api.db import VALID_FILE_TYPES
from api.db.services.knowledgebase_service import KnowledgebaseService from api.db.services.knowledgebase_service import KnowledgebaseService
from api.db.db_models import File from api.db.db_models import File
@ -48,7 +48,7 @@ from api.apps import login_required, current_user
@login_required @login_required
@validate_request("name") @validate_request("name")
async def create(): async def create():
req = await request_json() req = await get_request_json()
e, res = KnowledgebaseService.create_with_name( e, res = KnowledgebaseService.create_with_name(
name = req.pop("name", None), name = req.pop("name", None),
tenant_id = current_user.id, tenant_id = current_user.id,
@ -72,7 +72,7 @@ async def create():
@validate_request("kb_id", "name", "description", "parser_id") @validate_request("kb_id", "name", "description", "parser_id")
@not_allowed_parameters("id", "tenant_id", "created_by", "create_time", "update_time", "create_date", "update_date", "created_by") @not_allowed_parameters("id", "tenant_id", "created_by", "create_time", "update_time", "create_date", "update_date", "created_by")
async def update(): async def update():
req = await request_json() req = await get_request_json()
if not isinstance(req["name"], str): if not isinstance(req["name"], str):
return get_data_error_result(message="Dataset name must be string.") return get_data_error_result(message="Dataset name must be string.")
if req["name"].strip() == "": if req["name"].strip() == "":
@ -182,7 +182,7 @@ async def list_kbs():
else: else:
desc = True desc = True
req = await request_json() req = await get_request_json()
owner_ids = req.get("owner_ids", []) owner_ids = req.get("owner_ids", [])
try: try:
if not owner_ids: if not owner_ids:
@ -209,7 +209,7 @@ async def list_kbs():
@login_required @login_required
@validate_request("kb_id") @validate_request("kb_id")
async def rm(): async def rm():
req = await request_json() req = await get_request_json()
if not KnowledgebaseService.accessible4deletion(req["kb_id"], current_user.id): if not KnowledgebaseService.accessible4deletion(req["kb_id"], current_user.id):
return get_json_result( return get_json_result(
data=False, data=False,
@ -286,7 +286,7 @@ def list_tags_from_kbs():
@manager.route('/<kb_id>/rm_tags', methods=['POST']) # noqa: F821 @manager.route('/<kb_id>/rm_tags', methods=['POST']) # noqa: F821
@login_required @login_required
async def rm_tags(kb_id): async def rm_tags(kb_id):
req = await request_json() req = await get_request_json()
if not KnowledgebaseService.accessible(kb_id, current_user.id): if not KnowledgebaseService.accessible(kb_id, current_user.id):
return get_json_result( return get_json_result(
data=False, data=False,
@ -306,7 +306,7 @@ async def rm_tags(kb_id):
@manager.route('/<kb_id>/rename_tag', methods=['POST']) # noqa: F821 @manager.route('/<kb_id>/rename_tag', methods=['POST']) # noqa: F821
@login_required @login_required
async def rename_tags(kb_id): async def rename_tags(kb_id):
req = await request_json() req = await get_request_json()
if not KnowledgebaseService.accessible(kb_id, current_user.id): if not KnowledgebaseService.accessible(kb_id, current_user.id):
return get_json_result( return get_json_result(
data=False, data=False,
@ -428,7 +428,7 @@ async def list_pipeline_logs():
if create_date_to > create_date_from: if create_date_to > create_date_from:
return get_data_error_result(message="Create data filter is abnormal.") return get_data_error_result(message="Create data filter is abnormal.")
req = await request_json() req = await get_request_json()
operation_status = req.get("operation_status", []) operation_status = req.get("operation_status", [])
if operation_status: if operation_status:
@ -470,7 +470,7 @@ async def list_pipeline_dataset_logs():
if create_date_to > create_date_from: if create_date_to > create_date_from:
return get_data_error_result(message="Create data filter is abnormal.") return get_data_error_result(message="Create data filter is abnormal.")
req = await request_json() req = await get_request_json()
operation_status = req.get("operation_status", []) operation_status = req.get("operation_status", [])
if operation_status: if operation_status:
@ -492,7 +492,7 @@ async def delete_pipeline_logs():
if not kb_id: if not kb_id:
return get_json_result(data=False, message='Lack of "KB ID"', code=RetCode.ARGUMENT_ERROR) return get_json_result(data=False, message='Lack of "KB ID"', code=RetCode.ARGUMENT_ERROR)
req = await request_json() req = await get_request_json()
log_ids = req.get("log_ids", []) log_ids = req.get("log_ids", [])
PipelineOperationLogService.delete_by_ids(log_ids) PipelineOperationLogService.delete_by_ids(log_ids)
@ -517,7 +517,7 @@ def pipeline_log_detail():
@manager.route("/run_graphrag", methods=["POST"]) # noqa: F821 @manager.route("/run_graphrag", methods=["POST"]) # noqa: F821
@login_required @login_required
async def run_graphrag(): async def run_graphrag():
req = await request_json() req = await get_request_json()
kb_id = req.get("kb_id", "") kb_id = req.get("kb_id", "")
if not kb_id: if not kb_id:
@ -586,7 +586,7 @@ def trace_graphrag():
@manager.route("/run_raptor", methods=["POST"]) # noqa: F821 @manager.route("/run_raptor", methods=["POST"]) # noqa: F821
@login_required @login_required
async def run_raptor(): async def run_raptor():
req = await request_json() req = await get_request_json()
kb_id = req.get("kb_id", "") kb_id = req.get("kb_id", "")
if not kb_id: if not kb_id:
@ -655,7 +655,7 @@ def trace_raptor():
@manager.route("/run_mindmap", methods=["POST"]) # noqa: F821 @manager.route("/run_mindmap", methods=["POST"]) # noqa: F821
@login_required @login_required
async def run_mindmap(): async def run_mindmap():
req = await request_json() req = await get_request_json()
kb_id = req.get("kb_id", "") kb_id = req.get("kb_id", "")
if not kb_id: if not kb_id:
@ -857,11 +857,11 @@ async def check_embedding():
"question_kwd": full_doc.get("question_kwd") or [] "question_kwd": full_doc.get("question_kwd") or []
}) })
return out return out
def _clean(s: str) -> str: def _clean(s: str) -> str:
s = re.sub(r"</?(table|td|caption|tr|th)( [^<>]{0,12})?>", " ", s or "") s = re.sub(r"</?(table|td|caption|tr|th)( [^<>]{0,12})?>", " ", s or "")
return s if s else "None" return s if s else "None"
req = await request_json() req = await get_request_json()
kb_id = req.get("kb_id", "") kb_id = req.get("kb_id", "")
embd_id = req.get("embd_id", "") embd_id = req.get("embd_id", "")
n = int(req.get("check_num", 5)) n = int(req.get("check_num", 5))

View File

@ -15,20 +15,19 @@
# #
from quart import request
from api.apps import current_user, login_required from api.apps import current_user, login_required
from langfuse import Langfuse from langfuse import Langfuse
from api.db.db_models import DB from api.db.db_models import DB
from api.db.services.langfuse_service import TenantLangfuseService from api.db.services.langfuse_service import TenantLangfuseService
from api.utils.api_utils import get_error_data_result, get_json_result, server_error_response, validate_request from api.utils.api_utils import get_error_data_result, get_json_result, get_request_json, server_error_response, validate_request
@manager.route("/api_key", methods=["POST", "PUT"]) # noqa: F821 @manager.route("/api_key", methods=["POST", "PUT"]) # noqa: F821
@login_required @login_required
@validate_request("secret_key", "public_key", "host") @validate_request("secret_key", "public_key", "host")
async def set_api_key(): async def set_api_key():
req = await request.get_json() req = await get_request_json()
secret_key = req.get("secret_key", "") secret_key = req.get("secret_key", "")
public_key = req.get("public_key", "") public_key = req.get("public_key", "")
host = req.get("host", "") host = req.get("host", "")

View File

@ -21,10 +21,9 @@ from quart import request
from api.apps import login_required, current_user from api.apps import login_required, current_user
from api.db.services.tenant_llm_service import LLMFactoriesService, TenantLLMService from api.db.services.tenant_llm_service import LLMFactoriesService, TenantLLMService
from api.db.services.llm_service import LLMService from api.db.services.llm_service import LLMService
from api.utils.api_utils import server_error_response, get_data_error_result, validate_request from api.utils.api_utils import get_allowed_llm_factories, get_data_error_result, get_json_result, get_request_json, server_error_response, validate_request
from common.constants import StatusEnum, LLMType from common.constants import StatusEnum, LLMType
from api.db.db_models import TenantLLM from api.db.db_models import TenantLLM
from api.utils.api_utils import get_json_result, get_allowed_llm_factories
from rag.utils.base64_image import test_image from rag.utils.base64_image import test_image
from rag.llm import EmbeddingModel, ChatModel, RerankModel, CvModel, TTSModel from rag.llm import EmbeddingModel, ChatModel, RerankModel, CvModel, TTSModel
@ -54,7 +53,7 @@ def factories():
@login_required @login_required
@validate_request("llm_factory", "api_key") @validate_request("llm_factory", "api_key")
async def set_api_key(): async def set_api_key():
req = await request.json req = await get_request_json()
# test if api key works # test if api key works
chat_passed, embd_passed, rerank_passed = False, False, False chat_passed, embd_passed, rerank_passed = False, False, False
factory = req["llm_factory"] factory = req["llm_factory"]
@ -124,7 +123,7 @@ async def set_api_key():
@login_required @login_required
@validate_request("llm_factory") @validate_request("llm_factory")
async def add_llm(): async def add_llm():
req = await request.json req = await get_request_json()
factory = req["llm_factory"] factory = req["llm_factory"]
api_key = req.get("api_key", "x") api_key = req.get("api_key", "x")
llm_name = req.get("llm_name") llm_name = req.get("llm_name")
@ -269,7 +268,7 @@ async def add_llm():
@login_required @login_required
@validate_request("llm_factory", "llm_name") @validate_request("llm_factory", "llm_name")
async def delete_llm(): async def delete_llm():
req = await request.json req = await get_request_json()
TenantLLMService.filter_delete([TenantLLM.tenant_id == current_user.id, TenantLLM.llm_factory == req["llm_factory"], TenantLLM.llm_name == req["llm_name"]]) TenantLLMService.filter_delete([TenantLLM.tenant_id == current_user.id, TenantLLM.llm_factory == req["llm_factory"], TenantLLM.llm_name == req["llm_name"]])
return get_json_result(data=True) return get_json_result(data=True)
@ -278,7 +277,7 @@ async def delete_llm():
@login_required @login_required
@validate_request("llm_factory", "llm_name") @validate_request("llm_factory", "llm_name")
async def enable_llm(): async def enable_llm():
req = await request.json req = await get_request_json()
TenantLLMService.filter_update( TenantLLMService.filter_update(
[TenantLLM.tenant_id == current_user.id, TenantLLM.llm_factory == req["llm_factory"], TenantLLM.llm_name == req["llm_name"]], {"status": str(req.get("status", "1"))} [TenantLLM.tenant_id == current_user.id, TenantLLM.llm_factory == req["llm_factory"], TenantLLM.llm_name == req["llm_name"]], {"status": str(req.get("status", "1"))}
) )
@ -289,7 +288,7 @@ async def enable_llm():
@login_required @login_required
@validate_request("llm_factory") @validate_request("llm_factory")
async def delete_factory(): async def delete_factory():
req = await request.json req = await get_request_json()
TenantLLMService.filter_delete([TenantLLM.tenant_id == current_user.id, TenantLLM.llm_factory == req["llm_factory"]]) TenantLLMService.filter_delete([TenantLLM.tenant_id == current_user.id, TenantLLM.llm_factory == req["llm_factory"]])
return get_json_result(data=True) return get_json_result(data=True)

View File

@ -22,8 +22,7 @@ from api.db.services.user_service import TenantService
from common.constants import RetCode, VALID_MCP_SERVER_TYPES from common.constants import RetCode, VALID_MCP_SERVER_TYPES
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from api.utils.api_utils import get_data_error_result, get_json_result, server_error_response, validate_request, \ from api.utils.api_utils import get_data_error_result, get_json_result, get_mcp_tools, get_request_json, server_error_response, validate_request
get_mcp_tools
from api.utils.web_utils import get_float, safe_json_parse from api.utils.web_utils import get_float, safe_json_parse
from common.mcp_tool_call_conn import MCPToolCallSession, close_multiple_mcp_toolcall_sessions from common.mcp_tool_call_conn import MCPToolCallSession, close_multiple_mcp_toolcall_sessions
@ -40,7 +39,7 @@ async def list_mcp() -> Response:
else: else:
desc = True desc = True
req = await request.get_json() req = await get_request_json()
mcp_ids = req.get("mcp_ids", []) mcp_ids = req.get("mcp_ids", [])
try: try:
servers = MCPServerService.get_servers(current_user.id, mcp_ids, 0, 0, orderby, desc, keywords) or [] servers = MCPServerService.get_servers(current_user.id, mcp_ids, 0, 0, orderby, desc, keywords) or []
@ -73,7 +72,7 @@ def detail() -> Response:
@login_required @login_required
@validate_request("name", "url", "server_type") @validate_request("name", "url", "server_type")
async def create() -> Response: async def create() -> Response:
req = await request.get_json() req = await get_request_json()
server_type = req.get("server_type", "") server_type = req.get("server_type", "")
if server_type not in VALID_MCP_SERVER_TYPES: if server_type not in VALID_MCP_SERVER_TYPES:
@ -128,7 +127,7 @@ async def create() -> Response:
@login_required @login_required
@validate_request("mcp_id") @validate_request("mcp_id")
async def update() -> Response: async def update() -> Response:
req = await request.get_json() req = await get_request_json()
mcp_id = req.get("mcp_id", "") mcp_id = req.get("mcp_id", "")
e, mcp_server = MCPServerService.get_by_id(mcp_id) e, mcp_server = MCPServerService.get_by_id(mcp_id)
@ -184,7 +183,7 @@ async def update() -> Response:
@login_required @login_required
@validate_request("mcp_ids") @validate_request("mcp_ids")
async def rm() -> Response: async def rm() -> Response:
req = await request.get_json() req = await get_request_json()
mcp_ids = req.get("mcp_ids", []) mcp_ids = req.get("mcp_ids", [])
try: try:
@ -202,7 +201,7 @@ async def rm() -> Response:
@login_required @login_required
@validate_request("mcpServers") @validate_request("mcpServers")
async def import_multiple() -> Response: async def import_multiple() -> Response:
req = await request.get_json() req = await get_request_json()
servers = req.get("mcpServers", {}) servers = req.get("mcpServers", {})
if not servers: if not servers:
return get_data_error_result(message="No MCP servers provided.") return get_data_error_result(message="No MCP servers provided.")
@ -269,7 +268,7 @@ async def import_multiple() -> Response:
@login_required @login_required
@validate_request("mcp_ids") @validate_request("mcp_ids")
async def export_multiple() -> Response: async def export_multiple() -> Response:
req = await request.get_json() req = await get_request_json()
mcp_ids = req.get("mcp_ids", []) mcp_ids = req.get("mcp_ids", [])
if not mcp_ids: if not mcp_ids:
@ -301,7 +300,7 @@ async def export_multiple() -> Response:
@login_required @login_required
@validate_request("mcp_ids") @validate_request("mcp_ids")
async def list_tools() -> Response: async def list_tools() -> Response:
req = await request.get_json() req = await get_request_json()
mcp_ids = req.get("mcp_ids", []) mcp_ids = req.get("mcp_ids", [])
if not mcp_ids: if not mcp_ids:
return get_data_error_result(message="No MCP server IDs provided.") return get_data_error_result(message="No MCP server IDs provided.")
@ -348,7 +347,7 @@ async def list_tools() -> Response:
@login_required @login_required
@validate_request("mcp_id", "tool_name", "arguments") @validate_request("mcp_id", "tool_name", "arguments")
async def test_tool() -> Response: async def test_tool() -> Response:
req = await request.get_json() req = await get_request_json()
mcp_id = req.get("mcp_id", "") mcp_id = req.get("mcp_id", "")
if not mcp_id: if not mcp_id:
return get_data_error_result(message="No MCP server ID provided.") return get_data_error_result(message="No MCP server ID provided.")
@ -381,7 +380,7 @@ async def test_tool() -> Response:
@login_required @login_required
@validate_request("mcp_id", "tools") @validate_request("mcp_id", "tools")
async def cache_tool() -> Response: async def cache_tool() -> Response:
req = await request.get_json() req = await get_request_json()
mcp_id = req.get("mcp_id", "") mcp_id = req.get("mcp_id", "")
if not mcp_id: if not mcp_id:
return get_data_error_result(message="No MCP server ID provided.") return get_data_error_result(message="No MCP server ID provided.")
@ -404,7 +403,7 @@ async def cache_tool() -> Response:
@manager.route("/test_mcp", methods=["POST"]) # noqa: F821 @manager.route("/test_mcp", methods=["POST"]) # noqa: F821
@validate_request("url", "server_type") @validate_request("url", "server_type")
async def test_mcp() -> Response: async def test_mcp() -> Response:
req = await request.get_json() req = await get_request_json()
url = req.get("url", "") url = req.get("url", "")
if not url: if not url:

View File

@ -25,7 +25,7 @@ from api.db.services.canvas_service import UserCanvasService
from api.db.services.user_canvas_version import UserCanvasVersionService from api.db.services.user_canvas_version import UserCanvasVersionService
from common.constants import RetCode from common.constants import RetCode
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from api.utils.api_utils import get_data_error_result, get_error_data_result, get_json_result, token_required from api.utils.api_utils import get_data_error_result, get_error_data_result, get_json_result, get_request_json, token_required
from api.utils.api_utils import get_result from api.utils.api_utils import get_result
from quart import request, Response from quart import request, Response
@ -53,7 +53,7 @@ def list_agents(tenant_id):
@manager.route("/agents", methods=["POST"]) # noqa: F821 @manager.route("/agents", methods=["POST"]) # noqa: F821
@token_required @token_required
async def create_agent(tenant_id: str): async def create_agent(tenant_id: str):
req: dict[str, Any] = cast(dict[str, Any], await request.json) req: dict[str, Any] = cast(dict[str, Any], await get_request_json())
req["user_id"] = tenant_id req["user_id"] = tenant_id
if req.get("dsl") is not None: if req.get("dsl") is not None:
@ -90,7 +90,7 @@ async def create_agent(tenant_id: str):
@manager.route("/agents/<agent_id>", methods=["PUT"]) # noqa: F821 @manager.route("/agents/<agent_id>", methods=["PUT"]) # noqa: F821
@token_required @token_required
async def update_agent(tenant_id: str, agent_id: str): async def update_agent(tenant_id: str, agent_id: str):
req: dict[str, Any] = {k: v for k, v in cast(dict[str, Any], (await request.json)).items() if v is not None} req: dict[str, Any] = {k: v for k, v in cast(dict[str, Any], (await get_request_json())).items() if v is not None}
req["user_id"] = tenant_id req["user_id"] = tenant_id
if req.get("dsl") is not None: if req.get("dsl") is not None:
@ -136,7 +136,7 @@ def delete_agent(tenant_id: str, agent_id: str):
@manager.route('/webhook/<agent_id>', methods=['POST']) # noqa: F821 @manager.route('/webhook/<agent_id>', methods=['POST']) # noqa: F821
@token_required @token_required
async def webhook(tenant_id: str, agent_id: str): async def webhook(tenant_id: str, agent_id: str):
req = await request.json req = await get_request_json()
if not UserCanvasService.accessible(req["id"], tenant_id): if not UserCanvasService.accessible(req["id"], tenant_id):
return get_json_result( return get_json_result(
data=False, message='Only owner of canvas authorized for this operation.', data=False, message='Only owner of canvas authorized for this operation.',
@ -159,10 +159,10 @@ async def webhook(tenant_id: str, agent_id: str):
data=False, message=str(e), data=False, message=str(e),
code=RetCode.EXCEPTION_ERROR) code=RetCode.EXCEPTION_ERROR)
def sse(): async def sse():
nonlocal canvas nonlocal canvas
try: try:
for ans in canvas.run(query=req.get("query", ""), files=req.get("files", []), user_id=req.get("user_id", tenant_id), webhook_payload=req): async for ans in canvas.run(query=req.get("query", ""), files=req.get("files", []), user_id=req.get("user_id", tenant_id), webhook_payload=req):
yield "data:" + json.dumps(ans, ensure_ascii=False) + "\n\n" yield "data:" + json.dumps(ans, ensure_ascii=False) + "\n\n"
cvs.dsl = json.loads(str(canvas)) cvs.dsl = json.loads(str(canvas))

View File

@ -21,13 +21,13 @@ from api.db.services.tenant_llm_service import TenantLLMService
from api.db.services.user_service import TenantService from api.db.services.user_service import TenantService
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from common.constants import RetCode, StatusEnum from common.constants import RetCode, StatusEnum
from api.utils.api_utils import check_duplicate_ids, get_error_data_result, get_result, token_required, request_json from api.utils.api_utils import check_duplicate_ids, get_error_data_result, get_result, token_required, get_request_json
@manager.route("/chats", methods=["POST"]) # noqa: F821 @manager.route("/chats", methods=["POST"]) # noqa: F821
@token_required @token_required
async def create(tenant_id): async def create(tenant_id):
req = await request_json() req = await get_request_json()
ids = [i for i in req.get("dataset_ids", []) if i] ids = [i for i in req.get("dataset_ids", []) if i]
for kb_id in ids: for kb_id in ids:
kbs = KnowledgebaseService.accessible(kb_id=kb_id, user_id=tenant_id) kbs = KnowledgebaseService.accessible(kb_id=kb_id, user_id=tenant_id)
@ -146,7 +146,7 @@ async def create(tenant_id):
async def update(tenant_id, chat_id): async def update(tenant_id, chat_id):
if not DialogService.query(tenant_id=tenant_id, id=chat_id, status=StatusEnum.VALID.value): if not DialogService.query(tenant_id=tenant_id, id=chat_id, status=StatusEnum.VALID.value):
return get_error_data_result(message="You do not own the chat") return get_error_data_result(message="You do not own the chat")
req = await request_json() req = await get_request_json()
ids = req.get("dataset_ids", []) ids = req.get("dataset_ids", [])
if "show_quotation" in req: if "show_quotation" in req:
req["do_refer"] = req.pop("show_quotation") req["do_refer"] = req.pop("show_quotation")
@ -229,7 +229,7 @@ async def update(tenant_id, chat_id):
async def delete_chats(tenant_id): async def delete_chats(tenant_id):
errors = [] errors = []
success_count = 0 success_count = 0
req = await request_json() req = await get_request_json()
if not req: if not req:
ids = None ids = None
else: else:

View File

@ -15,12 +15,12 @@
# #
import logging import logging
from quart import request, jsonify from quart import jsonify
from api.db.services.document_service import DocumentService from api.db.services.document_service import DocumentService
from api.db.services.knowledgebase_service import KnowledgebaseService from api.db.services.knowledgebase_service import KnowledgebaseService
from api.db.services.llm_service import LLMBundle from api.db.services.llm_service import LLMBundle
from api.utils.api_utils import validate_request, build_error_result, apikey_required from api.utils.api_utils import apikey_required, build_error_result, get_request_json, validate_request
from rag.app.tag import label_question from rag.app.tag import label_question
from api.db.services.dialog_service import meta_filter, convert_conditions from api.db.services.dialog_service import meta_filter, convert_conditions
from common.constants import RetCode, LLMType from common.constants import RetCode, LLMType
@ -113,14 +113,14 @@ async def retrieval(tenant_id):
404: 404:
description: Knowledge base or document not found description: Knowledge base or document not found
""" """
req = await request.json req = await get_request_json()
question = req["query"] question = req["query"]
kb_id = req["knowledge_id"] kb_id = req["knowledge_id"]
use_kg = req.get("use_kg", False) use_kg = req.get("use_kg", False)
retrieval_setting = req.get("retrieval_setting", {}) retrieval_setting = req.get("retrieval_setting", {})
similarity_threshold = float(retrieval_setting.get("score_threshold", 0.0)) similarity_threshold = float(retrieval_setting.get("score_threshold", 0.0))
top = int(retrieval_setting.get("top_k", 1024)) top = int(retrieval_setting.get("top_k", 1024))
metadata_condition = req.get("metadata_condition", {}) metadata_condition = req.get("metadata_condition", {}) or {}
metas = DocumentService.get_meta_by_kbs([kb_id]) metas = DocumentService.get_meta_by_kbs([kb_id])
doc_ids = [] doc_ids = []
@ -132,7 +132,7 @@ async def retrieval(tenant_id):
embd_mdl = LLMBundle(kb.tenant_id, LLMType.EMBEDDING.value, llm_name=kb.embd_id) embd_mdl = LLMBundle(kb.tenant_id, LLMType.EMBEDDING.value, llm_name=kb.embd_id)
if metadata_condition: if metadata_condition:
doc_ids.extend(meta_filter(metas, convert_conditions(metadata_condition))) doc_ids.extend(meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and")))
if not doc_ids and metadata_condition: if not doc_ids and metadata_condition:
doc_ids = ["-999"] doc_ids = ["-999"]
ranks = settings.retriever.retrieval( ranks = settings.retriever.retrieval(

View File

@ -36,7 +36,7 @@ from api.db.services.tenant_llm_service import TenantLLMService
from api.db.services.task_service import TaskService, queue_tasks from api.db.services.task_service import TaskService, queue_tasks
from api.db.services.dialog_service import meta_filter, convert_conditions from api.db.services.dialog_service import meta_filter, convert_conditions
from api.utils.api_utils import check_duplicate_ids, construct_json_result, get_error_data_result, get_parser_config, get_result, server_error_response, token_required, \ from api.utils.api_utils import check_duplicate_ids, construct_json_result, get_error_data_result, get_parser_config, get_result, server_error_response, token_required, \
request_json get_request_json
from rag.app.qa import beAdoc, rmPrefix from rag.app.qa import beAdoc, rmPrefix
from rag.app.tag import label_question from rag.app.tag import label_question
from rag.nlp import rag_tokenizer, search from rag.nlp import rag_tokenizer, search
@ -231,7 +231,7 @@ async def update_doc(tenant_id, dataset_id, document_id):
schema: schema:
type: object type: object
""" """
req = await request_json() req = await get_request_json()
if not KnowledgebaseService.query(id=dataset_id, tenant_id=tenant_id): if not KnowledgebaseService.query(id=dataset_id, tenant_id=tenant_id):
return get_error_data_result(message="You don't own the dataset.") return get_error_data_result(message="You don't own the dataset.")
e, kb = KnowledgebaseService.get_by_id(dataset_id) e, kb = KnowledgebaseService.get_by_id(dataset_id)
@ -536,7 +536,7 @@ def list_docs(dataset_id, tenant_id):
return get_error_data_result(message=f"You don't own the dataset {dataset_id}. ") return get_error_data_result(message=f"You don't own the dataset {dataset_id}. ")
q = request.args q = request.args
document_id = q.get("id") document_id = q.get("id")
name = q.get("name") name = q.get("name")
if document_id and not DocumentService.query(id=document_id, kb_id=dataset_id): if document_id and not DocumentService.query(id=document_id, kb_id=dataset_id):
@ -545,16 +545,16 @@ def list_docs(dataset_id, tenant_id):
return get_error_data_result(message=f"You don't own the document {name}.") return get_error_data_result(message=f"You don't own the document {name}.")
page = int(q.get("page", 1)) page = int(q.get("page", 1))
page_size = int(q.get("page_size", 30)) page_size = int(q.get("page_size", 30))
orderby = q.get("orderby", "create_time") orderby = q.get("orderby", "create_time")
desc = str(q.get("desc", "true")).strip().lower() != "false" desc = str(q.get("desc", "true")).strip().lower() != "false"
keywords = q.get("keywords", "") keywords = q.get("keywords", "")
# filters - align with OpenAPI parameter names # filters - align with OpenAPI parameter names
suffix = q.getlist("suffix") suffix = q.getlist("suffix")
run_status = q.getlist("run") run_status = q.getlist("run")
create_time_from = int(q.get("create_time_from", 0)) create_time_from = int(q.get("create_time_from", 0))
create_time_to = int(q.get("create_time_to", 0)) create_time_to = int(q.get("create_time_to", 0))
# map run status (accept text or numeric) - align with API parameter # map run status (accept text or numeric) - align with API parameter
run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"} run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"}
@ -575,7 +575,7 @@ def list_docs(dataset_id, tenant_id):
# rename keys + map run status back to text for output # rename keys + map run status back to text for output
key_mapping = { key_mapping = {
"chunk_num": "chunk_count", "chunk_num": "chunk_count",
"kb_id": "dataset_id", "kb_id": "dataset_id",
"token_num": "token_count", "token_num": "token_count",
"parser_id": "chunk_method", "parser_id": "chunk_method",
} }
@ -631,7 +631,7 @@ async def delete(tenant_id, dataset_id):
""" """
if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id): if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id):
return get_error_data_result(message=f"You don't own the dataset {dataset_id}. ") return get_error_data_result(message=f"You don't own the dataset {dataset_id}. ")
req = await request_json() req = await get_request_json()
if not req: if not req:
doc_ids = None doc_ids = None
else: else:
@ -741,7 +741,7 @@ async def parse(tenant_id, dataset_id):
""" """
if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id): if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id):
return get_error_data_result(message=f"You don't own the dataset {dataset_id}.") return get_error_data_result(message=f"You don't own the dataset {dataset_id}.")
req = await request_json() req = await get_request_json()
if not req.get("document_ids"): if not req.get("document_ids"):
return get_error_data_result("`document_ids` is required") return get_error_data_result("`document_ids` is required")
doc_list = req.get("document_ids") doc_list = req.get("document_ids")
@ -824,7 +824,7 @@ async def stop_parsing(tenant_id, dataset_id):
""" """
if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id): if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=tenant_id):
return get_error_data_result(message=f"You don't own the dataset {dataset_id}.") return get_error_data_result(message=f"You don't own the dataset {dataset_id}.")
req = await request_json() req = await get_request_json()
if not req.get("document_ids"): if not req.get("document_ids"):
return get_error_data_result("`document_ids` is required") return get_error_data_result("`document_ids` is required")
@ -1096,7 +1096,7 @@ async def add_chunk(tenant_id, dataset_id, document_id):
if not doc: if not doc:
return get_error_data_result(message=f"You don't own the document {document_id}.") return get_error_data_result(message=f"You don't own the document {document_id}.")
doc = doc[0] doc = doc[0]
req = await request_json() req = await get_request_json()
if not str(req.get("content", "")).strip(): if not str(req.get("content", "")).strip():
return get_error_data_result(message="`content` is required") return get_error_data_result(message="`content` is required")
if "important_keywords" in req: if "important_keywords" in req:
@ -1202,7 +1202,7 @@ async def rm_chunk(tenant_id, dataset_id, document_id):
docs = DocumentService.get_by_ids([document_id]) docs = DocumentService.get_by_ids([document_id])
if not docs: if not docs:
raise LookupError(f"Can't find the document with ID {document_id}!") raise LookupError(f"Can't find the document with ID {document_id}!")
req = await request_json() req = await get_request_json()
condition = {"doc_id": document_id} condition = {"doc_id": document_id}
if "chunk_ids" in req: if "chunk_ids" in req:
unique_chunk_ids, duplicate_messages = check_duplicate_ids(req["chunk_ids"], "chunk") unique_chunk_ids, duplicate_messages = check_duplicate_ids(req["chunk_ids"], "chunk")
@ -1288,8 +1288,8 @@ async def update_chunk(tenant_id, dataset_id, document_id, chunk_id):
if not doc: if not doc:
return get_error_data_result(message=f"You don't own the document {document_id}.") return get_error_data_result(message=f"You don't own the document {document_id}.")
doc = doc[0] doc = doc[0]
req = await request_json() req = await get_request_json()
if "content" in req: if "content" in req and req["content"] is not None:
content = req["content"] content = req["content"]
else: else:
content = chunk.get("content_with_weight", "") content = chunk.get("content_with_weight", "")
@ -1411,7 +1411,7 @@ async def retrieval_test(tenant_id):
format: float format: float
description: Similarity score. description: Similarity score.
""" """
req = await request_json() req = await get_request_json()
if not req.get("dataset_ids"): if not req.get("dataset_ids"):
return get_error_data_result("`dataset_ids` is required.") return get_error_data_result("`dataset_ids` is required.")
kb_ids = req["dataset_ids"] kb_ids = req["dataset_ids"]
@ -1434,6 +1434,7 @@ async def retrieval_test(tenant_id):
question = req["question"] question = req["question"]
doc_ids = req.get("document_ids", []) doc_ids = req.get("document_ids", [])
use_kg = req.get("use_kg", False) use_kg = req.get("use_kg", False)
toc_enhance = req.get("toc_enhance", False)
langs = req.get("cross_languages", []) langs = req.get("cross_languages", [])
if not isinstance(doc_ids, list): if not isinstance(doc_ids, list):
return get_error_data_result("`documents` should be a list") return get_error_data_result("`documents` should be a list")
@ -1442,9 +1443,14 @@ async def retrieval_test(tenant_id):
if doc_id not in doc_ids_list: if doc_id not in doc_ids_list:
return get_error_data_result(f"The datasets don't own the document {doc_id}") return get_error_data_result(f"The datasets don't own the document {doc_id}")
if not doc_ids: if not doc_ids:
metadata_condition = req.get("metadata_condition", {}) metadata_condition = req.get("metadata_condition", {}) or {}
metas = DocumentService.get_meta_by_kbs(kb_ids) metas = DocumentService.get_meta_by_kbs(kb_ids)
doc_ids = meta_filter(metas, convert_conditions(metadata_condition)) doc_ids = meta_filter(metas, convert_conditions(metadata_condition), metadata_condition.get("logic", "and"))
# If metadata_condition has conditions but no docs match, return empty result
if not doc_ids and metadata_condition.get("conditions"):
return get_result(data={"total": 0, "chunks": [], "doc_aggs": {}})
if metadata_condition and not doc_ids:
doc_ids = ["-999"]
similarity_threshold = float(req.get("similarity_threshold", 0.2)) similarity_threshold = float(req.get("similarity_threshold", 0.2))
vector_similarity_weight = float(req.get("vector_similarity_weight", 0.3)) vector_similarity_weight = float(req.get("vector_similarity_weight", 0.3))
top = int(req.get("top_k", 1024)) top = int(req.get("top_k", 1024))
@ -1485,6 +1491,11 @@ async def retrieval_test(tenant_id):
highlight=highlight, highlight=highlight,
rank_feature=label_question(question, kbs), rank_feature=label_question(question, kbs),
) )
if toc_enhance:
chat_mdl = LLMBundle(kb.tenant_id, LLMType.CHAT)
cks = settings.retriever.retrieval_by_toc(question, ranks["chunks"], tenant_ids, chat_mdl, size)
if cks:
ranks["chunks"] = cks
if use_kg: if use_kg:
ck = settings.kg_retriever.retrieval(question, [k.tenant_id for k in kbs], kb_ids, embd_mdl, LLMBundle(kb.tenant_id, LLMType.CHAT)) ck = settings.kg_retriever.retrieval(question, [k.tenant_id for k in kbs], kb_ids, embd_mdl, LLMBundle(kb.tenant_id, LLMType.CHAT))
if ck["content_with_weight"]: if ck["content_with_weight"]:

View File

@ -23,15 +23,14 @@ from pathlib import Path
from api.db.services.document_service import DocumentService from api.db.services.document_service import DocumentService
from api.db.services.file2document_service import File2DocumentService from api.db.services.file2document_service import File2DocumentService
from api.db.services.knowledgebase_service import KnowledgebaseService from api.db.services.knowledgebase_service import KnowledgebaseService
from api.utils.api_utils import server_error_response, token_required from api.utils.api_utils import get_json_result, get_request_json, server_error_response, token_required
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from api.db import FileType from api.db import FileType
from api.db.services import duplicate_name from api.db.services import duplicate_name
from api.db.services.file_service import FileService from api.db.services.file_service import FileService
from api.utils.api_utils import get_json_result
from api.utils.file_utils import filename_type from api.utils.file_utils import filename_type
from common import settings from common import settings
from common.constants import RetCode
@manager.route('/file/upload', methods=['POST']) # noqa: F821 @manager.route('/file/upload', methods=['POST']) # noqa: F821
@token_required @token_required
@ -86,19 +85,19 @@ async def upload(tenant_id):
pf_id = root_folder["id"] pf_id = root_folder["id"]
if 'file' not in files: if 'file' not in files:
return get_json_result(data=False, message='No file part!', code=400) return get_json_result(data=False, message='No file part!', code=RetCode.BAD_REQUEST)
file_objs = files.getlist('file') file_objs = files.getlist('file')
for file_obj in file_objs: for file_obj in file_objs:
if file_obj.filename == '': if file_obj.filename == '':
return get_json_result(data=False, message='No selected file!', code=400) return get_json_result(data=False, message='No selected file!', code=RetCode.BAD_REQUEST)
file_res = [] file_res = []
try: try:
e, pf_folder = FileService.get_by_id(pf_id) e, pf_folder = FileService.get_by_id(pf_id)
if not e: if not e:
return get_json_result(data=False, message="Can't find this folder!", code=404) return get_json_result(data=False, message="Can't find this folder!", code=RetCode.NOT_FOUND)
for file_obj in file_objs: for file_obj in file_objs:
# Handle file path # Handle file path
@ -114,13 +113,13 @@ async def upload(tenant_id):
if file_len != len_id_list: if file_len != len_id_list:
e, file = FileService.get_by_id(file_id_list[len_id_list - 1]) e, file = FileService.get_by_id(file_id_list[len_id_list - 1])
if not e: if not e:
return get_json_result(data=False, message="Folder not found!", code=404) return get_json_result(data=False, message="Folder not found!", code=RetCode.NOT_FOUND)
last_folder = FileService.create_folder(file, file_id_list[len_id_list - 1], file_obj_names, last_folder = FileService.create_folder(file, file_id_list[len_id_list - 1], file_obj_names,
len_id_list) len_id_list)
else: else:
e, file = FileService.get_by_id(file_id_list[len_id_list - 2]) e, file = FileService.get_by_id(file_id_list[len_id_list - 2])
if not e: if not e:
return get_json_result(data=False, message="Folder not found!", code=404) return get_json_result(data=False, message="Folder not found!", code=RetCode.NOT_FOUND)
last_folder = FileService.create_folder(file, file_id_list[len_id_list - 2], file_obj_names, last_folder = FileService.create_folder(file, file_id_list[len_id_list - 2], file_obj_names,
len_id_list) len_id_list)
@ -193,16 +192,16 @@ async def create(tenant_id):
type: type:
type: string type: string
""" """
req = await request.json req = await get_request_json()
pf_id = await request.json.get("parent_id") pf_id = req.get("parent_id")
input_file_type = await request.json.get("type") input_file_type = req.get("type")
if not pf_id: if not pf_id:
root_folder = FileService.get_root_folder(tenant_id) root_folder = FileService.get_root_folder(tenant_id)
pf_id = root_folder["id"] pf_id = root_folder["id"]
try: try:
if not FileService.is_parent_folder_exist(pf_id): if not FileService.is_parent_folder_exist(pf_id):
return get_json_result(data=False, message="Parent Folder Doesn't Exist!", code=400) return get_json_result(data=False, message="Parent Folder Doesn't Exist!", code=RetCode.BAD_REQUEST)
if FileService.query(name=req["name"], parent_id=pf_id): if FileService.query(name=req["name"], parent_id=pf_id):
return get_json_result(data=False, message="Duplicated folder name in the same folder.", code=409) return get_json_result(data=False, message="Duplicated folder name in the same folder.", code=409)
@ -229,7 +228,7 @@ async def create(tenant_id):
@manager.route('/file/list', methods=['GET']) # noqa: F821 @manager.route('/file/list', methods=['GET']) # noqa: F821
@token_required @token_required
def list_files(tenant_id): async def list_files(tenant_id):
""" """
List files under a specific folder. List files under a specific folder.
--- ---
@ -306,13 +305,13 @@ def list_files(tenant_id):
try: try:
e, file = FileService.get_by_id(pf_id) e, file = FileService.get_by_id(pf_id)
if not e: if not e:
return get_json_result(message="Folder not found!", code=404) return get_json_result(message="Folder not found!", code=RetCode.NOT_FOUND)
files, total = FileService.get_by_pf_id(tenant_id, pf_id, page_number, items_per_page, orderby, desc, keywords) files, total = FileService.get_by_pf_id(tenant_id, pf_id, page_number, items_per_page, orderby, desc, keywords)
parent_folder = FileService.get_parent_folder(pf_id) parent_folder = FileService.get_parent_folder(pf_id)
if not parent_folder: if not parent_folder:
return get_json_result(message="File not found!", code=404) return get_json_result(message="File not found!", code=RetCode.NOT_FOUND)
return get_json_result(data={"total": total, "files": files, "parent_folder": parent_folder.to_json()}) return get_json_result(data={"total": total, "files": files, "parent_folder": parent_folder.to_json()})
except Exception as e: except Exception as e:
@ -321,7 +320,7 @@ def list_files(tenant_id):
@manager.route('/file/root_folder', methods=['GET']) # noqa: F821 @manager.route('/file/root_folder', methods=['GET']) # noqa: F821
@token_required @token_required
def get_root_folder(tenant_id): async def get_root_folder(tenant_id):
""" """
Get user's root folder. Get user's root folder.
--- ---
@ -357,7 +356,7 @@ def get_root_folder(tenant_id):
@manager.route('/file/parent_folder', methods=['GET']) # noqa: F821 @manager.route('/file/parent_folder', methods=['GET']) # noqa: F821
@token_required @token_required
def get_parent_folder(): async def get_parent_folder():
""" """
Get parent folder info of a file. Get parent folder info of a file.
--- ---
@ -392,7 +391,7 @@ def get_parent_folder():
try: try:
e, file = FileService.get_by_id(file_id) e, file = FileService.get_by_id(file_id)
if not e: if not e:
return get_json_result(message="Folder not found!", code=404) return get_json_result(message="Folder not found!", code=RetCode.NOT_FOUND)
parent_folder = FileService.get_parent_folder(file_id) parent_folder = FileService.get_parent_folder(file_id)
return get_json_result(data={"parent_folder": parent_folder.to_json()}) return get_json_result(data={"parent_folder": parent_folder.to_json()})
@ -402,7 +401,7 @@ def get_parent_folder():
@manager.route('/file/all_parent_folder', methods=['GET']) # noqa: F821 @manager.route('/file/all_parent_folder', methods=['GET']) # noqa: F821
@token_required @token_required
def get_all_parent_folders(tenant_id): async def get_all_parent_folders(tenant_id):
""" """
Get all parent folders of a file. Get all parent folders of a file.
--- ---
@ -439,7 +438,7 @@ def get_all_parent_folders(tenant_id):
try: try:
e, file = FileService.get_by_id(file_id) e, file = FileService.get_by_id(file_id)
if not e: if not e:
return get_json_result(message="Folder not found!", code=404) return get_json_result(message="Folder not found!", code=RetCode.NOT_FOUND)
parent_folders = FileService.get_all_parent_folders(file_id) parent_folders = FileService.get_all_parent_folders(file_id)
parent_folders_res = [folder.to_json() for folder in parent_folders] parent_folders_res = [folder.to_json() for folder in parent_folders]
@ -481,40 +480,40 @@ async def rm(tenant_id):
type: boolean type: boolean
example: true example: true
""" """
req = await request.json req = await get_request_json()
file_ids = req["file_ids"] file_ids = req["file_ids"]
try: try:
for file_id in file_ids: for file_id in file_ids:
e, file = FileService.get_by_id(file_id) e, file = FileService.get_by_id(file_id)
if not e: if not e:
return get_json_result(message="File or Folder not found!", code=404) return get_json_result(message="File or Folder not found!", code=RetCode.NOT_FOUND)
if not file.tenant_id: if not file.tenant_id:
return get_json_result(message="Tenant not found!", code=404) return get_json_result(message="Tenant not found!", code=RetCode.NOT_FOUND)
if file.type == FileType.FOLDER.value: if file.type == FileType.FOLDER.value:
file_id_list = FileService.get_all_innermost_file_ids(file_id, []) file_id_list = FileService.get_all_innermost_file_ids(file_id, [])
for inner_file_id in file_id_list: for inner_file_id in file_id_list:
e, file = FileService.get_by_id(inner_file_id) e, file = FileService.get_by_id(inner_file_id)
if not e: if not e:
return get_json_result(message="File not found!", code=404) return get_json_result(message="File not found!", code=RetCode.NOT_FOUND)
settings.STORAGE_IMPL.rm(file.parent_id, file.location) settings.STORAGE_IMPL.rm(file.parent_id, file.location)
FileService.delete_folder_by_pf_id(tenant_id, file_id) FileService.delete_folder_by_pf_id(tenant_id, file_id)
else: else:
settings.STORAGE_IMPL.rm(file.parent_id, file.location) settings.STORAGE_IMPL.rm(file.parent_id, file.location)
if not FileService.delete(file): if not FileService.delete(file):
return get_json_result(message="Database error (File removal)!", code=500) return get_json_result(message="Database error (File removal)!", code=RetCode.SERVER_ERROR)
informs = File2DocumentService.get_by_file_id(file_id) informs = File2DocumentService.get_by_file_id(file_id)
for inform in informs: for inform in informs:
doc_id = inform.document_id doc_id = inform.document_id
e, doc = DocumentService.get_by_id(doc_id) e, doc = DocumentService.get_by_id(doc_id)
if not e: if not e:
return get_json_result(message="Document not found!", code=404) return get_json_result(message="Document not found!", code=RetCode.NOT_FOUND)
tenant_id = DocumentService.get_tenant_id(doc_id) tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id: if not tenant_id:
return get_json_result(message="Tenant not found!", code=404) return get_json_result(message="Tenant not found!", code=RetCode.NOT_FOUND)
if not DocumentService.remove_document(doc, tenant_id): if not DocumentService.remove_document(doc, tenant_id):
return get_json_result(message="Database error (Document removal)!", code=500) return get_json_result(message="Database error (Document removal)!", code=RetCode.SERVER_ERROR)
File2DocumentService.delete_by_file_id(file_id) File2DocumentService.delete_by_file_id(file_id)
return get_json_result(data=True) return get_json_result(data=True)
@ -556,27 +555,27 @@ async def rename(tenant_id):
type: boolean type: boolean
example: true example: true
""" """
req = await request.json req = await get_request_json()
try: try:
e, file = FileService.get_by_id(req["file_id"]) e, file = FileService.get_by_id(req["file_id"])
if not e: if not e:
return get_json_result(message="File not found!", code=404) return get_json_result(message="File not found!", code=RetCode.NOT_FOUND)
if file.type != FileType.FOLDER.value and pathlib.Path(req["name"].lower()).suffix != pathlib.Path( if file.type != FileType.FOLDER.value and pathlib.Path(req["name"].lower()).suffix != pathlib.Path(
file.name.lower()).suffix: file.name.lower()).suffix:
return get_json_result(data=False, message="The extension of file can't be changed", code=400) return get_json_result(data=False, message="The extension of file can't be changed", code=RetCode.BAD_REQUEST)
for existing_file in FileService.query(name=req["name"], pf_id=file.parent_id): for existing_file in FileService.query(name=req["name"], pf_id=file.parent_id):
if existing_file.name == req["name"]: if existing_file.name == req["name"]:
return get_json_result(data=False, message="Duplicated file name in the same folder.", code=409) return get_json_result(data=False, message="Duplicated file name in the same folder.", code=409)
if not FileService.update_by_id(req["file_id"], {"name": req["name"]}): if not FileService.update_by_id(req["file_id"], {"name": req["name"]}):
return get_json_result(message="Database error (File rename)!", code=500) return get_json_result(message="Database error (File rename)!", code=RetCode.SERVER_ERROR)
informs = File2DocumentService.get_by_file_id(req["file_id"]) informs = File2DocumentService.get_by_file_id(req["file_id"])
if informs: if informs:
if not DocumentService.update_by_id(informs[0].document_id, {"name": req["name"]}): if not DocumentService.update_by_id(informs[0].document_id, {"name": req["name"]}):
return get_json_result(message="Database error (Document rename)!", code=500) return get_json_result(message="Database error (Document rename)!", code=RetCode.SERVER_ERROR)
return get_json_result(data=True) return get_json_result(data=True)
except Exception as e: except Exception as e:
@ -606,13 +605,13 @@ async def get(tenant_id, file_id):
description: File stream description: File stream
schema: schema:
type: file type: file
404: RetCode.NOT_FOUND:
description: File not found description: File not found
""" """
try: try:
e, file = FileService.get_by_id(file_id) e, file = FileService.get_by_id(file_id)
if not e: if not e:
return get_json_result(message="Document not found!", code=404) return get_json_result(message="Document not found!", code=RetCode.NOT_FOUND)
blob = settings.STORAGE_IMPL.get(file.parent_id, file.location) blob = settings.STORAGE_IMPL.get(file.parent_id, file.location)
if not blob: if not blob:
@ -667,7 +666,7 @@ async def move(tenant_id):
type: boolean type: boolean
example: true example: true
""" """
req = await request.json req = await get_request_json()
try: try:
file_ids = req["src_file_ids"] file_ids = req["src_file_ids"]
parent_id = req["dest_file_id"] parent_id = req["dest_file_id"]
@ -677,13 +676,13 @@ async def move(tenant_id):
for file_id in file_ids: for file_id in file_ids:
file = files_dict[file_id] file = files_dict[file_id]
if not file: if not file:
return get_json_result(message="File or Folder not found!", code=404) return get_json_result(message="File or Folder not found!", code=RetCode.NOT_FOUND)
if not file.tenant_id: if not file.tenant_id:
return get_json_result(message="Tenant not found!", code=404) return get_json_result(message="Tenant not found!", code=RetCode.NOT_FOUND)
fe, _ = FileService.get_by_id(parent_id) fe, _ = FileService.get_by_id(parent_id)
if not fe: if not fe:
return get_json_result(message="Parent Folder not found!", code=404) return get_json_result(message="Parent Folder not found!", code=RetCode.NOT_FOUND)
FileService.move_file(file_ids, parent_id) FileService.move_file(file_ids, parent_id)
return get_json_result(data=True) return get_json_result(data=True)
@ -694,7 +693,7 @@ async def move(tenant_id):
@manager.route('/file/convert', methods=['POST']) # noqa: F821 @manager.route('/file/convert', methods=['POST']) # noqa: F821
@token_required @token_required
async def convert(tenant_id): async def convert(tenant_id):
req = await request.json req = await get_request_json()
kb_ids = req["kb_ids"] kb_ids = req["kb_ids"]
file_ids = req["file_ids"] file_ids = req["file_ids"]
file2documents = [] file2documents = []
@ -705,7 +704,7 @@ async def convert(tenant_id):
for file_id in file_ids: for file_id in file_ids:
file = files_set[file_id] file = files_set[file_id]
if not file: if not file:
return get_json_result(message="File not found!", code=404) return get_json_result(message="File not found!", code=RetCode.NOT_FOUND)
file_ids_list = [file_id] file_ids_list = [file_id]
if file.type == FileType.FOLDER.value: if file.type == FileType.FOLDER.value:
file_ids_list = FileService.get_all_innermost_file_ids(file_id, []) file_ids_list = FileService.get_all_innermost_file_ids(file_id, [])
@ -716,13 +715,13 @@ async def convert(tenant_id):
doc_id = inform.document_id doc_id = inform.document_id
e, doc = DocumentService.get_by_id(doc_id) e, doc = DocumentService.get_by_id(doc_id)
if not e: if not e:
return get_json_result(message="Document not found!", code=404) return get_json_result(message="Document not found!", code=RetCode.NOT_FOUND)
tenant_id = DocumentService.get_tenant_id(doc_id) tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id: if not tenant_id:
return get_json_result(message="Tenant not found!", code=404) return get_json_result(message="Tenant not found!", code=RetCode.NOT_FOUND)
if not DocumentService.remove_document(doc, tenant_id): if not DocumentService.remove_document(doc, tenant_id):
return get_json_result( return get_json_result(
message="Database error (Document removal)!", code=404) message="Database error (Document removal)!", code=RetCode.NOT_FOUND)
File2DocumentService.delete_by_file_id(id) File2DocumentService.delete_by_file_id(id)
# insert # insert
@ -730,11 +729,11 @@ async def convert(tenant_id):
e, kb = KnowledgebaseService.get_by_id(kb_id) e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e: if not e:
return get_json_result( return get_json_result(
message="Can't find this knowledgebase!", code=404) message="Can't find this knowledgebase!", code=RetCode.NOT_FOUND)
e, file = FileService.get_by_id(id) e, file = FileService.get_by_id(id)
if not e: if not e:
return get_json_result( return get_json_result(
message="Can't find this file!", code=404) message="Can't find this file!", code=RetCode.NOT_FOUND)
doc = DocumentService.insert({ doc = DocumentService.insert({
"id": get_uuid(), "id": get_uuid(),

View File

@ -35,7 +35,7 @@ from api.db.services.search_service import SearchService
from api.db.services.user_service import UserTenantService from api.db.services.user_service import UserTenantService
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from api.utils.api_utils import check_duplicate_ids, get_data_openai, get_error_data_result, get_json_result, \ from api.utils.api_utils import check_duplicate_ids, get_data_openai, get_error_data_result, get_json_result, \
get_result, server_error_response, token_required, validate_request get_result, get_request_json, server_error_response, token_required, validate_request
from rag.app.tag import label_question from rag.app.tag import label_question
from rag.prompts.template import load_prompt from rag.prompts.template import load_prompt
from rag.prompts.generator import cross_languages, gen_meta_filter, keyword_extraction, chunks_format from rag.prompts.generator import cross_languages, gen_meta_filter, keyword_extraction, chunks_format
@ -45,7 +45,7 @@ from common import settings
@manager.route("/chats/<chat_id>/sessions", methods=["POST"]) # noqa: F821 @manager.route("/chats/<chat_id>/sessions", methods=["POST"]) # noqa: F821
@token_required @token_required
async def create(tenant_id, chat_id): async def create(tenant_id, chat_id):
req = await request.json req = await get_request_json()
req["dialog_id"] = chat_id req["dialog_id"] = chat_id
dia = DialogService.query(tenant_id=tenant_id, id=req["dialog_id"], status=StatusEnum.VALID.value) dia = DialogService.query(tenant_id=tenant_id, id=req["dialog_id"], status=StatusEnum.VALID.value)
if not dia: if not dia:
@ -73,7 +73,7 @@ async def create(tenant_id, chat_id):
@manager.route("/agents/<agent_id>/sessions", methods=["POST"]) # noqa: F821 @manager.route("/agents/<agent_id>/sessions", methods=["POST"]) # noqa: F821
@token_required @token_required
def create_agent_session(tenant_id, agent_id): async def create_agent_session(tenant_id, agent_id):
user_id = request.args.get("user_id", tenant_id) user_id = request.args.get("user_id", tenant_id)
e, cvs = UserCanvasService.get_by_id(agent_id) e, cvs = UserCanvasService.get_by_id(agent_id)
if not e: if not e:
@ -98,7 +98,7 @@ def create_agent_session(tenant_id, agent_id):
@manager.route("/chats/<chat_id>/sessions/<session_id>", methods=["PUT"]) # noqa: F821 @manager.route("/chats/<chat_id>/sessions/<session_id>", methods=["PUT"]) # noqa: F821
@token_required @token_required
async def update(tenant_id, chat_id, session_id): async def update(tenant_id, chat_id, session_id):
req = await request.json req = await get_request_json()
req["dialog_id"] = chat_id req["dialog_id"] = chat_id
conv_id = session_id conv_id = session_id
conv = ConversationService.query(id=conv_id, dialog_id=chat_id) conv = ConversationService.query(id=conv_id, dialog_id=chat_id)
@ -120,7 +120,7 @@ async def update(tenant_id, chat_id, session_id):
@manager.route("/chats/<chat_id>/completions", methods=["POST"]) # noqa: F821 @manager.route("/chats/<chat_id>/completions", methods=["POST"]) # noqa: F821
@token_required @token_required
async def chat_completion(tenant_id, chat_id): async def chat_completion(tenant_id, chat_id):
req = await request.json req = await get_request_json()
if not req: if not req:
req = {"question": ""} req = {"question": ""}
if not req.get("session_id"): if not req.get("session_id"):
@ -206,7 +206,7 @@ async def chat_completion_openai_like(tenant_id, chat_id):
if reference: if reference:
print(completion.choices[0].message.reference) print(completion.choices[0].message.reference)
""" """
req = await request.get_json() req = await get_request_json()
need_reference = bool(req.get("reference", False)) need_reference = bool(req.get("reference", False))
@ -384,7 +384,7 @@ async def chat_completion_openai_like(tenant_id, chat_id):
@validate_request("model", "messages") # noqa: F821 @validate_request("model", "messages") # noqa: F821
@token_required @token_required
async def agents_completion_openai_compatibility(tenant_id, agent_id): async def agents_completion_openai_compatibility(tenant_id, agent_id):
req = await request.json req = await get_request_json()
tiktokenenc = tiktoken.get_encoding("cl100k_base") tiktokenenc = tiktoken.get_encoding("cl100k_base")
messages = req.get("messages", []) messages = req.get("messages", [])
if not messages: if not messages:
@ -428,28 +428,26 @@ async def agents_completion_openai_compatibility(tenant_id, agent_id):
return resp return resp
else: else:
# For non-streaming, just return the response directly # For non-streaming, just return the response directly
response = next( async for response in completion_openai(
completion_openai(
tenant_id, tenant_id,
agent_id, agent_id,
question, question,
session_id=req.pop("session_id", req.get("id", "")) or req.get("metadata", {}).get("id", ""), session_id=req.pop("session_id", req.get("id", "")) or req.get("metadata", {}).get("id", ""),
stream=False, stream=False,
**req, **req,
) ):
) return jsonify(response)
return jsonify(response)
@manager.route("/agents/<agent_id>/completions", methods=["POST"]) # noqa: F821 @manager.route("/agents/<agent_id>/completions", methods=["POST"]) # noqa: F821
@token_required @token_required
async def agent_completions(tenant_id, agent_id): async def agent_completions(tenant_id, agent_id):
req = await request.json req = await get_request_json()
if req.get("stream", True): if req.get("stream", True):
def generate(): async def generate():
for answer in agent_completion(tenant_id=tenant_id, agent_id=agent_id, **req): async for answer in agent_completion(tenant_id=tenant_id, agent_id=agent_id, **req):
if isinstance(answer, str): if isinstance(answer, str):
try: try:
ans = json.loads(answer[5:]) # remove "data:" ans = json.loads(answer[5:]) # remove "data:"
@ -473,7 +471,7 @@ async def agent_completions(tenant_id, agent_id):
full_content = "" full_content = ""
reference = {} reference = {}
final_ans = "" final_ans = ""
for answer in agent_completion(tenant_id=tenant_id, agent_id=agent_id, **req): async for answer in agent_completion(tenant_id=tenant_id, agent_id=agent_id, **req):
try: try:
ans = json.loads(answer[5:]) ans = json.loads(answer[5:])
@ -493,7 +491,7 @@ async def agent_completions(tenant_id, agent_id):
@manager.route("/chats/<chat_id>/sessions", methods=["GET"]) # noqa: F821 @manager.route("/chats/<chat_id>/sessions", methods=["GET"]) # noqa: F821
@token_required @token_required
def list_session(tenant_id, chat_id): async def list_session(tenant_id, chat_id):
if not DialogService.query(tenant_id=tenant_id, id=chat_id, status=StatusEnum.VALID.value): if not DialogService.query(tenant_id=tenant_id, id=chat_id, status=StatusEnum.VALID.value):
return get_error_data_result(message=f"You don't own the assistant {chat_id}.") return get_error_data_result(message=f"You don't own the assistant {chat_id}.")
id = request.args.get("id") id = request.args.get("id")
@ -547,7 +545,7 @@ def list_session(tenant_id, chat_id):
@manager.route("/agents/<agent_id>/sessions", methods=["GET"]) # noqa: F821 @manager.route("/agents/<agent_id>/sessions", methods=["GET"]) # noqa: F821
@token_required @token_required
def list_agent_session(tenant_id, agent_id): async def list_agent_session(tenant_id, agent_id):
if not UserCanvasService.query(user_id=tenant_id, id=agent_id): if not UserCanvasService.query(user_id=tenant_id, id=agent_id):
return get_error_data_result(message=f"You don't own the agent {agent_id}.") return get_error_data_result(message=f"You don't own the agent {agent_id}.")
id = request.args.get("id") id = request.args.get("id")
@ -616,7 +614,7 @@ async def delete(tenant_id, chat_id):
errors = [] errors = []
success_count = 0 success_count = 0
req = await request.json req = await get_request_json()
convs = ConversationService.query(dialog_id=chat_id) convs = ConversationService.query(dialog_id=chat_id)
if not req: if not req:
ids = None ids = None
@ -664,7 +662,7 @@ async def delete(tenant_id, chat_id):
async def delete_agent_session(tenant_id, agent_id): async def delete_agent_session(tenant_id, agent_id):
errors = [] errors = []
success_count = 0 success_count = 0
req = await request.json req = await get_request_json()
cvs = UserCanvasService.query(user_id=tenant_id, id=agent_id) cvs = UserCanvasService.query(user_id=tenant_id, id=agent_id)
if not cvs: if not cvs:
return get_error_data_result(f"You don't own the agent {agent_id}") return get_error_data_result(f"You don't own the agent {agent_id}")
@ -717,7 +715,7 @@ async def delete_agent_session(tenant_id, agent_id):
@manager.route("/sessions/ask", methods=["POST"]) # noqa: F821 @manager.route("/sessions/ask", methods=["POST"]) # noqa: F821
@token_required @token_required
async def ask_about(tenant_id): async def ask_about(tenant_id):
req = await request.json req = await get_request_json()
if not req.get("question"): if not req.get("question"):
return get_error_data_result("`question` is required.") return get_error_data_result("`question` is required.")
if not req.get("dataset_ids"): if not req.get("dataset_ids"):
@ -756,7 +754,7 @@ async def ask_about(tenant_id):
@manager.route("/sessions/related_questions", methods=["POST"]) # noqa: F821 @manager.route("/sessions/related_questions", methods=["POST"]) # noqa: F821
@token_required @token_required
async def related_questions(tenant_id): async def related_questions(tenant_id):
req = await request.json req = await get_request_json()
if not req.get("question"): if not req.get("question"):
return get_error_data_result("`question` is required.") return get_error_data_result("`question` is required.")
question = req["question"] question = req["question"]
@ -807,7 +805,7 @@ Related search terms:
@manager.route("/chatbots/<dialog_id>/completions", methods=["POST"]) # noqa: F821 @manager.route("/chatbots/<dialog_id>/completions", methods=["POST"]) # noqa: F821
async def chatbot_completions(dialog_id): async def chatbot_completions(dialog_id):
req = await request.json req = await get_request_json()
token = request.headers.get("Authorization").split() token = request.headers.get("Authorization").split()
if len(token) != 2: if len(token) != 2:
@ -833,7 +831,7 @@ async def chatbot_completions(dialog_id):
@manager.route("/chatbots/<dialog_id>/info", methods=["GET"]) # noqa: F821 @manager.route("/chatbots/<dialog_id>/info", methods=["GET"]) # noqa: F821
def chatbots_inputs(dialog_id): async def chatbots_inputs(dialog_id):
token = request.headers.get("Authorization").split() token = request.headers.get("Authorization").split()
if len(token) != 2: if len(token) != 2:
return get_error_data_result(message='Authorization is not valid!"') return get_error_data_result(message='Authorization is not valid!"')
@ -857,7 +855,7 @@ def chatbots_inputs(dialog_id):
@manager.route("/agentbots/<agent_id>/completions", methods=["POST"]) # noqa: F821 @manager.route("/agentbots/<agent_id>/completions", methods=["POST"]) # noqa: F821
async def agent_bot_completions(agent_id): async def agent_bot_completions(agent_id):
req = await request.json req = await get_request_json()
token = request.headers.get("Authorization").split() token = request.headers.get("Authorization").split()
if len(token) != 2: if len(token) != 2:
@ -875,12 +873,12 @@ async def agent_bot_completions(agent_id):
resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8") resp.headers.add_header("Content-Type", "text/event-stream; charset=utf-8")
return resp return resp
for answer in agent_completion(objs[0].tenant_id, agent_id, **req): async for answer in agent_completion(objs[0].tenant_id, agent_id, **req):
return get_result(data=answer) return get_result(data=answer)
@manager.route("/agentbots/<agent_id>/inputs", methods=["GET"]) # noqa: F821 @manager.route("/agentbots/<agent_id>/inputs", methods=["GET"]) # noqa: F821
def begin_inputs(agent_id): async def begin_inputs(agent_id):
token = request.headers.get("Authorization").split() token = request.headers.get("Authorization").split()
if len(token) != 2: if len(token) != 2:
return get_error_data_result(message='Authorization is not valid!"') return get_error_data_result(message='Authorization is not valid!"')
@ -910,7 +908,7 @@ async def ask_about_embedded():
if not objs: if not objs:
return get_error_data_result(message='Authentication error: API key is invalid!"') return get_error_data_result(message='Authentication error: API key is invalid!"')
req = await request.json req = await get_request_json()
uid = objs[0].tenant_id uid = objs[0].tenant_id
search_id = req.get("search_id", "") search_id = req.get("search_id", "")
@ -949,7 +947,7 @@ async def retrieval_test_embedded():
if not objs: if not objs:
return get_error_data_result(message='Authentication error: API key is invalid!"') return get_error_data_result(message='Authentication error: API key is invalid!"')
req = await request.json req = await get_request_json()
page = int(req.get("page", 1)) page = int(req.get("page", 1))
size = int(req.get("size", 30)) size = int(req.get("size", 30))
question = req["question"] question = req["question"]
@ -977,14 +975,14 @@ async def retrieval_test_embedded():
metas = DocumentService.get_meta_by_kbs(kb_ids) metas = DocumentService.get_meta_by_kbs(kb_ids)
if meta_data_filter.get("method") == "auto": if meta_data_filter.get("method") == "auto":
chat_mdl = LLMBundle(tenant_id, LLMType.CHAT, llm_name=search_config.get("chat_id", "")) chat_mdl = LLMBundle(tenant_id, LLMType.CHAT, llm_name=search_config.get("chat_id", ""))
filters = gen_meta_filter(chat_mdl, metas, question) filters: dict = gen_meta_filter(chat_mdl, metas, question)
doc_ids.extend(meta_filter(metas, filters)) doc_ids.extend(meta_filter(metas, filters["conditions"], filters.get("logic", "and")))
if not doc_ids: if not doc_ids:
doc_ids = None doc_ids = None
elif meta_data_filter.get("method") == "manual": elif meta_data_filter.get("method") == "manual":
doc_ids.extend(meta_filter(metas, meta_data_filter["manual"])) doc_ids.extend(meta_filter(metas, meta_data_filter["manual"], meta_data_filter.get("logic", "and")))
if not doc_ids: if meta_data_filter["manual"] and not doc_ids:
doc_ids = None doc_ids = ["-999"]
try: try:
tenants = UserTenantService.query(user_id=tenant_id) tenants = UserTenantService.query(user_id=tenant_id)
@ -1048,7 +1046,7 @@ async def related_questions_embedded():
if not objs: if not objs:
return get_error_data_result(message='Authentication error: API key is invalid!"') return get_error_data_result(message='Authentication error: API key is invalid!"')
req = await request.json req = await get_request_json()
tenant_id = objs[0].tenant_id tenant_id = objs[0].tenant_id
if not tenant_id: if not tenant_id:
return get_error_data_result(message="permission denined.") return get_error_data_result(message="permission denined.")
@ -1083,7 +1081,7 @@ Related search terms:
@manager.route("/searchbots/detail", methods=["GET"]) # noqa: F821 @manager.route("/searchbots/detail", methods=["GET"]) # noqa: F821
def detail_share_embedded(): async def detail_share_embedded():
token = request.headers.get("Authorization").split() token = request.headers.get("Authorization").split()
if len(token) != 2: if len(token) != 2:
return get_error_data_result(message='Authorization is not valid!"') return get_error_data_result(message='Authorization is not valid!"')
@ -1125,7 +1123,7 @@ async def mindmap():
return get_error_data_result(message='Authentication error: API key is invalid!"') return get_error_data_result(message='Authentication error: API key is invalid!"')
tenant_id = objs[0].tenant_id tenant_id = objs[0].tenant_id
req = await request.json req = await get_request_json()
search_id = req.get("search_id", "") search_id = req.get("search_id", "")
search_app = SearchService.get_detail(search_id) if search_id else {} search_app = SearchService.get_detail(search_id) if search_id else {}

View File

@ -24,14 +24,14 @@ from api.db.services.search_service import SearchService
from api.db.services.user_service import TenantService, UserTenantService from api.db.services.user_service import TenantService, UserTenantService
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from common.constants import RetCode, StatusEnum from common.constants import RetCode, StatusEnum
from api.utils.api_utils import get_data_error_result, get_json_result, not_allowed_parameters, server_error_response, validate_request from api.utils.api_utils import get_data_error_result, get_json_result, not_allowed_parameters, get_request_json, server_error_response, validate_request
@manager.route("/create", methods=["post"]) # noqa: F821 @manager.route("/create", methods=["post"]) # noqa: F821
@login_required @login_required
@validate_request("name") @validate_request("name")
async def create(): async def create():
req = await request.get_json() req = await get_request_json()
search_name = req["name"] search_name = req["name"]
description = req.get("description", "") description = req.get("description", "")
if not isinstance(search_name, str): if not isinstance(search_name, str):
@ -66,7 +66,7 @@ async def create():
@validate_request("search_id", "name", "search_config", "tenant_id") @validate_request("search_id", "name", "search_config", "tenant_id")
@not_allowed_parameters("id", "created_by", "create_time", "update_time", "create_date", "update_date", "created_by") @not_allowed_parameters("id", "created_by", "create_time", "update_time", "create_date", "update_date", "created_by")
async def update(): async def update():
req = await request.get_json() req = await get_request_json()
if not isinstance(req["name"], str): if not isinstance(req["name"], str):
return get_data_error_result(message="Search name must be string.") return get_data_error_result(message="Search name must be string.")
if req["name"].strip() == "": if req["name"].strip() == "":
@ -150,7 +150,7 @@ async def list_search_app():
else: else:
desc = True desc = True
req = await request.get_json() req = await get_request_json()
owner_ids = req.get("owner_ids", []) owner_ids = req.get("owner_ids", [])
try: try:
if not owner_ids: if not owner_ids:
@ -174,7 +174,7 @@ async def list_search_app():
@login_required @login_required
@validate_request("search_id") @validate_request("search_id")
async def rm(): async def rm():
req = await request.get_json() req = await get_request_json()
search_id = req["search_id"] search_id = req["search_id"]
if not SearchService.accessible4deletion(search_id, current_user.id): if not SearchService.accessible4deletion(search_id, current_user.id):
return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR) return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)

View File

@ -14,7 +14,6 @@
# limitations under the License. # limitations under the License.
# #
from quart import request
from api.db import UserTenantRole from api.db import UserTenantRole
from api.db.db_models import UserTenant from api.db.db_models import UserTenant
from api.db.services.user_service import UserTenantService, UserService from api.db.services.user_service import UserTenantService, UserService
@ -22,7 +21,7 @@ from api.db.services.user_service import UserTenantService, UserService
from common.constants import RetCode, StatusEnum from common.constants import RetCode, StatusEnum
from common.misc_utils import get_uuid from common.misc_utils import get_uuid
from common.time_utils import delta_seconds from common.time_utils import delta_seconds
from api.utils.api_utils import get_json_result, validate_request, server_error_response, get_data_error_result from api.utils.api_utils import get_data_error_result, get_json_result, get_request_json, server_error_response, validate_request
from api.utils.web_utils import send_invite_email from api.utils.web_utils import send_invite_email
from common import settings from common import settings
from api.apps import smtp_mail_server, login_required, current_user from api.apps import smtp_mail_server, login_required, current_user
@ -56,7 +55,7 @@ async def create(tenant_id):
message='No authorization.', message='No authorization.',
code=RetCode.AUTHENTICATION_ERROR) code=RetCode.AUTHENTICATION_ERROR)
req = await request.json req = await get_request_json()
invite_user_email = req["email"] invite_user_email = req["email"]
invite_users = UserService.query(email=invite_user_email) invite_users = UserService.query(email=invite_user_email)
if not invite_users: if not invite_users:

View File

@ -39,6 +39,7 @@ from common.connection_utils import construct_response
from api.utils.api_utils import ( from api.utils.api_utils import (
get_data_error_result, get_data_error_result,
get_json_result, get_json_result,
get_request_json,
server_error_response, server_error_response,
validate_request, validate_request,
) )
@ -57,6 +58,7 @@ from api.utils.web_utils import (
captcha_key, captcha_key,
) )
from common import settings from common import settings
from common.http_client import async_request
@manager.route("/login", methods=["POST", "GET"]) # noqa: F821 @manager.route("/login", methods=["POST", "GET"]) # noqa: F821
@ -90,7 +92,7 @@ async def login():
schema: schema:
type: object type: object
""" """
json_body = await request.json json_body = await get_request_json()
if not json_body: if not json_body:
return get_json_result(data=False, code=RetCode.AUTHENTICATION_ERROR, message="Unauthorized!") return get_json_result(data=False, code=RetCode.AUTHENTICATION_ERROR, message="Unauthorized!")
@ -121,8 +123,8 @@ async def login():
response_data = user.to_json() response_data = user.to_json()
user.access_token = get_uuid() user.access_token = get_uuid()
login_user(user) login_user(user)
user.update_time = (current_timestamp(),) user.update_time = current_timestamp()
user.update_date = (datetime_format(datetime.now()),) user.update_date = datetime_format(datetime.now())
user.save() user.save()
msg = "Welcome back!" msg = "Welcome back!"
@ -136,7 +138,7 @@ async def login():
@manager.route("/login/channels", methods=["GET"]) # noqa: F821 @manager.route("/login/channels", methods=["GET"]) # noqa: F821
def get_login_channels(): async def get_login_channels():
""" """
Get all supported authentication channels. Get all supported authentication channels.
""" """
@ -157,7 +159,7 @@ def get_login_channels():
@manager.route("/login/<channel>", methods=["GET"]) # noqa: F821 @manager.route("/login/<channel>", methods=["GET"]) # noqa: F821
def oauth_login(channel): async def oauth_login(channel):
channel_config = settings.OAUTH_CONFIG.get(channel) channel_config = settings.OAUTH_CONFIG.get(channel)
if not channel_config: if not channel_config:
raise ValueError(f"Invalid channel name: {channel}") raise ValueError(f"Invalid channel name: {channel}")
@ -170,7 +172,7 @@ def oauth_login(channel):
@manager.route("/oauth/callback/<channel>", methods=["GET"]) # noqa: F821 @manager.route("/oauth/callback/<channel>", methods=["GET"]) # noqa: F821
def oauth_callback(channel): async def oauth_callback(channel):
""" """
Handle the OAuth/OIDC callback for various channels dynamically. Handle the OAuth/OIDC callback for various channels dynamically.
""" """
@ -192,7 +194,10 @@ def oauth_callback(channel):
return redirect("/?error=missing_code") return redirect("/?error=missing_code")
# Exchange authorization code for access token # Exchange authorization code for access token
token_info = auth_cli.exchange_code_for_token(code) if hasattr(auth_cli, "async_exchange_code_for_token"):
token_info = await auth_cli.async_exchange_code_for_token(code)
else:
token_info = auth_cli.exchange_code_for_token(code)
access_token = token_info.get("access_token") access_token = token_info.get("access_token")
if not access_token: if not access_token:
return redirect("/?error=token_failed") return redirect("/?error=token_failed")
@ -200,7 +205,10 @@ def oauth_callback(channel):
id_token = token_info.get("id_token") id_token = token_info.get("id_token")
# Fetch user info # Fetch user info
user_info = auth_cli.fetch_user_info(access_token, id_token=id_token) if hasattr(auth_cli, "async_fetch_user_info"):
user_info = await auth_cli.async_fetch_user_info(access_token, id_token=id_token)
else:
user_info = auth_cli.fetch_user_info(access_token, id_token=id_token)
if not user_info.email: if not user_info.email:
return redirect("/?error=email_missing") return redirect("/?error=email_missing")
@ -259,7 +267,7 @@ def oauth_callback(channel):
@manager.route("/github_callback", methods=["GET"]) # noqa: F821 @manager.route("/github_callback", methods=["GET"]) # noqa: F821
def github_callback(): async def github_callback():
""" """
**Deprecated**, Use `/oauth/callback/<channel>` instead. **Deprecated**, Use `/oauth/callback/<channel>` instead.
@ -279,9 +287,8 @@ def github_callback():
schema: schema:
type: object type: object
""" """
import requests res = await async_request(
"POST",
res = requests.post(
settings.GITHUB_OAUTH.get("url"), settings.GITHUB_OAUTH.get("url"),
data={ data={
"client_id": settings.GITHUB_OAUTH.get("client_id"), "client_id": settings.GITHUB_OAUTH.get("client_id"),
@ -299,7 +306,7 @@ def github_callback():
session["access_token"] = res["access_token"] session["access_token"] = res["access_token"]
session["access_token_from"] = "github" session["access_token_from"] = "github"
user_info = user_info_from_github(session["access_token"]) user_info = await user_info_from_github(session["access_token"])
email_address = user_info["email"] email_address = user_info["email"]
users = UserService.query(email=email_address) users = UserService.query(email=email_address)
user_id = get_uuid() user_id = get_uuid()
@ -348,7 +355,7 @@ def github_callback():
@manager.route("/feishu_callback", methods=["GET"]) # noqa: F821 @manager.route("/feishu_callback", methods=["GET"]) # noqa: F821
def feishu_callback(): async def feishu_callback():
""" """
Feishu OAuth callback endpoint. Feishu OAuth callback endpoint.
--- ---
@ -366,9 +373,8 @@ def feishu_callback():
schema: schema:
type: object type: object
""" """
import requests app_access_token_res = await async_request(
"POST",
app_access_token_res = requests.post(
settings.FEISHU_OAUTH.get("app_access_token_url"), settings.FEISHU_OAUTH.get("app_access_token_url"),
data=json.dumps( data=json.dumps(
{ {
@ -382,7 +388,8 @@ def feishu_callback():
if app_access_token_res["code"] != 0: if app_access_token_res["code"] != 0:
return redirect("/?error=%s" % app_access_token_res) return redirect("/?error=%s" % app_access_token_res)
res = requests.post( res = await async_request(
"POST",
settings.FEISHU_OAUTH.get("user_access_token_url"), settings.FEISHU_OAUTH.get("user_access_token_url"),
data=json.dumps( data=json.dumps(
{ {
@ -403,7 +410,7 @@ def feishu_callback():
return redirect("/?error=contact:user.email:readonly not in scope") return redirect("/?error=contact:user.email:readonly not in scope")
session["access_token"] = res["data"]["access_token"] session["access_token"] = res["data"]["access_token"]
session["access_token_from"] = "feishu" session["access_token_from"] = "feishu"
user_info = user_info_from_feishu(session["access_token"]) user_info = await user_info_from_feishu(session["access_token"])
email_address = user_info["email"] email_address = user_info["email"]
users = UserService.query(email=email_address) users = UserService.query(email=email_address)
user_id = get_uuid() user_id = get_uuid()
@ -451,36 +458,34 @@ def feishu_callback():
return redirect("/?auth=%s" % user.get_id()) return redirect("/?auth=%s" % user.get_id())
def user_info_from_feishu(access_token): async def user_info_from_feishu(access_token):
import requests
headers = { headers = {
"Content-Type": "application/json; charset=utf-8", "Content-Type": "application/json; charset=utf-8",
"Authorization": f"Bearer {access_token}", "Authorization": f"Bearer {access_token}",
} }
res = requests.get("https://open.feishu.cn/open-apis/authen/v1/user_info", headers=headers) res = await async_request("GET", "https://open.feishu.cn/open-apis/authen/v1/user_info", headers=headers)
user_info = res.json()["data"] user_info = res.json()["data"]
user_info["email"] = None if user_info.get("email") == "" else user_info["email"] user_info["email"] = None if user_info.get("email") == "" else user_info["email"]
return user_info return user_info
def user_info_from_github(access_token): async def user_info_from_github(access_token):
import requests
headers = {"Accept": "application/json", "Authorization": f"token {access_token}"} headers = {"Accept": "application/json", "Authorization": f"token {access_token}"}
res = requests.get(f"https://api.github.com/user?access_token={access_token}", headers=headers) res = await async_request("GET", f"https://api.github.com/user?access_token={access_token}", headers=headers)
user_info = res.json() user_info = res.json()
email_info = requests.get( email_info_response = await async_request(
"GET",
f"https://api.github.com/user/emails?access_token={access_token}", f"https://api.github.com/user/emails?access_token={access_token}",
headers=headers, headers=headers,
).json() )
email_info = email_info_response.json()
user_info["email"] = next((email for email in email_info if email["primary"]), None)["email"] user_info["email"] = next((email for email in email_info if email["primary"]), None)["email"]
return user_info return user_info
@manager.route("/logout", methods=["GET"]) # noqa: F821 @manager.route("/logout", methods=["GET"]) # noqa: F821
@login_required @login_required
def log_out(): async def log_out():
""" """
User logout endpoint. User logout endpoint.
--- ---
@ -531,7 +536,7 @@ async def setting_user():
type: object type: object
""" """
update_dict = {} update_dict = {}
request_data = await request.json request_data = await get_request_json()
if request_data.get("password"): if request_data.get("password"):
new_password = request_data.get("new_password") new_password = request_data.get("new_password")
if not check_password_hash(current_user.password, decrypt(request_data["password"])): if not check_password_hash(current_user.password, decrypt(request_data["password"])):
@ -570,7 +575,7 @@ async def setting_user():
@manager.route("/info", methods=["GET"]) # noqa: F821 @manager.route("/info", methods=["GET"]) # noqa: F821
@login_required @login_required
def user_profile(): async def user_profile():
""" """
Get user profile information. Get user profile information.
--- ---
@ -698,7 +703,7 @@ async def user_add():
code=RetCode.OPERATING_ERROR, code=RetCode.OPERATING_ERROR,
) )
req = await request.json req = await get_request_json()
email_address = req["email"] email_address = req["email"]
# Validate the email address # Validate the email address
@ -755,7 +760,7 @@ async def user_add():
@manager.route("/tenant_info", methods=["GET"]) # noqa: F821 @manager.route("/tenant_info", methods=["GET"]) # noqa: F821
@login_required @login_required
def tenant_info(): async def tenant_info():
""" """
Get tenant information. Get tenant information.
--- ---
@ -831,14 +836,14 @@ async def set_tenant_info():
schema: schema:
type: object type: object
""" """
req = await request.json req = await get_request_json()
try: try:
tid = req.pop("tenant_id") tid = req.pop("tenant_id")
TenantService.update_by_id(tid, req) TenantService.update_by_id(tid, req)
return get_json_result(data=True) return get_json_result(data=True)
except Exception as e: except Exception as e:
return server_error_response(e) return server_error_response(e)
@manager.route("/forget/captcha", methods=["GET"]) # noqa: F821 @manager.route("/forget/captcha", methods=["GET"]) # noqa: F821
async def forget_get_captcha(): async def forget_get_captcha():
@ -875,7 +880,7 @@ async def forget_send_otp():
- Verify the image captcha stored at captcha:{email} (case-insensitive). - Verify the image captcha stored at captcha:{email} (case-insensitive).
- On success, generate an email OTP (AZ with length = OTP_LENGTH), store hash + salt (and timestamp) in Redis with TTL, reset attempts and cooldown, and send the OTP via email. - On success, generate an email OTP (AZ with length = OTP_LENGTH), store hash + salt (and timestamp) in Redis with TTL, reset attempts and cooldown, and send the OTP via email.
""" """
req = await request.get_json() req = await get_request_json()
email = req.get("email") or "" email = req.get("email") or ""
captcha = (req.get("captcha") or "").strip() captcha = (req.get("captcha") or "").strip()
@ -931,7 +936,7 @@ async def forget_send_otp():
) )
except Exception: except Exception:
return get_json_result(data=False, code=RetCode.SERVER_ERROR, message="failed to send email") return get_json_result(data=False, code=RetCode.SERVER_ERROR, message="failed to send email")
return get_json_result(data=True, code=RetCode.SUCCESS, message="verification passed, email sent") return get_json_result(data=True, code=RetCode.SUCCESS, message="verification passed, email sent")
@ -941,7 +946,7 @@ async def forget():
POST: Verify email + OTP and reset password, then log the user in. POST: Verify email + OTP and reset password, then log the user in.
Request JSON: { email, otp, new_password, confirm_new_password } Request JSON: { email, otp, new_password, confirm_new_password }
""" """
req = await request.get_json() req = await get_request_json()
email = req.get("email") or "" email = req.get("email") or ""
otp = (req.get("otp") or "").strip() otp = (req.get("otp") or "").strip()
new_pwd = req.get("new_password") new_pwd = req.get("new_password")
@ -1002,8 +1007,8 @@ async def forget():
# Auto login (reuse login flow) # Auto login (reuse login flow)
user.access_token = get_uuid() user.access_token = get_uuid()
login_user(user) login_user(user)
user.update_time = (current_timestamp(),) user.update_time = current_timestamp()
user.update_date = (datetime_format(datetime.now()),) user.update_date = datetime_format(datetime.now())
user.save() user.save()
msg = "Password reset successful. Logged in." msg = "Password reset successful. Logged in."
return construct_response(data=user.to_json(), auth=user.get_id(), message=msg) return await construct_response(data=user.to_json(), auth=user.get_id(), message=msg)

View File

@ -749,7 +749,7 @@ class Knowledgebase(DataBaseModel):
parser_id = CharField(max_length=32, null=False, help_text="default parser ID", default=ParserType.NAIVE.value, index=True) parser_id = CharField(max_length=32, null=False, help_text="default parser ID", default=ParserType.NAIVE.value, index=True)
pipeline_id = CharField(max_length=32, null=True, help_text="Pipeline ID", index=True) pipeline_id = CharField(max_length=32, null=True, help_text="Pipeline ID", index=True)
parser_config = JSONField(null=False, default={"pages": [[1, 1000000]]}) parser_config = JSONField(null=False, default={"pages": [[1, 1000000]], "table_context_size": 0, "image_context_size": 0})
pagerank = IntegerField(default=0, index=False) pagerank = IntegerField(default=0, index=False)
graphrag_task_id = CharField(max_length=32, null=True, help_text="Graph RAG task ID", index=True) graphrag_task_id = CharField(max_length=32, null=True, help_text="Graph RAG task ID", index=True)
@ -774,7 +774,7 @@ class Document(DataBaseModel):
kb_id = CharField(max_length=256, null=False, index=True) kb_id = CharField(max_length=256, null=False, index=True)
parser_id = CharField(max_length=32, null=False, help_text="default parser ID", index=True) parser_id = CharField(max_length=32, null=False, help_text="default parser ID", index=True)
pipeline_id = CharField(max_length=32, null=True, help_text="pipeline ID", index=True) pipeline_id = CharField(max_length=32, null=True, help_text="pipeline ID", index=True)
parser_config = JSONField(null=False, default={"pages": [[1, 1000000]]}) parser_config = JSONField(null=False, default={"pages": [[1, 1000000]], "table_context_size": 0, "image_context_size": 0})
source_type = CharField(max_length=128, null=False, default="local", help_text="where dose this document come from", index=True) source_type = CharField(max_length=128, null=False, default="local", help_text="where dose this document come from", index=True)
type = CharField(max_length=32, null=False, help_text="file extension", index=True) type = CharField(max_length=32, null=False, help_text="file extension", index=True)
created_by = CharField(max_length=32, null=False, help_text="who created it", index=True) created_by = CharField(max_length=32, null=False, help_text="who created it", index=True)

View File

@ -34,14 +34,17 @@ from common.file_utils import get_project_base_directory
from common import settings from common import settings
from api.common.base64 import encode_to_base64 from api.common.base64 import encode_to_base64
DEFAULT_SUPERUSER_NICKNAME = os.getenv("DEFAULT_SUPERUSER_NICKNAME", "admin")
DEFAULT_SUPERUSER_EMAIL = os.getenv("DEFAULT_SUPERUSER_EMAIL", "admin@ragflow.io")
DEFAULT_SUPERUSER_PASSWORD = os.getenv("DEFAULT_SUPERUSER_PASSWORD", "admin")
def init_superuser(): def init_superuser(nickname=DEFAULT_SUPERUSER_NICKNAME, email=DEFAULT_SUPERUSER_EMAIL, password=DEFAULT_SUPERUSER_PASSWORD, role=UserTenantRole.OWNER):
user_info = { user_info = {
"id": uuid.uuid1().hex, "id": uuid.uuid1().hex,
"password": encode_to_base64("admin"), "password": encode_to_base64(password),
"nickname": "admin", "nickname": nickname,
"is_superuser": True, "is_superuser": True,
"email": "admin@ragflow.io", "email": email,
"creator": "system", "creator": "system",
"status": "1", "status": "1",
} }
@ -58,7 +61,7 @@ def init_superuser():
"tenant_id": user_info["id"], "tenant_id": user_info["id"],
"user_id": user_info["id"], "user_id": user_info["id"],
"invited_by": user_info["id"], "invited_by": user_info["id"],
"role": UserTenantRole.OWNER "role": role
} }
tenant_llm = get_init_tenant_llm(user_info["id"]) tenant_llm = get_init_tenant_llm(user_info["id"])
@ -70,7 +73,7 @@ def init_superuser():
UserTenantService.insert(**usr_tenant) UserTenantService.insert(**usr_tenant)
TenantLLMService.insert_many(tenant_llm) TenantLLMService.insert_many(tenant_llm)
logging.info( logging.info(
"Super user initialized. email: admin@ragflow.io, password: admin. Changing the password after login is strongly recommended.") f"Super user initialized. email: {email}, password: {password}. Changing the password after login is strongly recommended.")
chat_mdl = LLMBundle(tenant["id"], LLMType.CHAT, tenant["llm_id"]) chat_mdl = LLMBundle(tenant["id"], LLMType.CHAT, tenant["llm_id"])
msg = chat_mdl.chat(system="", history=[ msg = chat_mdl.chat(system="", history=[

View File

@ -177,7 +177,7 @@ class UserCanvasService(CommonService):
return True return True
def completion(tenant_id, agent_id, session_id=None, **kwargs): async def completion(tenant_id, agent_id, session_id=None, **kwargs):
query = kwargs.get("query", "") or kwargs.get("question", "") query = kwargs.get("query", "") or kwargs.get("question", "")
files = kwargs.get("files", []) files = kwargs.get("files", [])
inputs = kwargs.get("inputs", {}) inputs = kwargs.get("inputs", {})
@ -219,10 +219,14 @@ def completion(tenant_id, agent_id, session_id=None, **kwargs):
"id": message_id "id": message_id
}) })
txt = "" txt = ""
for ans in canvas.run(query=query, files=files, user_id=user_id, inputs=inputs): async for ans in canvas.run(query=query, files=files, user_id=user_id, inputs=inputs):
ans["session_id"] = session_id ans["session_id"] = session_id
if ans["event"] == "message": if ans["event"] == "message":
txt += ans["data"]["content"] txt += ans["data"]["content"]
if ans["data"].get("start_to_think", False):
txt += "<think>"
elif ans["data"].get("end_to_think", False):
txt += "</think>"
yield "data:" + json.dumps(ans, ensure_ascii=False) + "\n\n" yield "data:" + json.dumps(ans, ensure_ascii=False) + "\n\n"
conv.message.append({"role": "assistant", "content": txt, "created_at": time.time(), "id": message_id}) conv.message.append({"role": "assistant", "content": txt, "created_at": time.time(), "id": message_id})
@ -233,7 +237,7 @@ def completion(tenant_id, agent_id, session_id=None, **kwargs):
API4ConversationService.append_message(conv["id"], conv) API4ConversationService.append_message(conv["id"], conv)
def completion_openai(tenant_id, agent_id, question, session_id=None, stream=True, **kwargs): async def completion_openai(tenant_id, agent_id, question, session_id=None, stream=True, **kwargs):
tiktoken_encoder = tiktoken.get_encoding("cl100k_base") tiktoken_encoder = tiktoken.get_encoding("cl100k_base")
prompt_tokens = len(tiktoken_encoder.encode(str(question))) prompt_tokens = len(tiktoken_encoder.encode(str(question)))
user_id = kwargs.get("user_id", "") user_id = kwargs.get("user_id", "")
@ -241,7 +245,7 @@ def completion_openai(tenant_id, agent_id, question, session_id=None, stream=Tru
if stream: if stream:
completion_tokens = 0 completion_tokens = 0
try: try:
for ans in completion( async for ans in completion(
tenant_id=tenant_id, tenant_id=tenant_id,
agent_id=agent_id, agent_id=agent_id,
session_id=session_id, session_id=session_id,
@ -300,7 +304,7 @@ def completion_openai(tenant_id, agent_id, question, session_id=None, stream=Tru
try: try:
all_content = "" all_content = ""
reference = {} reference = {}
for ans in completion( async for ans in completion(
tenant_id=tenant_id, tenant_id=tenant_id,
agent_id=agent_id, agent_id=agent_id,
session_id=session_id, session_id=session_id,

View File

@ -15,6 +15,7 @@
# #
import logging import logging
from datetime import datetime from datetime import datetime
import os
from typing import Tuple, List from typing import Tuple, List
from anthropic import BaseModel from anthropic import BaseModel
@ -103,7 +104,8 @@ class SyncLogsService(CommonService):
Knowledgebase.avatar.alias("kb_avatar"), Knowledgebase.avatar.alias("kb_avatar"),
Connector2Kb.auto_parse, Connector2Kb.auto_parse,
cls.model.from_beginning.alias("reindex"), cls.model.from_beginning.alias("reindex"),
cls.model.status cls.model.status,
cls.model.update_time
] ]
if not connector_id: if not connector_id:
fields.append(Connector.config) fields.append(Connector.config)
@ -116,7 +118,11 @@ class SyncLogsService(CommonService):
if connector_id: if connector_id:
query = query.where(cls.model.connector_id == connector_id) query = query.where(cls.model.connector_id == connector_id)
else: else:
interval_expr = SQL("INTERVAL `t2`.`refresh_freq` MINUTE") database_type = os.getenv("DB_TYPE", "mysql")
if "postgres" in database_type.lower():
interval_expr = SQL("make_interval(mins => t2.refresh_freq)")
else:
interval_expr = SQL("INTERVAL `t2`.`refresh_freq` MINUTE")
query = query.where( query = query.where(
Connector.input_type == InputType.POLL, Connector.input_type == InputType.POLL,
Connector.status == TaskStatus.SCHEDULE, Connector.status == TaskStatus.SCHEDULE,
@ -208,9 +214,21 @@ class SyncLogsService(CommonService):
err, doc_blob_pairs = FileService.upload_document(kb, files, tenant_id, src) err, doc_blob_pairs = FileService.upload_document(kb, files, tenant_id, src)
errs.extend(err) errs.extend(err)
# Create a mapping from filename to metadata for later use
metadata_map = {}
for d in docs:
if d.get("metadata"):
filename = d["semantic_identifier"]+(f"{d['extension']}" if d["semantic_identifier"][::-1].find(d['extension'][::-1])<0 else "")
metadata_map[filename] = d["metadata"]
kb_table_num_map = {} kb_table_num_map = {}
for doc, _ in doc_blob_pairs: for doc, _ in doc_blob_pairs:
doc_ids.append(doc["id"]) doc_ids.append(doc["id"])
# Set metadata if available for this document
if doc["name"] in metadata_map:
DocumentService.update_by_id(doc["id"], {"meta_fields": metadata_map[doc["name"]]})
if not auto_parse or auto_parse == "0": if not auto_parse or auto_parse == "0":
continue continue
DocumentService.run(tenant_id, doc, kb_table_num_map) DocumentService.run(tenant_id, doc, kb_table_num_map)

View File

@ -25,6 +25,7 @@ import trio
from langfuse import Langfuse from langfuse import Langfuse
from peewee import fn from peewee import fn
from agentic_reasoning import DeepResearcher from agentic_reasoning import DeepResearcher
from api.db.services.file_service import FileService
from common.constants import LLMType, ParserType, StatusEnum from common.constants import LLMType, ParserType, StatusEnum
from api.db.db_models import DB, Dialog from api.db.db_models import DB, Dialog
from api.db.services.common_service import CommonService from api.db.services.common_service import CommonService
@ -178,6 +179,9 @@ class DialogService(CommonService):
return res return res
def chat_solo(dialog, messages, stream=True): def chat_solo(dialog, messages, stream=True):
attachments = ""
if "files" in messages[-1]:
attachments = "\n\n".join(FileService.get_files(messages[-1]["files"]))
if TenantLLMService.llm_id2llm_type(dialog.llm_id) == "image2text": if TenantLLMService.llm_id2llm_type(dialog.llm_id) == "image2text":
chat_mdl = LLMBundle(dialog.tenant_id, LLMType.IMAGE2TEXT, dialog.llm_id) chat_mdl = LLMBundle(dialog.tenant_id, LLMType.IMAGE2TEXT, dialog.llm_id)
else: else:
@ -188,6 +192,8 @@ def chat_solo(dialog, messages, stream=True):
if prompt_config.get("tts"): if prompt_config.get("tts"):
tts_mdl = LLMBundle(dialog.tenant_id, LLMType.TTS) tts_mdl = LLMBundle(dialog.tenant_id, LLMType.TTS)
msg = [{"role": m["role"], "content": re.sub(r"##\d+\$\$", "", m["content"])} for m in messages if m["role"] != "system"] msg = [{"role": m["role"], "content": re.sub(r"##\d+\$\$", "", m["content"])} for m in messages if m["role"] != "system"]
if attachments and msg:
msg[-1]["content"] += attachments
if stream: if stream:
last_ans = "" last_ans = ""
delta_ans = "" delta_ans = ""
@ -287,7 +293,7 @@ def convert_conditions(metadata_condition):
] ]
def meta_filter(metas: dict, filters: list[dict]): def meta_filter(metas: dict, filters: list[dict], logic: str = "and"):
doc_ids = set([]) doc_ids = set([])
def filter_out(v2docs, operator, value): def filter_out(v2docs, operator, value):
@ -304,6 +310,8 @@ def meta_filter(metas: dict, filters: list[dict]):
for conds in [ for conds in [
(operator == "contains", str(value).lower() in str(input).lower()), (operator == "contains", str(value).lower() in str(input).lower()),
(operator == "not contains", str(value).lower() not in str(input).lower()), (operator == "not contains", str(value).lower() not in str(input).lower()),
(operator == "in", str(input).lower() in str(value).lower()),
(operator == "not in", str(input).lower() not in str(value).lower()),
(operator == "start with", str(input).lower().startswith(str(value).lower())), (operator == "start with", str(input).lower().startswith(str(value).lower())),
(operator == "end with", str(input).lower().endswith(str(value).lower())), (operator == "end with", str(input).lower().endswith(str(value).lower())),
(operator == "empty", not input), (operator == "empty", not input),
@ -331,7 +339,10 @@ def meta_filter(metas: dict, filters: list[dict]):
if not doc_ids: if not doc_ids:
doc_ids = set(ids) doc_ids = set(ids)
else: else:
doc_ids = doc_ids & set(ids) if logic == "and":
doc_ids = doc_ids & set(ids)
else:
doc_ids = doc_ids | set(ids)
if not doc_ids: if not doc_ids:
return [] return []
return list(doc_ids) return list(doc_ids)
@ -375,8 +386,11 @@ def chat(dialog, messages, stream=True, **kwargs):
retriever = settings.retriever retriever = settings.retriever
questions = [m["content"] for m in messages if m["role"] == "user"][-3:] questions = [m["content"] for m in messages if m["role"] == "user"][-3:]
attachments = kwargs["doc_ids"].split(",") if "doc_ids" in kwargs else [] attachments = kwargs["doc_ids"].split(",") if "doc_ids" in kwargs else []
attachments_= ""
if "doc_ids" in messages[-1]: if "doc_ids" in messages[-1]:
attachments = messages[-1]["doc_ids"] attachments = messages[-1]["doc_ids"]
if "files" in messages[-1]:
attachments_ = "\n\n".join(FileService.get_files(messages[-1]["files"]))
prompt_config = dialog.prompt_config prompt_config = dialog.prompt_config
field_map = KnowledgebaseService.get_field_map(dialog.kb_ids) field_map = KnowledgebaseService.get_field_map(dialog.kb_ids)
@ -407,14 +421,15 @@ def chat(dialog, messages, stream=True, **kwargs):
if dialog.meta_data_filter: if dialog.meta_data_filter:
metas = DocumentService.get_meta_by_kbs(dialog.kb_ids) metas = DocumentService.get_meta_by_kbs(dialog.kb_ids)
if dialog.meta_data_filter.get("method") == "auto": if dialog.meta_data_filter.get("method") == "auto":
filters = gen_meta_filter(chat_mdl, metas, questions[-1]) filters: dict = gen_meta_filter(chat_mdl, metas, questions[-1])
attachments.extend(meta_filter(metas, filters)) attachments.extend(meta_filter(metas, filters["conditions"], filters.get("logic", "and")))
if not attachments: if not attachments:
attachments = None attachments = None
elif dialog.meta_data_filter.get("method") == "manual": elif dialog.meta_data_filter.get("method") == "manual":
attachments.extend(meta_filter(metas, dialog.meta_data_filter["manual"])) conds = dialog.meta_data_filter["manual"]
if not attachments: attachments.extend(meta_filter(metas, conds, dialog.meta_data_filter.get("logic", "and")))
attachments = None if conds and not attachments:
attachments = ["-999"]
if prompt_config.get("keyword", False): if prompt_config.get("keyword", False):
questions[-1] += keyword_extraction(chat_mdl, questions[-1]) questions[-1] += keyword_extraction(chat_mdl, questions[-1])
@ -445,7 +460,7 @@ def chat(dialog, messages, stream=True, **kwargs):
), ),
) )
for think in reasoner.thinking(kbinfos, " ".join(questions)): for think in reasoner.thinking(kbinfos, attachments_ + " ".join(questions)):
if isinstance(think, str): if isinstance(think, str):
thought = think thought = think
knowledges = [t for t in think.split("\n") if t] knowledges = [t for t in think.split("\n") if t]
@ -472,6 +487,7 @@ def chat(dialog, messages, stream=True, **kwargs):
cks = retriever.retrieval_by_toc(" ".join(questions), kbinfos["chunks"], tenant_ids, chat_mdl, dialog.top_n) cks = retriever.retrieval_by_toc(" ".join(questions), kbinfos["chunks"], tenant_ids, chat_mdl, dialog.top_n)
if cks: if cks:
kbinfos["chunks"] = cks kbinfos["chunks"] = cks
kbinfos["chunks"] = retriever.retrieval_by_children(kbinfos["chunks"], tenant_ids)
if prompt_config.get("tavily_api_key"): if prompt_config.get("tavily_api_key"):
tav = Tavily(prompt_config["tavily_api_key"]) tav = Tavily(prompt_config["tavily_api_key"])
tav_res = tav.retrieve_chunks(" ".join(questions)) tav_res = tav.retrieve_chunks(" ".join(questions))
@ -497,7 +513,7 @@ def chat(dialog, messages, stream=True, **kwargs):
kwargs["knowledge"] = "\n------\n" + "\n\n------\n\n".join(knowledges) kwargs["knowledge"] = "\n------\n" + "\n\n------\n\n".join(knowledges)
gen_conf = dialog.llm_setting gen_conf = dialog.llm_setting
msg = [{"role": "system", "content": prompt_config["system"].format(**kwargs)}] msg = [{"role": "system", "content": prompt_config["system"].format(**kwargs)+attachments_}]
prompt4citation = "" prompt4citation = ""
if knowledges and (prompt_config.get("quote", True) and kwargs.get("quote", True)): if knowledges and (prompt_config.get("quote", True) and kwargs.get("quote", True)):
prompt4citation = citation_prompt() prompt4citation = citation_prompt()
@ -666,7 +682,11 @@ Please write the SQL, only SQL, without any other explanations or text.
if kb_ids: if kb_ids:
kb_filter = "(" + " OR ".join([f"kb_id = '{kb_id}'" for kb_id in kb_ids]) + ")" kb_filter = "(" + " OR ".join([f"kb_id = '{kb_id}'" for kb_id in kb_ids]) + ")"
if "where" not in sql.lower(): if "where" not in sql.lower():
sql += f" WHERE {kb_filter}" o = sql.lower().split("order by")
if len(o) > 1:
sql = o[0] + f" WHERE {kb_filter} order by " + o[1]
else:
sql += f" WHERE {kb_filter}"
else: else:
sql += f" AND {kb_filter}" sql += f" AND {kb_filter}"
@ -674,10 +694,9 @@ Please write the SQL, only SQL, without any other explanations or text.
tried_times += 1 tried_times += 1
return settings.retriever.sql_retrieval(sql, format="json"), sql return settings.retriever.sql_retrieval(sql, format="json"), sql
tbl, sql = get_table() try:
if tbl is None: tbl, sql = get_table()
return None except Exception as e:
if tbl.get("error") and tried_times <= 2:
user_prompt = """ user_prompt = """
Table name: {}; Table name: {};
Table of database fields are as follows: Table of database fields are as follows:
@ -691,16 +710,14 @@ Please write the SQL, only SQL, without any other explanations or text.
The SQL error you provided last time is as follows: The SQL error you provided last time is as follows:
{} {}
Error issued by database as follows:
{}
Please correct the error and write SQL again, only SQL, without any other explanations or text. Please correct the error and write SQL again, only SQL, without any other explanations or text.
""".format(index_name(tenant_id), "\n".join([f"{k}: {v}" for k, v in field_map.items()]), question, sql, tbl["error"]) """.format(index_name(tenant_id), "\n".join([f"{k}: {v}" for k, v in field_map.items()]), question, e)
tbl, sql = get_table() try:
logging.debug("TRY it again: {}".format(sql)) tbl, sql = get_table()
except Exception:
return
logging.debug("GET table: {}".format(tbl)) if len(tbl["rows"]) == 0:
if tbl.get("error") or len(tbl["rows"]) == 0:
return None return None
docid_idx = set([ii for ii, c in enumerate(tbl["columns"]) if c["name"] == "doc_id"]) docid_idx = set([ii for ii, c in enumerate(tbl["columns"]) if c["name"] == "doc_id"])
@ -778,14 +795,14 @@ def ask(question, kb_ids, tenant_id, chat_llm_name=None, search_config={}):
if meta_data_filter: if meta_data_filter:
metas = DocumentService.get_meta_by_kbs(kb_ids) metas = DocumentService.get_meta_by_kbs(kb_ids)
if meta_data_filter.get("method") == "auto": if meta_data_filter.get("method") == "auto":
filters = gen_meta_filter(chat_mdl, metas, question) filters: dict = gen_meta_filter(chat_mdl, metas, question)
doc_ids.extend(meta_filter(metas, filters)) doc_ids.extend(meta_filter(metas, filters["conditions"], filters.get("logic", "and")))
if not doc_ids: if not doc_ids:
doc_ids = None doc_ids = None
elif meta_data_filter.get("method") == "manual": elif meta_data_filter.get("method") == "manual":
doc_ids.extend(meta_filter(metas, meta_data_filter["manual"])) doc_ids.extend(meta_filter(metas, meta_data_filter["manual"], meta_data_filter.get("logic", "and")))
if not doc_ids: if meta_data_filter["manual"] and not doc_ids:
doc_ids = None doc_ids = ["-999"]
kbinfos = retriever.retrieval( kbinfos = retriever.retrieval(
question=question, question=question,
@ -853,14 +870,14 @@ def gen_mindmap(question, kb_ids, tenant_id, search_config={}):
if meta_data_filter: if meta_data_filter:
metas = DocumentService.get_meta_by_kbs(kb_ids) metas = DocumentService.get_meta_by_kbs(kb_ids)
if meta_data_filter.get("method") == "auto": if meta_data_filter.get("method") == "auto":
filters = gen_meta_filter(chat_mdl, metas, question) filters: dict = gen_meta_filter(chat_mdl, metas, question)
doc_ids.extend(meta_filter(metas, filters)) doc_ids.extend(meta_filter(metas, filters["conditions"], filters.get("logic", "and")))
if not doc_ids: if not doc_ids:
doc_ids = None doc_ids = None
elif meta_data_filter.get("method") == "manual": elif meta_data_filter.get("method") == "manual":
doc_ids.extend(meta_filter(metas, meta_data_filter["manual"])) doc_ids.extend(meta_filter(metas, meta_data_filter["manual"], meta_data_filter.get("logic", "and")))
if not doc_ids: if meta_data_filter["manual"] and not doc_ids:
doc_ids = None doc_ids = ["-999"]
ranks = settings.retriever.retrieval( ranks = settings.retriever.retrieval(
question=question, question=question,

View File

@ -923,7 +923,7 @@ def doc_upload_and_parse(conversation_id, file_objs, user_id):
ParserType.AUDIO.value: audio, ParserType.AUDIO.value: audio,
ParserType.EMAIL.value: email ParserType.EMAIL.value: email
} }
parser_config = {"chunk_token_num": 4096, "delimiter": "\n!?;。;!?", "layout_recognize": "Plain Text"} parser_config = {"chunk_token_num": 4096, "delimiter": "\n!?;。;!?", "layout_recognize": "Plain Text", "table_context_size": 0, "image_context_size": 0}
exe = ThreadPoolExecutor(max_workers=12) exe = ThreadPoolExecutor(max_workers=12)
threads = [] threads = []
doc_nm = {} doc_nm = {}

View File

@ -13,10 +13,15 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import asyncio
import base64
import logging import logging
import re import re
import sys
import time
from concurrent.futures import ThreadPoolExecutor from concurrent.futures import ThreadPoolExecutor
from pathlib import Path from pathlib import Path
from typing import Union
from peewee import fn from peewee import fn
@ -520,7 +525,7 @@ class FileService(CommonService):
if img_base64 and file_type == FileType.VISUAL.value: if img_base64 and file_type == FileType.VISUAL.value:
return GptV4.image2base64(blob) return GptV4.image2base64(blob)
cks = FACTORY.get(FileService.get_parser(filename_type(filename), filename, ""), naive).chunk(filename, blob, **kwargs) cks = FACTORY.get(FileService.get_parser(filename_type(filename), filename, ""), naive).chunk(filename, blob, **kwargs)
return "\n".join([ck["content_with_weight"] for ck in cks]) return f"\n -----------------\nFile: {filename}\nContent as following: \n" + "\n".join([ck["content_with_weight"] for ck in cks])
@staticmethod @staticmethod
def get_parser(doc_type, filename, default): def get_parser(doc_type, filename, default):
@ -588,3 +593,80 @@ class FileService(CommonService):
errors += str(e) errors += str(e)
return errors return errors
@staticmethod
def upload_info(user_id, file, url: str|None=None):
def structured(filename, filetype, blob, content_type):
nonlocal user_id
if filetype == FileType.PDF.value:
blob = read_potential_broken_pdf(blob)
location = get_uuid()
FileService.put_blob(user_id, location, blob)
return {
"id": location,
"name": filename,
"size": sys.getsizeof(blob),
"extension": filename.split(".")[-1].lower(),
"mime_type": content_type,
"created_by": user_id,
"created_at": time.time(),
"preview_url": None
}
if url:
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter,
CrawlResult
)
filename = re.sub(r"\?.*", "", url.split("/")[-1])
async def adownload():
browser_config = BrowserConfig(
headless=True,
verbose=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter()
),
pdf=True,
screenshot=False
)
result: CrawlResult = await crawler.arun(
url=url,
config=crawler_config
)
return result
page = asyncio.run(adownload())
if page.pdf:
if filename.split(".")[-1].lower() != "pdf":
filename += ".pdf"
return structured(filename, "pdf", page.pdf, page.response_headers["content-type"])
return structured(filename, "html", str(page.markdown).encode("utf-8"), page.response_headers["content-type"], user_id)
DocumentService.check_doc_health(user_id, file.filename)
return structured(file.filename, filename_type(file.filename), file.read(), file.content_type)
@staticmethod
def get_files(files: Union[None, list[dict]]) -> list[str]:
if not files:
return []
def image_to_base64(file):
return "data:{};base64,{}".format(file["mime_type"],
base64.b64encode(FileService.get_blob(file["created_by"], file["id"])).decode("utf-8"))
exe = ThreadPoolExecutor(max_workers=5)
threads = []
for file in files:
if file["mime_type"].find("image") >=0:
threads.append(exe.submit(image_to_base64, file))
continue
threads.append(exe.submit(FileService.parse, file["name"], FileService.get_blob(file["created_by"], file["id"]), True, file["created_by"]))
return [th.result() for th in threads]

View File

@ -13,9 +13,11 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
import asyncio
import inspect import inspect
import logging import logging
import re import re
import threading
from common.token_utils import num_tokens_from_string from common.token_utils import num_tokens_from_string
from functools import partial from functools import partial
from typing import Generator from typing import Generator
@ -183,6 +185,66 @@ class LLMBundle(LLM4Tenant):
return txt return txt
def stream_transcription(self, audio):
mdl = self.mdl
supports_stream = hasattr(mdl, "stream_transcription") and callable(getattr(mdl, "stream_transcription"))
if supports_stream:
if self.langfuse:
generation = self.langfuse.start_generation(
trace_context=self.trace_context,
name="stream_transcription",
metadata={"model": self.llm_name}
)
final_text = ""
used_tokens = 0
try:
for evt in mdl.stream_transcription(audio):
if evt.get("event") == "final":
final_text = evt.get("text", "")
yield evt
except Exception as e:
err = {"event": "error", "text": str(e)}
yield err
final_text = final_text or ""
finally:
if final_text:
used_tokens = num_tokens_from_string(final_text)
TenantLLMService.increase_usage(self.tenant_id, self.llm_type, used_tokens)
if self.langfuse:
generation.update(
output={"output": final_text},
usage_details={"total_tokens": used_tokens}
)
generation.end()
return
if self.langfuse:
generation = self.langfuse.start_generation(trace_context=self.trace_context, name="stream_transcription", metadata={"model": self.llm_name})
full_text, used_tokens = mdl.transcription(audio)
if not TenantLLMService.increase_usage(
self.tenant_id, self.llm_type, used_tokens
):
logging.error(
f"LLMBundle.stream_transcription can't update token usage for {self.tenant_id}/SEQUENCE2TXT used_tokens: {used_tokens}"
)
if self.langfuse:
generation.update(
output={"output": full_text},
usage_details={"total_tokens": used_tokens}
)
generation.end()
yield {
"event": "final",
"text": full_text,
"streaming": False
}
def tts(self, text: str) -> Generator[bytes, None, None]: def tts(self, text: str) -> Generator[bytes, None, None]:
if self.langfuse: if self.langfuse:
generation = self.langfuse.start_generation(trace_context=self.trace_context, name="tts", input={"text": text}) generation = self.langfuse.start_generation(trace_context=self.trace_context, name="tts", input={"text": text})
@ -242,7 +304,7 @@ class LLMBundle(LLM4Tenant):
if not self.verbose_tool_use: if not self.verbose_tool_use:
txt = re.sub(r"<tool_call>.*?</tool_call>", "", txt, flags=re.DOTALL) txt = re.sub(r"<tool_call>.*?</tool_call>", "", txt, flags=re.DOTALL)
if isinstance(txt, int) and not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, used_tokens, self.llm_name): if used_tokens and not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, used_tokens, self.llm_name):
logging.error("LLMBundle.chat can't update token usage for {}/CHAT llm_name: {}, used_tokens: {}".format(self.tenant_id, self.llm_name, used_tokens)) logging.error("LLMBundle.chat can't update token usage for {}/CHAT llm_name: {}, used_tokens: {}".format(self.tenant_id, self.llm_name, used_tokens))
if self.langfuse: if self.langfuse:
@ -279,5 +341,80 @@ class LLMBundle(LLM4Tenant):
yield ans yield ans
if total_tokens > 0: if total_tokens > 0:
if not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, txt, self.llm_name): if not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, total_tokens, self.llm_name):
logging.error("LLMBundle.chat_streamly can't update token usage for {}/CHAT llm_name: {}, content: {}".format(self.tenant_id, self.llm_name, txt)) logging.error("LLMBundle.chat_streamly can't update token usage for {}/CHAT llm_name: {}, used_tokens: {}".format(self.tenant_id, self.llm_name, total_tokens))
def _bridge_sync_stream(self, gen):
loop = asyncio.get_running_loop()
queue: asyncio.Queue = asyncio.Queue()
def worker():
try:
for item in gen:
loop.call_soon_threadsafe(queue.put_nowait, item)
except Exception as e: # pragma: no cover
loop.call_soon_threadsafe(queue.put_nowait, e)
finally:
loop.call_soon_threadsafe(queue.put_nowait, StopAsyncIteration)
threading.Thread(target=worker, daemon=True).start()
return queue
async def async_chat(self, system: str, history: list, gen_conf: dict = {}, **kwargs):
chat_partial = partial(self.mdl.chat, system, history, gen_conf, **kwargs)
if self.is_tools and self.mdl.is_tools and hasattr(self.mdl, "chat_with_tools"):
chat_partial = partial(self.mdl.chat_with_tools, system, history, gen_conf, **kwargs)
use_kwargs = self._clean_param(chat_partial, **kwargs)
if hasattr(self.mdl, "async_chat_with_tools") and self.is_tools and self.mdl.is_tools:
txt, used_tokens = await self.mdl.async_chat_with_tools(system, history, gen_conf, **use_kwargs)
elif hasattr(self.mdl, "async_chat"):
txt, used_tokens = await self.mdl.async_chat(system, history, gen_conf, **use_kwargs)
else:
txt, used_tokens = await asyncio.to_thread(chat_partial, **use_kwargs)
txt = self._remove_reasoning_content(txt)
if not self.verbose_tool_use:
txt = re.sub(r"<tool_call>.*?</tool_call>", "", txt, flags=re.DOTALL)
if used_tokens and not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, used_tokens, self.llm_name):
logging.error("LLMBundle.async_chat can't update token usage for {}/CHAT llm_name: {}, used_tokens: {}".format(self.tenant_id, self.llm_name, used_tokens))
return txt
async def async_chat_streamly(self, system: str, history: list, gen_conf: dict = {}, **kwargs):
total_tokens = 0
if self.is_tools and self.mdl.is_tools:
stream_fn = getattr(self.mdl, "async_chat_streamly_with_tools", None)
else:
stream_fn = getattr(self.mdl, "async_chat_streamly", None)
if stream_fn:
chat_partial = partial(stream_fn, system, history, gen_conf)
use_kwargs = self._clean_param(chat_partial, **kwargs)
async for txt in chat_partial(**use_kwargs):
if isinstance(txt, int):
total_tokens = txt
break
yield txt
if total_tokens and not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, total_tokens, self.llm_name):
logging.error("LLMBundle.async_chat_streamly can't update token usage for {}/CHAT llm_name: {}, used_tokens: {}".format(self.tenant_id, self.llm_name, total_tokens))
return
chat_partial = partial(self.mdl.chat_streamly_with_tools if (self.is_tools and self.mdl.is_tools) else self.mdl.chat_streamly, system, history, gen_conf)
use_kwargs = self._clean_param(chat_partial, **kwargs)
queue = self._bridge_sync_stream(chat_partial(**use_kwargs))
while True:
item = await queue.get()
if item is StopAsyncIteration:
break
if isinstance(item, Exception):
raise item
if isinstance(item, int):
total_tokens = item
break
yield item
if total_tokens and not TenantLLMService.increase_usage(self.tenant_id, self.llm_type, total_tokens, self.llm_name):
logging.error("LLMBundle.async_chat_streamly can't update token usage for {}/CHAT llm_name: {}, used_tokens: {}".format(self.tenant_id, self.llm_name, total_tokens))

View File

@ -20,16 +20,15 @@
from common.log_utils import init_root_logger from common.log_utils import init_root_logger
from plugin import GlobalPluginManager from plugin import GlobalPluginManager
init_root_logger("ragflow_server")
import logging import logging
import os import os
import signal import signal
import sys import sys
import time
import traceback import traceback
import threading import threading
import uuid import uuid
import faulthandler
from api.apps import app, smtp_mail_server from api.apps import app, smtp_mail_server
from api.db.runtime_config import RuntimeConfig from api.db.runtime_config import RuntimeConfig
@ -37,7 +36,7 @@ from api.db.services.document_service import DocumentService
from common.file_utils import get_project_base_directory from common.file_utils import get_project_base_directory
from common import settings from common import settings
from api.db.db_models import init_database_tables as init_web_db from api.db.db_models import init_database_tables as init_web_db
from api.db.init_data import init_web_data from api.db.init_data import init_web_data, init_superuser
from common.versions import get_ragflow_version from common.versions import get_ragflow_version
from common.config_utils import show_configs from common.config_utils import show_configs
from common.mcp_tool_call_conn import shutdown_all_mcp_sessions from common.mcp_tool_call_conn import shutdown_all_mcp_sessions
@ -69,10 +68,12 @@ def signal_handler(sig, frame):
logging.info("Received interrupt signal, shutting down...") logging.info("Received interrupt signal, shutting down...")
shutdown_all_mcp_sessions() shutdown_all_mcp_sessions()
stop_event.set() stop_event.set()
time.sleep(1) stop_event.wait(1)
sys.exit(0) sys.exit(0)
if __name__ == '__main__': if __name__ == '__main__':
faulthandler.enable()
init_root_logger("ragflow_server")
logging.info(r""" logging.info(r"""
____ ___ ______ ______ __ ____ ___ ______ ______ __
/ __ \ / | / ____// ____// /____ _ __ / __ \ / | / ____// ____// /____ _ __
@ -109,11 +110,16 @@ if __name__ == '__main__':
parser.add_argument( parser.add_argument(
"--debug", default=False, help="debug mode", action="store_true" "--debug", default=False, help="debug mode", action="store_true"
) )
parser.add_argument(
"--init-superuser", default=False, help="init superuser", action="store_true"
)
args = parser.parse_args() args = parser.parse_args()
if args.version: if args.version:
print(get_ragflow_version()) print(get_ragflow_version())
sys.exit(0) sys.exit(0)
if args.init_superuser:
init_superuser()
RuntimeConfig.DEBUG = args.debug RuntimeConfig.DEBUG = args.debug
if RuntimeConfig.DEBUG: if RuntimeConfig.DEBUG:
logging.info("run on debug mode") logging.info("run on debug mode")
@ -156,5 +162,5 @@ if __name__ == '__main__':
except Exception: except Exception:
traceback.print_exc() traceback.print_exc()
stop_event.set() stop_event.set()
time.sleep(1) stop_event.wait(1)
os.kill(os.getpid(), signal.SIGKILL) os.kill(os.getpid(), signal.SIGKILL)

View File

@ -22,6 +22,7 @@ import os
import time import time
from copy import deepcopy from copy import deepcopy
from functools import wraps from functools import wraps
from typing import Any
import requests import requests
import trio import trio
@ -45,11 +46,40 @@ from common import settings
requests.models.complexjson.dumps = functools.partial(json.dumps, cls=CustomJSONEncoder) requests.models.complexjson.dumps = functools.partial(json.dumps, cls=CustomJSONEncoder)
async def request_json(): async def _coerce_request_data() -> dict:
"""Fetch JSON body with sane defaults; fallback to form data."""
payload: Any = None
last_error: Exception | None = None
try: try:
return await request.json payload = await request.get_json(force=True, silent=True)
except Exception: except Exception as e:
return {} last_error = e
payload = None
if payload is None:
try:
form = await request.form
payload = form.to_dict()
except Exception as e:
last_error = e
payload = None
if payload is None:
if last_error is not None:
raise last_error
raise ValueError("No JSON body or form data found in request.")
if isinstance(payload, dict):
return payload or {}
if isinstance(payload, str):
raise AttributeError("'str' object has no attribute 'get'")
raise TypeError(f"Unsupported request payload type: {type(payload)!r}")
async def get_request_json():
return await _coerce_request_data()
def serialize_for_json(obj): def serialize_for_json(obj):
""" """
@ -89,7 +119,8 @@ def get_data_error_result(code=RetCode.DATA_ERROR, message="Sorry! Data missing!
def server_error_response(e): def server_error_response(e):
logging.exception(e) # Quart invokes this handler outside the original except block, so we must pass exc_info manually.
logging.error("Unhandled exception during request", exc_info=(type(e), e, e.__traceback__))
try: try:
msg = repr(e).lower() msg = repr(e).lower()
if getattr(e, "code", None) == 401 or ("unauthorized" in msg) or ("401" in msg): if getattr(e, "code", None) == 401 or ("unauthorized" in msg) or ("401" in msg):
@ -136,7 +167,7 @@ def validate_request(*args, **kwargs):
def wrapper(func): def wrapper(func):
@wraps(func) @wraps(func)
async def decorated_function(*_args, **_kwargs): async def decorated_function(*_args, **_kwargs):
errs = process_args(await request.json or (await request.form).to_dict()) errs = process_args(await _coerce_request_data())
if errs: if errs:
return get_json_result(code=RetCode.ARGUMENT_ERROR, message=errs) return get_json_result(code=RetCode.ARGUMENT_ERROR, message=errs)
if inspect.iscoroutinefunction(func): if inspect.iscoroutinefunction(func):
@ -151,7 +182,7 @@ def validate_request(*args, **kwargs):
def not_allowed_parameters(*params): def not_allowed_parameters(*params):
def decorator(func): def decorator(func):
async def wrapper(*args, **kwargs): async def wrapper(*args, **kwargs):
input_arguments = await request.json or (await request.form).to_dict() input_arguments = await _coerce_request_data()
for param in params: for param in params:
if param in input_arguments: if param in input_arguments:
return get_json_result(code=RetCode.ARGUMENT_ERROR, message=f"Parameter {param} isn't allowed") return get_json_result(code=RetCode.ARGUMENT_ERROR, message=f"Parameter {param} isn't allowed")
@ -312,6 +343,10 @@ def get_parser_config(chunk_method, parser_config):
chunk_method = "naive" chunk_method = "naive"
# Define default configurations for each chunking method # Define default configurations for each chunking method
base_defaults = {
"table_context_size": 0,
"image_context_size": 0,
}
key_mapping = { key_mapping = {
"naive": { "naive": {
"layout_recognize": "DeepDOC", "layout_recognize": "DeepDOC",
@ -364,16 +399,19 @@ def get_parser_config(chunk_method, parser_config):
default_config = key_mapping[chunk_method] default_config = key_mapping[chunk_method]
# If no parser_config provided, return default # If no parser_config provided, return default merged with base defaults
if not parser_config: if not parser_config:
return default_config if default_config is None:
return deep_merge(base_defaults, {})
return deep_merge(base_defaults, default_config)
# If parser_config is provided, merge with defaults to ensure required fields exist # If parser_config is provided, merge with defaults to ensure required fields exist
if default_config is None: if default_config is None:
return parser_config return deep_merge(base_defaults, parser_config)
# Ensure raptor and graphrag fields have default values if not provided # Ensure raptor and graphrag fields have default values if not provided
merged_config = deep_merge(default_config, parser_config) merged_config = deep_merge(base_defaults, default_config)
merged_config = deep_merge(merged_config, parser_config)
return merged_config return merged_config

View File

@ -14,6 +14,7 @@
# limitations under the License. # limitations under the License.
# #
from collections import Counter from collections import Counter
import string
from typing import Annotated, Any, Literal from typing import Annotated, Any, Literal
from uuid import UUID from uuid import UUID
@ -25,6 +26,7 @@ from pydantic import (
StringConstraints, StringConstraints,
ValidationError, ValidationError,
field_validator, field_validator,
model_validator,
) )
from pydantic_core import PydanticCustomError from pydantic_core import PydanticCustomError
from werkzeug.exceptions import BadRequest, UnsupportedMediaType from werkzeug.exceptions import BadRequest, UnsupportedMediaType
@ -361,10 +363,9 @@ class CreateDatasetReq(Base):
description: Annotated[str | None, Field(default=None, max_length=65535)] description: Annotated[str | None, Field(default=None, max_length=65535)]
embedding_model: Annotated[str | None, Field(default=None, max_length=255, serialization_alias="embd_id")] embedding_model: Annotated[str | None, Field(default=None, max_length=255, serialization_alias="embd_id")]
permission: Annotated[Literal["me", "team"], Field(default="me", min_length=1, max_length=16)] permission: Annotated[Literal["me", "team"], Field(default="me", min_length=1, max_length=16)]
chunk_method: Annotated[ chunk_method: Annotated[str | None, Field(default=None, serialization_alias="parser_id")]
Literal["naive", "book", "email", "laws", "manual", "one", "paper", "picture", "presentation", "qa", "table", "tag"], parse_type: Annotated[int | None, Field(default=None, ge=0, le=64)]
Field(default="naive", min_length=1, max_length=32, serialization_alias="parser_id"), pipeline_id: Annotated[str | None, Field(default=None, min_length=32, max_length=32, serialization_alias="pipeline_id")]
]
parser_config: Annotated[ParserConfig | None, Field(default=None)] parser_config: Annotated[ParserConfig | None, Field(default=None)]
@field_validator("avatar", mode="after") @field_validator("avatar", mode="after")
@ -525,6 +526,93 @@ class CreateDatasetReq(Base):
raise PydanticCustomError("string_too_long", "Parser config exceeds size limit (max 65,535 characters). Current size: {actual}", {"actual": len(json_str)}) raise PydanticCustomError("string_too_long", "Parser config exceeds size limit (max 65,535 characters). Current size: {actual}", {"actual": len(json_str)})
return v return v
@field_validator("pipeline_id", mode="after")
@classmethod
def validate_pipeline_id(cls, v: str | None) -> str | None:
"""Validate pipeline_id as 32-char lowercase hex string if provided.
Rules:
- None or empty string: treat as None (not set)
- Must be exactly length 32
- Must contain only hex digits (0-9a-fA-F); normalized to lowercase
"""
if v is None:
return None
if v == "":
return None
if len(v) != 32:
raise PydanticCustomError("format_invalid", "pipeline_id must be 32 hex characters")
if any(ch not in string.hexdigits for ch in v):
raise PydanticCustomError("format_invalid", "pipeline_id must be hexadecimal")
return v.lower()
@model_validator(mode="after")
def validate_parser_dependency(self) -> "CreateDatasetReq":
"""
Mixed conditional validation:
- If parser_id is omitted (field not set):
* If both parse_type and pipeline_id are omitted → default chunk_method = "naive"
* If both parse_type and pipeline_id are provided → allow ingestion pipeline mode
- If parser_id is provided (valid enum) → parse_type and pipeline_id must be None (disallow mixed usage)
Raises:
PydanticCustomError with code 'dependency_error' on violation.
"""
# Omitted chunk_method (not in fields) logic
if self.chunk_method is None and "chunk_method" not in self.model_fields_set:
# All three absent → default naive
if self.parse_type is None and self.pipeline_id is None:
object.__setattr__(self, "chunk_method", "naive")
return self
# parser_id omitted: require BOTH parse_type & pipeline_id present (no partial allowed)
if self.parse_type is None or self.pipeline_id is None:
missing = []
if self.parse_type is None:
missing.append("parse_type")
if self.pipeline_id is None:
missing.append("pipeline_id")
raise PydanticCustomError(
"dependency_error",
"parser_id omitted → required fields missing: {fields}",
{"fields": ", ".join(missing)},
)
# Both provided → allow pipeline mode
return self
# parser_id provided (valid): MUST NOT have parse_type or pipeline_id
if isinstance(self.chunk_method, str):
if self.parse_type is not None or self.pipeline_id is not None:
invalid = []
if self.parse_type is not None:
invalid.append("parse_type")
if self.pipeline_id is not None:
invalid.append("pipeline_id")
raise PydanticCustomError(
"dependency_error",
"parser_id provided → disallowed fields present: {fields}",
{"fields": ", ".join(invalid)},
)
return self
@field_validator("chunk_method", mode="wrap")
@classmethod
def validate_chunk_method(cls, v: Any, handler) -> Any:
"""Wrap validation to unify error messages, including type errors (e.g. list)."""
allowed = {"naive", "book", "email", "laws", "manual", "one", "paper", "picture", "presentation", "qa", "table", "tag"}
error_msg = "Input should be 'naive', 'book', 'email', 'laws', 'manual', 'one', 'paper', 'picture', 'presentation', 'qa', 'table' or 'tag'"
# Omitted field: handler won't be invoked (wrap still gets value); None treated as explicit invalid
if v is None:
raise PydanticCustomError("literal_error", error_msg)
try:
# Run inner validation (type checking)
result = handler(v)
except Exception:
raise PydanticCustomError("literal_error", error_msg)
# After handler, enforce enumeration
if not isinstance(result, str) or result == "" or result not in allowed:
raise PydanticCustomError("literal_error", error_msg)
return result
class UpdateDatasetReq(CreateDatasetReq): class UpdateDatasetReq(CreateDatasetReq):
dataset_id: Annotated[str, Field(...)] dataset_id: Annotated[str, Field(...)]

View File

@ -49,6 +49,7 @@ class RetCode(IntEnum, CustomEnum):
RUNNING = 106 RUNNING = 106
PERMISSION_ERROR = 108 PERMISSION_ERROR = 108
AUTHENTICATION_ERROR = 109 AUTHENTICATION_ERROR = 109
BAD_REQUEST = 400
UNAUTHORIZED = 401 UNAUTHORIZED = 401
SERVER_ERROR = 500 SERVER_ERROR = 500
FORBIDDEN = 403 FORBIDDEN = 403
@ -118,6 +119,9 @@ class FileSource(StrEnum):
SHAREPOINT = "sharepoint" SHAREPOINT = "sharepoint"
SLACK = "slack" SLACK = "slack"
TEAMS = "teams" TEAMS = "teams"
WEBDAV = "webdav"
MOODLE = "moodle"
DROPBOX = "dropbox"
class PipelineTaskType(StrEnum): class PipelineTaskType(StrEnum):

View File

@ -14,6 +14,8 @@ from .google_drive.connector import GoogleDriveConnector
from .jira.connector import JiraConnector from .jira.connector import JiraConnector
from .sharepoint_connector import SharePointConnector from .sharepoint_connector import SharePointConnector
from .teams_connector import TeamsConnector from .teams_connector import TeamsConnector
from .webdav_connector import WebDAVConnector
from .moodle_connector import MoodleConnector
from .config import BlobType, DocumentSource from .config import BlobType, DocumentSource
from .models import Document, TextSection, ImageSection, BasicExpertInfo from .models import Document, TextSection, ImageSection, BasicExpertInfo
from .exceptions import ( from .exceptions import (
@ -36,6 +38,8 @@ __all__ = [
"JiraConnector", "JiraConnector",
"SharePointConnector", "SharePointConnector",
"TeamsConnector", "TeamsConnector",
"WebDAVConnector",
"MoodleConnector",
"BlobType", "BlobType",
"DocumentSource", "DocumentSource",
"Document", "Document",

View File

@ -90,7 +90,7 @@ class BlobStorageConnector(LoadConnector, PollConnector):
elif self.bucket_type == BlobType.S3_COMPATIBLE: elif self.bucket_type == BlobType.S3_COMPATIBLE:
if not all( if not all(
credentials.get(key) credentials.get(key)
for key in ["endpoint_url", "aws_access_key_id", "aws_secret_access_key"] for key in ["endpoint_url", "aws_access_key_id", "aws_secret_access_key", "addressing_style"]
): ):
raise ConnectorMissingCredentialError("S3 Compatible Storage") raise ConnectorMissingCredentialError("S3 Compatible Storage")

View File

@ -48,7 +48,10 @@ class DocumentSource(str, Enum):
GOOGLE_DRIVE = "google_drive" GOOGLE_DRIVE = "google_drive"
GMAIL = "gmail" GMAIL = "gmail"
DISCORD = "discord" DISCORD = "discord"
WEBDAV = "webdav"
MOODLE = "moodle"
S3_COMPATIBLE = "s3_compatible" S3_COMPATIBLE = "s3_compatible"
DROPBOX = "dropbox"
class FileOrigin(str, Enum): class FileOrigin(str, Enum):
@ -214,6 +217,7 @@ OAUTH_GOOGLE_DRIVE_CLIENT_SECRET = os.environ.get(
"OAUTH_GOOGLE_DRIVE_CLIENT_SECRET", "" "OAUTH_GOOGLE_DRIVE_CLIENT_SECRET", ""
) )
GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI = os.environ.get("GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI", "http://localhost:9380/v1/connector/google-drive/oauth/web/callback") GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI = os.environ.get("GOOGLE_DRIVE_WEB_OAUTH_REDIRECT_URI", "http://localhost:9380/v1/connector/google-drive/oauth/web/callback")
GMAIL_WEB_OAUTH_REDIRECT_URI = os.environ.get("GMAIL_WEB_OAUTH_REDIRECT_URI", "http://localhost:9380/v1/connector/gmail/oauth/web/callback")
CONFLUENCE_OAUTH_TOKEN_URL = "https://auth.atlassian.com/oauth/token" CONFLUENCE_OAUTH_TOKEN_URL = "https://auth.atlassian.com/oauth/token"
RATE_LIMIT_MESSAGE_LOWERCASE = "Rate limit exceeded".lower() RATE_LIMIT_MESSAGE_LOWERCASE = "Rate limit exceeded".lower()

View File

@ -1562,6 +1562,7 @@ class ConfluenceConnector(
size_bytes=len(page_content.encode("utf-8")), # Calculate size in bytes size_bytes=len(page_content.encode("utf-8")), # Calculate size in bytes
doc_updated_at=datetime_from_string(page["version"]["when"]), doc_updated_at=datetime_from_string(page["version"]["when"]),
primary_owners=primary_owners if primary_owners else None, primary_owners=primary_owners if primary_owners else None,
metadata=metadata if metadata else None,
) )
except Exception as e: except Exception as e:
logging.error(f"Error converting page {page.get('id', 'unknown')}: {e}") logging.error(f"Error converting page {page.get('id', 'unknown')}: {e}")

View File

@ -65,6 +65,7 @@ def _convert_message_to_document(
blob=message.content.encode("utf-8"), blob=message.content.encode("utf-8"),
extension=".txt", extension=".txt",
size_bytes=len(message.content.encode("utf-8")), size_bytes=len(message.content.encode("utf-8")),
metadata=metadata if metadata else None,
) )

View File

@ -1,13 +1,24 @@
"""Dropbox connector""" """Dropbox connector"""
import logging
from datetime import timezone
from typing import Any from typing import Any
from dropbox import Dropbox from dropbox import Dropbox
from dropbox.exceptions import ApiError, AuthError from dropbox.exceptions import ApiError, AuthError
from dropbox.files import FileMetadata, FolderMetadata
from common.data_source.config import INDEX_BATCH_SIZE from common.data_source.config import INDEX_BATCH_SIZE, DocumentSource
from common.data_source.exceptions import ConnectorValidationError, InsufficientPermissionsError, ConnectorMissingCredentialError from common.data_source.exceptions import (
ConnectorMissingCredentialError,
ConnectorValidationError,
InsufficientPermissionsError,
)
from common.data_source.interfaces import LoadConnector, PollConnector, SecondsSinceUnixEpoch from common.data_source.interfaces import LoadConnector, PollConnector, SecondsSinceUnixEpoch
from common.data_source.models import Document, GenerateDocumentsOutput
from common.data_source.utils import get_file_ext
logger = logging.getLogger(__name__)
class DropboxConnector(LoadConnector, PollConnector): class DropboxConnector(LoadConnector, PollConnector):
@ -19,29 +30,29 @@ class DropboxConnector(LoadConnector, PollConnector):
def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None: def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
"""Load Dropbox credentials""" """Load Dropbox credentials"""
try: access_token = credentials.get("dropbox_access_token")
access_token = credentials.get("dropbox_access_token") if not access_token:
if not access_token: raise ConnectorMissingCredentialError("Dropbox access token is required")
raise ConnectorMissingCredentialError("Dropbox access token is required")
self.dropbox_client = Dropbox(access_token)
self.dropbox_client = Dropbox(access_token) return None
return None
except Exception as e:
raise ConnectorMissingCredentialError(f"Dropbox: {e}")
def validate_connector_settings(self) -> None: def validate_connector_settings(self) -> None:
"""Validate Dropbox connector settings""" """Validate Dropbox connector settings"""
if not self.dropbox_client: if self.dropbox_client is None:
raise ConnectorMissingCredentialError("Dropbox") raise ConnectorMissingCredentialError("Dropbox")
try: try:
# Test connection by getting current account info self.dropbox_client.files_list_folder(path="", limit=1)
self.dropbox_client.users_get_current_account() except AuthError as e:
except (AuthError, ApiError) as e: logger.exception("[Dropbox]: Failed to validate Dropbox credentials")
if "invalid_access_token" in str(e).lower(): raise ConnectorValidationError(f"Dropbox credential is invalid: {e}")
raise InsufficientPermissionsError("Invalid Dropbox access token") except ApiError as e:
else: if e.error is not None and "insufficient_permissions" in str(e.error).lower():
raise ConnectorValidationError(f"Dropbox validation error: {e}") raise InsufficientPermissionsError("Your Dropbox token does not have sufficient permissions.")
raise ConnectorValidationError(f"Unexpected Dropbox error during validation: {e.user_message_text or e}")
except Exception as e:
raise ConnectorValidationError(f"Unexpected error during Dropbox settings validation: {e}")
def _download_file(self, path: str) -> bytes: def _download_file(self, path: str) -> bytes:
"""Download a single file from Dropbox.""" """Download a single file from Dropbox."""
@ -54,26 +65,105 @@ class DropboxConnector(LoadConnector, PollConnector):
"""Create a shared link for a file in Dropbox.""" """Create a shared link for a file in Dropbox."""
if self.dropbox_client is None: if self.dropbox_client is None:
raise ConnectorMissingCredentialError("Dropbox") raise ConnectorMissingCredentialError("Dropbox")
try: try:
# Try to get existing shared links first
shared_links = self.dropbox_client.sharing_list_shared_links(path=path) shared_links = self.dropbox_client.sharing_list_shared_links(path=path)
if shared_links.links: if shared_links.links:
return shared_links.links[0].url return shared_links.links[0].url
# Create a new shared link
link_settings = self.dropbox_client.sharing_create_shared_link_with_settings(path)
return link_settings.url
except Exception:
# Fallback to basic link format
return f"https://www.dropbox.com/home{path}"
def poll_source(self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch) -> Any: link_metadata = self.dropbox_client.sharing_create_shared_link_with_settings(path)
return link_metadata.url
except ApiError as err:
logger.exception(f"[Dropbox]: Failed to create a shared link for {path}: {err}")
return ""
def _yield_files_recursive(
self,
path: str,
start: SecondsSinceUnixEpoch | None,
end: SecondsSinceUnixEpoch | None,
) -> GenerateDocumentsOutput:
"""Yield files in batches from a specified Dropbox folder, including subfolders."""
if self.dropbox_client is None:
raise ConnectorMissingCredentialError("Dropbox")
result = self.dropbox_client.files_list_folder(
path,
limit=self.batch_size,
recursive=False,
include_non_downloadable_files=False,
)
while True:
batch: list[Document] = []
for entry in result.entries:
if isinstance(entry, FileMetadata):
modified_time = entry.client_modified
if modified_time.tzinfo is None:
modified_time = modified_time.replace(tzinfo=timezone.utc)
else:
modified_time = modified_time.astimezone(timezone.utc)
time_as_seconds = modified_time.timestamp()
if start is not None and time_as_seconds <= start:
continue
if end is not None and time_as_seconds > end:
continue
try:
downloaded_file = self._download_file(entry.path_display)
except Exception:
logger.exception(f"[Dropbox]: Error downloading file {entry.path_display}")
continue
batch.append(
Document(
id=f"dropbox:{entry.id}",
blob=downloaded_file,
source=DocumentSource.DROPBOX,
semantic_identifier=entry.name,
extension=get_file_ext(entry.name),
doc_updated_at=modified_time,
size_bytes=entry.size if getattr(entry, "size", None) is not None else len(downloaded_file),
)
)
elif isinstance(entry, FolderMetadata):
yield from self._yield_files_recursive(entry.path_lower, start, end)
if batch:
yield batch
if not result.has_more:
break
result = self.dropbox_client.files_list_folder_continue(result.cursor)
def poll_source(self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch) -> GenerateDocumentsOutput:
"""Poll Dropbox for recent file changes""" """Poll Dropbox for recent file changes"""
# Simplified implementation - in production this would handle actual polling if self.dropbox_client is None:
return [] raise ConnectorMissingCredentialError("Dropbox")
def load_from_state(self) -> Any: for batch in self._yield_files_recursive("", start, end):
yield batch
def load_from_state(self) -> GenerateDocumentsOutput:
"""Load files from Dropbox state""" """Load files from Dropbox state"""
# Simplified implementation return self._yield_files_recursive("", None, None)
return []
if __name__ == "__main__":
import os
logging.basicConfig(level=logging.DEBUG)
connector = DropboxConnector()
connector.load_credentials({"dropbox_access_token": os.environ.get("DROPBOX_ACCESS_TOKEN")})
connector.validate_connector_settings()
document_batches = connector.load_from_state()
try:
first_batch = next(document_batches)
print(f"Loaded {len(first_batch)} documents in first batch.")
for doc in first_batch:
print(f"- {doc.semantic_identifier} ({doc.size_bytes} bytes)")
except StopIteration:
print("No documents available in Dropbox.")

View File

@ -1,6 +1,6 @@
import logging import logging
import os
from typing import Any from typing import Any
from google.oauth2.credentials import Credentials as OAuthCredentials from google.oauth2.credentials import Credentials as OAuthCredentials
from google.oauth2.service_account import Credentials as ServiceAccountCredentials from google.oauth2.service_account import Credentials as ServiceAccountCredentials
from googleapiclient.errors import HttpError from googleapiclient.errors import HttpError
@ -9,10 +9,10 @@ from common.data_source.config import INDEX_BATCH_SIZE, SLIM_BATCH_SIZE, Documen
from common.data_source.google_util.auth import get_google_creds from common.data_source.google_util.auth import get_google_creds
from common.data_source.google_util.constant import DB_CREDENTIALS_PRIMARY_ADMIN_KEY, MISSING_SCOPES_ERROR_STR, SCOPE_INSTRUCTIONS, USER_FIELDS from common.data_source.google_util.constant import DB_CREDENTIALS_PRIMARY_ADMIN_KEY, MISSING_SCOPES_ERROR_STR, SCOPE_INSTRUCTIONS, USER_FIELDS
from common.data_source.google_util.resource import get_admin_service, get_gmail_service from common.data_source.google_util.resource import get_admin_service, get_gmail_service
from common.data_source.google_util.util import _execute_single_retrieval, execute_paginated_retrieval from common.data_source.google_util.util import _execute_single_retrieval, execute_paginated_retrieval, sanitize_filename, clean_string
from common.data_source.interfaces import LoadConnector, PollConnector, SecondsSinceUnixEpoch, SlimConnectorWithPermSync from common.data_source.interfaces import LoadConnector, PollConnector, SecondsSinceUnixEpoch, SlimConnectorWithPermSync
from common.data_source.models import BasicExpertInfo, Document, ExternalAccess, GenerateDocumentsOutput, GenerateSlimDocumentOutput, SlimDocument, TextSection from common.data_source.models import BasicExpertInfo, Document, ExternalAccess, GenerateDocumentsOutput, GenerateSlimDocumentOutput, SlimDocument, TextSection
from common.data_source.utils import build_time_range_query, clean_email_and_extract_name, get_message_body, is_mail_service_disabled_error, time_str_to_utc from common.data_source.utils import build_time_range_query, clean_email_and_extract_name, get_message_body, is_mail_service_disabled_error, gmail_time_str_to_utc
# Constants for Gmail API fields # Constants for Gmail API fields
THREAD_LIST_FIELDS = "nextPageToken, threads(id)" THREAD_LIST_FIELDS = "nextPageToken, threads(id)"
@ -67,7 +67,6 @@ def message_to_section(message: dict[str, Any]) -> tuple[TextSection, dict[str,
message_data += f"{name}: {value}\n" message_data += f"{name}: {value}\n"
message_body_text: str = get_message_body(payload) message_body_text: str = get_message_body(payload)
return TextSection(link=link, text=message_body_text + message_data), metadata return TextSection(link=link, text=message_body_text + message_data), metadata
@ -97,13 +96,15 @@ def thread_to_document(full_thread: dict[str, Any], email_used_to_fetch_thread:
if not semantic_identifier: if not semantic_identifier:
semantic_identifier = message_metadata.get("subject", "") semantic_identifier = message_metadata.get("subject", "")
semantic_identifier = clean_string(semantic_identifier)
semantic_identifier = sanitize_filename(semantic_identifier)
if message_metadata.get("updated_at"): if message_metadata.get("updated_at"):
updated_at = message_metadata.get("updated_at") updated_at = message_metadata.get("updated_at")
updated_at_datetime = None updated_at_datetime = None
if updated_at: if updated_at:
updated_at_datetime = time_str_to_utc(updated_at) updated_at_datetime = gmail_time_str_to_utc(updated_at)
thread_id = full_thread.get("id") thread_id = full_thread.get("id")
if not thread_id: if not thread_id:
@ -115,15 +116,24 @@ def thread_to_document(full_thread: dict[str, Any], email_used_to_fetch_thread:
if not semantic_identifier: if not semantic_identifier:
semantic_identifier = "(no subject)" semantic_identifier = "(no subject)"
combined_sections = "\n\n".join(
sec.text for sec in sections if hasattr(sec, "text")
)
blob = combined_sections
size_bytes = len(blob)
extension = '.txt'
return Document( return Document(
id=thread_id, id=thread_id,
semantic_identifier=semantic_identifier, semantic_identifier=semantic_identifier,
sections=sections, blob=blob,
size_bytes=size_bytes,
extension=extension,
source=DocumentSource.GMAIL, source=DocumentSource.GMAIL,
primary_owners=primary_owners, primary_owners=primary_owners,
secondary_owners=secondary_owners, secondary_owners=secondary_owners,
doc_updated_at=updated_at_datetime, doc_updated_at=updated_at_datetime,
metadata={}, metadata=message_metadata,
external_access=ExternalAccess( external_access=ExternalAccess(
external_user_emails={email_used_to_fetch_thread}, external_user_emails={email_used_to_fetch_thread},
external_user_group_ids=set(), external_user_group_ids=set(),
@ -214,15 +224,13 @@ class GmailConnector(LoadConnector, PollConnector, SlimConnectorWithPermSync):
q=query, q=query,
continue_on_404_or_403=True, continue_on_404_or_403=True,
): ):
full_threads = _execute_single_retrieval( full_thread = _execute_single_retrieval(
retrieval_function=gmail_service.users().threads().get, retrieval_function=gmail_service.users().threads().get,
list_key=None,
userId=user_email, userId=user_email,
fields=THREAD_FIELDS, fields=THREAD_FIELDS,
id=thread["id"], id=thread["id"],
continue_on_404_or_403=True, continue_on_404_or_403=True,
) )
full_thread = list(full_threads)[0]
doc = thread_to_document(full_thread, user_email) doc = thread_to_document(full_thread, user_email)
if doc is None: if doc is None:
continue continue
@ -310,4 +318,30 @@ class GmailConnector(LoadConnector, PollConnector, SlimConnectorWithPermSync):
if __name__ == "__main__": if __name__ == "__main__":
pass import time
import os
from common.data_source.google_util.util import get_credentials_from_env
logging.basicConfig(level=logging.INFO)
try:
email = os.environ.get("GMAIL_TEST_EMAIL", "newyorkupperbay@gmail.com")
creds = get_credentials_from_env(email, oauth=True, source="gmail")
print("Credentials loaded successfully")
print(f"{creds=}")
connector = GmailConnector(batch_size=2)
print("GmailConnector initialized")
connector.load_credentials(creds)
print("Credentials loaded into connector")
print("Gmail is ready to use")
for file in connector._fetch_threads(
int(time.time()) - 1 * 24 * 60 * 60,
int(time.time()),
):
print("new batch","-"*80)
for f in file:
print(f)
print("\n\n")
except Exception as e:
logging.exception(f"Error loading credentials: {e}")

View File

@ -1,7 +1,6 @@
"""Google Drive connector""" """Google Drive connector"""
import copy import copy
import json
import logging import logging
import os import os
import sys import sys
@ -32,7 +31,6 @@ from common.data_source.google_drive.file_retrieval import (
from common.data_source.google_drive.model import DriveRetrievalStage, GoogleDriveCheckpoint, GoogleDriveFileType, RetrievedDriveFile, StageCompletion from common.data_source.google_drive.model import DriveRetrievalStage, GoogleDriveCheckpoint, GoogleDriveFileType, RetrievedDriveFile, StageCompletion
from common.data_source.google_util.auth import get_google_creds from common.data_source.google_util.auth import get_google_creds
from common.data_source.google_util.constant import DB_CREDENTIALS_PRIMARY_ADMIN_KEY, MISSING_SCOPES_ERROR_STR, USER_FIELDS from common.data_source.google_util.constant import DB_CREDENTIALS_PRIMARY_ADMIN_KEY, MISSING_SCOPES_ERROR_STR, USER_FIELDS
from common.data_source.google_util.oauth_flow import ensure_oauth_token_dict
from common.data_source.google_util.resource import GoogleDriveService, get_admin_service, get_drive_service from common.data_source.google_util.resource import GoogleDriveService, get_admin_service, get_drive_service
from common.data_source.google_util.util import GoogleFields, execute_paginated_retrieval, get_file_owners from common.data_source.google_util.util import GoogleFields, execute_paginated_retrieval, get_file_owners
from common.data_source.google_util.util_threadpool_concurrency import ThreadSafeDict from common.data_source.google_util.util_threadpool_concurrency import ThreadSafeDict
@ -1138,39 +1136,6 @@ class GoogleDriveConnector(SlimConnectorWithPermSync, CheckpointedConnectorWithP
return GoogleDriveCheckpoint.model_validate_json(checkpoint_json) return GoogleDriveCheckpoint.model_validate_json(checkpoint_json)
def get_credentials_from_env(email: str, oauth: bool = False) -> dict:
try:
if oauth:
raw_credential_string = os.environ["GOOGLE_DRIVE_OAUTH_CREDENTIALS_JSON_STR"]
else:
raw_credential_string = os.environ["GOOGLE_DRIVE_SERVICE_ACCOUNT_JSON_STR"]
except KeyError:
raise ValueError("Missing Google Drive credentials in environment variables")
try:
credential_dict = json.loads(raw_credential_string)
except json.JSONDecodeError:
raise ValueError("Invalid JSON in Google Drive credentials")
if oauth:
credential_dict = ensure_oauth_token_dict(credential_dict, DocumentSource.GOOGLE_DRIVE)
refried_credential_string = json.dumps(credential_dict)
DB_CREDENTIALS_DICT_TOKEN_KEY = "google_tokens"
DB_CREDENTIALS_DICT_SERVICE_ACCOUNT_KEY = "google_service_account_key"
DB_CREDENTIALS_PRIMARY_ADMIN_KEY = "google_primary_admin"
DB_CREDENTIALS_AUTHENTICATION_METHOD = "authentication_method"
cred_key = DB_CREDENTIALS_DICT_TOKEN_KEY if oauth else DB_CREDENTIALS_DICT_SERVICE_ACCOUNT_KEY
return {
cred_key: refried_credential_string,
DB_CREDENTIALS_PRIMARY_ADMIN_KEY: email,
DB_CREDENTIALS_AUTHENTICATION_METHOD: "uploaded",
}
class CheckpointOutputWrapper: class CheckpointOutputWrapper:
""" """
Wraps a CheckpointOutput generator to give things back in a more digestible format. Wraps a CheckpointOutput generator to give things back in a more digestible format.
@ -1236,7 +1201,7 @@ def yield_all_docs_from_checkpoint_connector(
if __name__ == "__main__": if __name__ == "__main__":
import time import time
from common.data_source.google_util.util import get_credentials_from_env
logging.basicConfig(level=logging.DEBUG) logging.basicConfig(level=logging.DEBUG)
try: try:
@ -1245,7 +1210,7 @@ if __name__ == "__main__":
creds = get_credentials_from_env(email, oauth=True) creds = get_credentials_from_env(email, oauth=True)
print("Credentials loaded successfully") print("Credentials loaded successfully")
print(f"{creds=}") print(f"{creds=}")
sys.exit(0)
connector = GoogleDriveConnector( connector = GoogleDriveConnector(
include_shared_drives=False, include_shared_drives=False,
shared_drive_urls=None, shared_drive_urls=None,

View File

@ -49,11 +49,11 @@ MISSING_SCOPES_ERROR_STR = "client not authorized for any of the scopes requeste
SCOPE_INSTRUCTIONS = "" SCOPE_INSTRUCTIONS = ""
GOOGLE_DRIVE_WEB_OAUTH_POPUP_TEMPLATE = """<!DOCTYPE html> GOOGLE_WEB_OAUTH_POPUP_TEMPLATE = """<!DOCTYPE html>
<html lang="en"> <html lang="en">
<head> <head>
<meta charset="utf-8" /> <meta charset="utf-8" />
<title>Google Drive Authorization</title> <title>{title}</title>
<style> <style>
body {{ body {{
font-family: Arial, sans-serif; font-family: Arial, sans-serif;

View File

@ -1,12 +1,17 @@
import json
import logging import logging
import os
import re
import socket import socket
from collections.abc import Callable, Iterator from collections.abc import Callable, Iterator
from enum import Enum from enum import Enum
from typing import Any from typing import Any
import unicodedata
from googleapiclient.errors import HttpError # type: ignore # type: ignore from googleapiclient.errors import HttpError # type: ignore # type: ignore
from common.data_source.config import DocumentSource
from common.data_source.google_drive.model import GoogleDriveFileType from common.data_source.google_drive.model import GoogleDriveFileType
from common.data_source.google_util.oauth_flow import ensure_oauth_token_dict
# See https://developers.google.com/drive/api/reference/rest/v3/files/list for more # See https://developers.google.com/drive/api/reference/rest/v3/files/list for more
@ -117,6 +122,7 @@ def _execute_single_retrieval(
"""Execute a single retrieval from Google Drive API""" """Execute a single retrieval from Google Drive API"""
try: try:
results = retrieval_function(**request_kwargs).execute() results = retrieval_function(**request_kwargs).execute()
except HttpError as e: except HttpError as e:
if e.resp.status >= 500: if e.resp.status >= 500:
results = retrieval_function() results = retrieval_function()
@ -148,5 +154,110 @@ def _execute_single_retrieval(
error, error,
) )
results = retrieval_function() results = retrieval_function()
return results return results
def get_credentials_from_env(email: str, oauth: bool = False, source="drive") -> dict:
try:
if oauth:
raw_credential_string = os.environ["GOOGLE_OAUTH_CREDENTIALS_JSON_STR"]
else:
raw_credential_string = os.environ["GOOGLE_SERVICE_ACCOUNT_JSON_STR"]
except KeyError:
raise ValueError("Missing Google Drive credentials in environment variables")
try:
credential_dict = json.loads(raw_credential_string)
except json.JSONDecodeError:
raise ValueError("Invalid JSON in Google Drive credentials")
if oauth and source == "drive":
credential_dict = ensure_oauth_token_dict(credential_dict, DocumentSource.GOOGLE_DRIVE)
else:
credential_dict = ensure_oauth_token_dict(credential_dict, DocumentSource.GMAIL)
refried_credential_string = json.dumps(credential_dict)
DB_CREDENTIALS_DICT_TOKEN_KEY = "google_tokens"
DB_CREDENTIALS_DICT_SERVICE_ACCOUNT_KEY = "google_service_account_key"
DB_CREDENTIALS_PRIMARY_ADMIN_KEY = "google_primary_admin"
DB_CREDENTIALS_AUTHENTICATION_METHOD = "authentication_method"
cred_key = DB_CREDENTIALS_DICT_TOKEN_KEY if oauth else DB_CREDENTIALS_DICT_SERVICE_ACCOUNT_KEY
return {
cred_key: refried_credential_string,
DB_CREDENTIALS_PRIMARY_ADMIN_KEY: email,
DB_CREDENTIALS_AUTHENTICATION_METHOD: "uploaded",
}
def sanitize_filename(name: str) -> str:
"""
Soft sanitize for MinIO/S3:
- Replace only prohibited characters with a space.
- Preserve readability (no ugly underscores).
- Collapse multiple spaces.
"""
if name is None:
return "file.txt"
name = str(name).strip()
# Characters that MUST NOT appear in S3/MinIO object keys
# Replace them with a space (not underscore)
forbidden = r'[\\\?\#\%\*\:\|\<\>"]'
name = re.sub(forbidden, " ", name)
# Replace slashes "/" (S3 interprets as folder) with space
name = name.replace("/", " ")
# Collapse multiple spaces into one
name = re.sub(r"\s+", " ", name)
# Trim both ends
name = name.strip()
# Enforce reasonable max length
if len(name) > 200:
base, ext = os.path.splitext(name)
name = base[:180].rstrip() + ext
# Ensure there is an extension (your original logic)
if not os.path.splitext(name)[1]:
name += ".txt"
return name
def clean_string(text: str | None) -> str | None:
"""
Clean a string to make it safe for insertion into MySQL (utf8mb4).
- Normalize Unicode
- Remove control characters / zero-width characters
- Optionally remove high-plane emoji and symbols
"""
if text is None:
return None
# 0. Ensure the value is a string
text = str(text)
# 1. Normalize Unicode (NFC)
text = unicodedata.normalize("NFC", text)
# 2. Remove ASCII control characters (except tab, newline, carriage return)
text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]", "", text)
# 3. Remove zero-width characters / BOM
text = re.sub(r"[\u200b-\u200d\uFEFF]", "", text)
# 4. Remove high Unicode characters (emoji, special symbols)
text = re.sub(r"[\U00010000-\U0010FFFF]", "", text)
# 5. Final fallback: strip any invalid UTF-8 sequences
try:
text.encode("utf-8")
except UnicodeEncodeError:
text = text.encode("utf-8", errors="ignore").decode("utf-8")
return text

View File

@ -30,7 +30,6 @@ class LoadConnector(ABC):
"""Load documents from state""" """Load documents from state"""
pass pass
@abstractmethod
def validate_connector_settings(self) -> None: def validate_connector_settings(self) -> None:
"""Validate connector settings""" """Validate connector settings"""
pass pass

View File

@ -94,6 +94,7 @@ class Document(BaseModel):
blob: bytes blob: bytes
doc_updated_at: datetime doc_updated_at: datetime
size_bytes: int size_bytes: int
metadata: Optional[dict[str, Any]] = None
class BasicExpertInfo(BaseModel): class BasicExpertInfo(BaseModel):

View File

@ -0,0 +1,539 @@
from __future__ import annotations
import logging
import os
from collections.abc import Generator
from datetime import datetime, timezone
from retry import retry
from typing import Any, Optional
from markdownify import markdownify as md
from moodle import Moodle as MoodleClient, MoodleException
from common.data_source.config import INDEX_BATCH_SIZE
from common.data_source.exceptions import (
ConnectorMissingCredentialError,
CredentialExpiredError,
InsufficientPermissionsError,
ConnectorValidationError,
)
from common.data_source.interfaces import (
LoadConnector,
PollConnector,
SecondsSinceUnixEpoch,
)
from common.data_source.models import Document
from common.data_source.utils import batch_generator, rl_requests
logger = logging.getLogger(__name__)
class MoodleConnector(LoadConnector, PollConnector):
"""Moodle LMS connector for accessing course content"""
def __init__(self, moodle_url: str, batch_size: int = INDEX_BATCH_SIZE) -> None:
self.moodle_url = moodle_url.rstrip("/")
self.batch_size = batch_size
self.moodle_client: Optional[MoodleClient] = None
def _add_token_to_url(self, file_url: str) -> str:
"""Append Moodle token to URL if missing"""
if not self.moodle_client:
return file_url
token = getattr(self.moodle_client, "token", "")
if "token=" in file_url.lower():
return file_url
delimiter = "&" if "?" in file_url else "?"
return f"{file_url}{delimiter}token={token}"
def _log_error(
self, context: str, error: Exception, level: str = "warning"
) -> None:
"""Simplified logging wrapper"""
msg = f"{context}: {error}"
if level == "error":
logger.error(msg)
else:
logger.warning(msg)
def _get_latest_timestamp(self, *timestamps: int) -> int:
"""Return latest valid timestamp"""
return max((t for t in timestamps if t and t > 0), default=0)
def _yield_in_batches(
self, generator: Generator[Document, None, None]
) -> Generator[list[Document], None, None]:
for batch in batch_generator(generator, self.batch_size):
yield batch
def load_credentials(self, credentials: dict[str, Any]) -> None:
token = credentials.get("moodle_token")
if not token:
raise ConnectorMissingCredentialError("Moodle API token is required")
try:
self.moodle_client = MoodleClient(
self.moodle_url + "/webservice/rest/server.php", token
)
self.moodle_client.core.webservice.get_site_info()
except MoodleException as e:
if "invalidtoken" in str(e).lower():
raise CredentialExpiredError("Moodle token is invalid or expired")
raise ConnectorMissingCredentialError(
f"Failed to initialize Moodle client: {e}"
)
def validate_connector_settings(self) -> None:
if not self.moodle_client:
raise ConnectorMissingCredentialError("Moodle client not initialized")
try:
site_info = self.moodle_client.core.webservice.get_site_info()
if not site_info.sitename:
raise InsufficientPermissionsError("Invalid Moodle API response")
except MoodleException as e:
msg = str(e).lower()
if "invalidtoken" in msg:
raise CredentialExpiredError("Moodle token is invalid or expired")
if "accessexception" in msg:
raise InsufficientPermissionsError(
"Insufficient permissions. Ensure web services are enabled and permissions are correct."
)
raise ConnectorValidationError(f"Moodle validation error: {e}")
except Exception as e:
raise ConnectorValidationError(f"Unexpected validation error: {e}")
# -------------------------------------------------------------------------
# Data loading & polling
# -------------------------------------------------------------------------
def load_from_state(self) -> Generator[list[Document], None, None]:
if not self.moodle_client:
raise ConnectorMissingCredentialError("Moodle client not initialized")
logger.info("Starting full load from Moodle workspace")
courses = self._get_enrolled_courses()
if not courses:
logger.warning("No courses found to process")
return
yield from self._yield_in_batches(self._process_courses(courses))
def poll_source(
self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch
) -> Generator[list[Document], None, None]:
if not self.moodle_client:
raise ConnectorMissingCredentialError("Moodle client not initialized")
logger.info(
f"Polling Moodle updates between {datetime.fromtimestamp(start)} and {datetime.fromtimestamp(end)}"
)
courses = self._get_enrolled_courses()
if not courses:
logger.warning("No courses found to poll")
return
yield from self._yield_in_batches(
self._get_updated_content(courses, start, end)
)
@retry(tries=3, delay=1, backoff=2)
def _get_enrolled_courses(self) -> list:
if not self.moodle_client:
raise ConnectorMissingCredentialError("Moodle client not initialized")
try:
return self.moodle_client.core.course.get_courses()
except MoodleException as e:
self._log_error("fetching courses", e, "error")
raise ConnectorValidationError(f"Failed to fetch courses: {e}")
@retry(tries=3, delay=1, backoff=2)
def _get_course_contents(self, course_id: int):
if not self.moodle_client:
raise ConnectorMissingCredentialError("Moodle client not initialized")
try:
return self.moodle_client.core.course.get_contents(courseid=course_id)
except MoodleException as e:
self._log_error(f"fetching course contents for {course_id}", e)
return []
def _process_courses(self, courses) -> Generator[Document, None, None]:
for course in courses:
try:
contents = self._get_course_contents(course.id)
for section in contents:
for module in section.modules:
doc = self._process_module(course, section, module)
if doc:
yield doc
except Exception as e:
self._log_error(f"processing course {course.fullname}", e)
def _get_updated_content(
self, courses, start: float, end: float
) -> Generator[Document, None, None]:
for course in courses:
try:
contents = self._get_course_contents(course.id)
for section in contents:
for module in section.modules:
times = [
getattr(module, "timecreated", 0),
getattr(module, "timemodified", 0),
]
if hasattr(module, "contents"):
times.extend(
getattr(c, "timemodified", 0)
for c in module.contents
if c and getattr(c, "timemodified", 0)
)
last_mod = self._get_latest_timestamp(*times)
if start < last_mod <= end:
doc = self._process_module(course, section, module)
if doc:
yield doc
except Exception as e:
self._log_error(f"polling course {course.fullname}", e)
def _process_module(self, course, section, module) -> Optional[Document]:
try:
mtype = module.modname
if mtype in ["label", "url"]:
return None
if mtype == "resource":
return self._process_resource(course, section, module)
if mtype == "forum":
return self._process_forum(course, section, module)
if mtype == "page":
return self._process_page(course, section, module)
if mtype in ["assign", "quiz"]:
return self._process_activity(course, section, module)
if mtype == "book":
return self._process_book(course, section, module)
except Exception as e:
self._log_error(f"processing module {getattr(module, 'name', '?')}", e)
return None
def _process_resource(self, course, section, module) -> Optional[Document]:
if not getattr(module, "contents", None):
return None
file_info = module.contents[0]
if not getattr(file_info, "fileurl", None):
return None
file_name = os.path.basename(file_info.filename)
ts = self._get_latest_timestamp(
getattr(module, "timecreated", 0),
getattr(module, "timemodified", 0),
getattr(file_info, "timemodified", 0),
)
try:
resp = rl_requests.get(
self._add_token_to_url(file_info.fileurl), timeout=60
)
resp.raise_for_status()
blob = resp.content
ext = os.path.splitext(file_name)[1] or ".bin"
semantic_id = f"{course.fullname} / {section.name} / {file_name}"
# Create metadata dictionary with relevant information
metadata = {
"moodle_url": self.moodle_url,
"course_id": getattr(course, "id", None),
"course_name": getattr(course, "fullname", None),
"course_shortname": getattr(course, "shortname", None),
"section_id": getattr(section, "id", None),
"section_name": getattr(section, "name", None),
"section_number": getattr(section, "section", None),
"module_id": getattr(module, "id", None),
"module_name": getattr(module, "name", None),
"module_type": getattr(module, "modname", None),
"module_instance": getattr(module, "instance", None),
"file_url": getattr(file_info, "fileurl", None),
"file_name": file_name,
"file_size": getattr(file_info, "filesize", len(blob)),
"file_type": getattr(file_info, "mimetype", None),
"time_created": getattr(module, "timecreated", None),
"time_modified": getattr(module, "timemodified", None),
"visible": getattr(module, "visible", None),
"groupmode": getattr(module, "groupmode", None),
}
return Document(
id=f"moodle_resource_{module.id}",
source="moodle",
semantic_identifier=semantic_id,
extension=ext,
blob=blob,
doc_updated_at=datetime.fromtimestamp(ts or 0, tz=timezone.utc),
size_bytes=len(blob),
metadata=metadata,
)
except Exception as e:
self._log_error(f"downloading resource {file_name}", e, "error")
return None
def _process_forum(self, course, section, module) -> Optional[Document]:
if not self.moodle_client or not getattr(module, "instance", None):
return None
try:
result = self.moodle_client.mod.forum.get_forum_discussions(
forumid=module.instance
)
disc_list = getattr(result, "discussions", [])
if not disc_list:
return None
markdown = [f"# {module.name}\n"]
latest_ts = self._get_latest_timestamp(
getattr(module, "timecreated", 0),
getattr(module, "timemodified", 0),
)
for d in disc_list:
markdown.append(f"## {d.name}\n\n{md(d.message or '')}\n\n---\n")
latest_ts = max(latest_ts, getattr(d, "timemodified", 0))
blob = "\n".join(markdown).encode("utf-8")
semantic_id = f"{course.fullname} / {section.name} / {module.name}"
# Create metadata dictionary with relevant information
metadata = {
"moodle_url": self.moodle_url,
"course_id": getattr(course, "id", None),
"course_name": getattr(course, "fullname", None),
"course_shortname": getattr(course, "shortname", None),
"section_id": getattr(section, "id", None),
"section_name": getattr(section, "name", None),
"section_number": getattr(section, "section", None),
"module_id": getattr(module, "id", None),
"module_name": getattr(module, "name", None),
"module_type": getattr(module, "modname", None),
"forum_id": getattr(module, "instance", None),
"discussion_count": len(disc_list),
"time_created": getattr(module, "timecreated", None),
"time_modified": getattr(module, "timemodified", None),
"visible": getattr(module, "visible", None),
"groupmode": getattr(module, "groupmode", None),
"discussions": [
{
"id": getattr(d, "id", None),
"name": getattr(d, "name", None),
"user_id": getattr(d, "userid", None),
"user_fullname": getattr(d, "userfullname", None),
"time_created": getattr(d, "timecreated", None),
"time_modified": getattr(d, "timemodified", None),
}
for d in disc_list
],
}
return Document(
id=f"moodle_forum_{module.id}",
source="moodle",
semantic_identifier=semantic_id,
extension=".md",
blob=blob,
doc_updated_at=datetime.fromtimestamp(latest_ts or 0, tz=timezone.utc),
size_bytes=len(blob),
metadata=metadata,
)
except Exception as e:
self._log_error(f"processing forum {module.name}", e)
return None
def _process_page(self, course, section, module) -> Optional[Document]:
if not getattr(module, "contents", None):
return None
file_info = module.contents[0]
if not getattr(file_info, "fileurl", None):
return None
file_name = os.path.basename(file_info.filename)
ts = self._get_latest_timestamp(
getattr(module, "timecreated", 0),
getattr(module, "timemodified", 0),
getattr(file_info, "timemodified", 0),
)
try:
resp = rl_requests.get(
self._add_token_to_url(file_info.fileurl), timeout=60
)
resp.raise_for_status()
blob = resp.content
ext = os.path.splitext(file_name)[1] or ".html"
semantic_id = f"{course.fullname} / {section.name} / {module.name}"
# Create metadata dictionary with relevant information
metadata = {
"moodle_url": self.moodle_url,
"course_id": getattr(course, "id", None),
"course_name": getattr(course, "fullname", None),
"course_shortname": getattr(course, "shortname", None),
"section_id": getattr(section, "id", None),
"section_name": getattr(section, "name", None),
"section_number": getattr(section, "section", None),
"module_id": getattr(module, "id", None),
"module_name": getattr(module, "name", None),
"module_type": getattr(module, "modname", None),
"module_instance": getattr(module, "instance", None),
"page_url": getattr(file_info, "fileurl", None),
"file_name": file_name,
"file_size": getattr(file_info, "filesize", len(blob)),
"file_type": getattr(file_info, "mimetype", None),
"time_created": getattr(module, "timecreated", None),
"time_modified": getattr(module, "timemodified", None),
"visible": getattr(module, "visible", None),
"groupmode": getattr(module, "groupmode", None),
}
return Document(
id=f"moodle_page_{module.id}",
source="moodle",
semantic_identifier=semantic_id,
extension=ext,
blob=blob,
doc_updated_at=datetime.fromtimestamp(ts or 0, tz=timezone.utc),
size_bytes=len(blob),
metadata=metadata,
)
except Exception as e:
self._log_error(f"processing page {file_name}", e, "error")
return None
def _process_activity(self, course, section, module) -> Optional[Document]:
desc = getattr(module, "description", "")
if not desc:
return None
mtype, mname = module.modname, module.name
markdown = f"# {mname}\n\n**Type:** {mtype.capitalize()}\n\n{md(desc)}"
ts = self._get_latest_timestamp(
getattr(module, "timecreated", 0),
getattr(module, "timemodified", 0),
getattr(module, "added", 0),
)
semantic_id = f"{course.fullname} / {section.name} / {mname}"
blob = markdown.encode("utf-8")
# Create metadata dictionary with relevant information
metadata = {
"moodle_url": self.moodle_url,
"course_id": getattr(course, "id", None),
"course_name": getattr(course, "fullname", None),
"course_shortname": getattr(course, "shortname", None),
"section_id": getattr(section, "id", None),
"section_name": getattr(section, "name", None),
"section_number": getattr(section, "section", None),
"module_id": getattr(module, "id", None),
"module_name": getattr(module, "name", None),
"module_type": getattr(module, "modname", None),
"activity_type": mtype,
"activity_instance": getattr(module, "instance", None),
"description": desc,
"time_created": getattr(module, "timecreated", None),
"time_modified": getattr(module, "timemodified", None),
"added": getattr(module, "added", None),
"visible": getattr(module, "visible", None),
"groupmode": getattr(module, "groupmode", None),
}
return Document(
id=f"moodle_{mtype}_{module.id}",
source="moodle",
semantic_identifier=semantic_id,
extension=".md",
blob=blob,
doc_updated_at=datetime.fromtimestamp(ts or 0, tz=timezone.utc),
size_bytes=len(blob),
metadata=metadata,
)
def _process_book(self, course, section, module) -> Optional[Document]:
if not getattr(module, "contents", None):
return None
contents = module.contents
chapters = [
c
for c in contents
if getattr(c, "fileurl", None)
and os.path.basename(c.filename) == "index.html"
]
if not chapters:
return None
latest_ts = self._get_latest_timestamp(
getattr(module, "timecreated", 0),
getattr(module, "timemodified", 0),
*[getattr(c, "timecreated", 0) for c in contents],
*[getattr(c, "timemodified", 0) for c in contents],
)
markdown_parts = [f"# {module.name}\n"]
chapter_info = []
for ch in chapters:
try:
resp = rl_requests.get(self._add_token_to_url(ch.fileurl), timeout=60)
resp.raise_for_status()
html = resp.content.decode("utf-8", errors="ignore")
markdown_parts.append(md(html) + "\n\n---\n")
# Collect chapter information for metadata
chapter_info.append(
{
"chapter_id": getattr(ch, "chapterid", None),
"title": getattr(ch, "title", None),
"filename": getattr(ch, "filename", None),
"fileurl": getattr(ch, "fileurl", None),
"time_created": getattr(ch, "timecreated", None),
"time_modified": getattr(ch, "timemodified", None),
"size": getattr(ch, "filesize", None),
}
)
except Exception as e:
self._log_error(f"processing book chapter {ch.filename}", e)
blob = "\n".join(markdown_parts).encode("utf-8")
semantic_id = f"{course.fullname} / {section.name} / {module.name}"
# Create metadata dictionary with relevant information
metadata = {
"moodle_url": self.moodle_url,
"course_id": getattr(course, "id", None),
"course_name": getattr(course, "fullname", None),
"course_shortname": getattr(course, "shortname", None),
"section_id": getattr(section, "id", None),
"section_name": getattr(section, "name", None),
"section_number": getattr(section, "section", None),
"module_id": getattr(module, "id", None),
"module_name": getattr(module, "name", None),
"module_type": getattr(module, "modname", None),
"book_id": getattr(module, "instance", None),
"chapter_count": len(chapters),
"chapters": chapter_info,
"time_created": getattr(module, "timecreated", None),
"time_modified": getattr(module, "timemodified", None),
"visible": getattr(module, "visible", None),
"groupmode": getattr(module, "groupmode", None),
}
return Document(
id=f"moodle_book_{module.id}",
source="moodle",
semantic_identifier=semantic_id,
extension=".md",
blob=blob,
doc_updated_at=datetime.fromtimestamp(latest_ts or 0, tz=timezone.utc),
size_bytes=len(blob),
metadata=metadata,
)

View File

@ -1,38 +1,45 @@
import html
import logging import logging
from collections.abc import Generator from collections.abc import Generator
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Optional from typing import Any, Optional
from urllib.parse import urlparse
from retry import retry from retry import retry
from common.data_source.config import ( from common.data_source.config import (
INDEX_BATCH_SIZE, INDEX_BATCH_SIZE,
DocumentSource, NOTION_CONNECTOR_DISABLE_RECURSIVE_PAGE_LOOKUP NOTION_CONNECTOR_DISABLE_RECURSIVE_PAGE_LOOKUP,
DocumentSource,
)
from common.data_source.exceptions import (
ConnectorMissingCredentialError,
ConnectorValidationError,
CredentialExpiredError,
InsufficientPermissionsError,
UnexpectedValidationError,
) )
from common.data_source.interfaces import ( from common.data_source.interfaces import (
LoadConnector, LoadConnector,
PollConnector, PollConnector,
SecondsSinceUnixEpoch SecondsSinceUnixEpoch,
) )
from common.data_source.models import ( from common.data_source.models import (
Document, Document,
TextSection, GenerateDocumentsOutput GenerateDocumentsOutput,
)
from common.data_source.exceptions import (
ConnectorValidationError,
CredentialExpiredError,
InsufficientPermissionsError,
UnexpectedValidationError, ConnectorMissingCredentialError
)
from common.data_source.models import (
NotionPage,
NotionBlock, NotionBlock,
NotionSearchResponse NotionPage,
NotionSearchResponse,
TextSection,
) )
from common.data_source.utils import ( from common.data_source.utils import (
rl_requests,
batch_generator, batch_generator,
datetime_from_string,
fetch_notion_data, fetch_notion_data,
filter_pages_by_time,
properties_to_str, properties_to_str,
filter_pages_by_time, datetime_from_string rl_requests,
) )
@ -61,11 +68,9 @@ class NotionConnector(LoadConnector, PollConnector):
self.recursive_index_enabled = recursive_index_enabled or bool(root_page_id) self.recursive_index_enabled = recursive_index_enabled or bool(root_page_id)
@retry(tries=3, delay=1, backoff=2) @retry(tries=3, delay=1, backoff=2)
def _fetch_child_blocks( def _fetch_child_blocks(self, block_id: str, cursor: Optional[str] = None) -> dict[str, Any] | None:
self, block_id: str, cursor: Optional[str] = None
) -> dict[str, Any] | None:
"""Fetch all child blocks via the Notion API.""" """Fetch all child blocks via the Notion API."""
logging.debug(f"Fetching children of block with ID '{block_id}'") logging.debug(f"[Notion]: Fetching children of block with ID {block_id}")
block_url = f"https://api.notion.com/v1/blocks/{block_id}/children" block_url = f"https://api.notion.com/v1/blocks/{block_id}/children"
query_params = {"start_cursor": cursor} if cursor else None query_params = {"start_cursor": cursor} if cursor else None
@ -79,49 +84,42 @@ class NotionConnector(LoadConnector, PollConnector):
response.raise_for_status() response.raise_for_status()
return response.json() return response.json()
except Exception as e: except Exception as e:
if hasattr(e, 'response') and e.response.status_code == 404: if hasattr(e, "response") and e.response.status_code == 404:
logging.error( logging.error(f"[Notion]: Unable to access block with ID {block_id}. This is likely due to the block not being shared with the integration.")
f"Unable to access block with ID '{block_id}'. "
f"This is likely due to the block not being shared with the integration."
)
return None return None
else: else:
logging.exception(f"Error fetching blocks: {e}") logging.exception(f"[Notion]: Error fetching blocks: {e}")
raise raise
@retry(tries=3, delay=1, backoff=2) @retry(tries=3, delay=1, backoff=2)
def _fetch_page(self, page_id: str) -> NotionPage: def _fetch_page(self, page_id: str) -> NotionPage:
"""Fetch a page from its ID via the Notion API.""" """Fetch a page from its ID via the Notion API."""
logging.debug(f"Fetching page for ID '{page_id}'") logging.debug(f"[Notion]: Fetching page for ID {page_id}")
page_url = f"https://api.notion.com/v1/pages/{page_id}" page_url = f"https://api.notion.com/v1/pages/{page_id}"
try: try:
data = fetch_notion_data(page_url, self.headers, "GET") data = fetch_notion_data(page_url, self.headers, "GET")
return NotionPage(**data) return NotionPage(**data)
except Exception as e: except Exception as e:
logging.warning(f"Failed to fetch page, trying database for ID '{page_id}': {e}") logging.warning(f"[Notion]: Failed to fetch page, trying database for ID {page_id}: {e}")
return self._fetch_database_as_page(page_id) return self._fetch_database_as_page(page_id)
@retry(tries=3, delay=1, backoff=2) @retry(tries=3, delay=1, backoff=2)
def _fetch_database_as_page(self, database_id: str) -> NotionPage: def _fetch_database_as_page(self, database_id: str) -> NotionPage:
"""Attempt to fetch a database as a page.""" """Attempt to fetch a database as a page."""
logging.debug(f"Fetching database for ID '{database_id}' as a page") logging.debug(f"[Notion]: Fetching database for ID {database_id} as a page")
database_url = f"https://api.notion.com/v1/databases/{database_id}" database_url = f"https://api.notion.com/v1/databases/{database_id}"
data = fetch_notion_data(database_url, self.headers, "GET") data = fetch_notion_data(database_url, self.headers, "GET")
database_name = data.get("title") database_name = data.get("title")
database_name = ( database_name = database_name[0].get("text", {}).get("content") if database_name else None
database_name[0].get("text", {}).get("content") if database_name else None
)
return NotionPage(**data, database_name=database_name) return NotionPage(**data, database_name=database_name)
@retry(tries=3, delay=1, backoff=2) @retry(tries=3, delay=1, backoff=2)
def _fetch_database( def _fetch_database(self, database_id: str, cursor: Optional[str] = None) -> dict[str, Any]:
self, database_id: str, cursor: Optional[str] = None
) -> dict[str, Any]:
"""Fetch a database from its ID via the Notion API.""" """Fetch a database from its ID via the Notion API."""
logging.debug(f"Fetching database for ID '{database_id}'") logging.debug(f"[Notion]: Fetching database for ID {database_id}")
block_url = f"https://api.notion.com/v1/databases/{database_id}/query" block_url = f"https://api.notion.com/v1/databases/{database_id}/query"
body = {"start_cursor": cursor} if cursor else None body = {"start_cursor": cursor} if cursor else None
@ -129,17 +127,12 @@ class NotionConnector(LoadConnector, PollConnector):
data = fetch_notion_data(block_url, self.headers, "POST", body) data = fetch_notion_data(block_url, self.headers, "POST", body)
return data return data
except Exception as e: except Exception as e:
if hasattr(e, 'response') and e.response.status_code in [404, 400]: if hasattr(e, "response") and e.response.status_code in [404, 400]:
logging.error( logging.error(f"[Notion]: Unable to access database with ID {database_id}. This is likely due to the database not being shared with the integration.")
f"Unable to access database with ID '{database_id}'. "
f"This is likely due to the database not being shared with the integration."
)
return {"results": [], "next_cursor": None} return {"results": [], "next_cursor": None}
raise raise
def _read_pages_from_database( def _read_pages_from_database(self, database_id: str) -> tuple[list[NotionBlock], list[str]]:
self, database_id: str
) -> tuple[list[NotionBlock], list[str]]:
"""Returns a list of top level blocks and all page IDs in the database.""" """Returns a list of top level blocks and all page IDs in the database."""
result_blocks: list[NotionBlock] = [] result_blocks: list[NotionBlock] = []
result_pages: list[str] = [] result_pages: list[str] = []
@ -158,10 +151,10 @@ class NotionConnector(LoadConnector, PollConnector):
if self.recursive_index_enabled: if self.recursive_index_enabled:
if obj_type == "page": if obj_type == "page":
logging.debug(f"Found page with ID '{obj_id}' in database '{database_id}'") logging.debug(f"[Notion]: Found page with ID {obj_id} in database {database_id}")
result_pages.append(result["id"]) result_pages.append(result["id"])
elif obj_type == "database": elif obj_type == "database":
logging.debug(f"Found database with ID '{obj_id}' in database '{database_id}'") logging.debug(f"[Notion]: Found database with ID {obj_id} in database {database_id}")
_, child_pages = self._read_pages_from_database(obj_id) _, child_pages = self._read_pages_from_database(obj_id)
result_pages.extend(child_pages) result_pages.extend(child_pages)
@ -172,44 +165,229 @@ class NotionConnector(LoadConnector, PollConnector):
return result_blocks, result_pages return result_blocks, result_pages
def _read_blocks(self, base_block_id: str) -> tuple[list[NotionBlock], list[str]]: def _extract_rich_text(self, rich_text_array: list[dict[str, Any]]) -> str:
"""Reads all child blocks for the specified block, returns blocks and child page ids.""" collected_text: list[str] = []
for rich_text in rich_text_array:
content = ""
r_type = rich_text.get("type")
if r_type == "equation":
expr = rich_text.get("equation", {}).get("expression")
if expr:
content = expr
elif r_type == "mention":
mention = rich_text.get("mention", {}) or {}
mention_type = mention.get("type")
mention_value = mention.get(mention_type, {}) if mention_type else {}
if mention_type == "date":
start = mention_value.get("start")
end = mention_value.get("end")
if start and end:
content = f"{start} - {end}"
elif start:
content = start
elif mention_type in {"page", "database"}:
content = mention_value.get("id", rich_text.get("plain_text", ""))
elif mention_type == "link_preview":
content = mention_value.get("url", rich_text.get("plain_text", ""))
else:
content = rich_text.get("plain_text", "") or str(mention_value)
else:
if rich_text.get("plain_text"):
content = rich_text["plain_text"]
elif "text" in rich_text and rich_text["text"].get("content"):
content = rich_text["text"]["content"]
href = rich_text.get("href")
if content and href:
content = f"{content} ({href})"
if content:
collected_text.append(content)
return "".join(collected_text).strip()
def _build_table_html(self, table_block_id: str) -> str | None:
rows: list[str] = []
cursor = None
while True:
data = self._fetch_child_blocks(table_block_id, cursor)
if data is None:
break
for result in data["results"]:
if result.get("type") != "table_row":
continue
cells_html: list[str] = []
for cell in result["table_row"].get("cells", []):
cell_text = self._extract_rich_text(cell)
cell_html = html.escape(cell_text) if cell_text else ""
cells_html.append(f"<td>{cell_html}</td>")
rows.append(f"<tr>{''.join(cells_html)}</tr>")
if data.get("next_cursor") is None:
break
cursor = data["next_cursor"]
if not rows:
return None
return "<table>\n" + "\n".join(rows) + "\n</table>"
def _download_file(self, url: str) -> bytes | None:
try:
response = rl_requests.get(url, timeout=60)
response.raise_for_status()
return response.content
except Exception as exc:
logging.warning(f"[Notion]: Failed to download Notion file from {url}: {exc}")
return None
def _extract_file_metadata(self, result_obj: dict[str, Any], block_id: str) -> tuple[str | None, str, str | None]:
file_source_type = result_obj.get("type")
file_source = result_obj.get(file_source_type, {}) if file_source_type else {}
url = file_source.get("url")
name = result_obj.get("name") or file_source.get("name")
if url and not name:
parsed_name = Path(urlparse(url).path).name
name = parsed_name or f"notion_file_{block_id}"
elif not name:
name = f"notion_file_{block_id}"
caption = self._extract_rich_text(result_obj.get("caption", [])) if "caption" in result_obj else None
return url, name, caption
def _build_attachment_document(
self,
block_id: str,
url: str,
name: str,
caption: Optional[str],
page_last_edited_time: Optional[str],
) -> Document | None:
file_bytes = self._download_file(url)
if file_bytes is None:
return None
extension = Path(name).suffix or Path(urlparse(url).path).suffix or ".bin"
if extension and not extension.startswith("."):
extension = f".{extension}"
if not extension:
extension = ".bin"
updated_at = datetime_from_string(page_last_edited_time) if page_last_edited_time else datetime.now(timezone.utc)
semantic_identifier = caption or name or f"Notion file {block_id}"
return Document(
id=block_id,
blob=file_bytes,
source=DocumentSource.NOTION,
semantic_identifier=semantic_identifier,
extension=extension,
size_bytes=len(file_bytes),
doc_updated_at=updated_at,
)
def _read_blocks(self, base_block_id: str, page_last_edited_time: Optional[str] = None) -> tuple[list[NotionBlock], list[str], list[Document]]:
result_blocks: list[NotionBlock] = [] result_blocks: list[NotionBlock] = []
child_pages: list[str] = [] child_pages: list[str] = []
attachments: list[Document] = []
cursor = None cursor = None
while True: while True:
data = self._fetch_child_blocks(base_block_id, cursor) data = self._fetch_child_blocks(base_block_id, cursor)
if data is None: if data is None:
return result_blocks, child_pages return result_blocks, child_pages, attachments
for result in data["results"]: for result in data["results"]:
logging.debug(f"Found child block for block with ID '{base_block_id}': {result}") logging.debug(f"[Notion]: Found child block for block with ID {base_block_id}: {result}")
result_block_id = result["id"] result_block_id = result["id"]
result_type = result["type"] result_type = result["type"]
result_obj = result[result_type] result_obj = result[result_type]
if result_type in ["ai_block", "unsupported", "external_object_instance_page"]: if result_type in ["ai_block", "unsupported", "external_object_instance_page"]:
logging.warning(f"Skipping unsupported block type '{result_type}'") logging.warning(f"[Notion]: Skipping unsupported block type {result_type}")
continue
if result_type == "table":
table_html = self._build_table_html(result_block_id)
if table_html:
result_blocks.append(
NotionBlock(
id=result_block_id,
text=table_html,
prefix="\n\n",
)
)
continue
if result_type == "equation":
expr = result_obj.get("expression")
if expr:
result_blocks.append(
NotionBlock(
id=result_block_id,
text=expr,
prefix="\n",
)
)
continue continue
cur_result_text_arr = [] cur_result_text_arr = []
if "rich_text" in result_obj: if "rich_text" in result_obj:
for rich_text in result_obj["rich_text"]: text = self._extract_rich_text(result_obj["rich_text"])
if "text" in rich_text: if text:
text = rich_text["text"]["content"] cur_result_text_arr.append(text)
cur_result_text_arr.append(text)
if result_type == "bulleted_list_item":
if cur_result_text_arr:
cur_result_text_arr[0] = f"- {cur_result_text_arr[0]}"
else:
cur_result_text_arr = ["- "]
if result_type == "numbered_list_item":
if cur_result_text_arr:
cur_result_text_arr[0] = f"1. {cur_result_text_arr[0]}"
else:
cur_result_text_arr = ["1. "]
if result_type == "to_do":
checked = result_obj.get("checked")
checkbox_prefix = "[x]" if checked else "[ ]"
if cur_result_text_arr:
cur_result_text_arr = [f"{checkbox_prefix} {cur_result_text_arr[0]}"] + cur_result_text_arr[1:]
else:
cur_result_text_arr = [checkbox_prefix]
if result_type in {"file", "image", "pdf", "video", "audio"}:
file_url, file_name, caption = self._extract_file_metadata(result_obj, result_block_id)
if file_url:
attachment_doc = self._build_attachment_document(
block_id=result_block_id,
url=file_url,
name=file_name,
caption=caption,
page_last_edited_time=page_last_edited_time,
)
if attachment_doc:
attachments.append(attachment_doc)
attachment_label = caption or file_name
if attachment_label:
cur_result_text_arr.append(f"{result_type.capitalize()}: {attachment_label}")
if result["has_children"]: if result["has_children"]:
if result_type == "child_page": if result_type == "child_page":
child_pages.append(result_block_id) child_pages.append(result_block_id)
else: else:
logging.debug(f"Entering sub-block: {result_block_id}") logging.debug(f"[Notion]: Entering sub-block: {result_block_id}")
subblocks, subblock_child_pages = self._read_blocks(result_block_id) subblocks, subblock_child_pages, subblock_attachments = self._read_blocks(result_block_id, page_last_edited_time)
logging.debug(f"Finished sub-block: {result_block_id}") logging.debug(f"[Notion]: Finished sub-block: {result_block_id}")
result_blocks.extend(subblocks) result_blocks.extend(subblocks)
child_pages.extend(subblock_child_pages) child_pages.extend(subblock_child_pages)
attachments.extend(subblock_attachments)
if result_type == "child_database": if result_type == "child_database":
inner_blocks, inner_child_pages = self._read_pages_from_database(result_block_id) inner_blocks, inner_child_pages = self._read_pages_from_database(result_block_id)
@ -231,7 +409,7 @@ class NotionConnector(LoadConnector, PollConnector):
cursor = data["next_cursor"] cursor = data["next_cursor"]
return result_blocks, child_pages return result_blocks, child_pages, attachments
def _read_page_title(self, page: NotionPage) -> Optional[str]: def _read_page_title(self, page: NotionPage) -> Optional[str]:
"""Extracts the title from a Notion page.""" """Extracts the title from a Notion page."""
@ -245,9 +423,7 @@ class NotionConnector(LoadConnector, PollConnector):
return None return None
def _read_pages( def _read_pages(self, pages: list[NotionPage], start: SecondsSinceUnixEpoch | None = None, end: SecondsSinceUnixEpoch | None = None) -> Generator[Document, None, None]:
self, pages: list[NotionPage]
) -> Generator[Document, None, None]:
"""Reads pages for rich text content and generates Documents.""" """Reads pages for rich text content and generates Documents."""
all_child_page_ids: list[str] = [] all_child_page_ids: list[str] = []
@ -255,11 +431,17 @@ class NotionConnector(LoadConnector, PollConnector):
if isinstance(page, dict): if isinstance(page, dict):
page = NotionPage(**page) page = NotionPage(**page)
if page.id in self.indexed_pages: if page.id in self.indexed_pages:
logging.debug(f"Already indexed page with ID '{page.id}'. Skipping.") logging.debug(f"[Notion]: Already indexed page with ID {page.id}. Skipping.")
continue continue
logging.info(f"Reading page with ID '{page.id}', with url {page.url}") if start is not None and end is not None:
page_blocks, child_page_ids = self._read_blocks(page.id) page_ts = datetime_from_string(page.last_edited_time).timestamp()
if not (page_ts > start and page_ts <= end):
logging.debug(f"[Notion]: Skipping page {page.id} outside polling window.")
continue
logging.info(f"[Notion]: Reading page with ID {page.id}, with url {page.url}")
page_blocks, child_page_ids, attachment_docs = self._read_blocks(page.id, page.last_edited_time)
all_child_page_ids.extend(child_page_ids) all_child_page_ids.extend(child_page_ids)
self.indexed_pages.add(page.id) self.indexed_pages.add(page.id)
@ -268,14 +450,12 @@ class NotionConnector(LoadConnector, PollConnector):
if not page_blocks: if not page_blocks:
if not raw_page_title: if not raw_page_title:
logging.warning(f"No blocks OR title found for page with ID '{page.id}'. Skipping.") logging.warning(f"[Notion]: No blocks OR title found for page with ID {page.id}. Skipping.")
continue continue
text = page_title text = page_title
if page.properties: if page.properties:
text += "\n\n" + "\n".join( text += "\n\n" + "\n".join([f"{key}: {value}" for key, value in page.properties.items()])
[f"{key}: {value}" for key, value in page.properties.items()]
)
sections = [TextSection(link=page.url, text=text)] sections = [TextSection(link=page.url, text=text)]
else: else:
sections = [ sections = [
@ -286,45 +466,39 @@ class NotionConnector(LoadConnector, PollConnector):
for block in page_blocks for block in page_blocks
] ]
blob = ("\n".join([sec.text for sec in sections])).encode("utf-8") joined_text = "\n".join(sec.text for sec in sections)
blob = joined_text.encode("utf-8")
yield Document( yield Document(
id=page.id, id=page.id, blob=blob, source=DocumentSource.NOTION, semantic_identifier=page_title, extension=".txt", size_bytes=len(blob), doc_updated_at=datetime_from_string(page.last_edited_time)
blob=blob,
source=DocumentSource.NOTION,
semantic_identifier=page_title,
extension=".txt",
size_bytes=len(blob),
doc_updated_at=datetime_from_string(page.last_edited_time)
) )
for attachment_doc in attachment_docs:
yield attachment_doc
if self.recursive_index_enabled and all_child_page_ids: if self.recursive_index_enabled and all_child_page_ids:
for child_page_batch_ids in batch_generator(all_child_page_ids, INDEX_BATCH_SIZE): for child_page_batch_ids in batch_generator(all_child_page_ids, INDEX_BATCH_SIZE):
child_page_batch = [ child_page_batch = [self._fetch_page(page_id) for page_id in child_page_batch_ids if page_id not in self.indexed_pages]
self._fetch_page(page_id) yield from self._read_pages(child_page_batch, start, end)
for page_id in child_page_batch_ids
if page_id not in self.indexed_pages
]
yield from self._read_pages(child_page_batch)
@retry(tries=3, delay=1, backoff=2) @retry(tries=3, delay=1, backoff=2)
def _search_notion(self, query_dict: dict[str, Any]) -> NotionSearchResponse: def _search_notion(self, query_dict: dict[str, Any]) -> NotionSearchResponse:
"""Search for pages from a Notion database.""" """Search for pages from a Notion database."""
logging.debug(f"Searching for pages in Notion with query_dict: {query_dict}") logging.debug(f"[Notion]: Searching for pages in Notion with query_dict: {query_dict}")
data = fetch_notion_data("https://api.notion.com/v1/search", self.headers, "POST", query_dict) data = fetch_notion_data("https://api.notion.com/v1/search", self.headers, "POST", query_dict)
return NotionSearchResponse(**data) return NotionSearchResponse(**data)
def _recursive_load(self) -> Generator[list[Document], None, None]: def _recursive_load(self, start: SecondsSinceUnixEpoch | None = None, end: SecondsSinceUnixEpoch | None = None) -> Generator[list[Document], None, None]:
"""Recursively load pages starting from root page ID.""" """Recursively load pages starting from root page ID."""
if self.root_page_id is None or not self.recursive_index_enabled: if self.root_page_id is None or not self.recursive_index_enabled:
raise RuntimeError("Recursive page lookup is not enabled") raise RuntimeError("Recursive page lookup is not enabled")
logging.info(f"Recursively loading pages from Notion based on root page with ID: {self.root_page_id}") logging.info(f"[Notion]: Recursively loading pages from Notion based on root page with ID: {self.root_page_id}")
pages = [self._fetch_page(page_id=self.root_page_id)] pages = [self._fetch_page(page_id=self.root_page_id)]
yield from batch_generator(self._read_pages(pages), self.batch_size) yield from batch_generator(self._read_pages(pages, start, end), self.batch_size)
def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None: def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
"""Applies integration token to headers.""" """Applies integration token to headers."""
self.headers["Authorization"] = f'Bearer {credentials["notion_integration_token"]}' self.headers["Authorization"] = f"Bearer {credentials['notion_integration_token']}"
return None return None
def load_from_state(self) -> GenerateDocumentsOutput: def load_from_state(self) -> GenerateDocumentsOutput:
@ -348,12 +522,10 @@ class NotionConnector(LoadConnector, PollConnector):
else: else:
break break
def poll_source( def poll_source(self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch) -> GenerateDocumentsOutput:
self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch
) -> GenerateDocumentsOutput:
"""Poll Notion for updated pages within a time period.""" """Poll Notion for updated pages within a time period."""
if self.recursive_index_enabled and self.root_page_id: if self.recursive_index_enabled and self.root_page_id:
yield from self._recursive_load() yield from self._recursive_load(start, end)
return return
query_dict = { query_dict = {
@ -367,7 +539,7 @@ class NotionConnector(LoadConnector, PollConnector):
pages = filter_pages_by_time(db_res.results, start, end, "last_edited_time") pages = filter_pages_by_time(db_res.results, start, end, "last_edited_time")
if pages: if pages:
yield from batch_generator(self._read_pages(pages), self.batch_size) yield from batch_generator(self._read_pages(pages, start, end), self.batch_size)
if db_res.has_more: if db_res.has_more:
query_dict["start_cursor"] = db_res.next_cursor query_dict["start_cursor"] = db_res.next_cursor
else: else:

View File

@ -312,12 +312,15 @@ def create_s3_client(bucket_type: BlobType, credentials: dict[str, Any], europea
region_name=credentials["region"], region_name=credentials["region"],
) )
elif bucket_type == BlobType.S3_COMPATIBLE: elif bucket_type == BlobType.S3_COMPATIBLE:
addressing_style = credentials.get("addressing_style", "virtual")
return boto3.client( return boto3.client(
"s3", "s3",
endpoint_url=credentials["endpoint_url"], endpoint_url=credentials["endpoint_url"],
aws_access_key_id=credentials["aws_access_key_id"], aws_access_key_id=credentials["aws_access_key_id"],
aws_secret_access_key=credentials["aws_secret_access_key"], aws_secret_access_key=credentials["aws_secret_access_key"],
) config=Config(s3={'addressing_style': addressing_style}),
)
else: else:
raise ValueError(f"Unsupported bucket type: {bucket_type}") raise ValueError(f"Unsupported bucket type: {bucket_type}")
@ -730,7 +733,7 @@ def build_time_range_query(
"""Build time range query for Gmail API""" """Build time range query for Gmail API"""
query = "" query = ""
if time_range_start is not None and time_range_start != 0: if time_range_start is not None and time_range_start != 0:
query += f"after:{int(time_range_start)}" query += f"after:{int(time_range_start) + 1}"
if time_range_end is not None and time_range_end != 0: if time_range_end is not None and time_range_end != 0:
query += f" before:{int(time_range_end)}" query += f" before:{int(time_range_end)}"
query = query.strip() query = query.strip()
@ -775,6 +778,15 @@ def time_str_to_utc(time_str: str):
return datetime.fromisoformat(time_str.replace("Z", "+00:00")) return datetime.fromisoformat(time_str.replace("Z", "+00:00"))
def gmail_time_str_to_utc(time_str: str):
"""Convert Gmail RFC 2822 time string to UTC."""
from email.utils import parsedate_to_datetime
from datetime import timezone
dt = parsedate_to_datetime(time_str)
return dt.astimezone(timezone.utc)
# Notion Utilities # Notion Utilities
T = TypeVar("T") T = TypeVar("T")

View File

@ -0,0 +1,370 @@
"""WebDAV connector"""
import logging
import os
from datetime import datetime, timezone
from typing import Any, Optional
from webdav4.client import Client as WebDAVClient
from common.data_source.utils import (
get_file_ext,
)
from common.data_source.config import DocumentSource, INDEX_BATCH_SIZE, BLOB_STORAGE_SIZE_THRESHOLD
from common.data_source.exceptions import (
ConnectorMissingCredentialError,
ConnectorValidationError,
CredentialExpiredError,
InsufficientPermissionsError
)
from common.data_source.interfaces import LoadConnector, PollConnector
from common.data_source.models import Document, SecondsSinceUnixEpoch, GenerateDocumentsOutput
class WebDAVConnector(LoadConnector, PollConnector):
"""WebDAV connector for syncing files from WebDAV servers"""
def __init__(
self,
base_url: str,
remote_path: str = "/",
batch_size: int = INDEX_BATCH_SIZE,
) -> None:
"""Initialize WebDAV connector
Args:
base_url: Base URL of the WebDAV server (e.g., "https://webdav.example.com")
remote_path: Remote path to sync from (default: "/")
batch_size: Number of documents per batch
"""
self.base_url = base_url.rstrip("/")
if not remote_path:
remote_path = "/"
if not remote_path.startswith("/"):
remote_path = f"/{remote_path}"
if remote_path.endswith("/") and remote_path != "/":
remote_path = remote_path.rstrip("/")
self.remote_path = remote_path
self.batch_size = batch_size
self.client: Optional[WebDAVClient] = None
self._allow_images: bool | None = None
self.size_threshold: int | None = BLOB_STORAGE_SIZE_THRESHOLD
def set_allow_images(self, allow_images: bool) -> None:
"""Set whether to process images"""
logging.info(f"Setting allow_images to {allow_images}.")
self._allow_images = allow_images
def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
"""Load credentials and initialize WebDAV client
Args:
credentials: Dictionary containing 'username' and 'password'
Returns:
None
Raises:
ConnectorMissingCredentialError: If required credentials are missing
"""
logging.debug(f"Loading credentials for WebDAV server {self.base_url}")
username = credentials.get("username")
password = credentials.get("password")
if not username or not password:
raise ConnectorMissingCredentialError(
"WebDAV requires 'username' and 'password' credentials"
)
try:
# Initialize WebDAV client
self.client = WebDAVClient(
base_url=self.base_url,
auth=(username, password)
)
# Test connection
self.client.exists(self.remote_path)
except Exception as e:
logging.error(f"Failed to connect to WebDAV server: {e}")
raise ConnectorMissingCredentialError(
f"Failed to authenticate with WebDAV server: {e}"
)
return None
def _list_files_recursive(
self,
path: str,
start: datetime,
end: datetime,
) -> list[tuple[str, dict]]:
"""Recursively list all files in the given path
Args:
path: Path to list files from
start: Start datetime for filtering
end: End datetime for filtering
Returns:
List of tuples containing (file_path, file_info)
"""
if self.client is None:
raise ConnectorMissingCredentialError("WebDAV client not initialized")
files = []
try:
logging.debug(f"Listing directory: {path}")
for item in self.client.ls(path, detail=True):
item_path = item['name']
if item_path == path or item_path == path + '/':
continue
logging.debug(f"Found item: {item_path}, type: {item.get('type')}")
if item.get('type') == 'directory':
try:
files.extend(self._list_files_recursive(item_path, start, end))
except Exception as e:
logging.error(f"Error recursing into directory {item_path}: {e}")
continue
else:
try:
modified_time = item.get('modified')
if modified_time:
if isinstance(modified_time, datetime):
modified = modified_time
if modified.tzinfo is None:
modified = modified.replace(tzinfo=timezone.utc)
elif isinstance(modified_time, str):
try:
modified = datetime.strptime(modified_time, '%a, %d %b %Y %H:%M:%S %Z')
modified = modified.replace(tzinfo=timezone.utc)
except (ValueError, TypeError):
try:
modified = datetime.fromisoformat(modified_time.replace('Z', '+00:00'))
except (ValueError, TypeError):
logging.warning(f"Could not parse modified time for {item_path}: {modified_time}")
modified = datetime.now(timezone.utc)
else:
modified = datetime.now(timezone.utc)
else:
modified = datetime.now(timezone.utc)
logging.debug(f"File {item_path}: modified={modified}, start={start}, end={end}, include={start < modified <= end}")
if start < modified <= end:
files.append((item_path, item))
else:
logging.debug(f"File {item_path} filtered out by time range")
except Exception as e:
logging.error(f"Error processing file {item_path}: {e}")
continue
except Exception as e:
logging.error(f"Error listing directory {path}: {e}")
return files
def _yield_webdav_documents(
self,
start: datetime,
end: datetime,
) -> GenerateDocumentsOutput:
"""Generate documents from WebDAV server
Args:
start: Start datetime for filtering
end: End datetime for filtering
Yields:
Batches of documents
"""
if self.client is None:
raise ConnectorMissingCredentialError("WebDAV client not initialized")
logging.info(f"Searching for files in {self.remote_path} between {start} and {end}")
files = self._list_files_recursive(self.remote_path, start, end)
logging.info(f"Found {len(files)} files matching time criteria")
batch: list[Document] = []
for file_path, file_info in files:
file_name = os.path.basename(file_path)
size_bytes = file_info.get('size', 0)
if (
self.size_threshold is not None
and isinstance(size_bytes, int)
and size_bytes > self.size_threshold
):
logging.warning(
f"{file_name} exceeds size threshold of {self.size_threshold}. Skipping."
)
continue
try:
logging.debug(f"Downloading file: {file_path}")
from io import BytesIO
buffer = BytesIO()
self.client.download_fileobj(file_path, buffer)
blob = buffer.getvalue()
if blob is None or len(blob) == 0:
logging.warning(f"Downloaded content is empty for {file_path}")
continue
modified_time = file_info.get('modified')
if modified_time:
if isinstance(modified_time, datetime):
modified = modified_time
if modified.tzinfo is None:
modified = modified.replace(tzinfo=timezone.utc)
elif isinstance(modified_time, str):
try:
modified = datetime.strptime(modified_time, '%a, %d %b %Y %H:%M:%S %Z')
modified = modified.replace(tzinfo=timezone.utc)
except (ValueError, TypeError):
try:
modified = datetime.fromisoformat(modified_time.replace('Z', '+00:00'))
except (ValueError, TypeError):
logging.warning(f"Could not parse modified time for {file_path}: {modified_time}")
modified = datetime.now(timezone.utc)
else:
modified = datetime.now(timezone.utc)
else:
modified = datetime.now(timezone.utc)
batch.append(
Document(
id=f"webdav:{self.base_url}:{file_path}",
blob=blob,
source=DocumentSource.WEBDAV,
semantic_identifier=file_name,
extension=get_file_ext(file_name),
doc_updated_at=modified,
size_bytes=size_bytes if size_bytes else 0
)
)
if len(batch) == self.batch_size:
yield batch
batch = []
except Exception as e:
logging.exception(f"Error downloading file {file_path}: {e}")
if batch:
yield batch
def load_from_state(self) -> GenerateDocumentsOutput:
"""Load all documents from WebDAV server
Yields:
Batches of documents
"""
logging.debug(f"Loading documents from WebDAV server {self.base_url}")
return self._yield_webdav_documents(
start=datetime(1970, 1, 1, tzinfo=timezone.utc),
end=datetime.now(timezone.utc),
)
def poll_source(
self, start: SecondsSinceUnixEpoch, end: SecondsSinceUnixEpoch
) -> GenerateDocumentsOutput:
"""Poll WebDAV server for updated documents
Args:
start: Start timestamp (seconds since Unix epoch)
end: End timestamp (seconds since Unix epoch)
Yields:
Batches of documents
"""
if self.client is None:
raise ConnectorMissingCredentialError("WebDAV client not initialized")
start_datetime = datetime.fromtimestamp(start, tz=timezone.utc)
end_datetime = datetime.fromtimestamp(end, tz=timezone.utc)
for batch in self._yield_webdav_documents(start_datetime, end_datetime):
yield batch
def validate_connector_settings(self) -> None:
"""Validate WebDAV connector settings
Raises:
ConnectorMissingCredentialError: If credentials are not loaded
ConnectorValidationError: If settings are invalid
"""
if self.client is None:
raise ConnectorMissingCredentialError(
"WebDAV credentials not loaded."
)
if not self.base_url:
raise ConnectorValidationError(
"No base URL was provided in connector settings."
)
try:
if not self.client.exists(self.remote_path):
raise ConnectorValidationError(
f"Remote path '{self.remote_path}' does not exist on WebDAV server."
)
except Exception as e:
error_message = str(e)
if "401" in error_message or "unauthorized" in error_message.lower():
raise CredentialExpiredError(
"WebDAV credentials appear invalid or expired."
)
if "403" in error_message or "forbidden" in error_message.lower():
raise InsufficientPermissionsError(
f"Insufficient permissions to access path '{self.remote_path}' on WebDAV server."
)
if "404" in error_message or "not found" in error_message.lower():
raise ConnectorValidationError(
f"Remote path '{self.remote_path}' does not exist on WebDAV server."
)
raise ConnectorValidationError(
f"Unexpected WebDAV client error: {e}"
)
if __name__ == "__main__":
credentials_dict = {
"username": os.environ.get("WEBDAV_USERNAME"),
"password": os.environ.get("WEBDAV_PASSWORD"),
}
connector = WebDAVConnector(
base_url=os.environ.get("WEBDAV_URL") or "https://webdav.example.com",
remote_path=os.environ.get("WEBDAV_PATH") or "/",
)
try:
connector.load_credentials(credentials_dict)
connector.validate_connector_settings()
document_batch_generator = connector.load_from_state()
for document_batch in document_batch_generator:
print("First batch of documents:")
for doc in document_batch:
print(f"Document ID: {doc.id}")
print(f"Semantic Identifier: {doc.semantic_identifier}")
print(f"Source: {doc.source}")
print(f"Updated At: {doc.doc_updated_at}")
print("---")
break
except ConnectorMissingCredentialError as e:
print(f"Error: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")

157
common/http_client.py Normal file
View File

@ -0,0 +1,157 @@
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import asyncio
import logging
import os
import time
from typing import Any, Dict, Optional
import httpx
logger = logging.getLogger(__name__)
# Default knobs; keep conservative to avoid unexpected behavioural changes.
DEFAULT_TIMEOUT = float(os.environ.get("HTTP_CLIENT_TIMEOUT", "15"))
# Align with requests default: follow redirects with a max of 30 unless overridden.
DEFAULT_FOLLOW_REDIRECTS = bool(int(os.environ.get("HTTP_CLIENT_FOLLOW_REDIRECTS", "1")))
DEFAULT_MAX_REDIRECTS = int(os.environ.get("HTTP_CLIENT_MAX_REDIRECTS", "30"))
DEFAULT_MAX_RETRIES = int(os.environ.get("HTTP_CLIENT_MAX_RETRIES", "2"))
DEFAULT_BACKOFF_FACTOR = float(os.environ.get("HTTP_CLIENT_BACKOFF_FACTOR", "0.5"))
DEFAULT_PROXY = os.environ.get("HTTP_CLIENT_PROXY")
DEFAULT_USER_AGENT = os.environ.get("HTTP_CLIENT_USER_AGENT", "ragflow-http-client")
def _clean_headers(headers: Optional[Dict[str, str]], auth_token: Optional[str] = None) -> Optional[Dict[str, str]]:
merged_headers: Dict[str, str] = {}
if DEFAULT_USER_AGENT:
merged_headers["User-Agent"] = DEFAULT_USER_AGENT
if auth_token:
merged_headers["Authorization"] = auth_token
if headers is None:
return merged_headers or None
merged_headers.update({str(k): str(v) for k, v in headers.items() if v is not None})
return merged_headers or None
def _get_delay(backoff_factor: float, attempt: int) -> float:
return backoff_factor * (2**attempt)
async def async_request(
method: str,
url: str,
*,
timeout: float | httpx.Timeout | None = None,
follow_redirects: bool | None = None,
max_redirects: Optional[int] = None,
headers: Optional[Dict[str, str]] = None,
auth_token: Optional[str] = None,
retries: Optional[int] = None,
backoff_factor: Optional[float] = None,
proxies: Any = None,
**kwargs: Any,
) -> httpx.Response:
"""Lightweight async HTTP wrapper using httpx.AsyncClient with safe defaults."""
timeout = timeout if timeout is not None else DEFAULT_TIMEOUT
follow_redirects = DEFAULT_FOLLOW_REDIRECTS if follow_redirects is None else follow_redirects
max_redirects = DEFAULT_MAX_REDIRECTS if max_redirects is None else max_redirects
retries = DEFAULT_MAX_RETRIES if retries is None else max(retries, 0)
backoff_factor = DEFAULT_BACKOFF_FACTOR if backoff_factor is None else backoff_factor
headers = _clean_headers(headers, auth_token=auth_token)
proxies = DEFAULT_PROXY if proxies is None else proxies
async with httpx.AsyncClient(
timeout=timeout,
follow_redirects=follow_redirects,
max_redirects=max_redirects,
proxies=proxies,
) as client:
last_exc: Exception | None = None
for attempt in range(retries + 1):
try:
start = time.monotonic()
response = await client.request(method=method, url=url, headers=headers, **kwargs)
duration = time.monotonic() - start
logger.debug(f"async_request {method} {url} -> {response.status_code} in {duration:.3f}s")
return response
except httpx.RequestError as exc:
last_exc = exc
if attempt >= retries:
logger.warning(f"async_request exhausted retries for {method} {url}: {exc}")
raise
delay = _get_delay(backoff_factor, attempt)
logger.warning(f"async_request attempt {attempt + 1}/{retries + 1} failed for {method} {url}: {exc}; retrying in {delay:.2f}s")
await asyncio.sleep(delay)
raise last_exc # pragma: no cover
def sync_request(
method: str,
url: str,
*,
timeout: float | httpx.Timeout | None = None,
follow_redirects: bool | None = None,
max_redirects: Optional[int] = None,
headers: Optional[Dict[str, str]] = None,
auth_token: Optional[str] = None,
retries: Optional[int] = None,
backoff_factor: Optional[float] = None,
proxies: Any = None,
**kwargs: Any,
) -> httpx.Response:
"""Synchronous counterpart to async_request, for CLI/tests or sync contexts."""
timeout = timeout if timeout is not None else DEFAULT_TIMEOUT
follow_redirects = DEFAULT_FOLLOW_REDIRECTS if follow_redirects is None else follow_redirects
max_redirects = DEFAULT_MAX_REDIRECTS if max_redirects is None else max_redirects
retries = DEFAULT_MAX_RETRIES if retries is None else max(retries, 0)
backoff_factor = DEFAULT_BACKOFF_FACTOR if backoff_factor is None else backoff_factor
headers = _clean_headers(headers, auth_token=auth_token)
proxies = DEFAULT_PROXY if proxies is None else proxies
with httpx.Client(
timeout=timeout,
follow_redirects=follow_redirects,
max_redirects=max_redirects,
proxies=proxies,
) as client:
last_exc: Exception | None = None
for attempt in range(retries + 1):
try:
start = time.monotonic()
response = client.request(method=method, url=url, headers=headers, **kwargs)
duration = time.monotonic() - start
logger.debug(f"sync_request {method} {url} -> {response.status_code} in {duration:.3f}s")
return response
except httpx.RequestError as exc:
last_exc = exc
if attempt >= retries:
logger.warning(f"sync_request exhausted retries for {method} {url}: {exc}")
raise
delay = _get_delay(backoff_factor, attempt)
logger.warning(f"sync_request attempt {attempt + 1}/{retries + 1} failed for {method} {url}: {exc}; retrying in {delay:.2f}s")
time.sleep(delay)
raise last_exc # pragma: no cover
__all__ = [
"async_request",
"sync_request",
"DEFAULT_TIMEOUT",
"DEFAULT_FOLLOW_REDIRECTS",
"DEFAULT_MAX_REDIRECTS",
"DEFAULT_MAX_RETRIES",
"DEFAULT_BACKOFF_FACTOR",
"DEFAULT_PROXY",
"DEFAULT_USER_AGENT",
]

View File

@ -23,6 +23,8 @@ import subprocess
import sys import sys
import os import os
import logging import logging
from pathlib import Path
from typing import Dict
def get_uuid(): def get_uuid():
return uuid.uuid1().hex return uuid.uuid1().hex
@ -106,3 +108,152 @@ def pip_install_torch():
logging.info("Installing pytorch") logging.info("Installing pytorch")
pkg_names = ["torch>=2.5.0,<3.0.0"] pkg_names = ["torch>=2.5.0,<3.0.0"]
subprocess.check_call([sys.executable, "-m", "pip", "install", *pkg_names]) subprocess.check_call([sys.executable, "-m", "pip", "install", *pkg_names])
def parse_mineru_paths() -> Dict[str, Path]:
"""
Parse MinerU-related paths based on the MINERU_EXECUTABLE environment variable.
Expected layout (default convention):
MINERU_EXECUTABLE = /home/user/uv_tools/.venv/bin/mineru
From this path we derive:
- mineru_exec : full path to the mineru executable
- venv_dir : the virtual environment directory (.venv)
- tools_dir : the parent tools directory (e.g. uv_tools)
If MINERU_EXECUTABLE is not set, we fall back to the default layout:
$HOME/uv_tools/.venv/bin/mineru
Returns:
A dict with keys:
- "mineru_exec": Path
- "venv_dir": Path
- "tools_dir": Path
"""
mineru_exec_env = os.getenv("MINERU_EXECUTABLE")
if mineru_exec_env:
# Use the path from the environment variable
mineru_exec = Path(mineru_exec_env).expanduser().resolve()
venv_dir = mineru_exec.parent.parent
tools_dir = venv_dir.parent
else:
# Fall back to default convention: $HOME/uv_tools/.venv/bin/mineru
home = Path(os.path.expanduser("~"))
tools_dir = home / "uv_tools"
venv_dir = tools_dir / ".venv"
mineru_exec = venv_dir / "bin" / "mineru"
return {
"mineru_exec": mineru_exec,
"venv_dir": venv_dir,
"tools_dir": tools_dir,
}
@once
def install_mineru() -> None:
"""
Ensure MinerU is installed.
Behavior:
1. MinerU is enabled only when USE_MINERU is true/yes/1/y.
2. Resolve mineru_exec / venv_dir / tools_dir.
3. If mineru exists and works, log success and exit.
4. Otherwise:
- Create tools_dir
- Create venv if missing
- Install mineru[core], fallback to mineru[all]
- Validate with `--help`
5. Log installation success.
NOTE:
This function intentionally does NOT return the path.
Logging is used to indicate status.
"""
# Check if MinerU is enabled
use_mineru = os.getenv("USE_MINERU", "").strip().lower()
if use_mineru == "false":
logging.info("USE_MINERU=%r. Skipping MinerU installation.", use_mineru)
return
# Resolve expected paths
paths = parse_mineru_paths()
mineru_exec: Path = paths["mineru_exec"]
venv_dir: Path = paths["venv_dir"]
tools_dir: Path = paths["tools_dir"]
# Construct environment variables for installation/execution
env = os.environ.copy()
env["VIRTUAL_ENV"] = str(venv_dir)
env["PATH"] = str(venv_dir / "bin") + os.pathsep + env.get("PATH", "")
# Configure HuggingFace endpoint
env.setdefault("HUGGINGFACE_HUB_ENDPOINT", os.getenv("HF_ENDPOINT") or "https://hf-mirror.com")
# Helper: check whether mineru works
def mineru_works() -> bool:
try:
subprocess.check_call(
[str(mineru_exec), "--help"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
env=env,
)
return True
except Exception:
return False
# If MinerU is already installed and functional
if mineru_exec.is_file() and os.access(mineru_exec, os.X_OK) and mineru_works():
logging.info("MinerU already installed.")
os.environ["MINERU_EXECUTABLE"] = str(mineru_exec)
return
logging.info("MinerU not found. Installing into virtualenv: %s", venv_dir)
# Ensure parent directory exists
tools_dir.mkdir(parents=True, exist_ok=True)
# Create venv if missing
if not venv_dir.exists():
subprocess.check_call(
["uv", "venv", str(venv_dir)],
cwd=str(tools_dir),
env=env,
# stdout=subprocess.DEVNULL,
# stderr=subprocess.PIPE,
)
else:
logging.info("Virtual environment exists at %s. Reusing it.", venv_dir)
# Helper for pip install
def pip_install(pkg: str) -> None:
subprocess.check_call(
[
"uv", "pip", "install", "-U", pkg,
"-i", "https://mirrors.aliyun.com/pypi/simple",
"--extra-index-url", "https://pypi.org/simple",
],
cwd=str(tools_dir),
# stdout=subprocess.DEVNULL,
# stderr=subprocess.PIPE,
env=env,
)
# Install core version first; fallback to all
try:
logging.info("Installing mineru[core] ...")
pip_install("mineru[core]")
except subprocess.CalledProcessError:
logging.warning("mineru[core] installation failed. Installing mineru[all] ...")
pip_install("mineru[all]")
# Validate installation
if not mineru_works():
logging.error("MinerU installation failed: %s does not work.", mineru_exec)
raise RuntimeError(f"MinerU installation failed: {mineru_exec} is not functional")
os.environ["MINERU_EXECUTABLE"] = str(mineru_exec)
logging.info("MinerU installation completed successfully. Executable: %s", mineru_exec)

View File

@ -27,6 +27,7 @@ from common.constants import SVR_QUEUE_NAME, Storage
import rag.utils import rag.utils
import rag.utils.es_conn import rag.utils.es_conn
import rag.utils.infinity_conn import rag.utils.infinity_conn
import rag.utils.ob_conn
import rag.utils.opensearch_conn import rag.utils.opensearch_conn
from rag.utils.azure_sas_conn import RAGFlowAzureSasBlob from rag.utils.azure_sas_conn import RAGFlowAzureSasBlob
from rag.utils.azure_spn_conn import RAGFlowAzureSpnBlob from rag.utils.azure_spn_conn import RAGFlowAzureSpnBlob
@ -73,6 +74,8 @@ GITHUB_OAUTH = None
FEISHU_OAUTH = None FEISHU_OAUTH = None
OAUTH_CONFIG = None OAUTH_CONFIG = None
DOC_ENGINE = os.getenv('DOC_ENGINE', 'elasticsearch') DOC_ENGINE = os.getenv('DOC_ENGINE', 'elasticsearch')
DOC_ENGINE_INFINITY = (DOC_ENGINE.lower() == "infinity")
docStoreConn = None docStoreConn = None
@ -103,6 +106,7 @@ INFINITY = {}
AZURE = {} AZURE = {}
S3 = {} S3 = {}
MINIO = {} MINIO = {}
OB = {}
OSS = {} OSS = {}
OS = {} OS = {}
@ -137,7 +141,7 @@ def _get_or_create_secret_key():
import logging import logging
new_key = secrets.token_hex(32) new_key = secrets.token_hex(32)
logging.warning(f"SECURITY WARNING: Using auto-generated SECRET_KEY. Generated key: {new_key}") logging.warning("SECURITY WARNING: Using auto-generated SECRET_KEY.")
return new_key return new_key
class StorageFactory: class StorageFactory:
@ -227,9 +231,9 @@ def init_settings():
FEISHU_OAUTH = get_base_config("oauth", {}).get("feishu") FEISHU_OAUTH = get_base_config("oauth", {}).get("feishu")
OAUTH_CONFIG = get_base_config("oauth", {}) OAUTH_CONFIG = get_base_config("oauth", {})
global DOC_ENGINE, docStoreConn, ES, OS, INFINITY global DOC_ENGINE, DOC_ENGINE_INFINITY, docStoreConn, ES, OB, OS, INFINITY
DOC_ENGINE = os.environ.get("DOC_ENGINE", "elasticsearch") DOC_ENGINE = os.environ.get("DOC_ENGINE", "elasticsearch")
# DOC_ENGINE = os.environ.get('DOC_ENGINE', "opensearch") DOC_ENGINE_INFINITY = (DOC_ENGINE.lower() == "infinity")
lower_case_doc_engine = DOC_ENGINE.lower() lower_case_doc_engine = DOC_ENGINE.lower()
if lower_case_doc_engine == "elasticsearch": if lower_case_doc_engine == "elasticsearch":
ES = get_base_config("es", {}) ES = get_base_config("es", {})
@ -240,6 +244,9 @@ def init_settings():
elif lower_case_doc_engine == "opensearch": elif lower_case_doc_engine == "opensearch":
OS = get_base_config("os", {}) OS = get_base_config("os", {})
docStoreConn = rag.utils.opensearch_conn.OSConnection() docStoreConn = rag.utils.opensearch_conn.OSConnection()
elif lower_case_doc_engine == "oceanbase":
OB = get_base_config("oceanbase", {})
docStoreConn = rag.utils.ob_conn.OBConnection()
else: else:
raise Exception(f"Not supported doc engine: {DOC_ENGINE}") raise Exception(f"Not supported doc engine: {DOC_ENGINE}")

View File

@ -35,6 +35,12 @@ def num_tokens_from_string(string: str) -> int:
return 0 return 0
def total_token_count_from_response(resp): def total_token_count_from_response(resp):
"""
Extract token count from LLM response in various formats.
Handles None responses and different response structures from various LLM providers.
Returns 0 if token count cannot be determined.
"""
if resp is None: if resp is None:
return 0 return 0
@ -50,19 +56,19 @@ def total_token_count_from_response(resp):
except Exception: except Exception:
pass pass
if 'usage' in resp and 'total_tokens' in resp['usage']: if isinstance(resp, dict) and 'usage' in resp and 'total_tokens' in resp['usage']:
try: try:
return resp["usage"]["total_tokens"] return resp["usage"]["total_tokens"]
except Exception: except Exception:
pass pass
if 'usage' in resp and 'input_tokens' in resp['usage'] and 'output_tokens' in resp['usage']: if isinstance(resp, dict) and 'usage' in resp and 'input_tokens' in resp['usage'] and 'output_tokens' in resp['usage']:
try: try:
return resp["usage"]["input_tokens"] + resp["usage"]["output_tokens"] return resp["usage"]["input_tokens"] + resp["usage"]["output_tokens"]
except Exception: except Exception:
pass pass
if 'meta' in resp and 'tokens' in resp['meta'] and 'input_tokens' in resp['meta']['tokens'] and 'output_tokens' in resp['meta']['tokens']: if isinstance(resp, dict) and 'meta' in resp and 'tokens' in resp['meta'] and 'input_tokens' in resp['meta']['tokens'] and 'output_tokens' in resp['meta']['tokens']:
try: try:
return resp["meta"]["tokens"]["input_tokens"] + resp["meta"]["tokens"]["output_tokens"] return resp["meta"]["tokens"]["input_tokens"] + resp["meta"]["tokens"]["output_tokens"]
except Exception: except Exception:

View File

@ -5,20 +5,13 @@
"create_time": {"type": "varchar", "default": ""}, "create_time": {"type": "varchar", "default": ""},
"create_timestamp_flt": {"type": "float", "default": 0.0}, "create_timestamp_flt": {"type": "float", "default": 0.0},
"img_id": {"type": "varchar", "default": ""}, "img_id": {"type": "varchar", "default": ""},
"docnm_kwd": {"type": "varchar", "default": ""}, "docnm": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "docnm_kwd, title_tks, title_sm_tks"},
"title_tks": {"type": "varchar", "default": "", "analyzer": "whitespace"},
"title_sm_tks": {"type": "varchar", "default": "", "analyzer": "whitespace"},
"name_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"}, "name_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"important_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"tag_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"}, "tag_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"important_tks": {"type": "varchar", "default": "", "analyzer": "whitespace"}, "important_keywords": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "important_kwd, important_tks"},
"question_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"}, "questions": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "question_kwd, question_tks"},
"question_tks": {"type": "varchar", "default": "", "analyzer": "whitespace"}, "content": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "content_with_weight, content_ltks, content_sm_ltks"},
"content_with_weight": {"type": "varchar", "default": ""}, "authors": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "authors_tks, authors_sm_tks"},
"content_ltks": {"type": "varchar", "default": "", "analyzer": "whitespace"},
"content_sm_ltks": {"type": "varchar", "default": "", "analyzer": "whitespace"},
"authors_tks": {"type": "varchar", "default": "", "analyzer": "whitespace"},
"authors_sm_tks": {"type": "varchar", "default": "", "analyzer": "whitespace"},
"page_num_int": {"type": "varchar", "default": ""}, "page_num_int": {"type": "varchar", "default": ""},
"top_int": {"type": "varchar", "default": ""}, "top_int": {"type": "varchar", "default": ""},
"position_int": {"type": "varchar", "default": ""}, "position_int": {"type": "varchar", "default": ""},

View File

@ -7,6 +7,20 @@
"status": "1", "status": "1",
"rank": "999", "rank": "999",
"llm": [ "llm": [
{
"llm_name": "gpt-5.1",
"tags": "LLM,CHAT,400k,IMAGE2TEXT",
"max_tokens": 400000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "gpt-5.1-chat-latest",
"tags": "LLM,CHAT,400k,IMAGE2TEXT",
"max_tokens": 400000,
"model_type": "chat",
"is_tools": true
},
{ {
"llm_name": "gpt-5", "llm_name": "gpt-5",
"tags": "LLM,CHAT,400k,IMAGE2TEXT", "tags": "LLM,CHAT,400k,IMAGE2TEXT",
@ -269,20 +283,6 @@
"model_type": "chat", "model_type": "chat",
"is_tools": true "is_tools": true
}, },
{
"llm_name": "glm-4.5",
"tags": "LLM,CHAT,131K",
"max_tokens": 131000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "deepseek-v3.1",
"tags": "LLM,CHAT,128k",
"max_tokens": 128000,
"model_type": "chat",
"is_tools": true
},
{ {
"llm_name": "hunyuan-a13b-instruct", "llm_name": "hunyuan-a13b-instruct",
"tags": "LLM,CHAT,256k", "tags": "LLM,CHAT,256k",
@ -324,6 +324,34 @@
"max_tokens": 262000, "max_tokens": 262000,
"model_type": "chat", "model_type": "chat",
"is_tools": true "is_tools": true
},
{
"llm_name": "deepseek-ocr",
"tags": "LLM,8k",
"max_tokens": 8000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "qwen3-235b-a22b-instruct-2507",
"tags": "LLM,CHAT,256k",
"max_tokens": 256000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "glm-4.6",
"tags": "LLM,CHAT,200k",
"max_tokens": 200000,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "minimax-m2",
"tags": "LLM,CHAT,200k",
"max_tokens": 200000,
"model_type": "chat",
"is_tools": true
} }
] ]
}, },
@ -686,19 +714,13 @@
"model_type": "rerank" "model_type": "rerank"
}, },
{ {
"llm_name": "qwen-audio-asr", "llm_name": "qwen3-asr-flash",
"tags": "SPEECH2TEXT,8k", "tags": "SPEECH2TEXT,8k",
"max_tokens": 8000, "max_tokens": 8000,
"model_type": "speech2text" "model_type": "speech2text"
}, },
{ {
"llm_name": "qwen-audio-asr-latest", "llm_name": "qwen3-asr-flash-2025-09-08",
"tags": "SPEECH2TEXT,8k",
"max_tokens": 8000,
"model_type": "speech2text"
},
{
"llm_name": "qwen-audio-asr-1204",
"tags": "SPEECH2TEXT,8k", "tags": "SPEECH2TEXT,8k",
"max_tokens": 8000, "max_tokens": 8000,
"model_type": "speech2text" "model_type": "speech2text"
@ -1166,6 +1188,12 @@
"tags": "TEXT EMBEDDING", "tags": "TEXT EMBEDDING",
"max_tokens": 8196, "max_tokens": 8196,
"model_type": "embedding" "model_type": "embedding"
},
{
"llm_name": "jina-embeddings-v4",
"tags": "TEXT EMBEDDING",
"max_tokens": 32768,
"model_type": "embedding"
} }
] ]
}, },
@ -1198,39 +1226,14 @@
{ {
"name": "MiniMax", "name": "MiniMax",
"logo": "", "logo": "",
"tags": "LLM,TEXT EMBEDDING", "tags": "LLM",
"status": "1", "status": "1",
"rank": "810", "rank": "810",
"llm": [ "llm": [
{ {
"llm_name": "abab6.5-chat", "llm_name": "MiniMax-M2",
"tags": "LLM,CHAT,8k", "tags": "LLM,CHAT,200k",
"max_tokens": 8192, "max_tokens": 200000,
"model_type": "chat"
},
{
"llm_name": "abab6.5s-chat",
"tags": "LLM,CHAT,245k",
"max_tokens": 245760,
"model_type": "chat",
"is_tools": true
},
{
"llm_name": "abab6.5t-chat",
"tags": "LLM,CHAT,8k",
"max_tokens": 8192,
"model_type": "chat"
},
{
"llm_name": "abab6.5g-chat",
"tags": "LLM,CHAT,8k",
"max_tokens": 8192,
"model_type": "chat"
},
{
"llm_name": "abab5.5s-chat",
"tags": "LLM,CHAT,8k",
"max_tokens": 8192,
"model_type": "chat" "model_type": "chat"
} }
] ]
@ -3218,6 +3221,13 @@
"status": "1", "status": "1",
"rank": "990", "rank": "990",
"llm": [ "llm": [
{
"llm_name": "claude-opus-4-5-20251101",
"tags": "LLM,CHAT,IMAGE2TEXT,200k",
"max_tokens": 204800,
"model_type": "chat",
"is_tools": true
},
{ {
"llm_name": "claude-opus-4-1-20250805", "llm_name": "claude-opus-4-1-20250805",
"tags": "LLM,CHAT,IMAGE2TEXT,200k", "tags": "LLM,CHAT,IMAGE2TEXT,200k",

View File

@ -28,8 +28,17 @@ os:
infinity: infinity:
uri: 'localhost:23817' uri: 'localhost:23817'
db_name: 'default_db' db_name: 'default_db'
oceanbase:
scheme: 'oceanbase' # set 'mysql' to create connection using mysql config
config:
db_name: 'test'
user: 'root@ragflow'
password: 'infini_rag_flow'
host: 'localhost'
port: 2881
redis: redis:
db: 1 db: 1
username: ''
password: 'infini_rag_flow' password: 'infini_rag_flow'
host: 'localhost:6379' host: 'localhost:6379'
task_executor: task_executor:
@ -139,5 +148,3 @@ user_default_llm:
# secret_id: 'tencent_secret_id' # secret_id: 'tencent_secret_id'
# secret_key: 'tencent_secret_key' # secret_key: 'tencent_secret_key'
# region: 'tencent_region' # region: 'tencent_region'
# table_result_type: '1'
# markdown_image_response_type: '1'

View File

@ -187,7 +187,7 @@ class DoclingParser(RAGFlowPdfParser):
bbox = _BBox(int(pn), bb[0], bb[1], bb[2], bb[3]) bbox = _BBox(int(pn), bb[0], bb[1], bb[2], bb[3])
yield (DoclingContentType.EQUATION.value, text, bbox) yield (DoclingContentType.EQUATION.value, text, bbox)
def _transfer_to_sections(self, doc) -> list[tuple[str, str]]: def _transfer_to_sections(self, doc, parse_method: str) -> list[tuple[str, str]]:
sections: list[tuple[str, str]] = [] sections: list[tuple[str, str]] = []
for typ, payload, bbox in self._iter_doc_items(doc): for typ, payload, bbox in self._iter_doc_items(doc):
if typ == DoclingContentType.TEXT.value: if typ == DoclingContentType.TEXT.value:
@ -200,7 +200,12 @@ class DoclingParser(RAGFlowPdfParser):
continue continue
tag = self._make_line_tag(bbox) if isinstance(bbox,_BBox) else "" tag = self._make_line_tag(bbox) if isinstance(bbox,_BBox) else ""
sections.append((section, tag)) if parse_method == "manual":
sections.append((section, typ, tag))
elif parse_method == "paper":
sections.append((section + tag, typ))
else:
sections.append((section, tag))
return sections return sections
def cropout_docling_table(self, page_no: int, bbox: tuple[float, float, float, float], zoomin: int = 1): def cropout_docling_table(self, page_no: int, bbox: tuple[float, float, float, float], zoomin: int = 1):
@ -282,7 +287,8 @@ class DoclingParser(RAGFlowPdfParser):
output_dir: Optional[str] = None, output_dir: Optional[str] = None,
lang: Optional[str] = None, lang: Optional[str] = None,
method: str = "auto", method: str = "auto",
delete_output: bool = True, delete_output: bool = True,
parse_method: str = "raw"
): ):
if not self.check_installation(): if not self.check_installation():
@ -318,7 +324,7 @@ class DoclingParser(RAGFlowPdfParser):
if callback: if callback:
callback(0.7, f"[Docling] Parsed doc: {getattr(doc, 'num_pages', 'n/a')} pages") callback(0.7, f"[Docling] Parsed doc: {getattr(doc, 'num_pages', 'n/a')} pages")
sections = self._transfer_to_sections(doc) sections = self._transfer_to_sections(doc, parse_method=parse_method)
tables = self._transfer_to_tables(doc) tables = self._transfer_to_tables(doc)
if callback: if callback:

View File

@ -138,7 +138,6 @@ class RAGFlowHtmlParser:
"metadata": {"table_id": table_id, "index": table_list.index(t)}}) "metadata": {"table_id": table_id, "index": table_list.index(t)}})
return table_info_list return table_info_list
else: else:
block_id = None
if str.lower(element.name) in BLOCK_TAGS: if str.lower(element.name) in BLOCK_TAGS:
block_id = str(uuid.uuid1()) block_id = str(uuid.uuid1())
for child in element.children: for child in element.children:
@ -172,7 +171,7 @@ class RAGFlowHtmlParser:
if tag_name == "table": if tag_name == "table":
table_info_list.append(item) table_info_list.append(item)
else: else:
current_content += (" " if current_content else "" + content) current_content += (" " if current_content else "") + content
if current_content: if current_content:
block_content.append(current_content) block_content.append(current_content)
return block_content, table_info_list return block_content, table_info_list

View File

@ -72,9 +72,8 @@ class RAGFlowMarkdownParser:
# Replace any TAGS e.g. <table ...> to <table> # Replace any TAGS e.g. <table ...> to <table>
TAGS = ["table", "td", "tr", "th", "tbody", "thead", "div"] TAGS = ["table", "td", "tr", "th", "tbody", "thead", "div"]
table_with_attributes_pattern = re.compile( table_with_attributes_pattern = re.compile(rf"<(?:{'|'.join(TAGS)})[^>]*>", re.IGNORECASE)
rf"<(?:{'|'.join(TAGS)})[^>]*>", re.IGNORECASE
)
def replace_tag(m): def replace_tag(m):
tag_name = re.match(r"<(\w+)", m.group()).group(1) tag_name = re.match(r"<(\w+)", m.group()).group(1)
return "<{}>".format(tag_name) return "<{}>".format(tag_name)
@ -128,23 +127,48 @@ class MarkdownElementExtractor:
self.markdown_content = markdown_content self.markdown_content = markdown_content
self.lines = markdown_content.split("\n") self.lines = markdown_content.split("\n")
def get_delimiters(self,delimiters): def get_delimiters(self, delimiters):
toks = re.findall(r"`([^`]+)`", delimiters) toks = re.findall(r"`([^`]+)`", delimiters)
toks = sorted(set(toks), key=lambda x: -len(x)) toks = sorted(set(toks), key=lambda x: -len(x))
return "|".join(re.escape(t) for t in toks if t) return "|".join(re.escape(t) for t in toks if t)
def extract_elements(self,delimiter=None): def extract_elements(self, delimiter=None, include_meta=False):
"""Extract individual elements (headers, code blocks, lists, etc.)""" """Extract individual elements (headers, code blocks, lists, etc.)"""
sections = [] sections = []
i = 0 i = 0
dels="" dels = ""
if delimiter: if delimiter:
dels = self.get_delimiters(delimiter) dels = self.get_delimiters(delimiter)
if len(dels) > 0: if len(dels) > 0:
text = "\n".join(self.lines) text = "\n".join(self.lines)
parts = re.split(dels, text) if include_meta:
sections = [p.strip() for p in parts if p and p.strip()] pattern = re.compile(dels)
last_end = 0
for m in pattern.finditer(text):
part = text[last_end : m.start()]
if part and part.strip():
sections.append(
{
"content": part.strip(),
"start_line": text.count("\n", 0, last_end),
"end_line": text.count("\n", 0, m.start()),
}
)
last_end = m.end()
part = text[last_end:]
if part and part.strip():
sections.append(
{
"content": part.strip(),
"start_line": text.count("\n", 0, last_end),
"end_line": text.count("\n", 0, len(text)),
}
)
else:
parts = re.split(dels, text)
sections = [p.strip() for p in parts if p and p.strip()]
return sections return sections
while i < len(self.lines): while i < len(self.lines):
line = self.lines[i] line = self.lines[i]
@ -152,32 +176,35 @@ class MarkdownElementExtractor:
if re.match(r"^#{1,6}\s+.*$", line): if re.match(r"^#{1,6}\s+.*$", line):
# header # header
element = self._extract_header(i) element = self._extract_header(i)
sections.append(element["content"]) sections.append(element if include_meta else element["content"])
i = element["end_line"] + 1 i = element["end_line"] + 1
elif line.strip().startswith("```"): elif line.strip().startswith("```"):
# code block # code block
element = self._extract_code_block(i) element = self._extract_code_block(i)
sections.append(element["content"]) sections.append(element if include_meta else element["content"])
i = element["end_line"] + 1 i = element["end_line"] + 1
elif re.match(r"^\s*[-*+]\s+.*$", line) or re.match(r"^\s*\d+\.\s+.*$", line): elif re.match(r"^\s*[-*+]\s+.*$", line) or re.match(r"^\s*\d+\.\s+.*$", line):
# list block # list block
element = self._extract_list_block(i) element = self._extract_list_block(i)
sections.append(element["content"]) sections.append(element if include_meta else element["content"])
i = element["end_line"] + 1 i = element["end_line"] + 1
elif line.strip().startswith(">"): elif line.strip().startswith(">"):
# blockquote # blockquote
element = self._extract_blockquote(i) element = self._extract_blockquote(i)
sections.append(element["content"]) sections.append(element if include_meta else element["content"])
i = element["end_line"] + 1 i = element["end_line"] + 1
elif line.strip(): elif line.strip():
# text block (paragraphs and inline elements until next block element) # text block (paragraphs and inline elements until next block element)
element = self._extract_text_block(i) element = self._extract_text_block(i)
sections.append(element["content"]) sections.append(element if include_meta else element["content"])
i = element["end_line"] + 1 i = element["end_line"] + 1
else: else:
i += 1 i += 1
sections = [section for section in sections if section.strip()] if include_meta:
sections = [section for section in sections if section["content"].strip()]
else:
sections = [section for section in sections if section.strip()]
return sections return sections
def _extract_header(self, start_pos): def _extract_header(self, start_pos):

View File

@ -190,7 +190,7 @@ class MinerUParser(RAGFlowPdfParser):
self._run_mineru_executable(input_path, output_dir, method, backend, lang, server_url, callback) self._run_mineru_executable(input_path, output_dir, method, backend, lang, server_url, callback)
def _run_mineru_api(self, input_path: Path, output_dir: Path, method: str = "auto", backend: str = "pipeline", lang: Optional[str] = None, callback: Optional[Callable] = None): def _run_mineru_api(self, input_path: Path, output_dir: Path, method: str = "auto", backend: str = "pipeline", lang: Optional[str] = None, callback: Optional[Callable] = None):
OUTPUT_ZIP_PATH = os.path.join(str(output_dir), "output.zip") output_zip_path = os.path.join(str(output_dir), "output.zip")
pdf_file_path = str(input_path) pdf_file_path = str(input_path)
@ -230,16 +230,16 @@ class MinerUParser(RAGFlowPdfParser):
response.raise_for_status() response.raise_for_status()
if response.headers.get("Content-Type") == "application/zip": if response.headers.get("Content-Type") == "application/zip":
self.logger.info(f"[MinerU] zip file returned, saving to {OUTPUT_ZIP_PATH}...") self.logger.info(f"[MinerU] zip file returned, saving to {output_zip_path}...")
if callback: if callback:
callback(0.30, f"[MinerU] zip file returned, saving to {OUTPUT_ZIP_PATH}...") callback(0.30, f"[MinerU] zip file returned, saving to {output_zip_path}...")
with open(OUTPUT_ZIP_PATH, "wb") as f: with open(output_zip_path, "wb") as f:
f.write(response.content) f.write(response.content)
self.logger.info(f"[MinerU] Unzip to {output_path}...") self.logger.info(f"[MinerU] Unzip to {output_path}...")
self._extract_zip_no_root(OUTPUT_ZIP_PATH, output_path, pdf_file_name + "/") self._extract_zip_no_root(output_zip_path, output_path, pdf_file_name + "/")
if callback: if callback:
callback(0.40, f"[MinerU] Unzip to {output_path}...") callback(0.40, f"[MinerU] Unzip to {output_path}...")
@ -459,13 +459,36 @@ class MinerUParser(RAGFlowPdfParser):
return poss return poss
def _read_output(self, output_dir: Path, file_stem: str, method: str = "auto", backend: str = "pipeline") -> list[dict[str, Any]]: def _read_output(self, output_dir: Path, file_stem: str, method: str = "auto", backend: str = "pipeline") -> list[dict[str, Any]]:
subdir = output_dir / file_stem / method candidates = []
if backend.startswith("vlm-"): seen = set()
subdir = output_dir / file_stem / "vlm"
json_file = subdir / f"{file_stem}_content_list.json"
if not json_file.exists(): def add_candidate_path(p: Path):
raise FileNotFoundError(f"[MinerU] Missing output file: {json_file}") if p not in seen:
seen.add(p)
candidates.append(p)
if backend.startswith("vlm-"):
add_candidate_path(output_dir / file_stem / "vlm")
if method:
add_candidate_path(output_dir / file_stem / method)
add_candidate_path(output_dir / file_stem / "auto")
else:
if method:
add_candidate_path(output_dir / file_stem / method)
add_candidate_path(output_dir / file_stem / "vlm")
add_candidate_path(output_dir / file_stem / "auto")
json_file = None
subdir = None
for sub in candidates:
jf = sub / f"{file_stem}_content_list.json"
if jf.exists():
subdir = sub
json_file = jf
break
if not json_file:
raise FileNotFoundError(f"[MinerU] Missing output file, tried: {', '.join(str(c / (file_stem + '_content_list.json')) for c in candidates)}")
with open(json_file, "r", encoding="utf-8") as f: with open(json_file, "r", encoding="utf-8") as f:
data = json.load(f) data = json.load(f)
@ -476,7 +499,7 @@ class MinerUParser(RAGFlowPdfParser):
item[key] = str((subdir / item[key]).resolve()) item[key] = str((subdir / item[key]).resolve())
return data return data
def _transfer_to_sections(self, outputs: list[dict[str, Any]]): def _transfer_to_sections(self, outputs: list[dict[str, Any]], parse_method: str = None):
sections = [] sections = []
for output in outputs: for output in outputs:
match output["type"]: match output["type"]:
@ -497,7 +520,11 @@ class MinerUParser(RAGFlowPdfParser):
case MinerUContentType.DISCARDED: case MinerUContentType.DISCARDED:
pass pass
if section: if section and parse_method == "manual":
sections.append((section, output["type"], self._line_tag(output)))
elif section and parse_method == "paper":
sections.append((section + self._line_tag(output), output["type"]))
else:
sections.append((section, self._line_tag(output))) sections.append((section, self._line_tag(output)))
return sections return sections
@ -516,6 +543,7 @@ class MinerUParser(RAGFlowPdfParser):
method: str = "auto", method: str = "auto",
server_url: Optional[str] = None, server_url: Optional[str] = None,
delete_output: bool = True, delete_output: bool = True,
parse_method: str = "raw",
) -> tuple: ) -> tuple:
import shutil import shutil
@ -565,7 +593,8 @@ class MinerUParser(RAGFlowPdfParser):
self.logger.info(f"[MinerU] Parsed {len(outputs)} blocks from PDF.") self.logger.info(f"[MinerU] Parsed {len(outputs)} blocks from PDF.")
if callback: if callback:
callback(0.75, f"[MinerU] Parsed {len(outputs)} blocks from PDF.") callback(0.75, f"[MinerU] Parsed {len(outputs)} blocks from PDF.")
return self._transfer_to_sections(outputs), self._transfer_to_tables(outputs)
return self._transfer_to_sections(outputs, parse_method), self._transfer_to_tables(outputs)
finally: finally:
if temp_pdf and temp_pdf.exists(): if temp_pdf and temp_pdf.exists():
try: try:

View File

@ -33,6 +33,8 @@ import xgboost as xgb
from huggingface_hub import snapshot_download from huggingface_hub import snapshot_download
from PIL import Image from PIL import Image
from pypdf import PdfReader as pdf2_read from pypdf import PdfReader as pdf2_read
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from common.file_utils import get_project_base_directory from common.file_utils import get_project_base_directory
from common.misc_utils import pip_install_torch from common.misc_utils import pip_install_torch
@ -353,7 +355,6 @@ class RAGFlowPdfParser:
def _assign_column(self, boxes, zoomin=3): def _assign_column(self, boxes, zoomin=3):
if not boxes: if not boxes:
return boxes return boxes
if all("col_id" in b for b in boxes): if all("col_id" in b for b in boxes):
return boxes return boxes
@ -361,61 +362,79 @@ class RAGFlowPdfParser:
for b in boxes: for b in boxes:
by_page[b["page_number"]].append(b) by_page[b["page_number"]].append(b)
page_info = {} # pg -> dict(page_w, left_edge, cand_cols) page_cols = {}
counter = Counter()
for pg, bxs in by_page.items(): for pg, bxs in by_page.items():
if not bxs: if not bxs:
page_info[pg] = {"page_w": 1.0, "left_edge": 0.0, "cand": 1} page_cols[pg] = 1
counter[1] += 1
continue continue
if hasattr(self, "page_images") and self.page_images and len(self.page_images) >= pg: x0s_raw = np.array([b["x0"] for b in bxs], dtype=float)
page_w = self.page_images[pg - 1].size[0] / max(1, zoomin)
left_edge = 0.0
else:
xs0 = [box["x0"] for box in bxs]
xs1 = [box["x1"] for box in bxs]
left_edge = float(min(xs0))
page_w = max(1.0, float(max(xs1) - left_edge))
widths = [max(1.0, (box["x1"] - box["x0"])) for box in bxs] min_x0 = np.min(x0s_raw)
median_w = float(np.median(widths)) if widths else 1.0 max_x1 = np.max([b["x1"] for b in bxs])
width = max_x1 - min_x0
raw_cols = int(page_w / max(1.0, median_w)) INDENT_TOL = width * 0.12
x0s = []
for x in x0s_raw:
if abs(x - min_x0) < INDENT_TOL:
x0s.append([min_x0])
else:
x0s.append([x])
x0s = np.array(x0s, dtype=float)
max_try = min(4, len(bxs))
if max_try < 2:
max_try = 1
best_k = 1
best_score = -1
# cand = raw_cols if (raw_cols >= 2 and median_w < page_w / raw_cols * 0.8) else 1 for k in range(1, max_try + 1):
cand = raw_cols km = KMeans(n_clusters=k, n_init="auto")
labels = km.fit_predict(x0s)
page_info[pg] = {"page_w": page_w, "left_edge": left_edge, "cand": cand} centers = np.sort(km.cluster_centers_.flatten())
counter[cand] += 1 if len(centers) > 1:
try:
score = silhouette_score(x0s, labels)
except ValueError:
continue
else:
score = 0
if score > best_score:
best_score = score
best_k = k
logging.info(f"[Page {pg}] median_w={median_w:.2f}, page_w={page_w:.2f}, raw_cols={raw_cols}, cand={cand}") page_cols[pg] = best_k
logging.info(f"[Page {pg}] best_score={best_score:.2f}, best_k={best_k}")
global_cols = counter.most_common(1)[0][0]
global_cols = Counter(page_cols.values()).most_common(1)[0][0]
logging.info(f"Global column_num decided by majority: {global_cols}") logging.info(f"Global column_num decided by majority: {global_cols}")
for pg, bxs in by_page.items(): for pg, bxs in by_page.items():
if not bxs: if not bxs:
continue continue
k = page_cols[pg]
if len(bxs) < k:
k = 1
x0s = np.array([[b["x0"]] for b in bxs], dtype=float)
km = KMeans(n_clusters=k, n_init="auto")
labels = km.fit_predict(x0s)
page_w = page_info[pg]["page_w"] centers = km.cluster_centers_.flatten()
left_edge = page_info[pg]["left_edge"] order = np.argsort(centers)
if global_cols == 1: remap = {orig: new for new, orig in enumerate(order)}
for box in bxs:
box["col_id"] = 0
continue
for box in bxs: for b, lb in zip(bxs, labels):
w = box["x1"] - box["x0"] b["col_id"] = remap[lb]
if w >= 0.8 * page_w:
box["col_id"] = 0 grouped = defaultdict(list)
continue for b in bxs:
cx = 0.5 * (box["x0"] + box["x1"]) grouped[b["col_id"]].append(b)
norm_cx = (cx - left_edge) / page_w
norm_cx = max(0.0, min(norm_cx, 0.999999))
box["col_id"] = int(min(global_cols - 1, norm_cx * global_cols))
return boxes return boxes
@ -1071,7 +1090,7 @@ class RAGFlowPdfParser:
logging.debug("Images converted.") logging.debug("Images converted.")
self.is_english = [ self.is_english = [
re.search(r"[a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join(random.choices([c["text"] for c in self.page_chars[i]], k=min(100, len(self.page_chars[i]))))) re.search(r"[ a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join(random.choices([c["text"] for c in self.page_chars[i]], k=min(100, len(self.page_chars[i])))))
for i in range(len(self.page_chars)) for i in range(len(self.page_chars))
] ]
if sum([1 if e else 0 for e in self.is_english]) > len(self.page_images) / 2: if sum([1 if e else 0 for e in self.is_english]) > len(self.page_images) / 2:
@ -1128,7 +1147,7 @@ class RAGFlowPdfParser:
if not self.is_english and not any([c for c in self.page_chars]) and self.boxes: if not self.is_english and not any([c for c in self.page_chars]) and self.boxes:
bxes = [b for bxs in self.boxes for b in bxs] bxes = [b for bxs in self.boxes for b in bxs]
self.is_english = re.search(r"[\na-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join([b["text"] for b in random.choices(bxes, k=min(30, len(bxes)))])) self.is_english = re.search(r"[ \na-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join([b["text"] for b in random.choices(bxes, k=min(30, len(bxes)))]))
logging.debug(f"Is it English: {self.is_english}") logging.debug(f"Is it English: {self.is_english}")
@ -1303,7 +1322,10 @@ class RAGFlowPdfParser:
positions = [] positions = []
for ii, (pns, left, right, top, bottom) in enumerate(poss): for ii, (pns, left, right, top, bottom) in enumerate(poss):
right = left + max_width if 0 < ii < len(poss) - 1:
right = max(left + 10, right)
else:
right = left + max_width
bottom *= ZM bottom *= ZM
for pn in pns[1:]: for pn in pns[1:]:
if 0 <= pn - 1 < page_count: if 0 <= pn - 1 < page_count:

View File

@ -192,12 +192,16 @@ class TencentCloudAPIClient:
class TCADPParser(RAGFlowPdfParser): class TCADPParser(RAGFlowPdfParser):
def __init__(self, secret_id: str = None, secret_key: str = None, region: str = "ap-guangzhou"): def __init__(self, secret_id: str = None, secret_key: str = None, region: str = "ap-guangzhou",
table_result_type: str = None, markdown_image_response_type: str = None):
super().__init__() super().__init__()
# First initialize logger # First initialize logger
self.logger = logging.getLogger(self.__class__.__name__) self.logger = logging.getLogger(self.__class__.__name__)
# Log received parameters
self.logger.info(f"[TCADP] Initializing with parameters - table_result_type: {table_result_type}, markdown_image_response_type: {markdown_image_response_type}")
# Priority: read configuration from RAGFlow configuration system (service_conf.yaml) # Priority: read configuration from RAGFlow configuration system (service_conf.yaml)
try: try:
tcadp_parser = get_base_config("tcadp_config", {}) tcadp_parser = get_base_config("tcadp_config", {})
@ -205,14 +209,30 @@ class TCADPParser(RAGFlowPdfParser):
self.secret_id = secret_id or tcadp_parser.get("secret_id") self.secret_id = secret_id or tcadp_parser.get("secret_id")
self.secret_key = secret_key or tcadp_parser.get("secret_key") self.secret_key = secret_key or tcadp_parser.get("secret_key")
self.region = region or tcadp_parser.get("region", "ap-guangzhou") self.region = region or tcadp_parser.get("region", "ap-guangzhou")
self.table_result_type = tcadp_parser.get("table_result_type", "1") # Set table_result_type and markdown_image_response_type from config or parameters
self.markdown_image_response_type = tcadp_parser.get("markdown_image_response_type", "1") self.table_result_type = table_result_type if table_result_type is not None else tcadp_parser.get("table_result_type", "1")
self.logger.info("[TCADP] Configuration read from service_conf.yaml") self.markdown_image_response_type = markdown_image_response_type if markdown_image_response_type is not None else tcadp_parser.get("markdown_image_response_type", "1")
else: else:
self.logger.error("[TCADP] Please configure tcadp_config in service_conf.yaml first") self.logger.error("[TCADP] Please configure tcadp_config in service_conf.yaml first")
# If config file is empty, use provided parameters or defaults
self.secret_id = secret_id
self.secret_key = secret_key
self.region = region or "ap-guangzhou"
self.table_result_type = table_result_type if table_result_type is not None else "1"
self.markdown_image_response_type = markdown_image_response_type if markdown_image_response_type is not None else "1"
except ImportError: except ImportError:
self.logger.info("[TCADP] Configuration module import failed") self.logger.info("[TCADP] Configuration module import failed")
# If config file is not available, use provided parameters or defaults
self.secret_id = secret_id
self.secret_key = secret_key
self.region = region or "ap-guangzhou"
self.table_result_type = table_result_type if table_result_type is not None else "1"
self.markdown_image_response_type = markdown_image_response_type if markdown_image_response_type is not None else "1"
# Log final values
self.logger.info(f"[TCADP] Final values - table_result_type: {self.table_result_type}, markdown_image_response_type: {self.markdown_image_response_type}")
if not self.secret_id or not self.secret_key: if not self.secret_id or not self.secret_key:
raise ValueError("[TCADP] Please set Tencent Cloud API keys, configure tcadp_config in service_conf.yaml") raise ValueError("[TCADP] Please set Tencent Cloud API keys, configure tcadp_config in service_conf.yaml")
@ -400,6 +420,8 @@ class TCADPParser(RAGFlowPdfParser):
"TableResultType": self.table_result_type, "TableResultType": self.table_result_type,
"MarkdownImageResponseType": self.markdown_image_response_type "MarkdownImageResponseType": self.markdown_image_response_type
} }
self.logger.info(f"[TCADP] API request config - TableResultType: {self.table_result_type}, MarkdownImageResponseType: {self.markdown_image_response_type}")
result = client.reconstruct_document_sse( result = client.reconstruct_document_sse(
file_type=file_type, file_type=file_type,

View File

@ -17,7 +17,7 @@
import logging import logging
import math import math
import os import os
import re # import re
from collections import Counter from collections import Counter
from copy import deepcopy from copy import deepcopy
@ -62,8 +62,9 @@ class LayoutRecognizer(Recognizer):
def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True): def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
def __is_garbage(b): def __is_garbage(b):
patt = [r"^•+$", "^[0-9]{1,2} / ?[0-9]{1,2}$", r"^[0-9]{1,2} of [0-9]{1,2}$", "^http://[^ ]{12,}", "\\(cid *: *[0-9]+ *\\)"] return False
return any([re.search(p, b["text"]) for p in patt]) # patt = [r"^•+$", "^[0-9]{1,2} / ?[0-9]{1,2}$", r"^[0-9]{1,2} of [0-9]{1,2}$", "^http://[^ ]{12,}", "\\(cid *: *[0-9]+ *\\)"]
# return any([re.search(p, b["text"]) for p in patt])
if self.client: if self.client:
layouts = self.client.predict(image_list) layouts = self.client.predict(image_list)

View File

@ -7,6 +7,7 @@
# Available options: # Available options:
# - `elasticsearch` (default) # - `elasticsearch` (default)
# - `infinity` (https://github.com/infiniflow/infinity) # - `infinity` (https://github.com/infiniflow/infinity)
# - `oceanbase` (https://github.com/oceanbase/oceanbase)
# - `opensearch` (https://github.com/opensearch-project/OpenSearch) # - `opensearch` (https://github.com/opensearch-project/OpenSearch)
DOC_ENGINE=${DOC_ENGINE:-elasticsearch} DOC_ENGINE=${DOC_ENGINE:-elasticsearch}
@ -62,6 +63,27 @@ INFINITY_THRIFT_PORT=23817
INFINITY_HTTP_PORT=23820 INFINITY_HTTP_PORT=23820
INFINITY_PSQL_PORT=5432 INFINITY_PSQL_PORT=5432
# The hostname where the OceanBase service is exposed
OCEANBASE_HOST=oceanbase
# The port used to expose the OceanBase service
OCEANBASE_PORT=2881
# The username for OceanBase
OCEANBASE_USER=root@ragflow
# The password for OceanBase
OCEANBASE_PASSWORD=infini_rag_flow
# The doc database of the OceanBase service to use
OCEANBASE_DOC_DBNAME=ragflow_doc
# OceanBase container configuration
OB_CLUSTER_NAME=${OB_CLUSTER_NAME:-ragflow}
OB_TENANT_NAME=${OB_TENANT_NAME:-ragflow}
OB_SYS_PASSWORD=${OCEANBASE_PASSWORD:-infini_rag_flow}
OB_TENANT_PASSWORD=${OCEANBASE_PASSWORD:-infini_rag_flow}
OB_MEMORY_LIMIT=${OB_MEMORY_LIMIT:-10G}
OB_SYSTEM_MEMORY=${OB_SYSTEM_MEMORY:-2G}
OB_DATAFILE_SIZE=${OB_DATAFILE_SIZE:-20G}
OB_LOG_DISK_SIZE=${OB_LOG_DISK_SIZE:-20G}
# The password for MySQL. # The password for MySQL.
MYSQL_PASSWORD=infini_rag_flow MYSQL_PASSWORD=infini_rag_flow
# The hostname where the MySQL service is exposed # The hostname where the MySQL service is exposed
@ -208,9 +230,16 @@ REGISTER_ENABLED=1
# SANDBOX_MAX_MEMORY=256m # b, k, m, g # SANDBOX_MAX_MEMORY=256m # b, k, m, g
# SANDBOX_TIMEOUT=10s # s, m, 1m30s # SANDBOX_TIMEOUT=10s # s, m, 1m30s
# Enable DocLing and Mineru # Enable DocLing
USE_DOCLING=false USE_DOCLING=false
# Enable Mineru
USE_MINERU=false USE_MINERU=false
MINERU_EXECUTABLE="$HOME/uv_tools/.venv/bin/mineru"
MINERU_DELETE_OUTPUT=0 # keep output directory
MINERU_BACKEND=pipeline # or another backend you prefer
# pptx support # pptx support
DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1 DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1

View File

@ -138,6 +138,15 @@ The [.env](./.env) file contains important environment variables for Docker.
- `password`: The password for MinIO. - `password`: The password for MinIO.
- `host`: The MinIO serving IP *and* port inside the Docker container. Defaults to `minio:9000`. - `host`: The MinIO serving IP *and* port inside the Docker container. Defaults to `minio:9000`.
- `oceanbase`
- `scheme`: The connection scheme. Set to `mysql` to use mysql config, or other values to use config below.
- `config`:
- `db_name`: The OceanBase database name.
- `user`: The username for OceanBase.
- `password`: The password for OceanBase.
- `host`: The hostname of the OceanBase service.
- `port`: The port of OceanBase.
- `oss` - `oss`
- `access_key`: The access key ID used to authenticate requests to the OSS service. - `access_key`: The access key ID used to authenticate requests to the OSS service.
- `secret_key`: The secret access key used to authenticate requests to the OSS service. - `secret_key`: The secret access key used to authenticate requests to the OSS service.

View File

@ -72,7 +72,7 @@ services:
infinity: infinity:
profiles: profiles:
- infinity - infinity
image: infiniflow/infinity:v0.6.5 image: infiniflow/infinity:v0.6.8
volumes: volumes:
- infinity_data:/var/infinity - infinity_data:/var/infinity
- ./infinity_conf.toml:/infinity_conf.toml - ./infinity_conf.toml:/infinity_conf.toml
@ -96,6 +96,31 @@ services:
retries: 120 retries: 120
restart: on-failure restart: on-failure
oceanbase:
profiles:
- oceanbase
image: oceanbase/oceanbase-ce:4.4.1.0-100000032025101610
volumes:
- ./oceanbase/data:/root/ob
- ./oceanbase/conf:/root/.obd/cluster
- ./oceanbase/init.d:/root/boot/init.d
ports:
- ${OCEANBASE_PORT:-2881}:2881
env_file: .env
environment:
- MODE=normal
- OB_SERVER_IP=127.0.0.1
mem_limit: ${MEM_LIMIT}
healthcheck:
test: [ 'CMD-SHELL', 'obclient -h127.0.0.1 -P2881 -uroot@${OB_TENANT_NAME:-ragflow} -p${OB_TENANT_PASSWORD:-infini_rag_flow} -e "CREATE DATABASE IF NOT EXISTS ${OCEANBASE_DOC_DBNAME:-ragflow_doc};"' ]
interval: 10s
retries: 30
start_period: 30s
timeout: 10s
networks:
- ragflow
restart: on-failure
sandbox-executor-manager: sandbox-executor-manager:
profiles: profiles:
- sandbox - sandbox
@ -154,7 +179,7 @@ services:
minio: minio:
image: quay.io/minio/minio:RELEASE.2025-06-13T11-33-47Z image: quay.io/minio/minio:RELEASE.2025-06-13T11-33-47Z
command: server --console-address ":9001" /data command: ["server", "--console-address", ":9001", "/data"]
ports: ports:
- ${MINIO_PORT}:9000 - ${MINIO_PORT}:9000
- ${MINIO_CONSOLE_PORT}:9001 - ${MINIO_CONSOLE_PORT}:9001
@ -176,7 +201,7 @@ services:
redis: redis:
# swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/valkey/valkey:8 # swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/valkey/valkey:8
image: valkey/valkey:8 image: valkey/valkey:8
command: redis-server --requirepass ${REDIS_PASSWORD} --maxmemory 128mb --maxmemory-policy allkeys-lru command: ["redis-server", "--requirepass", "${REDIS_PASSWORD}", "--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"]
env_file: .env env_file: .env
ports: ports:
- ${REDIS_PORT}:6379 - ${REDIS_PORT}:6379
@ -256,6 +281,8 @@ volumes:
driver: local driver: local
infinity_data: infinity_data:
driver: local driver: local
ob_data:
driver: local
mysql_data: mysql_data:
driver: local driver: local
minio_data: minio_data:

View File

@ -13,6 +13,7 @@ function usage() {
echo " --disable-datasync Disables synchronization of datasource workers." echo " --disable-datasync Disables synchronization of datasource workers."
echo " --enable-mcpserver Enables the MCP server." echo " --enable-mcpserver Enables the MCP server."
echo " --enable-adminserver Enables the Admin server." echo " --enable-adminserver Enables the Admin server."
echo " --init-superuser Initializes the superuser."
echo " --consumer-no-beg=<num> Start range for consumers (if using range-based)." echo " --consumer-no-beg=<num> Start range for consumers (if using range-based)."
echo " --consumer-no-end=<num> End range for consumers (if using range-based)." echo " --consumer-no-end=<num> End range for consumers (if using range-based)."
echo " --workers=<num> Number of task executors to run (if range is not used)." echo " --workers=<num> Number of task executors to run (if range is not used)."
@ -24,6 +25,7 @@ function usage() {
echo " $0 --disable-webserver --workers=2 --host-id=myhost123" echo " $0 --disable-webserver --workers=2 --host-id=myhost123"
echo " $0 --enable-mcpserver" echo " $0 --enable-mcpserver"
echo " $0 --enable-adminserver" echo " $0 --enable-adminserver"
echo " $0 --init-superuser"
exit 1 exit 1
} }
@ -32,6 +34,7 @@ ENABLE_TASKEXECUTOR=1 # Default to enable task executor
ENABLE_DATASYNC=1 ENABLE_DATASYNC=1
ENABLE_MCP_SERVER=0 ENABLE_MCP_SERVER=0
ENABLE_ADMIN_SERVER=0 # Default close admin server ENABLE_ADMIN_SERVER=0 # Default close admin server
INIT_SUPERUSER_ARGS="" # Default to not initialize superuser
CONSUMER_NO_BEG=0 CONSUMER_NO_BEG=0
CONSUMER_NO_END=0 CONSUMER_NO_END=0
WORKERS=1 WORKERS=1
@ -83,6 +86,10 @@ for arg in "$@"; do
ENABLE_ADMIN_SERVER=1 ENABLE_ADMIN_SERVER=1
shift shift
;; ;;
--init-superuser)
INIT_SUPERUSER_ARGS="--init-superuser"
shift
;;
--mcp-host=*) --mcp-host=*)
MCP_HOST="${arg#*=}" MCP_HOST="${arg#*=}"
shift shift
@ -240,7 +247,7 @@ if [[ "${ENABLE_WEBSERVER}" -eq 1 ]]; then
echo "Starting ragflow_server..." echo "Starting ragflow_server..."
while true; do while true; do
"$PY" api/ragflow_server.py & "$PY" api/ragflow_server.py ${INIT_SUPERUSER_ARGS} &
wait; wait;
sleep 1; sleep 1;
done & done &

View File

@ -1,5 +1,5 @@
[general] [general]
version = "0.6.5" version = "0.6.8"
time_zone = "utc-8" time_zone = "utc-8"
[network] [network]
@ -54,4 +54,3 @@ memindex_memory_quota = "1GB"
wal_dir = "/var/infinity/wal" wal_dir = "/var/infinity/wal"
[resource] [resource]
resource_dir = "/var/infinity/resource"

View File

@ -23,12 +23,12 @@ server {
gzip_disable "MSIE [1-6]\."; gzip_disable "MSIE [1-6]\.";
location ~ ^/api/v1/admin { location ~ ^/api/v1/admin {
proxy_pass http://ragflow:9381; proxy_pass http://localhost:9381;
include proxy.conf; include proxy.conf;
} }
location ~ ^/(v1|api) { location ~ ^/(v1|api) {
proxy_pass http://ragflow:9380; proxy_pass http://localhost:9380;
include proxy.conf; include proxy.conf;
} }

View File

@ -0,0 +1 @@
ALTER SYSTEM SET ob_vector_memory_limit_percentage = 30;

View File

@ -28,8 +28,17 @@ os:
infinity: infinity:
uri: '${INFINITY_HOST:-infinity}:23817' uri: '${INFINITY_HOST:-infinity}:23817'
db_name: 'default_db' db_name: 'default_db'
oceanbase:
scheme: 'oceanbase' # set 'mysql' to create connection using mysql config
config:
db_name: '${OCEANBASE_DOC_DBNAME:-test}'
user: '${OCEANBASE_USER:-root@ragflow}'
password: '${OCEANBASE_PASSWORD:-infini_rag_flow}'
host: '${OCEANBASE_HOST:-oceanbase}'
port: ${OCEANBASE_PORT:-2881}
redis: redis:
db: 1 db: 1
username: '${REDIS_USERNAME:-}'
password: '${REDIS_PASSWORD:-infini_rag_flow}' password: '${REDIS_PASSWORD:-infini_rag_flow}'
host: '${REDIS_HOST:-redis}:6379' host: '${REDIS_HOST:-redis}:6379'
user_default_llm: user_default_llm:
@ -142,5 +151,3 @@ user_default_llm:
# secret_id: '${TENCENT_SECRET_ID}' # secret_id: '${TENCENT_SECRET_ID}'
# secret_key: '${TENCENT_SECRET_KEY}' # secret_key: '${TENCENT_SECRET_KEY}'
# region: '${TENCENT_REGION}' # region: '${TENCENT_REGION}'
# table_result_type: '1'
# markdown_image_response_type: '1'

View File

@ -89,6 +89,8 @@ RAGFlow utilizes MinIO as its object storage solution, leveraging its scalabilit
- `REDIS_PORT` - `REDIS_PORT`
The port used to expose the Redis service to the host machine, allowing **external** access to the Redis service running inside the Docker container. Defaults to `6379`. The port used to expose the Redis service to the host machine, allowing **external** access to the Redis service running inside the Docker container. Defaults to `6379`.
- `REDIS_USERNAME`
Optional Redis ACL username when using Redis 6+ authentication.
- `REDIS_PASSWORD` - `REDIS_PASSWORD`
The password for Redis. The password for Redis.
@ -160,6 +162,13 @@ If you cannot download the RAGFlow Docker image, try the following mirrors.
- `password`: The password for MinIO. - `password`: The password for MinIO.
- `host`: The MinIO serving IP *and* port inside the Docker container. Defaults to `minio:9000`. - `host`: The MinIO serving IP *and* port inside the Docker container. Defaults to `minio:9000`.
### `redis`
- `host`: The Redis serving IP *and* port inside the Docker container. Defaults to `redis:6379`.
- `db`: The Redis database index to use. Defaults to `1`.
- `username`: Optional Redis ACL username (Redis 6+).
- `password`: The password for the specified Redis user.
### `oauth` ### `oauth`
The OAuth configuration for signing up or signing in to RAGFlow using a third-party account. The OAuth configuration for signing up or signing in to RAGFlow using a third-party account.

View File

@ -323,9 +323,9 @@ The status of a Docker container status does not necessarily reflect the status
2. Follow [this document](./guides/run_health_check.md) to check the health status of the Elasticsearch service. 2. Follow [this document](./guides/run_health_check.md) to check the health status of the Elasticsearch service.
:::danger IMPORTANT :::danger IMPORTANT
The status of a Docker container status does not necessarily reflect the status of the service. You may find that your services are unhealthy even when the corresponding Docker containers are up running. Possible reasons for this include network failures, incorrect port numbers, or DNS issues. The status of a Docker container status does not necessarily reflect the status of the service. You may find that your services are unhealthy even when the corresponding Docker containers are up running. Possible reasons for this include network failures, incorrect port numbers, or DNS issues.
::: :::
3. If your container keeps restarting, ensure `vm.max_map_count` >= 262144 as per [this README](https://github.com/infiniflow/ragflow?tab=readme-ov-file#-start-up-the-server). Updating the `vm.max_map_count` value in **/etc/sysctl.conf** is required, if you wish to keep your change permanent. Note that this configuration works only for Linux. 3. If your container keeps restarting, ensure `vm.max_map_count` >= 262144 as per [this README](https://github.com/infiniflow/ragflow?tab=readme-ov-file#-start-up-the-server). Updating the `vm.max_map_count` value in **/etc/sysctl.conf** is required, if you wish to keep your change permanent. Note that this configuration works only for Linux.
@ -456,9 +456,9 @@ To switch your document engine from Elasticsearch to [Infinity](https://github.c
```bash ```bash
$ docker compose -f docker/docker-compose.yml down -v $ docker compose -f docker/docker-compose.yml down -v
``` ```
:::caution WARNING :::caution WARNING
`-v` will delete all Docker container volumes, and the existing data will be cleared. `-v` will delete all Docker container volumes, and the existing data will be cleared.
::: :::
2. In **docker/.env**, set `DOC_ENGINE=${DOC_ENGINE:-infinity}` 2. In **docker/.env**, set `DOC_ENGINE=${DOC_ENGINE:-infinity}`
3. Restart your Docker image: 3. Restart your Docker image:
@ -497,20 +497,6 @@ MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports
1. Prepare MinerU 1. Prepare MinerU
- **If you deploy RAGFlow from source**, install MinerU into an isolated virtual environment (recommended path: `$HOME/uv_tools`):
```bash
mkdir -p "$HOME/uv_tools"
cd "$HOME/uv_tools"
uv venv .venv
source .venv/bin/activate
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
# or
# uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
```
- **If you deploy RAGFlow with Docker**, you usually only need to turn on MinerU support in `docker/.env`:
```bash ```bash
# docker/.env # docker/.env
... ...
@ -518,18 +504,15 @@ MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports
... ...
``` ```
Enabling `USE_MINERU=true` will internally perform the same setup as the manual configuration (including setting the MinerU executable path and related environment variables). You only need the manual installation above if you are running from source or want full control over the MinerU installation. Enabling `USE_MINERU=true` will internally perform the same setup as the manual configuration (including setting the MinerU executable path and related environment variables).
2. Start RAGFlow with MinerU enabled: 2. Start RAGFlow with MinerU enabled:
- **Source deployment** in the RAGFlow repo, export the key MinerU-related variables and start the backend service: - **Source deployment** in the RAGFlow repo, continue to start the backend service:
```bash ```bash
# in RAGFlow repo ...
export MINERU_EXECUTABLE="$HOME/uv_tools/.venv/bin/mineru"
export MINERU_DELETE_OUTPUT=0 # keep output directory
export MINERU_BACKEND=pipeline # or another backend you prefer
source .venv/bin/activate source .venv/bin/activate
export PYTHONPATH=$(pwd) export PYTHONPATH=$(pwd)
bash docker/launch_backend_service.sh bash docker/launch_backend_service.sh

Some files were not shown because too many files have changed in this diff Show More