Commit Graph

262 Commits

Author SHA1 Message Date
7a4044b05f Feat: use filepath for files with the same name for all data source types (#11819)
### What problem does this PR solve?

When there are multiple files with the same name the file would just
duplicate, making it hard to distinguish between the different files.
Now if there are multiple files with the same name, they will be named
after their folder path in the storage unit.

This was done for the webdav connector and with this PR also for Notion,
Confluence and S3 Storage.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Contribution by RAGcon GmbH, visit us [here](https://www.ragcon.ai/)
2025-12-18 17:42:43 +08:00
4dd8cdc38b task executor issues (#12006)
### What problem does this PR solve?

**Fixes #8706** - `InfinityException: TOO_MANY_CONNECTIONS` when running
multiple task executor workers

### Problem Description

When running RAGFlow with 8-16 task executor workers, most workers fail
to start properly. Checking logs revealed that workers were
stuck/hanging during Infinity connection initialization - only 1-2
workers would successfully register in Redis while the rest remained
blocked.

### Root Cause

The Infinity SDK `ConnectionPool` pre-allocates all connections in
`__init__`. With the default `max_size=32` and multiple workers (e.g.,
16), this creates 16×32=512 connections immediately on startup,
exceeding Infinity's default 128 connection limit. Workers hang while
waiting for connections that can never be established.

### Changes

1. **Prevent Infinity connection storm** (`rag/utils/infinity_conn.py`,
`rag/svr/task_executor.py`)
- Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since
operations are synchronous)
- Added staggered startup delay (2s per worker) to spread connection
initialization

2. **Handle None children_delimiter** (`rag/app/naive.py`)
   - Use `or ""` to handle explicitly set None values from parser config

3. **MinerU parser robustness** (`deepdoc/parser/mineru_parser.py`)
   - Use `.get()` for optional output fields that may be missing
- Fix DISCARDED block handling: change `pass` to `continue` to skip
discarded blocks entirely

### Why `max_size=4` is sufficient

| Workers | Pool Size | Total Connections | Infinity Limit |
|---------|-----------|-------------------|----------------|
| 16      | 32        | 512               | 128          |
| 16      | 4         | 64                | 128          |
| 32      | 4         | 128               | 128          |

- All RAGFlow operations are synchronous: `get_conn()` → operation →
`release_conn()`
- No parallel `docStoreConn` operations in the codebase
- Maximum 1-2 concurrent connections needed per worker; 4 provides
safety margin

### MinerU DISCARDED block bug

When MinerU returns blocks with `type: "discarded"` (headers, footers,
watermarks, page numbers, artifacts), the previous code used `pass`
which left the `section` variable undefined, causing:

- **UnboundLocalError** if DISCARDED is the first block
- **Duplicate content** if DISCARDED follows another block (stale value
from previous iteration)

**Root cause confirmed via MinerU source code:**

From
[`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14):
```python
class BlockType:
    DISCARDED = 'discarded'
    # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE
```

Per [MinerU
documentation](https://opendatalab.github.io/MinerU/reference/output_files/),
discarded blocks contain content that should be filtered out for clean
text extraction.

**Fix:** Changed `pass` to `continue` to skip discarded blocks entirely.

### Testing

- Verified all 16 workers now register successfully in Redis
- All workers heartbeating correctly
- Document parsing works as expected
- MinerU parsing with DISCARDED blocks no longer crashes

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: user210 <user210@rt>
2025-12-18 10:03:30 +08:00
8e4d011b15 Fix: parent-children chunking method. (#11997)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-12-17 16:50:36 +08:00
03f9be7cbb Refa: only support MinerU-API now (#11977)
### What problem does this PR solve?

Only support MinerU-API now, still need to complete frontend for
pipeline to allow the configuration of MinerU options.

### Type of change

- [x] Refactoring
2025-12-17 12:58:48 +08:00
30019dab9f Change knowledge base to dataset (#11976)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-17 10:03:33 +08:00
7db9045b74 Feat: Add box connector (#11845)
### What problem does this PR solve?

Feat: Add box connector

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-12 10:23:40 +08:00
ea4a5cd665 Fix: tokenizer issue. (#11902)
#11786
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-11 17:38:17 +08:00
a6afb7dfe2 Fix data_sync startup crash by properly invoking async main (#11879)
### What problem does this PR solve?

This PR fixes a startup crash in the data_sync_0 service caused by an
incorrect asyncio.run call. The main coroutine was being passed as a
function reference instead of being invoked, which raised:

`ValueError: a coroutine was expected, got <function main ...>
`

What I changed

- Updated the entrypoint in sync_data_source.py to correctly invoke the
coroutine with `asyncio.run(main())`.

Testing
- No tested.

Related Issue
Fixes https://github.com/infiniflow/ragflow/issues/11878

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-11 10:09:16 +08:00
65a5a56d95 Refa:replace trio with asyncio (#11831)
### What problem does this PR solve?

change:
replace trio with asyncio

### Type of change
- [x] Refactoring
2025-12-09 19:23:14 +08:00
dd046be976 Fix: parent-child chunking method (#11810)
### What problem does this PR solve?

change:
parent-child chunking method

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-09 09:34:01 +08:00
9b8971a9de Fix:toc in pipeline (#11785)
### What problem does this PR solve?
change:
Fix toc in pipeline
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-08 09:42:20 +08:00
4012d65b3c Feat: update front end for confluence connector (#11747)
### What problem does this PR solve?

Feat: update front end for confluence connector

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-04 18:49:13 +08:00
e2bc1a3478 Feat: add more attribute for confluence connector. (#11743)
### What problem does this PR solve?

Feat: add more attribute for confluence connector. 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-04 17:28:03 +08:00
4870d42949 feat: Auto-disable Raptor for structured data (Issue #11653) (#11676)
### What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.

**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.

**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)

**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications

**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx")  # True

# CSV file - automatically skipped  
should_skip_raptor(".csv")  # True

# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table")  # True

# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive")  # False

# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False})  # False
```

**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.

**Testing**: 44 comprehensive tests, 100% passing

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 17:02:29 +08:00
3c50c7d3ac Refactor code (#11694)
### What problem does this PR solve?

Rename function and refactor log message

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-03 15:15:00 +08:00
b5ad7b7062 Feat: support TOC transformer. (#11685)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 12:27:50 +08:00
c946858328 Feat: add mineru auto installer (#11649)
### What problem does this PR solve?

Feat: add mineru auto installer

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:29:26 +08:00
d1e172171f Refactor: better describe how to get prefix for sync data source (#11636)
### What problem does this PR solve?

better describe how to get prefix for sync data source

### Type of change

- [x] Refactoring
2025-12-01 17:46:44 +08:00
6ea4248bdc Feat: support parent-child in search procedure. (#11629)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 14:03:09 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
cf7fdd274b Feat: add gmail connector (#11549)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:09:40 +08:00
12979a3f21 feat: improve metadata handling in connector service (#11421)
### What problem does this PR solve?

- Update sync data source to handle metadata properly

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-26 19:55:48 +08:00
2fd5ac1031 Feat: Add Webdav storage as data source (#11422)
### What problem does this PR solve?

This PR adds webdav storage as data source for data sync service.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-26 14:14:42 +08:00
d1744aaaf3 Feat: add datasource Dropbox (#11488)
### What problem does this PR solve?

Add datasource Dropbox.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 09:40:03 +08:00
f0a14f5fce Add Moodle data source integration (#11325)
### What problem does this PR solve?

This PR adds a native Moodle connector to sync content (courses,
resources, forums, assignments, pages, books) into RAGFlow.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-21 19:58:49 +08:00
13e212c856 Feat: add Jira connector (#11285)
### What problem does this PR solve?

Add Jira connector.

<img width="978" height="925" alt="image"
src="https://github.com/user-attachments/assets/78bb5c77-2710-4569-a76e-9087ca23b227"
/>

---

<img width="1903" height="489" alt="image"
src="https://github.com/user-attachments/assets/193bc5c5-f751-4bd5-883a-2173282c2b96"
/>

---

<img width="1035" height="925" alt="image"
src="https://github.com/user-attachments/assets/1a0aec19-30eb-4ada-9283-61d1c915f59d"
/>

---

<img width="1905" height="601" alt="image"
src="https://github.com/user-attachments/assets/3dde1062-3f27-4717-8e09-fd5fd5e64171"
/>

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-17 09:38:04 +08:00
e8f1a245a6 Feat:update check_embedding api (#11254)
### What problem does this PR solve?
pr: 
#10854
change:
update check_embedding api

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-13 18:48:25 +08:00
908450509f Feat: add fault-tolerant mechanism to RAPTOR (#11206)
### What problem does this PR solve?

Add fault-tolerant mechanism to RAPTOR.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-13 18:48:07 +08:00
296476ab89 Refactor function name (#11210)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-12 19:00:15 +08:00
8ae562504b Fix: GraphRAG and RAPTOR tasks do not affect document status (#11194)
### What problem does this PR solve?

GraphRAG and RAPTOR tasks do not affect document status.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-12 12:03:41 +08:00
c30ffb5716 Fix: ollama model list issue. (#11175)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-11 19:46:41 +08:00
9213568692 Feat: add mechanism to check cancellation in Agent (#10766)
### What problem does this PR solve?

Add mechanism to check cancellation in Agent.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-11 17:36:48 +08:00
f441f8ffc2 Fix: waitForResponse component. (#11172)
### What problem does this PR solve?

#10056

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-11-11 16:58:47 +08:00
df16a80f25 Feat: add initial Google Drive connector support (#11147)
### What problem does this PR solve?

This feature is primarily ported from the
[Onyx](https://github.com/onyx-dot-app/onyx) project with necessary
modifications. Thanks for such a brilliant project.

Minor: consistently use `google_drive` rather than `google_driver`.

<img width="566" height="731" alt="image"
src="https://github.com/user-attachments/assets/6f64e70e-881e-42c7-b45f-809d3e0024a4"
/>

<img width="904" height="830" alt="image"
src="https://github.com/user-attachments/assets/dfa7d1ef-819a-4a82-8c52-0999f48ed4a6"
/>

<img width="911" height="869" alt="image"
src="https://github.com/user-attachments/assets/39e792fb-9fbe-4f3d-9b3c-b2265186bc22"
/>

<img width="947" height="323" alt="image"
src="https://github.com/user-attachments/assets/27d70e96-d9c0-42d9-8c89-276919b6d61d"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-10 19:15:02 +08:00
d207291217 Fix: add download stats to kb logs. (#11112)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-10 13:28:07 +08:00
d016a06fd5 Feat/monitor task (#11116)
### What problem does this PR solve?

Show task executor.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-10 12:51:39 +08:00
dd1c8c5779 Feat: add auto parse to connector. (#11099)
### What problem does this PR solve?

#10953

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-07 16:49:29 +08:00
34283d4db4 Feat: add data source to pipleline logs . (#11075)
### What problem does this PR solve?

#10953

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-07 11:43:59 +08:00
af98763e27 Admin: add 'show version' (#11079)
### What problem does this PR solve?

```
admin> show version;
show_version
+-----------------------+
| version               |
+-----------------------+
| v0.21.0-241-gc6cf58d5 |
+-----------------------+
admin> \q
Goodbye!

```

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-06 19:24:46 +08:00
0cd8024c34 Feat: RAPTOR handle cancel gracefully (#11074)
### What problem does this PR solve?

RAPTOR handle cancel gracefully.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-06 17:18:03 +08:00
3bd1fefe1f Feat: debug sync data. (#11073)
### What problem does this PR solve?

#10953 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-06 16:48:04 +08:00
23b81eae77 Feat: GraphRAG handle cancel gracefully (#11061)
### What problem does this PR solve?

 GraghRAG handle cancel gracefully. #10997.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-06 16:12:20 +08:00
adbb8319e0 Fix: add fields for logs. (#11039)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-06 09:49:57 +08:00
f98b24c9bf Move api.settings to common.settings (#11036)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-06 09:36:38 +08:00
dddf766470 Feat: start data sync service. (#11026)
### What problem does this PR solve?

#10953 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-05 15:43:15 +08:00
b86e07088b Fix: escape multi-steps issues. (#11016)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-05 14:51:00 +08:00
1a9215bc6f Move some vars to globals (#11017)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 14:14:38 +08:00
96c015fb85 Fix and refactor imports (#11010)
### What problem does this PR solve?

1. Move EMBEDDING_CFG to common.globals
2. Fix error imports
3. Move signal handles to common/signal_utils.py

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 11:07:54 +08:00
bab3fce136 Move some constants to common (#11004)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 08:01:39 +08:00
4bbbf92331 Refa: link connector to KB. (#10991)
### What problem does this PR solve?

#10953

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-04 20:13:52 +08:00