Commit Graph

249 Commits

Author SHA1 Message Date
4870d42949 feat: Auto-disable Raptor for structured data (Issue #11653) (#11676)
### What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.

**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.

**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)

**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications

**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx")  # True

# CSV file - automatically skipped  
should_skip_raptor(".csv")  # True

# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table")  # True

# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive")  # False

# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False})  # False
```

**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.

**Testing**: 44 comprehensive tests, 100% passing

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 17:02:29 +08:00
3c50c7d3ac Refactor code (#11694)
### What problem does this PR solve?

Rename function and refactor log message

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-03 15:15:00 +08:00
b5ad7b7062 Feat: support TOC transformer. (#11685)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 12:27:50 +08:00
c946858328 Feat: add mineru auto installer (#11649)
### What problem does this PR solve?

Feat: add mineru auto installer

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:29:26 +08:00
d1e172171f Refactor: better describe how to get prefix for sync data source (#11636)
### What problem does this PR solve?

better describe how to get prefix for sync data source

### Type of change

- [x] Refactoring
2025-12-01 17:46:44 +08:00
6ea4248bdc Feat: support parent-child in search procedure. (#11629)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 14:03:09 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
cf7fdd274b Feat: add gmail connector (#11549)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:09:40 +08:00
12979a3f21 feat: improve metadata handling in connector service (#11421)
### What problem does this PR solve?

- Update sync data source to handle metadata properly

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-26 19:55:48 +08:00
2fd5ac1031 Feat: Add Webdav storage as data source (#11422)
### What problem does this PR solve?

This PR adds webdav storage as data source for data sync service.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-26 14:14:42 +08:00
d1744aaaf3 Feat: add datasource Dropbox (#11488)
### What problem does this PR solve?

Add datasource Dropbox.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 09:40:03 +08:00
f0a14f5fce Add Moodle data source integration (#11325)
### What problem does this PR solve?

This PR adds a native Moodle connector to sync content (courses,
resources, forums, assignments, pages, books) into RAGFlow.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-21 19:58:49 +08:00
13e212c856 Feat: add Jira connector (#11285)
### What problem does this PR solve?

Add Jira connector.

<img width="978" height="925" alt="image"
src="https://github.com/user-attachments/assets/78bb5c77-2710-4569-a76e-9087ca23b227"
/>

---

<img width="1903" height="489" alt="image"
src="https://github.com/user-attachments/assets/193bc5c5-f751-4bd5-883a-2173282c2b96"
/>

---

<img width="1035" height="925" alt="image"
src="https://github.com/user-attachments/assets/1a0aec19-30eb-4ada-9283-61d1c915f59d"
/>

---

<img width="1905" height="601" alt="image"
src="https://github.com/user-attachments/assets/3dde1062-3f27-4717-8e09-fd5fd5e64171"
/>

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-17 09:38:04 +08:00
e8f1a245a6 Feat:update check_embedding api (#11254)
### What problem does this PR solve?
pr: 
#10854
change:
update check_embedding api

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-13 18:48:25 +08:00
908450509f Feat: add fault-tolerant mechanism to RAPTOR (#11206)
### What problem does this PR solve?

Add fault-tolerant mechanism to RAPTOR.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-13 18:48:07 +08:00
296476ab89 Refactor function name (#11210)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-12 19:00:15 +08:00
8ae562504b Fix: GraphRAG and RAPTOR tasks do not affect document status (#11194)
### What problem does this PR solve?

GraphRAG and RAPTOR tasks do not affect document status.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-12 12:03:41 +08:00
c30ffb5716 Fix: ollama model list issue. (#11175)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-11 19:46:41 +08:00
9213568692 Feat: add mechanism to check cancellation in Agent (#10766)
### What problem does this PR solve?

Add mechanism to check cancellation in Agent.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-11 17:36:48 +08:00
f441f8ffc2 Fix: waitForResponse component. (#11172)
### What problem does this PR solve?

#10056

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-11-11 16:58:47 +08:00
df16a80f25 Feat: add initial Google Drive connector support (#11147)
### What problem does this PR solve?

This feature is primarily ported from the
[Onyx](https://github.com/onyx-dot-app/onyx) project with necessary
modifications. Thanks for such a brilliant project.

Minor: consistently use `google_drive` rather than `google_driver`.

<img width="566" height="731" alt="image"
src="https://github.com/user-attachments/assets/6f64e70e-881e-42c7-b45f-809d3e0024a4"
/>

<img width="904" height="830" alt="image"
src="https://github.com/user-attachments/assets/dfa7d1ef-819a-4a82-8c52-0999f48ed4a6"
/>

<img width="911" height="869" alt="image"
src="https://github.com/user-attachments/assets/39e792fb-9fbe-4f3d-9b3c-b2265186bc22"
/>

<img width="947" height="323" alt="image"
src="https://github.com/user-attachments/assets/27d70e96-d9c0-42d9-8c89-276919b6d61d"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-10 19:15:02 +08:00
d207291217 Fix: add download stats to kb logs. (#11112)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-10 13:28:07 +08:00
d016a06fd5 Feat/monitor task (#11116)
### What problem does this PR solve?

Show task executor.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-10 12:51:39 +08:00
dd1c8c5779 Feat: add auto parse to connector. (#11099)
### What problem does this PR solve?

#10953

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-07 16:49:29 +08:00
34283d4db4 Feat: add data source to pipleline logs . (#11075)
### What problem does this PR solve?

#10953

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-07 11:43:59 +08:00
af98763e27 Admin: add 'show version' (#11079)
### What problem does this PR solve?

```
admin> show version;
show_version
+-----------------------+
| version               |
+-----------------------+
| v0.21.0-241-gc6cf58d5 |
+-----------------------+
admin> \q
Goodbye!

```

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-06 19:24:46 +08:00
0cd8024c34 Feat: RAPTOR handle cancel gracefully (#11074)
### What problem does this PR solve?

RAPTOR handle cancel gracefully.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-06 17:18:03 +08:00
3bd1fefe1f Feat: debug sync data. (#11073)
### What problem does this PR solve?

#10953 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-06 16:48:04 +08:00
23b81eae77 Feat: GraphRAG handle cancel gracefully (#11061)
### What problem does this PR solve?

 GraghRAG handle cancel gracefully. #10997.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-06 16:12:20 +08:00
adbb8319e0 Fix: add fields for logs. (#11039)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-06 09:49:57 +08:00
f98b24c9bf Move api.settings to common.settings (#11036)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-06 09:36:38 +08:00
dddf766470 Feat: start data sync service. (#11026)
### What problem does this PR solve?

#10953 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-05 15:43:15 +08:00
b86e07088b Fix: escape multi-steps issues. (#11016)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-05 14:51:00 +08:00
1a9215bc6f Move some vars to globals (#11017)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 14:14:38 +08:00
96c015fb85 Fix and refactor imports (#11010)
### What problem does this PR solve?

1. Move EMBEDDING_CFG to common.globals
2. Fix error imports
3. Move signal handles to common/signal_utils.py

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 11:07:54 +08:00
bab3fce136 Move some constants to common (#11004)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 08:01:39 +08:00
4bbbf92331 Refa: link connector to KB. (#10991)
### What problem does this PR solve?

#10953

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-04 20:13:52 +08:00
880a6a0428 Move some enumerate type to constants.py (#10998)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-04 19:25:25 +08:00
465a140727 Feat: refine Confluence connector (#10994)
### What problem does this PR solve?

Refine Confluence connector.
#10953

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-11-04 17:29:11 +08:00
16d2be623c Minor tweaks (#10987)
### What problem does this PR solve?

1. Rename identifier name
2. Fix some return statement
3. Fix some typos

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-04 14:15:31 +08:00
1e45137284 Move 'timeout' to common folder (#10983)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-04 11:51:12 +08:00
378bdfccfc Refactor log utils (#10973)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 20:25:02 +08:00
3e5a39482e Feat: Support multiple data sources synchronizations (#10954)
### What problem does this PR solve?
#10953

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-03 19:59:18 +08:00
076d811086 Introduce common/config_utils.py (#10968)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 17:25:06 +08:00
d008a4df9f Move base64_image related functions to common directory (#10957)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 15:20:46 +08:00
360f5c1179 Move token related functions to common (#10942)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 08:50:05 +08:00
44f2d6f5da Move 'get_project_base_directory' to common directory (#10940)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-02 21:05:28 +08:00
73144e278b Don't release full image (#10654)
### What problem does this PR solve?

Introduced gpu profile in .env
Added Dockerfile_tei
fix datrie
Removed LIGHTEN flag

### Type of change

- [x] Documentation Update
- [x] Refactoring
2025-10-23 23:02:27 +08:00
2d491188b8 Refa: improve flow of GraphRAG and RAPTOR (#10709)
### What problem does this PR solve?

Improve flow of GraphRAG and RAPTOR.

### Type of change

- [x] Refactoring
2025-10-22 09:29:20 +08:00
deb81810e9 Update message printout when start ingestion server (#10677)
### What problem does this PR solve?

```
     ____                                  __     _                                                                  
    /  _/   ____    ____ _  ___    _____  / /_   (_)  ____    ____           _____  ___    _____ _   __  ___    _____
    / /    / __ \  / __ `/ / _ \  / ___/ / __/  / /  / __ \  / __ \         / ___/ / _ \  / ___/| | / / / _ \  / ___/
 _/ /    / / / / / /_/ / /  __/ (__  ) / /_   / /  / /_/ / / / / /        (__  ) /  __/ / /    | |/ / /  __/ / /    
/___/   /_/ /_/  \__, /  \___/ /____/  \__/  /_/   \____/ /_/ /_/        /____/  \___/ /_/     |___/  \___/ /_/     
                /____/          
```

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-10-21 09:38:20 +08:00