Commit Graph

1105 Commits

Author SHA1 Message Date
257af75ece Fix: relative page_number in boxes (#11712)
page_number in boxes is relative page number,must + from_page

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-04 11:23:34 +08:00
cbdacf21f6 feat(gcs): Add support for Google Cloud Storage (GCS) integration (#11718)
### What problem does this PR solve?

This Pull Request introduces native support for Google Cloud Storage
(GCS) as an optional object storage backend.

Currently, RAGFlow relies on a limited set of storage options. This
feature addresses the need for seamless integration with GCP
environments, allowing users to leverage a fully managed, highly
durable, and scalable storage service (GCS) instead of needing to deploy
and maintain third-party object storage solutions. This simplifies
deployment, especially for users running on GCP infrastructure like GKE
or Cloud Run.

The implementation uses a single GCS bucket defined via configuration,
mapping RAGFlow's internal logical storage units (or "buckets") to
folder prefixes within that GCS container to maintain data separation.
This architectural choice avoids the operational complexities associated
with dynamically creating and managing unique GCS buckets for every
logical unit.

### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-12-04 10:44:05 +08:00
3c224c817b Fix: Correct pagination and early termination bugs in chunk_list() (#11692)
## Summary

This PR fixes two critical bugs in `chunk_list()` method that prevent
processing large documents (>128 chunks) in GraphRAG and
  other workflows.

  ## Bugs Fixed

  ### Bug 1: Incorrect pagination offset calculation
  **Location:** `rag/nlp/search.py` lines 530-531

**Problem:** The loop variable `p` was used directly as offset, causing
incorrect pagination:
  ```python
  # BEFORE (BUGGY):
  for p in range(offset, max_count, bs):  # p = 0, 128, 256, 384...
es_res = self.dataStore.search(..., p, bs, ...) # p used as offset

  Fix: Use page number multiplied by batch size:
  # AFTER (FIXED):
  for page_num, p in enumerate(range(offset, max_count, bs)):
      es_res = self.dataStore.search(..., page_num * bs, bs, ...)

  Bug 2: Premature loop termination

  Location: rag/nlp/search.py lines 538-539

Problem: Loop terminates when any page returns fewer than 128 chunks,
even when thousands more remain:
  # BEFORE (BUGGY):
if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+
remain
      break

  Fix: Only terminate when zero chunks returned:
  # AFTER (FIXED):
  if len(dict_chunks.values()) == 0:
      break

  Enhancement: Add max_count parameter to GraphRAG

  Location: graphrag/general/index.py line 60

Added max_count=10000 parameter to chunk loading for both LightRAG and
General GraphRAG paths to ensure all chunks are
  processed.

  Testing

  Validated with a 314-page legal document containing 3,207 chunks:

  Before fixes:
  - Only 2-126 chunks processed
  - GraphRAG generated 25 nodes, 8 edges

  After fixes:
  - All 3,209 chunks processed 
  - GraphRAG processing complete dataset

  Impact

These bugs affect any workflow using chunk_list() with large documents,
particularly:
  - GraphRAG knowledge graph generation
  - RAPTOR hierarchical summarization
  - Document processing pipelines with >128 chunks

  Related Issue

  Fixes #11687

  Checklist

  - Code follows project style guidelines
  - Tested with large documents (3,207+ chunks)
  - Both bugs validated by Dosu bot in issue #11687
  - No breaking changes to API

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-03 19:44:20 +08:00
4870d42949 feat: Auto-disable Raptor for structured data (Issue #11653) (#11676)
### What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.

**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.

**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)

**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications

**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx")  # True

# CSV file - automatically skipped  
should_skip_raptor(".csv")  # True

# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table")  # True

# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive")  # False

# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False})  # False
```

**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.

**Testing**: 44 comprehensive tests, 100% passing

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 17:02:29 +08:00
3c50c7d3ac Refactor code (#11694)
### What problem does this PR solve?

Rename function and refactor log message

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-03 15:15:00 +08:00
e3f40db963 Refa: make RAGFlow more asynchronous 2 (#11689)
### What problem does this PR solve?

Make RAGFlow more asynchronous 2. #11551, #11579, #11619.

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-12-03 14:19:53 +08:00
b5ad7b7062 Feat: support TOC transformer. (#11685)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 12:27:50 +08:00
5c81e01de5 Fix: incorrect async chat streamly output (#11679)
### What problem does this PR solve?

Incorrect async chat streamly output. #11677.

Disable beartype for #11666.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-03 11:15:45 +08:00
a6681d6366 Revert "Refa: make RAGFlow more asynchronous 2" (#11669)
Reverts infiniflow/ragflow#11664
2025-12-02 19:42:05 +08:00
627c11c429 Refa: make RAGFlow more asynchronous 2 (#11664)
### What problem does this PR solve?

Make RAGFlow more asynchronous 2. #11551, #11579, #11619.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
- [x] Performance Improvement
2025-12-02 18:57:07 +08:00
4ba17361e9 feat: improve presentation PdfParser (#11639)
The old presentation PdfParser lost table format after parse

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:35:14 +08:00
c946858328 Feat: add mineru auto installer (#11649)
### What problem does this PR solve?

Feat: add mineru auto installer

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:29:26 +08:00
2ffe6f7439 Import rag_tokenizer from Infinity (#11647)
### What problem does this PR solve?

- Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk
via https://github.com/infiniflow/infinity/pull/3117 .
Import rag_tokenizer from infinity and inherit from
rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py.

- Bump infinity to 0.6.8

### Type of change
- [x] Refactoring
2025-12-02 14:59:37 +08:00
a713f54732 Refa: add MiniMax-M2 and remove deprecated MiniMax models (#11642)
### What problem does this PR solve?

Add MiniMax-M2 and remove deprecated models.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-12-02 14:43:44 +08:00
b8c0fb4572 Feat:new api /sequence2txt and update QWenSeq2txt (#11643)
### What problem does this PR solve?
change:
new api /sequence2txt,
update QWenSeq2txt and ZhipuSeq2txt

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 11:17:31 +08:00
d1e172171f Refactor: better describe how to get prefix for sync data source (#11636)
### What problem does this PR solve?

better describe how to get prefix for sync data source

### Type of change

- [x] Refactoring
2025-12-01 17:46:44 +08:00
81ae6cf78d Feat: support uploading in dialog. (#11634)
### What problem does this PR solve?

#9590

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 16:54:57 +08:00
41cff3e09e Fix: jina embedding issue (#11628)
### What problem does this PR solve?

Fix: jina embedding issue #11614 
Feat: Add jina embedding v4

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 14:24:35 +08:00
b6c4722687 Refa: make RAGFlow more asynchronous (#11601)
### What problem does this PR solve?

Try to make this more asynchronous. Verified in chat and agent
scenarios, reducing blocking behavior. #11551, #11579.

However, the impact of these changes still requires further
investigation to ensure everything works as expected.

### Type of change

- [x] Refactoring
2025-12-01 14:24:06 +08:00
6ea4248bdc Feat: support parent-child in search procedure. (#11629)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 14:03:09 +08:00
88a28212b3 Fix: Table parse method issue. (#11627)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:42:35 +08:00
9d0309aedc Fix: [MinerU] Missing output file (#11623)
### What problem does this PR solve?

Add fallbacks for MinerU output path. #11613, #11620.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:17:43 +08:00
7499608a8b feat: add Redis username support (#11608)
### What problem does this PR solve?

Support for Redis 6+ ACL authentication (username)

close #11606 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
2025-12-01 11:26:20 +08:00
80f6d22d2a Fix typos (#11607)
### What problem does this PR solve?

Fix typos

### Type of change

- [x] Fix typos
2025-12-01 09:49:46 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
dbdda0fbab Feat: optimize meta filter generation for better structure handling (#11586)
### What problem does this PR solve?

optimize meta filter generation for better structure handling

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:30:53 +08:00
cf7fdd274b Feat: add gmail connector (#11549)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:09:40 +08:00
982ed233a2 Fix: doc_aggs not correctly returned when no chunks retrieved. (#11578)
### What problem does this PR solve?

Fix: doc_aggs not correctly returned when no chunks retrieved.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-28 13:09:05 +08:00
8604c4f57c Feat: add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5 (#11559)
### What problem does this PR solve?

Add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5. #11548

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 17:59:17 +08:00
856201c0f2 Fix ft_title_rag_fine (#11555)
### What problem does this PR solve?

Fix ft_title_rag_fine

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 10:26:08 +08:00
9d8b96c1d0 Feat: add context for figure and table (#11547)
### What problem does this PR solve?

Add context for figure table.



![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e)


`==================()` for demonstrating purpose. 
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 10:21:44 +08:00
12979a3f21 feat: improve metadata handling in connector service (#11421)
### What problem does this PR solve?

- Update sync data source to handle metadata properly

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-26 19:55:48 +08:00
2fd5ac1031 Feat: Add Webdav storage as data source (#11422)
### What problem does this PR solve?

This PR adds webdav storage as data source for data sync service.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-26 14:14:42 +08:00
40e84ca41a Use Infinity single-field-multi-index (#11444)
### What problem does this PR solve?

Use Infinity single-field-multi-index

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-11-26 11:06:37 +08:00
74e0b58d89 Fix: excel default optimization. (#11519)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:20 +08:00
7c20c964b4 Fix: incorrect image merging for naive markdown parser (#11520)
### What problem does this PR solve?

Fix incorrect image merging for naive markdown parser. #9349 


[ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:06 +08:00
915e385244 Fix: uv lock updates (#11511)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 16:01:12 +08:00
f5faf0c94f Feat: support operator in/not in for metadata filter. (#11503)
### What problem does this PR solve?

#11376 #11378

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 12:44:26 +08:00
bcd70affb5 Fix: unexpected parameter. (#11497)
### What problem does this PR solve?

#11489

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 11:17:27 +08:00
41665b0865 Refactor: Email parser use with to handle buffer (#11496)
### What problem does this PR solve?
 Email parser use with to handle buffer

### Type of change

- [x] Refactoring
2025-11-25 10:03:37 +08:00
d1744aaaf3 Feat: add datasource Dropbox (#11488)
### What problem does this PR solve?

Add datasource Dropbox.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-25 09:40:03 +08:00
f0a14f5fce Add Moodle data source integration (#11325)
### What problem does this PR solve?

This PR adds a native Moodle connector to sync content (courses,
resources, forums, assignments, pages, books) into RAGFlow.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-21 19:58:49 +08:00
174a2578e8 Feat: add auth header for Ollama chat model (#11452)
### What problem does this PR solve?

Add auth header for Ollama chat model. #11350

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 19:47:06 +08:00
db0f6840d9 Feat: ignore chunk size when using custom delimiters (#11434)
### What problem does this PR solve?

Ignore chunk size when using custom delimiter.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-21 14:36:26 +08:00
971c1bcba7 Fix: missing parameters in by_plaintext method for PDF naive mode (#11408)
### What problem does this PR solve?

FIx: missing parameters in by_plaintext method for PDF naive mode

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: lih <dev_lih@139.com>
2025-11-21 09:33:36 +08:00
d3d2ccc76c Feat: add more chunking method (#11413)
### What problem does this PR solve?

Feat: add more chunking method #11311

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 19:07:17 +08:00
b846a0f547 Fix: incorrect retrieval total count with pagination enabled (#11400)
### What problem does this PR solve?

Incorrect retrieval total count with pagination enabled.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-20 15:35:09 +08:00
06cef71ba6 Feat: add or logic operations for meta data filters. (#11404)
### What problem does this PR solve?

#11376 #11387

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 14:31:12 +08:00
420c97199a Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041)
### What problem does this PR solve?

- Added TCADP Parser configuration fields to PDF, PPT, and spreadsheet
parsing forms
- Implemented support for setting table result type (Markdown/HTML) and
Markdown image response type (URL/Text)
- Updated TCADP Parser to handle return format settings from
configuration or parameters
- Enhanced frontend to dynamically show TCADP options based on selected
parsing method
- Modified backend to pass format parameters when calling TCADP API
- Optimized form default value logic for TCADP configuration items
- Updated multilingual resource files for new configuration options

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:08:42 +08:00
38234aca53 feat: add OceanBase doc engine (#11228)
### What problem does this PR solve?

Add OceanBase doc engine. Close #5350

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:00:14 +08:00