Commit Graph

1119 Commits

Author SHA1 Message Date
bb6022477e Bump infinity to v0.6.11. Requires python>=3.11 (#11814)
### What problem does this PR solve?

Bump infinity to v0.6.11. Requires python>=3.11

### Type of change

- [x] Refactoring
2025-12-09 16:23:37 +08:00
c51e6b2a58 Refa: migrate CV model chat to Async (#11828)
### What problem does this PR solve?

Migrate CV model chat to Async. #11750

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2025-12-09 13:08:37 +08:00
481192300d Fix:[ERROR][Exception]: list index out of range (#11826)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/11821

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-09 09:58:34 +08:00
dd046be976 Fix: parent-child chunking method (#11810)
### What problem does this PR solve?

change:
parent-child chunking method

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-09 09:34:01 +08:00
09a3854ed8 Fix: chunk method error. (#11807)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-08 14:28:23 +08:00
43f51baa96 Fix errors (#11804)
### What problem does this PR solve?

1. typos
2. grammar errors.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-08 12:21:18 +08:00
b66881a371 Refactor:book parser use with to handle bytesIO (#11800)
### What problem does this PR solve?

book parser use with to handle bytesIO

### Type of change

- [x] Refactoring
2025-12-08 10:18:46 +08:00
51ec708c58 Refa: cleanup synchronous functions in chat_model and implement synchronization for conversation and dialog chats (#11779)
### What problem does this PR solve?

Cleanup synchronous functions in chat_model and implement
synchronization for conversation and dialog chats.

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-12-08 09:43:03 +08:00
9b8971a9de Fix:toc in pipeline (#11785)
### What problem does this PR solve?
change:
Fix toc in pipeline
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-08 09:42:20 +08:00
7719fd6350 Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702)
### What problem does this PR solve?

This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:

1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
   This PR adds:

   * root-directory detection
   * fallback lookup using sanitized names
   * a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.

2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
   `ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:

   * parse position tags when present
   * default to empty positions when missing
     preserving backward compatibility and preventing crashes.

### Type of change

* [x] Bug Fix (non-breaking change which fixes an issue)


[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)

we need your help [yongtenglei](https://github.com/yongtenglei)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-05 19:25:45 +08:00
4012d65b3c Feat: update front end for confluence connector (#11747)
### What problem does this PR solve?

Feat: update front end for confluence connector

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-04 18:49:13 +08:00
e2bc1a3478 Feat: add more attribute for confluence connector. (#11743)
### What problem does this PR solve?

Feat: add more attribute for confluence connector. 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-04 17:28:03 +08:00
ca4a0ee1b2 Remove huqie.txt from RAGFflow and bump infinity to 0.6.10 (#11661)
### What problem does this PR solve?

huqie.txt and huqie.txt.trie are put to infinity-sdk in
https://github.com/infiniflow/infinity/pull/3127.

Remove huqie.txt from ragflow and bump infinity to 0.6.10 in this PR.

### Type of change

- [x] Refactoring
2025-12-04 14:53:57 +08:00
27b0550876 Refa: cleanup synchronous functions in agent_with_tools (#11736)
### What problem does this PR solve?

Cleanup synchronous functions in agent_with_tools.

### Type of change

- [x] Refactoring
2025-12-04 14:15:05 +08:00
257af75ece Fix: relative page_number in boxes (#11712)
page_number in boxes is relative page number,must + from_page

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-04 11:23:34 +08:00
cbdacf21f6 feat(gcs): Add support for Google Cloud Storage (GCS) integration (#11718)
### What problem does this PR solve?

This Pull Request introduces native support for Google Cloud Storage
(GCS) as an optional object storage backend.

Currently, RAGFlow relies on a limited set of storage options. This
feature addresses the need for seamless integration with GCP
environments, allowing users to leverage a fully managed, highly
durable, and scalable storage service (GCS) instead of needing to deploy
and maintain third-party object storage solutions. This simplifies
deployment, especially for users running on GCP infrastructure like GKE
or Cloud Run.

The implementation uses a single GCS bucket defined via configuration,
mapping RAGFlow's internal logical storage units (or "buckets") to
folder prefixes within that GCS container to maintain data separation.
This architectural choice avoids the operational complexities associated
with dynamically creating and managing unique GCS buckets for every
logical unit.

### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-12-04 10:44:05 +08:00
3c224c817b Fix: Correct pagination and early termination bugs in chunk_list() (#11692)
## Summary

This PR fixes two critical bugs in `chunk_list()` method that prevent
processing large documents (>128 chunks) in GraphRAG and
  other workflows.

  ## Bugs Fixed

  ### Bug 1: Incorrect pagination offset calculation
  **Location:** `rag/nlp/search.py` lines 530-531

**Problem:** The loop variable `p` was used directly as offset, causing
incorrect pagination:
  ```python
  # BEFORE (BUGGY):
  for p in range(offset, max_count, bs):  # p = 0, 128, 256, 384...
es_res = self.dataStore.search(..., p, bs, ...) # p used as offset

  Fix: Use page number multiplied by batch size:
  # AFTER (FIXED):
  for page_num, p in enumerate(range(offset, max_count, bs)):
      es_res = self.dataStore.search(..., page_num * bs, bs, ...)

  Bug 2: Premature loop termination

  Location: rag/nlp/search.py lines 538-539

Problem: Loop terminates when any page returns fewer than 128 chunks,
even when thousands more remain:
  # BEFORE (BUGGY):
if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+
remain
      break

  Fix: Only terminate when zero chunks returned:
  # AFTER (FIXED):
  if len(dict_chunks.values()) == 0:
      break

  Enhancement: Add max_count parameter to GraphRAG

  Location: graphrag/general/index.py line 60

Added max_count=10000 parameter to chunk loading for both LightRAG and
General GraphRAG paths to ensure all chunks are
  processed.

  Testing

  Validated with a 314-page legal document containing 3,207 chunks:

  Before fixes:
  - Only 2-126 chunks processed
  - GraphRAG generated 25 nodes, 8 edges

  After fixes:
  - All 3,209 chunks processed 
  - GraphRAG processing complete dataset

  Impact

These bugs affect any workflow using chunk_list() with large documents,
particularly:
  - GraphRAG knowledge graph generation
  - RAPTOR hierarchical summarization
  - Document processing pipelines with >128 chunks

  Related Issue

  Fixes #11687

  Checklist

  - Code follows project style guidelines
  - Tested with large documents (3,207+ chunks)
  - Both bugs validated by Dosu bot in issue #11687
  - No breaking changes to API

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-03 19:44:20 +08:00
4870d42949 feat: Auto-disable Raptor for structured data (Issue #11653) (#11676)
### What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.

**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.

**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)

**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications

**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx")  # True

# CSV file - automatically skipped  
should_skip_raptor(".csv")  # True

# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table")  # True

# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive")  # False

# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False})  # False
```

**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.

**Testing**: 44 comprehensive tests, 100% passing

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 17:02:29 +08:00
3c50c7d3ac Refactor code (#11694)
### What problem does this PR solve?

Rename function and refactor log message

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-03 15:15:00 +08:00
e3f40db963 Refa: make RAGFlow more asynchronous 2 (#11689)
### What problem does this PR solve?

Make RAGFlow more asynchronous 2. #11551, #11579, #11619.

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-12-03 14:19:53 +08:00
b5ad7b7062 Feat: support TOC transformer. (#11685)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-03 12:27:50 +08:00
5c81e01de5 Fix: incorrect async chat streamly output (#11679)
### What problem does this PR solve?

Incorrect async chat streamly output. #11677.

Disable beartype for #11666.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-03 11:15:45 +08:00
a6681d6366 Revert "Refa: make RAGFlow more asynchronous 2" (#11669)
Reverts infiniflow/ragflow#11664
2025-12-02 19:42:05 +08:00
627c11c429 Refa: make RAGFlow more asynchronous 2 (#11664)
### What problem does this PR solve?

Make RAGFlow more asynchronous 2. #11551, #11579, #11619.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
- [x] Performance Improvement
2025-12-02 18:57:07 +08:00
4ba17361e9 feat: improve presentation PdfParser (#11639)
The old presentation PdfParser lost table format after parse

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:35:14 +08:00
c946858328 Feat: add mineru auto installer (#11649)
### What problem does this PR solve?

Feat: add mineru auto installer

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:29:26 +08:00
2ffe6f7439 Import rag_tokenizer from Infinity (#11647)
### What problem does this PR solve?

- Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk
via https://github.com/infiniflow/infinity/pull/3117 .
Import rag_tokenizer from infinity and inherit from
rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py.

- Bump infinity to 0.6.8

### Type of change
- [x] Refactoring
2025-12-02 14:59:37 +08:00
a713f54732 Refa: add MiniMax-M2 and remove deprecated MiniMax models (#11642)
### What problem does this PR solve?

Add MiniMax-M2 and remove deprecated models.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-12-02 14:43:44 +08:00
b8c0fb4572 Feat:new api /sequence2txt and update QWenSeq2txt (#11643)
### What problem does this PR solve?
change:
new api /sequence2txt,
update QWenSeq2txt and ZhipuSeq2txt

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 11:17:31 +08:00
d1e172171f Refactor: better describe how to get prefix for sync data source (#11636)
### What problem does this PR solve?

better describe how to get prefix for sync data source

### Type of change

- [x] Refactoring
2025-12-01 17:46:44 +08:00
81ae6cf78d Feat: support uploading in dialog. (#11634)
### What problem does this PR solve?

#9590

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 16:54:57 +08:00
41cff3e09e Fix: jina embedding issue (#11628)
### What problem does this PR solve?

Fix: jina embedding issue #11614 
Feat: Add jina embedding v4

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 14:24:35 +08:00
b6c4722687 Refa: make RAGFlow more asynchronous (#11601)
### What problem does this PR solve?

Try to make this more asynchronous. Verified in chat and agent
scenarios, reducing blocking behavior. #11551, #11579.

However, the impact of these changes still requires further
investigation to ensure everything works as expected.

### Type of change

- [x] Refactoring
2025-12-01 14:24:06 +08:00
6ea4248bdc Feat: support parent-child in search procedure. (#11629)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-01 14:03:09 +08:00
88a28212b3 Fix: Table parse method issue. (#11627)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:42:35 +08:00
9d0309aedc Fix: [MinerU] Missing output file (#11623)
### What problem does this PR solve?

Add fallbacks for MinerU output path. #11613, #11620.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-01 12:17:43 +08:00
7499608a8b feat: add Redis username support (#11608)
### What problem does this PR solve?

Support for Redis 6+ ACL authentication (username)

close #11606 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
2025-12-01 11:26:20 +08:00
80f6d22d2a Fix typos (#11607)
### What problem does this PR solve?

Fix typos

### Type of change

- [x] Fix typos
2025-12-01 09:49:46 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
dbdda0fbab Feat: optimize meta filter generation for better structure handling (#11586)
### What problem does this PR solve?

optimize meta filter generation for better structure handling

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:30:53 +08:00
cf7fdd274b Feat: add gmail connector (#11549)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 13:09:40 +08:00
982ed233a2 Fix: doc_aggs not correctly returned when no chunks retrieved. (#11578)
### What problem does this PR solve?

Fix: doc_aggs not correctly returned when no chunks retrieved.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-28 13:09:05 +08:00
8604c4f57c Feat: add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5 (#11559)
### What problem does this PR solve?

Add GPT-5.1, GPT‑5.1 Instant and Claude-Opus-4.5. #11548

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 17:59:17 +08:00
856201c0f2 Fix ft_title_rag_fine (#11555)
### What problem does this PR solve?

Fix ft_title_rag_fine

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 10:26:08 +08:00
9d8b96c1d0 Feat: add context for figure and table (#11547)
### What problem does this PR solve?

Add context for figure table.



![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e)


`==================()` for demonstrating purpose. 
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 10:21:44 +08:00
12979a3f21 feat: improve metadata handling in connector service (#11421)
### What problem does this PR solve?

- Update sync data source to handle metadata properly

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-11-26 19:55:48 +08:00
2fd5ac1031 Feat: Add Webdav storage as data source (#11422)
### What problem does this PR solve?

This PR adds webdav storage as data source for data sync service.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-26 14:14:42 +08:00
40e84ca41a Use Infinity single-field-multi-index (#11444)
### What problem does this PR solve?

Use Infinity single-field-multi-index

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-11-26 11:06:37 +08:00
74e0b58d89 Fix: excel default optimization. (#11519)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:20 +08:00
7c20c964b4 Fix: incorrect image merging for naive markdown parser (#11520)
### What problem does this PR solve?

Fix incorrect image merging for naive markdown parser. #9349 


[ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:06 +08:00