## Overview
This PR adds support for **Single Bucket Mode** in RAGFlow, allowing
users to configure MinIO/S3 to use a single bucket with a directory
structure instead of creating multiple buckets per Knowledge Base and
user folder.
## Problem Statement
The current implementation creates one bucket per Knowledge Base and one
bucket per user folder, which can be problematic when:
- Cloud providers charge per bucket
- IAM policies restrict bucket creation
- Organizations want centralized data management in a single bucket
## Solution
Added a `prefix_path` configuration option to the MinIO connector that
enables:
- Using a single bucket with directory-based organization
- Backward compatibility with existing multi-bucket deployments
- Support for MinIO, AWS S3, and other S3-compatible storage backends
## Changes
- **`rag/utils/minio_conn.py`**: Enhanced MinIO connector to support
single bucket mode with prefix paths
- **`conf/service_conf.yaml`**: Added new configuration options
(`bucket` and `prefix_path`)
- **`docker/service_conf.yaml.template`**: Updated template with single
bucket configuration examples
- **`docker/.env.single-bucket-example`**: Added example environment
variables for single bucket setup
- **`docs/single-bucket-mode.md`**: Comprehensive documentation covering
usage, migration, and troubleshooting
## Configuration Example
```yaml
minio:
user: "access-key"
password: "secret-key"
host: "minio.example.com:443"
bucket: "ragflow-bucket" # Single bucket name
prefix_path: "ragflow" # Optional prefix path
```
## Backward Compatibility
✅ Fully backward compatible - existing deployments continue to work
without any changes
- If `bucket` is not configured, uses default multi-bucket behavior
- If `bucket` is configured without `prefix_path`, uses bucket root
- If both are configured, uses `bucket/prefix_path/` structure
## Testing
- Tested with MinIO (local and cloud)
- Verified backward compatibility with existing multi-bucket mode
- Validated IAM policy restrictions work correctly
## Documentation
Included comprehensive documentation in `docs/single-bucket-mode.md`
covering:
- Configuration examples
- Migration guide from multi-bucket to single-bucket mode
- IAM policy examples
- Troubleshooting guide
---
**Related Issue**: Addresses use cases where bucket creation is
restricted or costly
### What problem does this PR solve?
change:
async issue and sensitive logging
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR fixes a startup crash in the data_sync_0 service caused by an
incorrect asyncio.run call. The main coroutine was being passed as a
function reference instead of being invoked, which raised:
`ValueError: a coroutine was expected, got <function main ...>
`
What I changed
- Updated the entrypoint in sync_data_source.py to correctly invoke the
coroutine with `asyncio.run(main())`.
Testing
- No tested.
Related Issue
Fixes https://github.com/infiniflow/ragflow/issues/11878
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Enhance OBConnection.search for better performance. Main changes:
1. Use string type of vector array in distance func for better parsing
performance.
2. Manually set max_connections as pool size instead of using default
value.
3. Set 'fulltext_search_columns' when starting.
4. Cache the results of the table existence check (we will never drop
the table).
5. Remove unused 'group_results' logic.
6. Add the `USE_FULLTEXT_FIRST_FUSION_SEARCH` flag, and the
corresponding fusion search SQL when it's false.
### Type of change
- [x] Performance Improvement
### What problem does this PR solve?
Fix: duplicate output by async_chat_streamly
Refact: revert manual modification
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Treat MinerU as an OCR model.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
## What changes were proposed in this pull request?
Added a return statement after the successful completion of the async
for loop in async_chat_streamly.
## Why are the changes needed?
Previously, the code lacked a break/return mechanism inside the try
block. This caused the retry loop (for attempt in range...) to continue
executing even after the LLM response was successfully generated and
yielded, resulting in duplicate requests (up to max_retries times).
## Does this PR introduce any user-facing change?
No (it fixes an internal logic bug).
### What problem does this PR solve?
Migrate CV model chat to Async. #11750
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
### What problem does this PR solve?
Cleanup synchronous functions in chat_model and implement
synchronization for conversation and dialog chats.
### Type of change
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:
1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
This PR adds:
* root-directory detection
* fallback lookup using sanitized names
* a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.
2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
`ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:
* parse position tags when present
* default to empty positions when missing
preserving backward compatibility and preventing crashes.
### Type of change
* [x] Bug Fix (non-breaking change which fixes an issue)
[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)
we need your help [yongtenglei](https://github.com/yongtenglei)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Feat: update front end for confluence connector
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: add more attribute for confluence connector.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
huqie.txt and huqie.txt.trie are put to infinity-sdk in
https://github.com/infiniflow/infinity/pull/3127.
Remove huqie.txt from ragflow and bump infinity to 0.6.10 in this PR.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
This Pull Request introduces native support for Google Cloud Storage
(GCS) as an optional object storage backend.
Currently, RAGFlow relies on a limited set of storage options. This
feature addresses the need for seamless integration with GCP
environments, allowing users to leverage a fully managed, highly
durable, and scalable storage service (GCS) instead of needing to deploy
and maintain third-party object storage solutions. This simplifies
deployment, especially for users running on GCP infrastructure like GKE
or Cloud Run.
The implementation uses a single GCS bucket defined via configuration,
mapping RAGFlow's internal logical storage units (or "buckets") to
folder prefixes within that GCS container to maintain data separation.
This architectural choice avoids the operational complexities associated
with dynamically creating and managing unique GCS buckets for every
logical unit.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Summary
This PR fixes two critical bugs in `chunk_list()` method that prevent
processing large documents (>128 chunks) in GraphRAG and
other workflows.
## Bugs Fixed
### Bug 1: Incorrect pagination offset calculation
**Location:** `rag/nlp/search.py` lines 530-531
**Problem:** The loop variable `p` was used directly as offset, causing
incorrect pagination:
```python
# BEFORE (BUGGY):
for p in range(offset, max_count, bs): # p = 0, 128, 256, 384...
es_res = self.dataStore.search(..., p, bs, ...) # p used as offset
Fix: Use page number multiplied by batch size:
# AFTER (FIXED):
for page_num, p in enumerate(range(offset, max_count, bs)):
es_res = self.dataStore.search(..., page_num * bs, bs, ...)
Bug 2: Premature loop termination
Location: rag/nlp/search.py lines 538-539
Problem: Loop terminates when any page returns fewer than 128 chunks,
even when thousands more remain:
# BEFORE (BUGGY):
if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+
remain
break
Fix: Only terminate when zero chunks returned:
# AFTER (FIXED):
if len(dict_chunks.values()) == 0:
break
Enhancement: Add max_count parameter to GraphRAG
Location: graphrag/general/index.py line 60
Added max_count=10000 parameter to chunk loading for both LightRAG and
General GraphRAG paths to ensure all chunks are
processed.
Testing
Validated with a 314-page legal document containing 3,207 chunks:
Before fixes:
- Only 2-126 chunks processed
- GraphRAG generated 25 nodes, 8 edges
After fixes:
- All 3,209 chunks processed ✅
- GraphRAG processing complete dataset
Impact
These bugs affect any workflow using chunk_list() with large documents,
particularly:
- GraphRAG knowledge graph generation
- RAPTOR hierarchical summarization
- Document processing pipelines with >128 chunks
Related Issue
Fixes#11687
Checklist
- Code follows project style guidelines
- Tested with large documents (3,207+ chunks)
- Both bugs validated by Dosu bot in issue #11687
- No breaking changes to API
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.
**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.
**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)
**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications
**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx") # True
# CSV file - automatically skipped
should_skip_raptor(".csv") # True
# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table") # True
# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive") # False
# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False
```
**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.
**Testing**: 44 comprehensive tests, 100% passing
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Rename function and refactor log message
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Make RAGFlow more asynchronous 2. #11551, #11579, #11619.
### Type of change
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
Incorrect async chat streamly output. #11677.
Disable beartype for #11666.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Make RAGFlow more asynchronous 2. #11551, #11579, #11619.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
Feat: add mineru auto installer
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk
via https://github.com/infiniflow/infinity/pull/3117 .
Import rag_tokenizer from infinity and inherit from
rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py.
- Bump infinity to 0.6.8
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Add MiniMax-M2 and remove deprecated models.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
### What problem does this PR solve?
change:
new api /sequence2txt,
update QWenSeq2txt and ZhipuSeq2txt
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fix: jina embedding issue #11614
Feat: Add jina embedding v4
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Try to make this more asynchronous. Verified in chat and agent
scenarios, reducing blocking behavior. #11551, #11579.
However, the impact of these changes still requires further
investigation to ensure everything works as expected.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Add fallbacks for MinerU output path. #11613, #11620.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Support for Redis 6+ ACL authentication (username)
close#11606
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update