ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-23 06:46:40 +08:00

Author	SHA1	Message	Date
Magicbook1108	7db9045b74	Feat: Add box connector (#11845 ) ### What problem does this PR solve? Feat: Add box connector ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-12 10:23:40 +08:00
Andrea Bugeja	74afb8d710	feat: Add Single Bucket Mode for MinIO/S3 (#11416 ) ## Overview This PR adds support for Single Bucket Mode in RAGFlow, allowing users to configure MinIO/S3 to use a single bucket with a directory structure instead of creating multiple buckets per Knowledge Base and user folder. ## Problem Statement The current implementation creates one bucket per Knowledge Base and one bucket per user folder, which can be problematic when: - Cloud providers charge per bucket - IAM policies restrict bucket creation - Organizations want centralized data management in a single bucket ## Solution Added a `prefix_path` configuration option to the MinIO connector that enables: - Using a single bucket with directory-based organization - Backward compatibility with existing multi-bucket deployments - Support for MinIO, AWS S3, and other S3-compatible storage backends ## Changes - `rag/utils/minio_conn.py`: Enhanced MinIO connector to support single bucket mode with prefix paths - `conf/service_conf.yaml`: Added new configuration options (`bucket` and `prefix_path`) - `docker/service_conf.yaml.template`: Updated template with single bucket configuration examples - `docker/.env.single-bucket-example`: Added example environment variables for single bucket setup - `docs/single-bucket-mode.md`: Comprehensive documentation covering usage, migration, and troubleshooting ## Configuration Example ```yaml minio: user: "access-key" password: "secret-key" host: "minio.example.com:443" bucket: "ragflow-bucket" # Single bucket name prefix_path: "ragflow" # Optional prefix path ``` ## Backward Compatibility ✅ Fully backward compatible - existing deployments continue to work without any changes - If `bucket` is not configured, uses default multi-bucket behavior - If `bucket` is configured without `prefix_path`, uses bucket root - If both are configured, uses `bucket/prefix_path/` structure ## Testing - Tested with MinIO (local and cloud) - Verified backward compatibility with existing multi-bucket mode - Validated IAM policy restrictions work correctly ## Documentation Included comprehensive documentation in `docs/single-bucket-mode.md` covering: - Configuration examples - Migration guide from multi-bucket to single-bucket mode - IAM policy examples - Troubleshooting guide --- Related Issue: Addresses use cases where bucket creation is restricted or costly	2025-12-11 19:22:47 +08:00
Kevin Hu	ea4a5cd665	Fix: tokenizer issue. (#11902 ) #11786 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 17:38:17 +08:00
Yongteng Lei	e9710b7aa9	Refa: treat MinerU as an OCR model 2 (#11905 ) ### What problem does this PR solve? Treat MinerU as an OCR model 2. #11903 ### Type of change - [x] Refactoring	2025-12-11 17:33:12 +08:00
buua436	e3cfe8e848	Fix:async issue and sensitive logging (#11895 ) ### What problem does this PR solve? change： async issue and sensitive logging ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 13:54:47 +08:00
David López Carrascal	a6afb7dfe2	Fix data_sync startup crash by properly invoking async main (#11879 ) ### What problem does this PR solve? This PR fixes a startup crash in the data_sync_0 service caused by an incorrect asyncio.run call. The main coroutine was being passed as a function reference instead of being invoked, which raised: `ValueError: a coroutine was expected, got <function main ...> ` What I changed - Updated the entrypoint in sync_data_source.py to correctly invoke the coroutine with `asyncio.run(main())`. Testing - No tested. Related Issue Fixes https://github.com/infiniflow/ragflow/issues/11878 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 10:09:16 +08:00
He Wang	badf33e3b9	feat: enhance OBConnection.search (#11876 ) ### What problem does this PR solve? Enhance OBConnection.search for better performance. Main changes: 1. Use string type of vector array in distance func for better parsing performance. 2. Manually set max_connections as pool size instead of using default value. 3. Set 'fulltext_search_columns' when starting. 4. Cache the results of the table existence check (we will never drop the table). 5. Remove unused 'group_results' logic. 6. Add the `USE_FULLTEXT_FIRST_FUSION_SEARCH` flag, and the corresponding fusion search SQL when it's false. ### Type of change - [x] Performance Improvement	2025-12-10 19:13:37 +08:00
buua436	3cb72377d7	Refa:remove sensitive information (#11873 ) ### What problem does this PR solve? change: remove sensitive information ### Type of change - [x] Refactoring	2025-12-10 19:08:45 +08:00
buua436	ab4b62031f	Fix:csv parse in Table (#11870 ) ### What problem does this PR solve? change: csv parse in Table ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-10 16:44:06 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00
Magicbook1108	ca2d6f3301	Fix: duplicate output by async_chat_streamly (#11842 ) ### What problem does this PR solve? Fix: duplicate output by async_chat_streamly Refact: revert manual modification ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-09 19:21:52 +08:00
Yongteng Lei	a94b3b9df2	Refa: treat MinerU as an OCR model (#11849 ) ### What problem does this PR solve? Treat MinerU as an OCR model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-09 18:54:14 +08:00
N0bodycan	9863862348	fix: prevent redundant retries in async_chat_streamly upon success (#11832 ) ## What changes were proposed in this pull request? Added a return statement after the successful completion of the async for loop in async_chat_streamly. ## Why are the changes needed? Previously, the code lacked a break/return mechanism inside the try block. This caused the retry loop (for attempt in range...) to continue executing even after the LLM response was successfully generated and yielded, resulting in duplicate requests (up to max_retries times). ## Does this PR introduce any user-facing change? No (it fixes an internal logic bug).	2025-12-09 17:14:30 +08:00
Zhichang Yu	bb6022477e	Bump infinity to v0.6.11. Requires python>=3.11 (#11814 ) ### What problem does this PR solve? Bump infinity to v0.6.11. Requires python>=3.11 ### Type of change - [x] Refactoring	2025-12-09 16:23:37 +08:00
Yongteng Lei	c51e6b2a58	Refa: migrate CV model chat to Async (#11828 ) ### What problem does this PR solve? Migrate CV model chat to Async. #11750 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2025-12-09 13:08:37 +08:00
Stephen Hu	481192300d	Fix:[ERROR][Exception]: list index out of range (#11826 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/11821 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-09 09:58:34 +08:00
buua436	dd046be976	Fix: parent-child chunking method (#11810 ) ### What problem does this PR solve? change: parent-child chunking method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-09 09:34:01 +08:00
Kevin Hu	09a3854ed8	Fix: chunk method error. (#11807 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 14:28:23 +08:00
Jin Hai	43f51baa96	Fix errors (#11804 ) ### What problem does this PR solve? 1. typos 2. grammar errors. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 12:21:18 +08:00
Stephen Hu	b66881a371	Refactor:book parser use with to handle bytesIO (#11800 ) ### What problem does this PR solve? book parser use with to handle bytesIO ### Type of change - [x] Refactoring	2025-12-08 10:18:46 +08:00
Yongteng Lei	51ec708c58	Refa: cleanup synchronous functions in chat_model and implement synchronization for conversation and dialog chats (#11779 ) ### What problem does this PR solve? Cleanup synchronous functions in chat_model and implement synchronization for conversation and dialog chats. ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-12-08 09:43:03 +08:00
buua436	9b8971a9de	Fix:toc in pipeline (#11785 ) ### What problem does this PR solve? change: Fix toc in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 09:42:20 +08:00
少卿	7719fd6350	Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702 ) ### What problem does this PR solve? This PR addresses two independent issues encountered when using the MinerU engine in Ragflow: 1. MinerU API output path mismatch for non-ASCII filenames MinerU sanitizes the root directory name inside the returned ZIP when the original filename contains non-ASCII characters (e.g., Chinese). Ragflow's client-side unzip logic assumed the original filename stem and therefore failed to locate `_content_list.json`. This PR adds: * root-directory detection * fallback lookup using sanitized names * a broadened `_read_output` search with a glob fallback ensuring output files are consistently located regardless of filename encoding. 2. Chunker crash due to tuple-structure mismatch in manual mode Some parsers (e.g., MinerU / Docling) return 2-tuple sections, but Ragflow’s chunker expects 3-tuple sections, leading to: `ValueError: not enough values to unpack (expected 3, got 2)` This PR normalizes all sections to a uniform structure `(text, layout, positions)`: * parse position tags when present * default to empty positions when missing preserving backward compatibility and preventing crashes. ### Type of change * [x] Bug Fix (non-breaking change which fixes an issue) [#11136](https://github.com/infiniflow/ragflow/issues/11136) [#11700](https://github.com/infiniflow/ragflow/issues/11700) [#11620](https://github.com/infiniflow/ragflow/issues/11620) [#11701](https://github.com/infiniflow/ragflow/pull/11701) we need your help [yongtenglei](https://github.com/yongtenglei) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-05 19:25:45 +08:00
Magicbook1108	4012d65b3c	Feat: update front end for confluence connector (#11747 ) ### What problem does this PR solve? Feat: update front end for confluence connector ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 18:49:13 +08:00
Magicbook1108	e2bc1a3478	Feat: add more attribute for confluence connector. (#11743 ) ### What problem does this PR solve? Feat: add more attribute for confluence connector. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 17:28:03 +08:00
qinling0210	ca4a0ee1b2	Remove huqie.txt from RAGFflow and bump infinity to 0.6.10 (#11661 ) ### What problem does this PR solve? huqie.txt and huqie.txt.trie are put to infinity-sdk in https://github.com/infiniflow/infinity/pull/3127. Remove huqie.txt from ragflow and bump infinity to 0.6.10 in this PR. ### Type of change - [x] Refactoring	2025-12-04 14:53:57 +08:00
Yongteng Lei	27b0550876	Refa: cleanup synchronous functions in agent_with_tools (#11736 ) ### What problem does this PR solve? Cleanup synchronous functions in agent_with_tools. ### Type of change - [x] Refactoring	2025-12-04 14:15:05 +08:00
rommy2017	257af75ece	Fix: relative page_number in boxes (#11712 ) page_number in boxes is relative page number，must + from_page ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-04 11:23:34 +08:00
Wiratama	cbdacf21f6	feat(gcs): Add support for Google Cloud Storage (GCS) integration (#11718 ) ### What problem does this PR solve? This Pull Request introduces native support for Google Cloud Storage (GCS) as an optional object storage backend. Currently, RAGFlow relies on a limited set of storage options. This feature addresses the need for seamless integration with GCP environments, allowing users to leverage a fully managed, highly durable, and scalable storage service (GCS) instead of needing to deploy and maintain third-party object storage solutions. This simplifies deployment, especially for users running on GCP infrastructure like GKE or Cloud Run. The implementation uses a single GCS bucket defined via configuration, mapping RAGFlow's internal logical storage units (or "buckets") to folder prefixes within that GCS container to maintain data separation. This architectural choice avoids the operational complexities associated with dynamically creating and managing unique GCS buckets for every logical unit. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 10:44:05 +08:00
David Eberto Domenech Castillo	3c224c817b	Fix: Correct pagination and early termination bugs in chunk_list() (#11692 ) ## Summary This PR fixes two critical bugs in `chunk_list()` method that prevent processing large documents (>128 chunks) in GraphRAG and other workflows. ## Bugs Fixed ### Bug 1: Incorrect pagination offset calculation Location: `rag/nlp/search.py` lines 530-531 Problem: The loop variable `p` was used directly as offset, causing incorrect pagination: ```python # BEFORE (BUGGY): for p in range(offset, max_count, bs): # p = 0, 128, 256, 384... es_res = self.dataStore.search(..., p, bs, ...) # p used as offset Fix: Use page number multiplied by batch size: # AFTER (FIXED): for page_num, p in enumerate(range(offset, max_count, bs)): es_res = self.dataStore.search(..., page_num * bs, bs, ...) Bug 2: Premature loop termination Location: rag/nlp/search.py lines 538-539 Problem: Loop terminates when any page returns fewer than 128 chunks, even when thousands more remain: # BEFORE (BUGGY): if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+ remain break Fix: Only terminate when zero chunks returned: # AFTER (FIXED): if len(dict_chunks.values()) == 0: break Enhancement: Add max_count parameter to GraphRAG Location: graphrag/general/index.py line 60 Added max_count=10000 parameter to chunk loading for both LightRAG and General GraphRAG paths to ensure all chunks are processed. Testing Validated with a 314-page legal document containing 3,207 chunks: Before fixes: - Only 2-126 chunks processed - GraphRAG generated 25 nodes, 8 edges After fixes: - All 3,209 chunks processed ✅ - GraphRAG processing complete dataset Impact These bugs affect any workflow using chunk_list() with large documents, particularly: - GraphRAG knowledge graph generation - RAPTOR hierarchical summarization - Document processing pipelines with >128 chunks Related Issue Fixes #11687 Checklist - Code follows project style guidelines - Tested with large documents (3,207+ chunks) - Both bugs validated by Dosu bot in issue #11687 - No breaking changes to API --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-03 19:44:20 +08:00
hsparks-codes	4870d42949	feat: Auto-disable Raptor for structured data (Issue #11653 ) (#11676 ) ### What problem does this PR solve? Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653. Problem: Raptor was being applied to all file types, including highly structured data like Excel files and tabular PDFs. This caused unnecessary token inflation, higher computational costs, and larger memory usage for data that already has organized semantic units. Solution: Automatically skip Raptor processing for: - Excel files (.xls, .xlsx, .xlsm, .xlsb) - CSV files (.csv, .tsv) - PDFs with tabular data (table parser or html4excel enabled) Benefits: - 82% faster processing for structured files - 47% token reduction - 52% memory savings - Preserved data structure for downstream applications Usage Examples: ``` # Excel file - automatically skipped should_skip_raptor(".xlsx") # True # CSV file - automatically skipped should_skip_raptor(".csv") # True # Tabular PDF - automatically skipped should_skip_raptor(".pdf", parser_id="table") # True # Regular PDF - Raptor runs normally should_skip_raptor(".pdf", parser_id="naive") # False # Override for special cases should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False ``` Configuration: Includes `auto_disable_for_structured_data` toggle (default: true) to allow override for special use cases. Testing: 44 comprehensive tests, 100% passing ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:02:29 +08:00
Jin Hai	3c50c7d3ac	Refactor code (#11694 ) ### What problem does this PR solve? Rename function and refactor log message ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-03 15:15:00 +08:00
Yongteng Lei	e3f40db963	Refa: make RAGFlow more asynchronous 2 (#11689 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-12-03 14:19:53 +08:00
Kevin Hu	b5ad7b7062	Feat: support TOC transformer. (#11685 ) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 12:27:50 +08:00
Yongteng Lei	5c81e01de5	Fix: incorrect async chat streamly output (#11679 ) ### What problem does this PR solve? Incorrect async chat streamly output. #11677. Disable beartype for #11666. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-03 11:15:45 +08:00
Kevin Hu	a6681d6366	Revert "Refa: make RAGFlow more asynchronous 2" (#11669 ) Reverts infiniflow/ragflow#11664	2025-12-02 19:42:05 +08:00
Yongteng Lei	627c11c429	Refa: make RAGFlow more asynchronous 2 (#11664 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring - [x] Performance Improvement	2025-12-02 18:57:07 +08:00
rommy2017	4ba17361e9	feat: improve presentation PdfParser (#11639 ) The old presentation PdfParser lost table format after parse ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:35:14 +08:00
Billy Bao	c946858328	Feat: add mineru auto installer (#11649 ) ### What problem does this PR solve? Feat: add mineru auto installer ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:29:26 +08:00
qinling0210	2ffe6f7439	Import rag_tokenizer from Infinity (#11647 ) ### What problem does this PR solve? - Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk via https://github.com/infiniflow/infinity/pull/3117 . Import rag_tokenizer from infinity and inherit from rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py. - Bump infinity to 0.6.8 ### Type of change - [x] Refactoring	2025-12-02 14:59:37 +08:00
Yongteng Lei	a713f54732	Refa: add MiniMax-M2 and remove deprecated MiniMax models (#11642 ) ### What problem does this PR solve? Add MiniMax-M2 and remove deprecated models. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-02 14:43:44 +08:00
buua436	b8c0fb4572	Feat:new api /sequence2txt and update QWenSeq2txt (#11643 ) ### What problem does this PR solve? change: new api /sequence2txt, update QWenSeq2txt and ZhipuSeq2txt ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 11:17:31 +08:00
Stephen Hu	d1e172171f	Refactor: better describe how to get prefix for sync data source (#11636 ) ### What problem does this PR solve? better describe how to get prefix for sync data source ### Type of change - [x] Refactoring	2025-12-01 17:46:44 +08:00
Kevin Hu	81ae6cf78d	Feat: support uploading in dialog. (#11634 ) ### What problem does this PR solve? #9590 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 16:54:57 +08:00
Billy Bao	41cff3e09e	Fix: jina embedding issue (#11628 ) ### What problem does this PR solve? Fix: jina embedding issue #11614 Feat: Add jina embedding v4 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 14:24:35 +08:00
Yongteng Lei	b6c4722687	Refa: make RAGFlow more asynchronous (#11601 ) ### What problem does this PR solve? Try to make this more asynchronous. Verified in chat and agent scenarios, reducing blocking behavior. #11551, #11579. However, the impact of these changes still requires further investigation to ensure everything works as expected. ### Type of change - [x] Refactoring	2025-12-01 14:24:06 +08:00
Kevin Hu	6ea4248bdc	Feat: support parent-child in search procedure. (#11629 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 14:03:09 +08:00
Kevin Hu	88a28212b3	Fix: Table parse method issue. (#11627 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:42:35 +08:00
Yongteng Lei	9d0309aedc	Fix: [MinerU] Missing output file (#11623 ) ### What problem does this PR solve? Add fallbacks for MinerU output path. #11613, #11620. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:17:43 +08:00
Lei Zhang	7499608a8b	feat: add Redis username support (#11608 ) ### What problem does this PR solve? Support for Redis 6+ ACL authentication (username) close #11606 ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2025-12-01 11:26:20 +08:00

1 2 3 4 5 ...

1132 Commits