ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-23 03:26:53 +08:00

Author	SHA1	Message	Date
Magicbook1108	b4e06237ef	Feat: detect docx support via header-byte inspection (#11731 ) ## What problem does this PR solve? Feat: detect docx support via header-byte inspection, a further optimize based on #11684 Not all files with a .doc extension are truly legacy .doc formats, and some are internally valid .docx documents. The previous implementation relied on URL suffix checks, which misclassified these cases and was therefore not reliable. Doc file could be previewed: [en2zh.doc](https://github.com/user-attachments/files/23921131/en2zh.doc) Doc file could not be previewed: [file-sample_100kB.doc](https://github.com/user-attachments/files/23921134/file-sample_100kB.doc) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-04 13:41:18 +08:00
chanx	751a13fb64	Feature：Add a loading status to the agent canvas page. (#11733 ) ### What problem does this PR solve? Feature：Add a loading status to the agent canvas page. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 13:40:49 +08:00
shirukai	fa7b857aa9	fix: resolve "'bool' object has no attribute 'items'" in SDK enabled … (#11725 ) ### What problem does this PR solve? Fixes the `AttributeError: 'bool' object has no attribute 'items'` error when updating the `enabled` parameter of a document via the Python SDK (Issue #11721). Background: When calling `Document.update({"enabled": True/False})` through the SDK, the server-side API returned a boolean `data=True` in the response (instead of a dictionary). The SDK's `_update_from_dict` method (in `base.py`) expects a dictionary to iterate over with `.items()`, leading to an immediate AttributeError during response parsing. This prevented successful synchronization of the updated `enabled` status to the local SDK object, even if the server-side database/update index operations succeeded. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Additional Context (optional, for clarity) - Root Cause: Server returned `data=True` (boolean) for `enabled` parameter updates, violating the SDK's expectation of a dictionary-type `data` field. - Fix Logic: 1. Removed the separate `return get_result(data=True)` in the `enabled` update branch to unify response flow. 2. - Backward Compatibility: No breaking changes—other update scenarios (e.g., renaming documents, modifying chunk methods) remain unaffected, and the response format stays consistent. Co-authored-by: shirukai <shirukai@hollysysdigital.com>	2025-12-04 11:24:01 +08:00
rommy2017	257af75ece	Fix: relative page_number in boxes (#11712 ) page_number in boxes is relative page number，must + from_page ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-04 11:23:34 +08:00
Wiratama	cbdacf21f6	feat(gcs): Add support for Google Cloud Storage (GCS) integration (#11718 ) ### What problem does this PR solve? This Pull Request introduces native support for Google Cloud Storage (GCS) as an optional object storage backend. Currently, RAGFlow relies on a limited set of storage options. This feature addresses the need for seamless integration with GCP environments, allowing users to leverage a fully managed, highly durable, and scalable storage service (GCS) instead of needing to deploy and maintain third-party object storage solutions. This simplifies deployment, especially for users running on GCP infrastructure like GKE or Cloud Run. The implementation uses a single GCS bucket defined via configuration, mapping RAGFlow's internal logical storage units (or "buckets") to folder prefixes within that GCS container to maintain data separation. This architectural choice avoids the operational complexities associated with dynamically creating and managing unique GCS buckets for every logical unit. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 10:44:05 +08:00
Stephen Hu	b1f3130519	Refactor: Remove useless for and add (#11720 ) ### What problem does this PR solve? Remove useless for and add ### Type of change - [x] Refactoring	2025-12-04 10:43:24 +08:00
David Eberto Domenech Castillo	3c224c817b	Fix: Correct pagination and early termination bugs in chunk_list() (#11692 ) ## Summary This PR fixes two critical bugs in `chunk_list()` method that prevent processing large documents (>128 chunks) in GraphRAG and other workflows. ## Bugs Fixed ### Bug 1: Incorrect pagination offset calculation Location: `rag/nlp/search.py` lines 530-531 Problem: The loop variable `p` was used directly as offset, causing incorrect pagination: ```python # BEFORE (BUGGY): for p in range(offset, max_count, bs): # p = 0, 128, 256, 384... es_res = self.dataStore.search(..., p, bs, ...) # p used as offset Fix: Use page number multiplied by batch size: # AFTER (FIXED): for page_num, p in enumerate(range(offset, max_count, bs)): es_res = self.dataStore.search(..., page_num * bs, bs, ...) Bug 2: Premature loop termination Location: rag/nlp/search.py lines 538-539 Problem: Loop terminates when any page returns fewer than 128 chunks, even when thousands more remain: # BEFORE (BUGGY): if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+ remain break Fix: Only terminate when zero chunks returned: # AFTER (FIXED): if len(dict_chunks.values()) == 0: break Enhancement: Add max_count parameter to GraphRAG Location: graphrag/general/index.py line 60 Added max_count=10000 parameter to chunk loading for both LightRAG and General GraphRAG paths to ensure all chunks are processed. Testing Validated with a 314-page legal document containing 3,207 chunks: Before fixes: - Only 2-126 chunks processed - GraphRAG generated 25 nodes, 8 edges After fixes: - All 3,209 chunks processed ✅ - GraphRAG processing complete dataset Impact These bugs affect any workflow using chunk_list() with large documents, particularly: - GraphRAG knowledge graph generation - RAPTOR hierarchical summarization - Document processing pipelines with >128 chunks Related Issue Fixes #11687 Checklist - Code follows project style guidelines - Tested with large documents (3,207+ chunks) - Both bugs validated by Dosu bot in issue #11687 - No breaking changes to API --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-03 19:44:20 +08:00
hsparks-codes	a3c9402218	Feat: confluence space key (#11706 ) # PR Description: Add Space Key Configuration for Confluence Data Source ### What problem does this PR solve? This PR addresses issue #11638 where users requested the ability to specify Confluence Space Keys when configuring a Confluence data source connector. Problem: Currently, the RAGFlow UI for Confluence data sources only provides fields for: - Username - Access Token - Wiki Base URL - Is Cloud checkbox There is no way to specify which Confluence space(s) to sync, causing RAGFlow to attempt syncing all accessible spaces. This is problematic for users who: - Only want to index specific spaces (e.g., only the HR or Documentation space) - Have access to many spaces but only need a subset - Want to avoid unnecessary data transfer and processing Solution: The backend `ConfluenceConnector` class already supports a `space` parameter in its `__init__()` method (line 1282 in `common/data_source/confluence_connector.py`), but this parameter was never exposed in the UI. This PR adds the missing UI field to allow users to configure space filtering. User Impact: Users can now: - Leave the field empty to sync all accessible spaces (default behavior) - Specify a single space key (e.g., `DEV`) - Specify multiple space keys separated by commas (e.g., `DEV,DOCS,HR`) This gives users fine-grained control over which Confluence content gets indexed into their RAGFlow knowledge base. Fixes #11638 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --- ## Implementation Details ### Changes Made 1. Frontend UI (`web/src/pages/user-setting/data-source/contant.tsx`) - Added "Space Key" text input field to Confluence configuration form - Field is optional (not required) - Positioned after "Is Cloud" checkbox for logical grouping - Added to initial values with empty string default *2. Internationalization (`web/src/locales/.ts`) - English (`en.ts`): Added `confluenceSpaceKeyTip` with clear instructions and examples - Chinese (`zh.ts`): Added Chinese translation for the tooltip - Russian (`ru.ts`): Added Russian translation for the tooltip - Bonus Fix: Removed duplicate `deleteModal` object in `zh.ts` that was causing TypeScript lint errors ### Backend Compatibility No backend changes were needed! The `ConfluenceConnector` class already supports the `space` parameter: ```python def __init__( self, wiki_base: str, is_cloud: bool, space: str = "", # ← Already supported! page_id: str = "", index_recursively: bool = False, cql_query: str \| None = None, ... ) ``` The connector uses this parameter to filter the CQL query (line 1328-1330): ```python elif space: uri_safe_space = quote(space) base_cql_page_query += f" and space='{uri_safe_space}'" ``` ### User Experience Before: - Users could only sync ALL accessible spaces - No UI option to limit scope After:** - Users see "Space Key" field with helpful tooltip - Tooltip explains: - Optional field (leave empty for all spaces) - Single space example: `DEV` - Multiple spaces example: `DEV,DOCS,HR` - Available in English, Chinese, and Russian ### Future Enhancements Potential improvements for future PRs: - Add validation to check if space key exists before saving - Add autocomplete/dropdown to show available spaces - Add UI hints about space key format requirements - Support for page_id filtering (already supported in backend) --- ## Related Issues - Fixes #11638 - [Confluence] How to specify Space Key when adding Confluence data source?	2025-12-03 19:17:47 +08:00
Jin Hai	a7d40e9132	Update since 'File manager' is renamed to 'File' (#11698 ) ### What problem does this PR solve? Update some docs and comments, since 'File manager' is rename to 'File' ### Type of change - [x] Documentation Update - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>	2025-12-03 18:32:15 +08:00
Yongteng Lei	648342b62f	Fix: handle MinerU sanitized filenames when reading output (#11701 ) ### What problem does this PR solve? Handle MinerU sanitized filenames when reading output. #11613, #11620. Thanks @shaoqing404 for raising this issue. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-03 17:24:37 +08:00
hsparks-codes	4870d42949	feat: Auto-disable Raptor for structured data (Issue #11653 ) (#11676 ) ### What problem does this PR solve? Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653. Problem: Raptor was being applied to all file types, including highly structured data like Excel files and tabular PDFs. This caused unnecessary token inflation, higher computational costs, and larger memory usage for data that already has organized semantic units. Solution: Automatically skip Raptor processing for: - Excel files (.xls, .xlsx, .xlsm, .xlsb) - CSV files (.csv, .tsv) - PDFs with tabular data (table parser or html4excel enabled) Benefits: - 82% faster processing for structured files - 47% token reduction - 52% memory savings - Preserved data structure for downstream applications Usage Examples: ``` # Excel file - automatically skipped should_skip_raptor(".xlsx") # True # CSV file - automatically skipped should_skip_raptor(".csv") # True # Tabular PDF - automatically skipped should_skip_raptor(".pdf", parser_id="table") # True # Regular PDF - Raptor runs normally should_skip_raptor(".pdf", parser_id="naive") # False # Override for special cases should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False ``` Configuration: Includes `auto_disable_for_structured_data` toggle (default: true) to allow override for special use cases. Testing: 44 comprehensive tests, 100% passing ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:02:29 +08:00
redredrrred	caaf7043cc	Standardize UI text capitalization to sentence case (#11696 ) ### What problem does this PR solve? This PR addresses inconsistencies in UI text capitalization across the application, enforcing a "Sentence case" style (only the first letter capitalized) for better readability and visual consistency. ### Type of change - [x] Refactoring	2025-12-03 17:01:22 +08:00
hsparks-codes	237a66913b	Feat: RAG evaluation (#11674 ) ### What problem does this PR solve? Feature: This PR implements a comprehensive RAG evaluation framework to address issue #11656. Problem: Developers using RAGFlow lack systematic ways to measure RAG accuracy and quality. They cannot objectively answer: 1. Are RAG results truly accurate? 2. How should configurations be adjusted to improve quality? 3. How to maintain and improve RAG performance over time? Solution: This PR adds a complete evaluation system with: - Dataset & test case management - Create ground truth datasets with questions and expected answers - Automated evaluation - Run RAG pipeline on test cases and compute metrics - Comprehensive metrics - Precision, recall, F1 score, MRR, hit rate for retrieval quality - Smart recommendations - Analyze results and suggest specific configuration improvements (e.g., "increase top_k", "enable reranking") - 20+ REST API endpoints - Full CRUD operations for datasets, test cases, and evaluation runs Impact: Enables developers to objectively measure RAG quality, identify issues, and systematically improve their RAG systems through data-driven configuration tuning. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:00:58 +08:00
Jin Hai	3c50c7d3ac	Refactor code (#11694 ) ### What problem does this PR solve? Rename function and refactor log message ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-03 15:15:00 +08:00
balibabu	b44e65a12e	Feat: Replace antd with shadcn and delete the template node. #10427 (#11693 ) ### What problem does this PR solve? Feat: Replace antd with shadcn and delete the template node. #10427 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 14:37:58 +08:00
Yongteng Lei	e3f40db963	Refa: make RAGFlow more asynchronous 2 (#11689 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-12-03 14:19:53 +08:00
Kevin Hu	b5ad7b7062	Feat: support TOC transformer. (#11685 ) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 12:27:50 +08:00
Billy Bao	6fc7def562	Feat: optimize the information displayed when .doc preview is unavailable (#11684 ) ### What problem does this PR solve? Feat: optimize the information displayed when .doc preview is unavailable #11605 ### Type of change - [X] New Feature (non-breaking change which adds functionality) #### Performance (Before) <img width="700" alt="image" src="https://github.com/user-attachments/assets/15cf69ee-3698-4e18-8e8f-bb75c321334d" /> #### Performance (After) ![img_v3_02sk_c0fcaf74-4a26-4b6c-b0e0-8f8929426d9g](https://github.com/user-attachments/assets/8c8eea3e-2c8e-457c-ab2b-5ef205806f42)	2025-12-03 12:22:01 +08:00
buua436	c8f608b2dd	Feat:support tts in agent (#11675 ) ### What problem does this PR solve? change: support tts in agent ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 12:03:59 +08:00
Yongteng Lei	5c81e01de5	Fix: incorrect async chat streamly output (#11679 ) ### What problem does this PR solve? Incorrect async chat streamly output. #11677. Disable beartype for #11666. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-03 11:15:45 +08:00
writinwaters	83fac6d0a0	Docs: How to specify an ingestion pipeline when creating a dataset (#11670 ) ### What problem does this PR solve? ### Type of change - [x] Documentation Update	2025-12-03 09:35:52 +08:00
Kevin Hu	a6681d6366	Revert "Refa: make RAGFlow more asynchronous 2" (#11669 ) Reverts infiniflow/ragflow#11664	2025-12-02 19:42:05 +08:00
chanx	1388c4420d	Feature：Add voice dialogue functionality to the agent application (#11668 ) ### What problem does this PR solve? Feature：Add voice dialogue functionality to the agent application ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 19:39:43 +08:00
Levi	962bd5f5df	feat: improve Moodle connector functionality (#11665 ) ### What problem does this PR solve? Add metadata from moodle data source. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 19:12:43 +08:00
Yongteng Lei	627c11c429	Refa: make RAGFlow more asynchronous 2 (#11664 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring - [x] Performance Improvement	2025-12-02 18:57:07 +08:00
rommy2017	4ba17361e9	feat: improve presentation PdfParser (#11639 ) The old presentation PdfParser lost table format after parse ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:35:14 +08:00
Billy Bao	c946858328	Feat: add mineru auto installer (#11649 ) ### What problem does this PR solve? Feat: add mineru auto installer ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:29:26 +08:00
balibabu	ba6e2af5fd	Feat: Delete useless request hooks. #10427 (#11659 ) ### What problem does this PR solve? Feat: Delete useless request hooks. #10427 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:24:29 +08:00
qinling0210	2ffe6f7439	Import rag_tokenizer from Infinity (#11647 ) ### What problem does this PR solve? - Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk via https://github.com/infiniflow/infinity/pull/3117 . Import rag_tokenizer from infinity and inherit from rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py. - Bump infinity to 0.6.8 ### Type of change - [x] Refactoring	2025-12-02 14:59:37 +08:00
Zhichang Yu	e3987e21b9	Update upgrade guide: add stop server step and rename section (#11654 ) ### What problem does this PR solve? Update upgrade guide: add stop server step and rename section ### Type of change - [x] Documentation Update	2025-12-02 14:51:03 +08:00
Yongteng Lei	a713f54732	Refa: add MiniMax-M2 and remove deprecated MiniMax models (#11642 ) ### What problem does this PR solve? Add MiniMax-M2 and remove deprecated models. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-02 14:43:44 +08:00
balibabu	519f03097e	Feat: Remove unnecessary dialogue-related code. #10427 (#11652 ) ### What problem does this PR solve? Feat: Remove unnecessary dialogue-related code. #10427 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 14:42:28 +08:00
Kevin Hu	299c655e39	Fix: file manager KB link issue. (#11648 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-02 12:14:27 +08:00
buua436	b8c0fb4572	Feat:new api /sequence2txt and update QWenSeq2txt (#11643 ) ### What problem does this PR solve? change: new api /sequence2txt, update QWenSeq2txt and ZhipuSeq2txt ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 11:17:31 +08:00
Stephen Hu	d1e172171f	Refactor: better describe how to get prefix for sync data source (#11636 ) ### What problem does this PR solve? better describe how to get prefix for sync data source ### Type of change - [x] Refactoring	2025-12-01 17:46:44 +08:00
Kevin Hu	81ae6cf78d	Feat: support uploading in dialog. (#11634 ) ### What problem does this PR solve? #9590 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 16:54:57 +08:00
balibabu	1120575021	Feat: Files uploaded via the dialog box can be uploaded without binding to a dataset. #9590 (#11630 ) ### What problem does this PR solve? Feat: Files uploaded via the dialog box can be uploaded without binding to a dataset. #9590 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 16:29:02 +08:00
Zhichang Yu	221947acc4	Fix workflows	2025-12-01 15:36:43 +08:00
Zhichang Yu	21d8ffca56	Fix workflows	2025-12-01 14:58:33 +08:00
Billy Bao	41cff3e09e	Fix: jina embedding issue (#11628 ) ### What problem does this PR solve? Fix: jina embedding issue #11614 Feat: Add jina embedding v4 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 14:24:35 +08:00
Yongteng Lei	b6c4722687	Refa: make RAGFlow more asynchronous (#11601 ) ### What problem does this PR solve? Try to make this more asynchronous. Verified in chat and agent scenarios, reducing blocking behavior. #11551, #11579. However, the impact of these changes still requires further investigation to ensure everything works as expected. ### Type of change - [x] Refactoring	2025-12-01 14:24:06 +08:00
Kevin Hu	6ea4248bdc	Feat: support parent-child in search procedure. (#11629 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 14:03:09 +08:00
Kevin Hu	88a28212b3	Fix: Table parse method issue. (#11627 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:42:35 +08:00
Yongteng Lei	9d0309aedc	Fix: [MinerU] Missing output file (#11623 ) ### What problem does this PR solve? Add fallbacks for MinerU output path. #11613, #11620. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:17:43 +08:00
dzikus	9a8ce9d3e2	fix: increase Quart RESPONSE_TIMEOUT and BODY_TIMEOUT for slow LLM responses (#11612 ) ### What problem does this PR solve? Quart framework has default RESPONSE_TIMEOUT and BODY_TIMEOUT of 60 seconds. This causes the frontend chat to hang exactly after 60 seconds when using slow LLM backends (e.g., Ollama on CPU, or remote APIs with high latency). This fix adds configurable timeout settings via environment variables with sensible defaults (600 seconds = 10 minutes) to match other timeout configurations in RAGFlow. Fixes issues with chat timeout when: - Using local Ollama on CPU (response time ~2 minutes) - Using remote LLM APIs with high latency - Processing complex RAG queries with many chunks ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Grzegorz Sterniczuk <grzegorz@sternicz.uk>	2025-12-01 11:26:34 +08:00
Lei Zhang	7499608a8b	feat: add Redis username support (#11608 ) ### What problem does this PR solve? Support for Redis 6+ ACL authentication (username) close #11606 ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2025-12-01 11:26:20 +08:00
writinwaters	0ebbb60102	Docs: deploying a local model using Jina not supported (#11624 ) ### What problem does this PR solve? ### Type of change - [x] Documentation Update	2025-12-01 11:24:29 +08:00
omahs	80f6d22d2a	Fix typos (#11607 ) ### What problem does this PR solve? Fix typos ### Type of change - [x] Fix typos	2025-12-01 09:49:46 +08:00
Oranggge	088b049b4c	Feature: embedded chat theme (#11581 ) ### What problem does this PR solve? This PR closing feature request #11286. It implements ability to choose the background theme of the _Full screen chat_ which is Embed into webpage. Looks like that: <img width="501" height="349" alt="image" src="https://github.com/user-attachments/assets/e5fdfb14-9ed9-43bb-a40d-4b580985b9d4" /> It works similar to `Locale`, using url parameter to set the theme. if the parameter is invalid then is using the default theme. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Your Name <you@example.com>	2025-12-01 09:49:28 +08:00
Billy Bao	fa9b7b259c	Feat: create datasets from http api supports ingestion pipeline (#11597 ) ### What problem does this PR solve? Feat: create datasets from http api supports ingestion pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 19:55:24 +08:00

1 2 3 4 5 ...

4634 Commits