ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-23 23:16:58 +08:00

Author	SHA1	Message	Date
Magicbook1108	f8fd1ea7e1	Feat: Further update Bedrock model configs (#12029 ) ### What problem does this PR solve? Feat: Further update Bedrock model configs #12020 #12008 <img width="700" alt="2b4f0f7fab803a2a2d5f345c756a2c69" src="https://github.com/user-attachments/assets/e1b9eaad-5c60-47bd-a6f4-88a104ce0c63" /> <img width="700" alt="afe88ec3c58f745f85c5c507b040c250" src="https://github.com/user-attachments/assets/9de39745-395d-4145-930b-96eb452ad6ef" /> <img width="700" alt="1a21bb2b7cd8003dce1e5207f27efc69" src="https://github.com/user-attachments/assets/ddba1682-6654-4954-aa71-41b8ebc04ac0" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-19 11:32:20 +08:00
buua436	57edc215d7	Feat:update webhook component (#11739 ) ### What problem does this PR solve? issue: https://github.com/infiniflow/ragflow/issues/10427 https://github.com/infiniflow/ragflow/issues/8115 change: - Support for Multiple HTTP Methods (POST / GET / PUT / PATCH / DELETE / HEAD) - Security Validation 1. max_body_size 2. IP whitelist 3. rate limit 4. token / basic / jwt authentication - File Upload Support - Unified Content-Type Handling - Full Schema-Based Extraction & Type Validation - Two Execution Modes: Immediately / Streaming ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-18 19:34:39 +08:00
Jonah Hartmann	7a4044b05f	Feat: use filepath for files with the same name for all data source types (#11819 ) ### What problem does this PR solve? When there are multiple files with the same name the file would just duplicate, making it hard to distinguish between the different files. Now if there are multiple files with the same name, they will be named after their folder path in the storage unit. This was done for the webdav connector and with this PR also for Notion, Confluence and S3 Storage. ### Type of change - [x] New Feature (non-breaking change which adds functionality) Contribution by RAGcon GmbH, visit us [here](https://www.ragcon.ai/)	2025-12-18 17:42:43 +08:00
Magicbook1108	e84d5412bc	Feat: bedrock iam authentication (#12020 ) ### What problem does this PR solve? Feat: bedrock iam authentication #12008 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-18 17:13:09 +08:00
Magicbook1108	2331b3a270	Refact: Update loggings (#12014 ) ### What problem does this PR solve? Refact: Update loggings ### Type of change - [x] Refactoring	2025-12-18 14:18:03 +08:00
Stephen Hu	a63dcfed6f	Refactor: improve cohere calculate total counts (#12007 ) ### What problem does this PR solve? improve cohere calculate total counts ### Type of change - [x] Refactoring	2025-12-18 10:04:28 +08:00
concertdictate	4dd8cdc38b	task executor issues (#12006 ) ### What problem does this PR solve? Fixes #8706 - `InfinityException: TOO_MANY_CONNECTIONS` when running multiple task executor workers ### Problem Description When running RAGFlow with 8-16 task executor workers, most workers fail to start properly. Checking logs revealed that workers were stuck/hanging during Infinity connection initialization - only 1-2 workers would successfully register in Redis while the rest remained blocked. ### Root Cause The Infinity SDK `ConnectionPool` pre-allocates all connections in `__init__`. With the default `max_size=32` and multiple workers (e.g., 16), this creates 16×32=512 connections immediately on startup, exceeding Infinity's default 128 connection limit. Workers hang while waiting for connections that can never be established. ### Changes 1. Prevent Infinity connection storm (`rag/utils/infinity_conn.py`, `rag/svr/task_executor.py`) - Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since operations are synchronous) - Added staggered startup delay (2s per worker) to spread connection initialization 2. Handle None children_delimiter (`rag/app/naive.py`) - Use `or ""` to handle explicitly set None values from parser config 3. MinerU parser robustness (`deepdoc/parser/mineru_parser.py`) - Use `.get()` for optional output fields that may be missing - Fix DISCARDED block handling: change `pass` to `continue` to skip discarded blocks entirely ### Why `max_size=4` is sufficient \| Workers \| Pool Size \| Total Connections \| Infinity Limit \| \|---------\|-----------\|-------------------\|----------------\| \| 16 \| 32 \| 512 \| 128 ❌ \| \| 16 \| 4 \| 64 \| 128 ✅ \| \| 32 \| 4 \| 128 \| 128 ✅ \| - All RAGFlow operations are synchronous: `get_conn()` → operation → `release_conn()` - No parallel `docStoreConn` operations in the codebase - Maximum 1-2 concurrent connections needed per worker; 4 provides safety margin ### MinerU DISCARDED block bug When MinerU returns blocks with `type: "discarded"` (headers, footers, watermarks, page numbers, artifacts), the previous code used `pass` which left the `section` variable undefined, causing: - UnboundLocalError if DISCARDED is the first block - Duplicate content if DISCARDED follows another block (stale value from previous iteration) Root cause confirmed via MinerU source code: From [`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14): ```python class BlockType: DISCARDED = 'discarded' # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE ``` Per [MinerU documentation](https://opendatalab.github.io/MinerU/reference/output_files/), discarded blocks contain content that should be filtered out for clean text extraction. Fix: Changed `pass` to `continue` to skip discarded blocks entirely. ### Testing - Verified all 16 workers now register successfully in Redis - All workers heartbeating correctly - Document parsing works as expected - MinerU parsing with DISCARDED blocks no longer crashes ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: user210 <user210@rt>	2025-12-18 10:03:30 +08:00
Yongteng Lei	672958a192	Fix: model not authorized (#12001 ) ### What problem does this PR solve? Fix model not authorized. #11973. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-17 19:48:24 +08:00
Jin Hai	d38f8a1562	Add license and Fix IDE warnings (#11985 ) ### What problem does this PR solve? - Add license - Fix IDE warnings ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-17 17:04:44 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
Magicbook1108	82d4e5fb87	Ref: update loggings (#11987 ) ### What problem does this PR solve? Ref: update loggins ### Type of change - [x] Refactoring	2025-12-17 15:43:25 +08:00
Yongteng Lei	03f9be7cbb	Refa: only support MinerU-API now (#11977 ) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring	2025-12-17 12:58:48 +08:00
Jin Hai	30019dab9f	Change knowledge base to dataset (#11976 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-17 10:03:33 +08:00
Jin Hai	0e8b9588ba	Fix error and format issue (#11975 ) ### What problem does this PR solve? 1. Fix error of book chunking. 2. Fix format issues. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-16 19:29:37 +08:00
concertdictate	49c74d08e8	Feature/mineru improvements (#11938 ) 我已在下面的评论中用中文重复说明。 ### What problem does this PR solve? ## Summary This PR enhances the MinerU document parser with additional configuration options, giving users more control over PDF parsing behavior and improving support for multilingual documents. ## Changes ### Backend (`deepdoc/parser/mineru_parser.py`) - Added configurable parsing options: - Parse Method: `auto`, `txt`, or `ocr` — allows users to choose the extraction strategy - Formula Recognition: Toggle for enabling/disabling formula extraction (useful to disable for Cyrillic documents where it may cause issues) - Table Recognition: Toggle for enabling/disabling table extraction - Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate RAGFlow language settings to MinerU-compatible language codes for better OCR accuracy - Improved parser configuration handling to pass these options through the processing pipeline ### Frontend (`web/`) - Created new `MinerUOptionsFormField` component that conditionally renders when MinerU is selected as the layout recognition engine - Added UI controls for: - Parse method selection (dropdown) - Formula recognition toggle (switch) - Table recognition toggle (switch) - Added i18n translations for English and Chinese - Integrated the options into both the dataset creation dialog and dataset settings page ### Integration - Updated `rag/app/naive.py` to forward MinerU options to the parser - Updated task service to handle the new configuration parameters ## Why MinerU is a powerful document parser, but the default settings don't work well for all document types. This PR allows users to: 1. Choose the best parsing method for their documents 2. Disable formula recognition for Cyrillic/non-Latin scripts where it causes issues 3. Control table extraction based on document needs 4. Benefit from automatic language detection for better OCR results ## Testing - [x] Tested MinerU parsing with different parse methods - [x] Verified UI renders correctly when MinerU is selected/deselected - [x] Confirmed settings persist correctly in dataset configuration ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: user210 <user210@rt> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-16 13:15:25 +08:00
Stephen Hu	ef5d1d4b74	Fix: 'AzureEmbed' object has no attribute 'total_token_count_from_response' (#11962 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/11956 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-16 11:29:07 +08:00
Yongteng Lei	ad6f7fd4b0	Fix: pipeline ignore MinerU backend config and vllm module is missing (#11955 ) ### What problem does this PR solve? Fix pipeline ignore MinerU backend config and vllm module is missing. #11944, #11947. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-15 18:03:34 +08:00
Stephen Hu	2a0f835ffe	Refactor: Improve the logic to calculate embedding total token count (#11943 ) ### What problem does this PR solve? Improve the logic to calculate embedding total token count ### Type of change - [x] Refactoring	2025-12-15 11:33:57 +08:00
YngvarHuang	81eb03d230	Support uploading encrypted files to object storage (#11837 ) (#11838 ) ### What problem does this PR solve? Support uploading encrypted files to object storage. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: virgilwong <hyhvirgil@gmail.com>	2025-12-15 09:45:18 +08:00
Magicbook1108	7d23c3aed0	Fix: presentation parsing & Embedding encode exception handling (#11933 ) ### What problem does this PR solve? Fix: presentation parsing #11920 Fix: Embeddin encode exception handling ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-13 11:37:42 +08:00
Yongteng Lei	6be0338aa0	Fix: Asure-OpenAI resource not found (#11934 ) ### What problem does this PR solve? Asure-OpenAI resource not found. #11750 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-13 11:32:46 +08:00
Yongteng Lei	2b260901df	Fix: raptor don't have attribute chat (#11936 ) ### What problem does this PR solve? Raptor don't have attribute chat. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-12 20:08:18 +08:00
Magicbook1108	948bc93786	Feat: Add GPT-5.2 & pro (#11929 ) ### What problem does this PR solve? Feat: Add GPT-5.2 & pro ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-12 17:35:08 +08:00
Magicbook1108	7db9045b74	Feat: Add box connector (#11845 ) ### What problem does this PR solve? Feat: Add box connector ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-12 10:23:40 +08:00
Andrea Bugeja	74afb8d710	feat: Add Single Bucket Mode for MinIO/S3 (#11416 ) ## Overview This PR adds support for Single Bucket Mode in RAGFlow, allowing users to configure MinIO/S3 to use a single bucket with a directory structure instead of creating multiple buckets per Knowledge Base and user folder. ## Problem Statement The current implementation creates one bucket per Knowledge Base and one bucket per user folder, which can be problematic when: - Cloud providers charge per bucket - IAM policies restrict bucket creation - Organizations want centralized data management in a single bucket ## Solution Added a `prefix_path` configuration option to the MinIO connector that enables: - Using a single bucket with directory-based organization - Backward compatibility with existing multi-bucket deployments - Support for MinIO, AWS S3, and other S3-compatible storage backends ## Changes - `rag/utils/minio_conn.py`: Enhanced MinIO connector to support single bucket mode with prefix paths - `conf/service_conf.yaml`: Added new configuration options (`bucket` and `prefix_path`) - `docker/service_conf.yaml.template`: Updated template with single bucket configuration examples - `docker/.env.single-bucket-example`: Added example environment variables for single bucket setup - `docs/single-bucket-mode.md`: Comprehensive documentation covering usage, migration, and troubleshooting ## Configuration Example ```yaml minio: user: "access-key" password: "secret-key" host: "minio.example.com:443" bucket: "ragflow-bucket" # Single bucket name prefix_path: "ragflow" # Optional prefix path ``` ## Backward Compatibility ✅ Fully backward compatible - existing deployments continue to work without any changes - If `bucket` is not configured, uses default multi-bucket behavior - If `bucket` is configured without `prefix_path`, uses bucket root - If both are configured, uses `bucket/prefix_path/` structure ## Testing - Tested with MinIO (local and cloud) - Verified backward compatibility with existing multi-bucket mode - Validated IAM policy restrictions work correctly ## Documentation Included comprehensive documentation in `docs/single-bucket-mode.md` covering: - Configuration examples - Migration guide from multi-bucket to single-bucket mode - IAM policy examples - Troubleshooting guide --- Related Issue: Addresses use cases where bucket creation is restricted or costly	2025-12-11 19:22:47 +08:00
Kevin Hu	ea4a5cd665	Fix: tokenizer issue. (#11902 ) #11786 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 17:38:17 +08:00
Yongteng Lei	e9710b7aa9	Refa: treat MinerU as an OCR model 2 (#11905 ) ### What problem does this PR solve? Treat MinerU as an OCR model 2. #11903 ### Type of change - [x] Refactoring	2025-12-11 17:33:12 +08:00
buua436	e3cfe8e848	Fix:async issue and sensitive logging (#11895 ) ### What problem does this PR solve? change： async issue and sensitive logging ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 13:54:47 +08:00
David López Carrascal	a6afb7dfe2	Fix data_sync startup crash by properly invoking async main (#11879 ) ### What problem does this PR solve? This PR fixes a startup crash in the data_sync_0 service caused by an incorrect asyncio.run call. The main coroutine was being passed as a function reference instead of being invoked, which raised: `ValueError: a coroutine was expected, got <function main ...> ` What I changed - Updated the entrypoint in sync_data_source.py to correctly invoke the coroutine with `asyncio.run(main())`. Testing - No tested. Related Issue Fixes https://github.com/infiniflow/ragflow/issues/11878 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 10:09:16 +08:00
He Wang	badf33e3b9	feat: enhance OBConnection.search (#11876 ) ### What problem does this PR solve? Enhance OBConnection.search for better performance. Main changes: 1. Use string type of vector array in distance func for better parsing performance. 2. Manually set max_connections as pool size instead of using default value. 3. Set 'fulltext_search_columns' when starting. 4. Cache the results of the table existence check (we will never drop the table). 5. Remove unused 'group_results' logic. 6. Add the `USE_FULLTEXT_FIRST_FUSION_SEARCH` flag, and the corresponding fusion search SQL when it's false. ### Type of change - [x] Performance Improvement	2025-12-10 19:13:37 +08:00
buua436	3cb72377d7	Refa:remove sensitive information (#11873 ) ### What problem does this PR solve? change: remove sensitive information ### Type of change - [x] Refactoring	2025-12-10 19:08:45 +08:00
buua436	ab4b62031f	Fix:csv parse in Table (#11870 ) ### What problem does this PR solve? change: csv parse in Table ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-10 16:44:06 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00
Magicbook1108	ca2d6f3301	Fix: duplicate output by async_chat_streamly (#11842 ) ### What problem does this PR solve? Fix: duplicate output by async_chat_streamly Refact: revert manual modification ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-09 19:21:52 +08:00
Yongteng Lei	a94b3b9df2	Refa: treat MinerU as an OCR model (#11849 ) ### What problem does this PR solve? Treat MinerU as an OCR model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-09 18:54:14 +08:00
N0bodycan	9863862348	fix: prevent redundant retries in async_chat_streamly upon success (#11832 ) ## What changes were proposed in this pull request? Added a return statement after the successful completion of the async for loop in async_chat_streamly. ## Why are the changes needed? Previously, the code lacked a break/return mechanism inside the try block. This caused the retry loop (for attempt in range...) to continue executing even after the LLM response was successfully generated and yielded, resulting in duplicate requests (up to max_retries times). ## Does this PR introduce any user-facing change? No (it fixes an internal logic bug).	2025-12-09 17:14:30 +08:00
Zhichang Yu	bb6022477e	Bump infinity to v0.6.11. Requires python>=3.11 (#11814 ) ### What problem does this PR solve? Bump infinity to v0.6.11. Requires python>=3.11 ### Type of change - [x] Refactoring	2025-12-09 16:23:37 +08:00
Yongteng Lei	c51e6b2a58	Refa: migrate CV model chat to Async (#11828 ) ### What problem does this PR solve? Migrate CV model chat to Async. #11750 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2025-12-09 13:08:37 +08:00
Stephen Hu	481192300d	Fix:[ERROR][Exception]: list index out of range (#11826 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/11821 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-09 09:58:34 +08:00
buua436	dd046be976	Fix: parent-child chunking method (#11810 ) ### What problem does this PR solve? change: parent-child chunking method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-09 09:34:01 +08:00
Kevin Hu	09a3854ed8	Fix: chunk method error. (#11807 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 14:28:23 +08:00
Jin Hai	43f51baa96	Fix errors (#11804 ) ### What problem does this PR solve? 1. typos 2. grammar errors. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 12:21:18 +08:00
Stephen Hu	b66881a371	Refactor:book parser use with to handle bytesIO (#11800 ) ### What problem does this PR solve? book parser use with to handle bytesIO ### Type of change - [x] Refactoring	2025-12-08 10:18:46 +08:00
Yongteng Lei	51ec708c58	Refa: cleanup synchronous functions in chat_model and implement synchronization for conversation and dialog chats (#11779 ) ### What problem does this PR solve? Cleanup synchronous functions in chat_model and implement synchronization for conversation and dialog chats. ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-12-08 09:43:03 +08:00
buua436	9b8971a9de	Fix:toc in pipeline (#11785 ) ### What problem does this PR solve? change: Fix toc in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 09:42:20 +08:00
少卿	7719fd6350	Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702 ) ### What problem does this PR solve? This PR addresses two independent issues encountered when using the MinerU engine in Ragflow: 1. MinerU API output path mismatch for non-ASCII filenames MinerU sanitizes the root directory name inside the returned ZIP when the original filename contains non-ASCII characters (e.g., Chinese). Ragflow's client-side unzip logic assumed the original filename stem and therefore failed to locate `_content_list.json`. This PR adds: * root-directory detection * fallback lookup using sanitized names * a broadened `_read_output` search with a glob fallback ensuring output files are consistently located regardless of filename encoding. 2. Chunker crash due to tuple-structure mismatch in manual mode Some parsers (e.g., MinerU / Docling) return 2-tuple sections, but Ragflow’s chunker expects 3-tuple sections, leading to: `ValueError: not enough values to unpack (expected 3, got 2)` This PR normalizes all sections to a uniform structure `(text, layout, positions)`: * parse position tags when present * default to empty positions when missing preserving backward compatibility and preventing crashes. ### Type of change * [x] Bug Fix (non-breaking change which fixes an issue) [#11136](https://github.com/infiniflow/ragflow/issues/11136) [#11700](https://github.com/infiniflow/ragflow/issues/11700) [#11620](https://github.com/infiniflow/ragflow/issues/11620) [#11701](https://github.com/infiniflow/ragflow/pull/11701) we need your help [yongtenglei](https://github.com/yongtenglei) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-05 19:25:45 +08:00
Magicbook1108	4012d65b3c	Feat: update front end for confluence connector (#11747 ) ### What problem does this PR solve? Feat: update front end for confluence connector ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 18:49:13 +08:00
Magicbook1108	e2bc1a3478	Feat: add more attribute for confluence connector. (#11743 ) ### What problem does this PR solve? Feat: add more attribute for confluence connector. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-04 17:28:03 +08:00
qinling0210	ca4a0ee1b2	Remove huqie.txt from RAGFflow and bump infinity to 0.6.10 (#11661 ) ### What problem does this PR solve? huqie.txt and huqie.txt.trie are put to infinity-sdk in https://github.com/infiniflow/infinity/pull/3127. Remove huqie.txt from ragflow and bump infinity to 0.6.10 in this PR. ### Type of change - [x] Refactoring	2025-12-04 14:53:57 +08:00
Yongteng Lei	27b0550876	Refa: cleanup synchronous functions in agent_with_tools (#11736 ) ### What problem does this PR solve? Cleanup synchronous functions in agent_with_tools. ### Type of change - [x] Refactoring	2025-12-04 14:15:05 +08:00

1 2 3 4 5 ...

1155 Commits