ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-21 13:32:49 +08:00

Author	SHA1	Message	Date
concertdictate	4dd8cdc38b	task executor issues (#12006 ) ### What problem does this PR solve? Fixes #8706 - `InfinityException: TOO_MANY_CONNECTIONS` when running multiple task executor workers ### Problem Description When running RAGFlow with 8-16 task executor workers, most workers fail to start properly. Checking logs revealed that workers were stuck/hanging during Infinity connection initialization - only 1-2 workers would successfully register in Redis while the rest remained blocked. ### Root Cause The Infinity SDK `ConnectionPool` pre-allocates all connections in `__init__`. With the default `max_size=32` and multiple workers (e.g., 16), this creates 16×32=512 connections immediately on startup, exceeding Infinity's default 128 connection limit. Workers hang while waiting for connections that can never be established. ### Changes 1. Prevent Infinity connection storm (`rag/utils/infinity_conn.py`, `rag/svr/task_executor.py`) - Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since operations are synchronous) - Added staggered startup delay (2s per worker) to spread connection initialization 2. Handle None children_delimiter (`rag/app/naive.py`) - Use `or ""` to handle explicitly set None values from parser config 3. MinerU parser robustness (`deepdoc/parser/mineru_parser.py`) - Use `.get()` for optional output fields that may be missing - Fix DISCARDED block handling: change `pass` to `continue` to skip discarded blocks entirely ### Why `max_size=4` is sufficient \| Workers \| Pool Size \| Total Connections \| Infinity Limit \| \|---------\|-----------\|-------------------\|----------------\| \| 16 \| 32 \| 512 \| 128 ❌ \| \| 16 \| 4 \| 64 \| 128 ✅ \| \| 32 \| 4 \| 128 \| 128 ✅ \| - All RAGFlow operations are synchronous: `get_conn()` → operation → `release_conn()` - No parallel `docStoreConn` operations in the codebase - Maximum 1-2 concurrent connections needed per worker; 4 provides safety margin ### MinerU DISCARDED block bug When MinerU returns blocks with `type: "discarded"` (headers, footers, watermarks, page numbers, artifacts), the previous code used `pass` which left the `section` variable undefined, causing: - UnboundLocalError if DISCARDED is the first block - Duplicate content if DISCARDED follows another block (stale value from previous iteration) Root cause confirmed via MinerU source code: From [`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14): ```python class BlockType: DISCARDED = 'discarded' # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE ``` Per [MinerU documentation](https://opendatalab.github.io/MinerU/reference/output_files/), discarded blocks contain content that should be filtered out for clean text extraction. Fix: Changed `pass` to `continue` to skip discarded blocks entirely. ### Testing - Verified all 16 workers now register successfully in Redis - All workers heartbeating correctly - Document parsing works as expected - MinerU parsing with DISCARDED blocks no longer crashes ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: user210 <user210@rt>	2025-12-18 10:03:30 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
Yongteng Lei	03f9be7cbb	Refa: only support MinerU-API now (#11977 ) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring	2025-12-17 12:58:48 +08:00
Jin Hai	30019dab9f	Change knowledge base to dataset (#11976 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-17 10:03:33 +08:00
Kevin Hu	ea4a5cd665	Fix: tokenizer issue. (#11902 ) #11786 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 17:38:17 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00
buua436	dd046be976	Fix: parent-child chunking method (#11810 ) ### What problem does this PR solve? change: parent-child chunking method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-09 09:34:01 +08:00
buua436	9b8971a9de	Fix:toc in pipeline (#11785 ) ### What problem does this PR solve? change: Fix toc in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 09:42:20 +08:00
hsparks-codes	4870d42949	feat: Auto-disable Raptor for structured data (Issue #11653 ) (#11676 ) ### What problem does this PR solve? Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653. Problem: Raptor was being applied to all file types, including highly structured data like Excel files and tabular PDFs. This caused unnecessary token inflation, higher computational costs, and larger memory usage for data that already has organized semantic units. Solution: Automatically skip Raptor processing for: - Excel files (.xls, .xlsx, .xlsm, .xlsb) - CSV files (.csv, .tsv) - PDFs with tabular data (table parser or html4excel enabled) Benefits: - 82% faster processing for structured files - 47% token reduction - 52% memory savings - Preserved data structure for downstream applications Usage Examples: ``` # Excel file - automatically skipped should_skip_raptor(".xlsx") # True # CSV file - automatically skipped should_skip_raptor(".csv") # True # Tabular PDF - automatically skipped should_skip_raptor(".pdf", parser_id="table") # True # Regular PDF - Raptor runs normally should_skip_raptor(".pdf", parser_id="naive") # False # Override for special cases should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False ``` Configuration: Includes `auto_disable_for_structured_data` toggle (default: true) to allow override for special use cases. Testing: 44 comprehensive tests, 100% passing ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:02:29 +08:00
Jin Hai	3c50c7d3ac	Refactor code (#11694 ) ### What problem does this PR solve? Rename function and refactor log message ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-03 15:15:00 +08:00
Kevin Hu	b5ad7b7062	Feat: support TOC transformer. (#11685 ) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 12:27:50 +08:00
Billy Bao	c946858328	Feat: add mineru auto installer (#11649 ) ### What problem does this PR solve? Feat: add mineru auto installer ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:29:26 +08:00
Kevin Hu	6ea4248bdc	Feat: support parent-child in search procedure. (#11629 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 14:03:09 +08:00
Kevin Hu	14616cf845	Feat: add child parent chunking method in backend. (#11598 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 19:25:32 +08:00
buua436	e8f1a245a6	Feat:update check_embedding api (#11254 ) ### What problem does this PR solve? pr: #10854 change: update check_embedding api ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-13 18:48:25 +08:00
Yongteng Lei	908450509f	Feat: add fault-tolerant mechanism to RAPTOR (#11206 ) ### What problem does this PR solve? Add fault-tolerant mechanism to RAPTOR. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-13 18:48:07 +08:00
Jin Hai	296476ab89	Refactor function name (#11210 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-12 19:00:15 +08:00
Yongteng Lei	8ae562504b	Fix: GraphRAG and RAPTOR tasks do not affect document status (#11194 ) ### What problem does this PR solve? GraphRAG and RAPTOR tasks do not affect document status. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-12 12:03:41 +08:00
Kevin Hu	c30ffb5716	Fix: ollama model list issue. (#11175 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-11 19:46:41 +08:00
Yongteng Lei	9213568692	Feat: add mechanism to check cancellation in Agent (#10766 ) ### What problem does this PR solve? Add mechanism to check cancellation in Agent. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-11 17:36:48 +08:00
Kevin Hu	f441f8ffc2	Fix: waitForResponse component. (#11172 ) ### What problem does this PR solve? #10056 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-11-11 16:58:47 +08:00
Lynn	d016a06fd5	Feat/monitor task (#11116 ) ### What problem does this PR solve? Show task executor. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-10 12:51:39 +08:00
Jin Hai	af98763e27	Admin: add 'show version' (#11079 ) ### What problem does this PR solve? ``` admin> show version; show_version +-----------------------+ \| version \| +-----------------------+ \| v0.21.0-241-gc6cf58d5 \| +-----------------------+ admin> \q Goodbye! ``` ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 19:24:46 +08:00
Yongteng Lei	0cd8024c34	Feat: RAPTOR handle cancel gracefully (#11074 ) ### What problem does this PR solve? RAPTOR handle cancel gracefully. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-06 17:18:03 +08:00
Yongteng Lei	23b81eae77	Feat: GraphRAG handle cancel gracefully (#11061 ) ### What problem does this PR solve? GraghRAG handle cancel gracefully. #10997. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-06 16:12:20 +08:00
Jin Hai	f98b24c9bf	Move api.settings to common.settings (#11036 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 09:36:38 +08:00
Jin Hai	1a9215bc6f	Move some vars to globals (#11017 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 14:14:38 +08:00
Jin Hai	96c015fb85	Fix and refactor imports (#11010 ) ### What problem does this PR solve? 1. Move EMBEDDING_CFG to common.globals 2. Fix error imports 3. Move signal handles to common/signal_utils.py ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 11:07:54 +08:00
Jin Hai	bab3fce136	Move some constants to common (#11004 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 08:01:39 +08:00
Kevin Hu	4bbbf92331	Refa: link connector to KB. (#10991 ) ### What problem does this PR solve? #10953 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-04 20:13:52 +08:00
Jin Hai	1e45137284	Move 'timeout' to common folder (#10983 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-04 11:51:12 +08:00
Jin Hai	378bdfccfc	Refactor log utils (#10973 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 20:25:02 +08:00
Kevin Hu	3e5a39482e	Feat: Support multiple data sources synchronizations (#10954 ) ### What problem does this PR solve? #10953 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-03 19:59:18 +08:00
Jin Hai	076d811086	Introduce common/config_utils.py (#10968 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 17:25:06 +08:00
Jin Hai	d008a4df9f	Move base64_image related functions to common directory (#10957 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 15:20:46 +08:00
Jin Hai	360f5c1179	Move token related functions to common (#10942 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 08:50:05 +08:00
Jin Hai	44f2d6f5da	Move 'get_project_base_directory' to common directory (#10940 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 21:05:28 +08:00
Zhichang Yu	73144e278b	Don't release full image (#10654 ) ### What problem does this PR solve? Introduced gpu profile in .env Added Dockerfile_tei fix datrie Removed LIGHTEN flag ### Type of change - [x] Documentation Update - [x] Refactoring	2025-10-23 23:02:27 +08:00
Yongteng Lei	2d491188b8	Refa: improve flow of GraphRAG and RAPTOR (#10709 ) ### What problem does this PR solve? Improve flow of GraphRAG and RAPTOR. ### Type of change - [x] Refactoring	2025-10-22 09:29:20 +08:00
Jin Hai	deb81810e9	Update message printout when start ingestion server (#10677 ) ### What problem does this PR solve? ``` ____ __ _ / _/ ____ ____ _ ___ _____ / /_ (_) ____ ____ _____ ___ _____ _ __ ___ _____ / / / __ \ / __ `/ / _ \ / ___/ / __/ / / / __ \ / __ \ / ___/ / _ \ / ___/\| \| / / / _ \ / ___/ _/ / / / / / / /_/ / / __/ (__ ) / /_ / / / /_/ / / / / / (__ ) / __/ / / \| \|/ / / __/ / / /___/ /_/ /_/ \__, / \___/ /____/ \__/ /_/ \____/ /_/ /_/ /____/ \___/ /_/ \|___/ \___/ /_/ /____/ ``` ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-21 09:38:20 +08:00
Liu An	8af769de41	Fix: add toc_kwd field and update page_num_int type (#10596 ) ### What problem does this PR solve? - Added new field 'toc_kwd' to infinity_mapping.json for table of contents keyword support - Changed page_num_int from integer to array type in task_executor.py to handle multiple page numbers ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-16 12:47:24 +08:00
Kevin Hu	f92a45dcc4	Feat: let toc run asynchronizly... (#10513 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-14 14:14:52 +08:00
Yongteng Lei	65c3f0406c	Fix: maintain backward compatibility for KB tasks (#10508 ) ### What problem does this PR solve? Maintain backward compatibility for KB tasks ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-13 11:53:48 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
Jin Hai	d931c33ced	Fix typos: retrievaler -> retriever (#10372 ) ### What problem does this PR solve? Fix typos ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-10 09:17:36 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Jin Hai	4eb7659499	Fix bug: broken import from rag.prompts.prompts (#10217 ) ### What problem does this PR solve? Fix broken imports ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2025-09-23 10:19:25 +08:00
Yongteng Lei	45f52e85d7	Feat: refine dataflow and initialize dataflow app (#9952 ) ### What problem does this PR solve? Refine dataflow and initialize dataflow app. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-05 18:50:46 +08:00
Stephen Hu	c461261f0b	Refactor: Improve the try logic for upload_to_minio (#9735 ) ### What problem does this PR solve? Improve the try logic for upload_to_minio ### Type of change - [x] Refactoring	2025-08-28 09:35:29 +08:00

1 2 3 4 5

214 Commits