ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-29 22:56:36 +08:00

Author	SHA1	Message	Date
David Eberto Domenech Castillo	3c224c817b	Fix: Correct pagination and early termination bugs in chunk_list() (#11692 ) ## Summary This PR fixes two critical bugs in `chunk_list()` method that prevent processing large documents (>128 chunks) in GraphRAG and other workflows. ## Bugs Fixed ### Bug 1: Incorrect pagination offset calculation Location: `rag/nlp/search.py` lines 530-531 Problem: The loop variable `p` was used directly as offset, causing incorrect pagination: ```python # BEFORE (BUGGY): for p in range(offset, max_count, bs): # p = 0, 128, 256, 384... es_res = self.dataStore.search(..., p, bs, ...) # p used as offset Fix: Use page number multiplied by batch size: # AFTER (FIXED): for page_num, p in enumerate(range(offset, max_count, bs)): es_res = self.dataStore.search(..., page_num * bs, bs, ...) Bug 2: Premature loop termination Location: rag/nlp/search.py lines 538-539 Problem: Loop terminates when any page returns fewer than 128 chunks, even when thousands more remain: # BEFORE (BUGGY): if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+ remain break Fix: Only terminate when zero chunks returned: # AFTER (FIXED): if len(dict_chunks.values()) == 0: break Enhancement: Add max_count parameter to GraphRAG Location: graphrag/general/index.py line 60 Added max_count=10000 parameter to chunk loading for both LightRAG and General GraphRAG paths to ensure all chunks are processed. Testing Validated with a 314-page legal document containing 3,207 chunks: Before fixes: - Only 2-126 chunks processed - GraphRAG generated 25 nodes, 8 edges After fixes: - All 3,209 chunks processed ✅ - GraphRAG processing complete dataset Impact These bugs affect any workflow using chunk_list() with large documents, particularly: - GraphRAG knowledge graph generation - RAPTOR hierarchical summarization - Document processing pipelines with >128 chunks Related Issue Fixes #11687 Checklist - Code follows project style guidelines - Tested with large documents (3,207+ chunks) - Both bugs validated by Dosu bot in issue #11687 - No breaking changes to API --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-03 19:44:20 +08:00
qinling0210	2ffe6f7439	Import rag_tokenizer from Infinity (#11647 ) ### What problem does this PR solve? - Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk via https://github.com/infiniflow/infinity/pull/3117 . Import rag_tokenizer from infinity and inherit from rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py. - Bump infinity to 0.6.8 ### Type of change - [x] Refactoring	2025-12-02 14:59:37 +08:00
Kevin Hu	81ae6cf78d	Feat: support uploading in dialog. (#11634 ) ### What problem does this PR solve? #9590 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 16:54:57 +08:00
Kevin Hu	6ea4248bdc	Feat: support parent-child in search procedure. (#11629 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-01 14:03:09 +08:00
Kevin Hu	14616cf845	Feat: add child parent chunking method in backend. (#11598 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 19:25:32 +08:00
Billy Bao	982ed233a2	Fix: doc_aggs not correctly returned when no chunks retrieved. (#11578 ) ### What problem does this PR solve? Fix: doc_aggs not correctly returned when no chunks retrieved. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-28 13:09:05 +08:00
Yongteng Lei	9d8b96c1d0	Feat: add context for figure and table (#11547 ) ### What problem does this PR solve? Add context for figure table. ![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e) `==================()` for demonstrating purpose. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 10:21:44 +08:00
Zhichang Yu	40e84ca41a	Use Infinity single-field-multi-index (#11444 ) ### What problem does this PR solve? Use Infinity single-field-multi-index ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-11-26 11:06:37 +08:00
Kevin Hu	74e0b58d89	Fix: excel default optimization. (#11519 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:20 +08:00
Yongteng Lei	db0f6840d9	Feat: ignore chunk size when using custom delimiters (#11434 ) ### What problem does this PR solve? Ignore chunk size when using custom delimiter. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-21 14:36:26 +08:00
Yongteng Lei	b846a0f547	Fix: incorrect retrieval total count with pagination enabled (#11400 ) ### What problem does this PR solve? Incorrect retrieval total count with pagination enabled. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-20 15:35:09 +08:00
He Wang	38234aca53	feat: add OceanBase doc engine (#11228 ) ### What problem does this PR solve? Add OceanBase doc engine. Close #5350 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 10:00:14 +08:00
Billy Bao	7264fb6978	Fix: concat images in word document. (#11310 ) ### What problem does this PR solve? Fix: concat images in word document. Partially solved issues in #11063 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-17 19:38:26 +08:00
Jin Hai	bd4bc57009	Refactor: move mcp connection utilities to common (#11304 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-17 15:34:17 +08:00
Billy Bao	e27ff8d3d4	Fix: rerank algorithm (#11266 ) ### What problem does this PR solve? Fix: rerank algorithm #11234 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-14 13:59:54 +08:00
Billy Bao	93422fa8cc	Fix: Law parser (#11246 ) ### What problem does this PR solve? Fix: Law parser ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-13 15:19:02 +08:00
Jin Hai	296476ab89	Refactor function name (#11210 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-12 19:00:15 +08:00
Billy Bao	5629fbd2ca	Fix: OpenSearch retrieval no return & Add documentation of /retrieval (#11083 ) ### What problem does this PR solve? Fix: OpenSearch retrieval no return #11006 Add documentation #11072 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Documentation Update --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>	2025-11-07 09:28:42 +08:00
Jin Hai	f98b24c9bf	Move api.settings to common.settings (#11036 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 09:36:38 +08:00
Jin Hai	360f5c1179	Move token related functions to common (#10942 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 08:50:05 +08:00
Jin Hai	44f2d6f5da	Move 'get_project_base_directory' to common directory (#10940 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 21:05:28 +08:00
Billy Bao	27f0d82102	Fix: opensearch retrieval error (#10891 ) ### What problem does this PR solve? Fix: opensearch retrieval error #10828 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-30 17:30:54 +08:00
Jin Hai	766d900a41	Refactor: rename rmSpace to remove_redundant_spaces (#10796 ) ### What problem does this PR solve? - rename rmSpace to remove_redundant_spaces - move clean_markdown_block to common module - add unit tests for remove_redundant_spaces and clean_markdown_block ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-28 09:46:32 +08:00
Billy Bao	e59458c36b	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 ) ### What problem does this PR solve? Fix: parsing excel with chartsheet #10815 Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 09:40:37 +08:00
Billy Bao	501b7d4d01	Fix: prio synonym match than wordnet for english (#10762 ) ### What problem does this PR solve? Fix: prio synonym match than wordnet for english ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-27 09:32:55 +08:00
Stephen Hu	1d57801c0c	Fix:ERROR 20 Method rag.nlp.search.Dealer.search() parameter highlight="None" violates type hint bool \| list, as <class "builtins.NoneType"> "None" not list or bool. (#10743 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/10733 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-27 09:29:39 +08:00
Kevin Hu	ea73f13ebf	Fix: infinity rerank error. (#10760 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-23 17:38:54 +08:00
Billy Bao	863c3e3d9c	Fix: tree merge (#10691 ) ### What problem does this PR solve? Fix: Fix tree merge, solved #10636 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 13:02:01 +08:00
Kevin Hu	43ea312144	Fix: search highlight. (#10616 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-16 18:45:43 +08:00
Zhichang Yu	e48bec1cbf	Don't rerank for infinity (#10579 ) ### What problem does this PR solve? Don't need rerank for infinity since Infinity normalizes each way score before fusion. ### Type of change - [x] Refactoring	2025-10-15 20:15:49 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Liu An	9e323a9351	Feat(nlp): add "怎么办" pattern to question word removal (#10284 ) ### What problem does this PR solve? Added "怎么办" to the regex pattern in rmWWW method to improve query cleaning by removing this common question phrase along with other question words. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-25 16:47:56 +08:00
Billy Bao	ca9f30e1a1	Add tree_merge for law parsers, significantly outperforming hierarchical_merge (#10202 ) ### What problem does this PR solve? Add tree_merge for law parsers, significantly outperforming hierarchical_merge, solved: #8637 1. Add tree_merge for law parsers, include build_tree and get_tree by dfs. 2. add Copyright statement for helath_utils ### Type of change - [x] Documentation Update - [x] Performance Improvement	2025-09-22 16:33:21 +08:00
Yongteng Lei	0d9c1f1c3c	Feat: dataflow supports Spreadsheet and Word processor document (#9996 ) ### What problem does this PR solve? Dataflow supports Spreadsheet and Word processor document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-10 13:02:53 +08:00
Kevin Hu	e9ee9269f5	Feat: user defined prompt. (#9972 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-08 14:05:01 +08:00
Stephen Hu	4e16936fa4	Refactor: Use re compile for weight method (#9929 ) ### What problem does this PR solve? Use re compile for the weight method ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-09-05 12:29:44 +08:00
Kevin Hu	c27172b3bc	Feat: init dataflow. (#9791 ) ### What problem does this PR solve? #9790 Close #9782 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-28 18:40:32 +08:00
Jin Hai	5abd0bbac1	Fix typo (#9766 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-08-27 18:56:40 +08:00
Kevin Hu	b5b8032a56	Feat: Support metadata auto filer for Search. (#9524 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-19 10:27:24 +08:00
Kevin Hu	153e430b00	Feat: add meta data filter. (#9405 ) ### What problem does this PR solve? #8531 #7417 #6761 #6573 #6477 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-12 14:12:56 +08:00
Stephen Hu	0a0bfc02a0	Refactor:naive_merge_with_images close useless images (#9296 ) ### What problem does this PR solve? naive_merge_with_images close useless images ### Type of change - [x] Refactoring	2025-08-07 11:07:29 +08:00
Stephen Hu	7efeaf6548	Fix:remove a img close which can not operate (#9267 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9149#issuecomment-3157129587 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-06 10:59:49 +08:00
Stephen Hu	667c5812d0	Fix:Repeated images when parsing markdown files with images (#9196 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9149 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-04 13:35:58 +08:00
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113 ) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 19:41:09 +08:00
Zhichang Yu	342a04ec8a	Added infinity rank_feature support (#9044 ) ### What problem does this PR solve? Added infinity rank_feature support ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-29 09:14:23 +08:00
Yongteng Lei	dbc2a8689a	Fix: no chunks parsed out for Law (#8842 ) ### What problem does this PR solve? Fixes no chunks parsed out for Law. #5113 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 13:01:56 +08:00
Stephen Hu	f569401398	Fix: better_handle_different_types (#8775 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8719#issuecomment-3055883271 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-11 18:21:39 +08:00
Stephen Hu	00c954755e	Fix:use the same logic to handle pos in tokenize_chunks_with_images (#8732 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8719 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-09 09:31:40 +08:00

1 2 3 4 5

237 Commits