ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-30 15:16:45 +08:00

Author	SHA1	Message	Date
Jin Hai	766d900a41	Refactor: rename rmSpace to remove_redundant_spaces (#10796 ) ### What problem does this PR solve? - rename rmSpace to remove_redundant_spaces - move clean_markdown_block to common module - add unit tests for remove_redundant_spaces and clean_markdown_block ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-28 09:46:32 +08:00
Billy Bao	e59458c36b	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 ) ### What problem does this PR solve? Fix: parsing excel with chartsheet #10815 Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 09:40:37 +08:00
Billy Bao	501b7d4d01	Fix: prio synonym match than wordnet for english (#10762 ) ### What problem does this PR solve? Fix: prio synonym match than wordnet for english ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-27 09:32:55 +08:00
Stephen Hu	1d57801c0c	Fix:ERROR 20 Method rag.nlp.search.Dealer.search() parameter highlight="None" violates type hint bool \| list, as <class "builtins.NoneType"> "None" not list or bool. (#10743 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/10733 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-27 09:29:39 +08:00
Kevin Hu	ea73f13ebf	Fix: infinity rerank error. (#10760 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-23 17:38:54 +08:00
Billy Bao	863c3e3d9c	Fix: tree merge (#10691 ) ### What problem does this PR solve? Fix: Fix tree merge, solved #10636 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 13:02:01 +08:00
Kevin Hu	43ea312144	Fix: search highlight. (#10616 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-16 18:45:43 +08:00
Zhichang Yu	e48bec1cbf	Don't rerank for infinity (#10579 ) ### What problem does this PR solve? Don't need rerank for infinity since Infinity normalizes each way score before fusion. ### Type of change - [x] Refactoring	2025-10-15 20:15:49 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Liu An	9e323a9351	Feat(nlp): add "怎么办" pattern to question word removal (#10284 ) ### What problem does this PR solve? Added "怎么办" to the regex pattern in rmWWW method to improve query cleaning by removing this common question phrase along with other question words. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-25 16:47:56 +08:00
Billy Bao	ca9f30e1a1	Add tree_merge for law parsers, significantly outperforming hierarchical_merge (#10202 ) ### What problem does this PR solve? Add tree_merge for law parsers, significantly outperforming hierarchical_merge, solved: #8637 1. Add tree_merge for law parsers, include build_tree and get_tree by dfs. 2. add Copyright statement for helath_utils ### Type of change - [x] Documentation Update - [x] Performance Improvement	2025-09-22 16:33:21 +08:00
Yongteng Lei	0d9c1f1c3c	Feat: dataflow supports Spreadsheet and Word processor document (#9996 ) ### What problem does this PR solve? Dataflow supports Spreadsheet and Word processor document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-10 13:02:53 +08:00
Kevin Hu	e9ee9269f5	Feat: user defined prompt. (#9972 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-08 14:05:01 +08:00
Stephen Hu	4e16936fa4	Refactor: Use re compile for weight method (#9929 ) ### What problem does this PR solve? Use re compile for the weight method ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-09-05 12:29:44 +08:00
Kevin Hu	c27172b3bc	Feat: init dataflow. (#9791 ) ### What problem does this PR solve? #9790 Close #9782 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-28 18:40:32 +08:00
Jin Hai	5abd0bbac1	Fix typo (#9766 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-08-27 18:56:40 +08:00
Kevin Hu	b5b8032a56	Feat: Support metadata auto filer for Search. (#9524 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-19 10:27:24 +08:00
Kevin Hu	153e430b00	Feat: add meta data filter. (#9405 ) ### What problem does this PR solve? #8531 #7417 #6761 #6573 #6477 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-12 14:12:56 +08:00
Stephen Hu	0a0bfc02a0	Refactor:naive_merge_with_images close useless images (#9296 ) ### What problem does this PR solve? naive_merge_with_images close useless images ### Type of change - [x] Refactoring	2025-08-07 11:07:29 +08:00
Stephen Hu	7efeaf6548	Fix:remove a img close which can not operate (#9267 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9149#issuecomment-3157129587 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-06 10:59:49 +08:00
Stephen Hu	667c5812d0	Fix:Repeated images when parsing markdown files with images (#9196 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9149 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-04 13:35:58 +08:00
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113 ) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 19:41:09 +08:00
Zhichang Yu	342a04ec8a	Added infinity rank_feature support (#9044 ) ### What problem does this PR solve? Added infinity rank_feature support ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-29 09:14:23 +08:00
Yongteng Lei	dbc2a8689a	Fix: no chunks parsed out for Law (#8842 ) ### What problem does this PR solve? Fixes no chunks parsed out for Law. #5113 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 13:01:56 +08:00
Stephen Hu	f569401398	Fix: better_handle_different_types (#8775 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8719#issuecomment-3055883271 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-11 18:21:39 +08:00
Stephen Hu	00c954755e	Fix:use the same logic to handle pos in tokenize_chunks_with_images (#8732 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8719 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-09 09:31:40 +08:00
Stephen Hu	8af0d04ad0	Refactor:Improve the logic in search.py (#8716 ) ### What problem does this PR solve? 1. Remove the useless pop logic due to already been checked at the if logic 2. merge log logic ### Type of change - [x] Refactoring	2025-07-08 12:32:01 +08:00
Yongteng Lei	b705ff08fe	Refa: improve GraphRAG similarity sensitivity to numeric differences (#8479 ) ### What problem does this PR solve? Improve GraphRAG similarity sensitivity to numeric differences. #8444. ### Type of change - [x] Refactoring	2025-06-25 16:20:59 +08:00
Kevin Hu	d4e6e2bd21	Fix: doc_aggs issue. (#8418 ) ### What problem does this PR solve? #8406 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-23 14:54:01 +08:00
Wesley	3d0b440e9f	fix(search.py):remove hard page_size (#8242 ) ### What problem does this PR solve? Fix the restriction of forcing similarity_threshold=0 and page_size=30 when doc_ids is not empty #8228 --------- Co-authored-by: shiqing.wusq <shiqing.wusq@dtzhejiang.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-06-13 14:56:25 +08:00
Kevin Hu	93f5df716f	Fix: order chunks from docx by positions. (#7979 ) ### What problem does this PR solve? #7934 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-30 17:20:53 +08:00
Yongteng Lei	bd4678bca6	Fix: Unnecessary truncation in markdown parser (#7972 ) ### What problem does this PR solve? Fix unnecessary truncation in markdown parser. So that markdown can work perfectly like [this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576) in #7824, supporting multiple special delimiters. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-30 15:04:21 +08:00
Yongteng Lei	46963ab1ca	Fix: add advanced delimiter detection for naive merge (#7941 ) ### What problem does this PR solve? Add advanced delimiter detection for naive merge. #7824 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-05-29 16:17:22 +08:00
Yongteng Lei	0c562f0a9f	Refa: change citation mark as [ID:n] (#7923 ) ### What problem does this PR solve? Change citation mark as [ID:n], it's easier for LLMs to follow the instruction :) #7904 ### Type of change - [x] Refactoring	2025-05-29 10:03:51 +08:00
Sol	0d7cfce6e1	Update rag/nlp/query.py (#7816 ) ### What problem does this PR solve? Fix tokenizer resulting in low recall ![37743d3a495f734aa69f1e173fa77457](https://github.com/user-attachments/assets/1394757e-8fcb-4f87-96af-a92716144884) ![4aba633a17f34269a4e17e84fafb34c4](https://github.com/user-attachments/assets/a1828e32-3e17-4394-a633-ba3f09bd506d) ![image](https://github.com/user-attachments/assets/61308f32-2a4f-44d5-a034-d65bbec554ef) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-05-23 17:13:37 +08:00
Stephen Hu	db4371c745	Fix: Improve First Chunk Size (#7806 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7790 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-23 14:30:19 +08:00
Emmanuel Ferdman	d4a123d6dd	Fix: resolve regex library warnings (#7782 ) ### What problem does this PR solve? This small PR resolves the regex library warnings showing in Python3.11: ```python DeprecationWarning: 'count' is passed as positional argument ``` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-22 10:06:28 +08:00
Kevin Hu	321a280031	Feat: add image preview to retrieval test. (#7610 ) ### What problem does this PR solve? #7608 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-05-13 14:30:36 +08:00
Stephen Hu	573d46a4ef	FIX:ZeroDivisionError when using large page_size in client.retrieve() (#7595 ) ### What problem does this PR solve? Close #7592 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-13 10:46:31 +08:00
Kevin Hu	a14865e6bb	Fix: empty query issue. (#7551 ) ### What problem does this PR solve? #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-09 12:20:19 +08:00
Kevin Hu	c7310f7fb2	Refa: similarity calculations. (#7381 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-04-28 19:17:11 +08:00
Stephen Hu	1662c7eda3	Feat: Markdown add image (#7124 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/6984 1. Markdown parser supports get pictures 2. For Native, when handling Markdown, it will handle images 3. improve merge and ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-04-25 18:35:28 +08:00
Yongteng Lei	67dee2d74e	Fix: fix retrieval tesing wrong pagination (#7174 ) ### What problem does this PR solve? Fix retrieval testing wrong pagination. #7171 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-04-22 15:16:04 +08:00
alulala	d9266ed65a	Fix: incorrect total chunks count in retrieval function after similarity filtering (#6741 ) (#6932 ) ### Related Issue: https://github.com/infiniflow/ragflow/issues/6741 ### Environment: Using nightly version Commit version: [[`6051abb`](`6051abb4a3`)] ### Bug Description: The retrieval function in rag/nlp/search.py returns the original total chunks number even after chunks are filtered by similarity_threshold. This creates inconsistency between the actual returned chunks and the reported total. ### Changes Made: Added code to count how many search results actually meet or exceed the configured similarity threshold Positioned the calculation after the doc_ids conditional logic to ensure special cases are handled correctly Updated the ranks["total"] value to store this filtered count instead of using the raw search result count Using NumPy leverages optimized C-level batch operations to optimize speed	2025-04-11 12:31:36 +08:00
kaiyuan Zhang	ead5f7aba9	Fix infinite recursion in RagTokenizer when processing repetitive characters (#6109 ) ### What problem does this PR solve? fix #6085 RagTokenizer's dfs_() function falls into infinite recursion when processing text with repetitive Chinese characters (e.g., "一一一一一十一十一十一..." or "一一一一一一十十十十十十十二十二十二..."), causing memory leaks. ### Type of change Implemented three optimizations to the dfs_() function: 1.Added memoization with _memo dictionary to cache computed results 2.Added recursion depth limiting with _depth parameter (max 10 levels) 3.Implemented special handling for repetitive character sequences - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-04-01 13:59:52 +08:00
Kevin Hu	0758c04941	Refa: token similarity calculations. (#6614 ) ### What problem does this PR solve? #6507 ### Type of change - [x] Performance Improvement	2025-03-28 09:33:08 +08:00
Kevin Hu	cc8029a732	Fix: uploading in chat box issue. (#6547 ) ### What problem does this PR solve? #6228 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-26 15:37:48 +08:00
Kevin Hu	ee5aa51d43	Fix: point in tag issue. (#6436 ) ### What problem does this PR solve? #6414 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-24 10:45:29 +08:00

1 2 3 4 5

215 Commits