ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-28 14:16:34 +08:00

Author	SHA1	Message	Date
Pegasus	d8192f8f17	Fix: validate regex pattern in split_with_pattern to prevent crash (#12633 ) ### What problem does this PR solve? Fix regex pattern validation in split_with_pattern (#12605) - Add try-except block to validate user-provided regex patterns before use - Gracefully fallback to single chunk when invalid regex is provided - Prevent server crash during DOCX parsing with malformed delimiters ## Problem Parsing DOCX files with custom regex delimiters crashes with `re.error: nothing to repeat at position 9` when users provide invalid regex patterns. Closes #12605 ## Solution Validate and compile regex pattern before use. On invalid pattern, log warning and return content as single chunk instead of crashing. ## Changes - `rag/nlp/__init__.py`: Add regex validation in `split_with_pattern()` function ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461	2026-01-15 14:24:51 +08:00
lys1313013	b226e06e2d	refactor: remove debug print statements (#12534 ) ### What problem does this PR solve? refactor: remove debug print statements ### Type of change - [x] Refactoring	2026-01-09 19:23:50 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Yongteng Lei	4cd4526492	Feat: PDF vision figure parser supports reading context (#12416 ) ### What problem does this PR solve? PDF vision figure parser supports reading context. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-05 09:55:43 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Yongteng Lei	51bc41b2e8	Refa: improve image table context (#12244 ) ### What problem does this PR solve? Improve image table context. Current strategy in attach_media_context: - Order by position when possible: if any chunk has page/position info, sort by (page, top, left), otherwise keep original order. - Apply only to media chunks: images use image_context_size, tables use table_context_size. - Primary matching: on the same page, choose a text chunk whose vertical span overlaps the media, then pick the one with the closest vertical midpoint. - Fallback matching: if no overlap on that page, choose the nearest text chunk on the same page (page-head uses the next text; page-tail uses the previous text). - Context extraction: inside the chosen text chunk, find a mid-sentence boundary near the text midpoint, then take context_size tokens split before/after (total budget). - No multi-chunk stitching: context comes from a single text chunk to avoid mixing unrelated segments. ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-26 17:55:32 +08:00
Kevin Hu	8197f9a873	Fix: table tag on chunks. (#12126 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:25:38 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
qinling0210	2ffe6f7439	Import rag_tokenizer from Infinity (#11647 ) ### What problem does this PR solve? - Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk via https://github.com/infiniflow/infinity/pull/3117 . Import rag_tokenizer from infinity and inherit from rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py. - Bump infinity to 0.6.8 ### Type of change - [x] Refactoring	2025-12-02 14:59:37 +08:00
Kevin Hu	14616cf845	Feat: add child parent chunking method in backend. (#11598 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 19:25:32 +08:00
Yongteng Lei	9d8b96c1d0	Feat: add context for figure and table (#11547 ) ### What problem does this PR solve? Add context for figure table. ![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e) `==================()` for demonstrating purpose. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 10:21:44 +08:00
Kevin Hu	74e0b58d89	Fix: excel default optimization. (#11519 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:20 +08:00
Yongteng Lei	db0f6840d9	Feat: ignore chunk size when using custom delimiters (#11434 ) ### What problem does this PR solve? Ignore chunk size when using custom delimiter. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-21 14:36:26 +08:00
Billy Bao	7264fb6978	Fix: concat images in word document. (#11310 ) ### What problem does this PR solve? Fix: concat images in word document. Partially solved issues in #11063 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-17 19:38:26 +08:00
Jin Hai	bd4bc57009	Refactor: move mcp connection utilities to common (#11304 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-17 15:34:17 +08:00
Billy Bao	93422fa8cc	Fix: Law parser (#11246 ) ### What problem does this PR solve? Fix: Law parser ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-13 15:19:02 +08:00
Jin Hai	360f5c1179	Move token related functions to common (#10942 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 08:50:05 +08:00
Billy Bao	863c3e3d9c	Fix: tree merge (#10691 ) ### What problem does this PR solve? Fix: Fix tree merge, solved #10636 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 13:02:01 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Billy Bao	ca9f30e1a1	Add tree_merge for law parsers, significantly outperforming hierarchical_merge (#10202 ) ### What problem does this PR solve? Add tree_merge for law parsers, significantly outperforming hierarchical_merge, solved: #8637 1. Add tree_merge for law parsers, include build_tree and get_tree by dfs. 2. add Copyright statement for helath_utils ### Type of change - [x] Documentation Update - [x] Performance Improvement	2025-09-22 16:33:21 +08:00
Yongteng Lei	0d9c1f1c3c	Feat: dataflow supports Spreadsheet and Word processor document (#9996 ) ### What problem does this PR solve? Dataflow supports Spreadsheet and Word processor document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-10 13:02:53 +08:00
Kevin Hu	e9ee9269f5	Feat: user defined prompt. (#9972 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-08 14:05:01 +08:00
Kevin Hu	c27172b3bc	Feat: init dataflow. (#9791 ) ### What problem does this PR solve? #9790 Close #9782 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-28 18:40:32 +08:00
Jin Hai	5abd0bbac1	Fix typo (#9766 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-08-27 18:56:40 +08:00
Kevin Hu	b5b8032a56	Feat: Support metadata auto filer for Search. (#9524 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-19 10:27:24 +08:00
Stephen Hu	0a0bfc02a0	Refactor:naive_merge_with_images close useless images (#9296 ) ### What problem does this PR solve? naive_merge_with_images close useless images ### Type of change - [x] Refactoring	2025-08-07 11:07:29 +08:00
Stephen Hu	7efeaf6548	Fix:remove a img close which can not operate (#9267 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9149#issuecomment-3157129587 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-06 10:59:49 +08:00
Stephen Hu	667c5812d0	Fix:Repeated images when parsing markdown files with images (#9196 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/9149 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-04 13:35:58 +08:00
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113 ) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 19:41:09 +08:00
Yongteng Lei	dbc2a8689a	Fix: no chunks parsed out for Law (#8842 ) ### What problem does this PR solve? Fixes no chunks parsed out for Law. #5113 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 13:01:56 +08:00
Stephen Hu	f569401398	Fix: better_handle_different_types (#8775 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8719#issuecomment-3055883271 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-11 18:21:39 +08:00
Stephen Hu	00c954755e	Fix:use the same logic to handle pos in tokenize_chunks_with_images (#8732 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8719 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-09 09:31:40 +08:00
Yongteng Lei	b705ff08fe	Refa: improve GraphRAG similarity sensitivity to numeric differences (#8479 ) ### What problem does this PR solve? Improve GraphRAG similarity sensitivity to numeric differences. #8444. ### Type of change - [x] Refactoring	2025-06-25 16:20:59 +08:00
Kevin Hu	93f5df716f	Fix: order chunks from docx by positions. (#7979 ) ### What problem does this PR solve? #7934 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-30 17:20:53 +08:00
Yongteng Lei	bd4678bca6	Fix: Unnecessary truncation in markdown parser (#7972 ) ### What problem does this PR solve? Fix unnecessary truncation in markdown parser. So that markdown can work perfectly like [this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576) in #7824, supporting multiple special delimiters. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-30 15:04:21 +08:00
Yongteng Lei	46963ab1ca	Fix: add advanced delimiter detection for naive merge (#7941 ) ### What problem does this PR solve? Add advanced delimiter detection for naive merge. #7824 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-05-29 16:17:22 +08:00
Stephen Hu	db4371c745	Fix: Improve First Chunk Size (#7806 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7790 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-23 14:30:19 +08:00
Emmanuel Ferdman	d4a123d6dd	Fix: resolve regex library warnings (#7782 ) ### What problem does this PR solve? This small PR resolves the regex library warnings showing in Python3.11: ```python DeprecationWarning: 'count' is passed as positional argument ``` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-22 10:06:28 +08:00
Kevin Hu	321a280031	Feat: add image preview to retrieval test. (#7610 ) ### What problem does this PR solve? #7608 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-05-13 14:30:36 +08:00
Stephen Hu	1662c7eda3	Feat: Markdown add image (#7124 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/6984 1. Markdown parser supports get pictures 2. For Native, when handling Markdown, it will handle images 3. improve merge and ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-04-25 18:35:28 +08:00
Kevin Hu	a087d13ccb	Feat: text file support position retaining. (#6231 ) ### What problem does this PR solve? #5832 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-18 16:55:11 +08:00
Kevin Hu	e05cdc2f9c	Fix: encode detect error. (#6006 ) ### What problem does this PR solve? #5967 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-13 10:47:58 +08:00
Kevin Hu	daddfc9e1b	Remove dup gb2312, solve currupt error. (#5326 ) ### What problem does this PR solve? #5252 #5325 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-25 12:22:37 +08:00
Kevin Hu	3444cb15e3	Refine search query. (#5235 ) ### What problem does this PR solve? #5173 #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-21 18:32:32 +08:00
Kevin Hu	7b3d700d5f	Apply agentic searching. (#5196 ) ### What problem does this PR solve? #5173 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-02-20 17:41:01 +08:00
Kevin Hu	f374dd38b6	Fix divided by zero issue. (#4784 ) ### What problem does this PR solve? #4779 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-08 10:36:26 +08:00
Luo Pan	68d46b2a1e	Fix bug in hierarchical_merge function (#4006 ) ### What problem does this PR solve? Fix hierarchical_merge function. From idx vs. actual value to actual value vs. actual value. Related issue #4003 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: luopan <luopan@example.com>	2024-12-13 08:50:58 +08:00
Zhichang Yu	03f00c9e6f	Rename page_num_list, top_list, position_list (#3940 ) ### What problem does this PR solve? Rename page_num_list, top_list, position_list to page_num_int, top_int, position_int ### Type of change - [x] Refactoring	2024-12-10 16:32:58 +08:00

1 2

96 Commits