ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-23 11:36:38 +08:00

Author	SHA1	Message	Date
Kevin Hu	927db0b373	Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716 ) ### Type of change - [x] Refactoring	2026-01-20 13:29:37 +08:00
Magicbook1108	045314a1aa	Fix: duplicate content in chunk (#12655 ) ### What problem does this PR solve? Fix: duplicate content in chunk #12336 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-16 15:32:04 +08:00
Yongteng Lei	64c75d558e	Fix: zip extraction vulnerabilities in MinerU and TCADP (#12527 ) ### What problem does this PR solve? Fix zip extraction vulnerabilities: - Block symlink entries in zip files. - Reject encrypted zip entries. - Prevent absolute path attacks (including Windows paths). - Block path traversal attempts (../). - Stop zip slip exploits (directory escape). - Use streaming for memory-safe file handling. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 12:24:50 +08:00
Lin Manhui	4fe3c24198	feat: PaddleOCR PDF parser supports thumnails and positions (#12565 ) ### What problem does this PR solve? 1. PaddleOCR PDF parser supports thumnails and positions. 2. Add FAQ documentation for PaddleOCR PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:51:08 +08:00
lys1313013	b226e06e2d	refactor: remove debug print statements (#12534 ) ### What problem does this PR solve? refactor: remove debug print statements ### Type of change - [x] Refactoring	2026-01-09 19:23:50 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Yongteng Lei	4cd4526492	Feat: PDF vision figure parser supports reading context (#12416 ) ### What problem does this PR solve? PDF vision figure parser supports reading context. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-05 09:55:43 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
Jin Hai	df3cbb9b9e	Refactor code (#12305 ) ### What problem does this PR solve? as title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:09:18 +08:00
Stephen Hu	0b5d1ebefa	refactor: docling parser will close bytes io (#12280 ) ### What problem does this PR solve? docling parser will close bytes io ### Type of change - [x] Refactoring	2025-12-29 13:33:27 +08:00
Rin	651d9fff9f	security: replace unsafe eval with ast.literal_eval in vision operators (#12236 ) Addresses a potential RCE vulnerability in NormalizeImage by using ast.literal_eval for safer string parsing. --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-29 13:28:09 +08:00
Kevin Hu	bc9e1e3b9a	Fix: parent-children pipleine bad case. (#12246 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-26 18:57:16 +08:00
Kevin Hu	f0dac1d90e	Fix: loopitem None issue. (#12166 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 12:12:38 +08:00
Magicbook1108	712d537d66	Fix: vision_figure_parser_docx/pdf_wrapper (#12104 ) ### What problem does this PR solve? Fix: vision_figure_parser_docx/pdf_wrapper #11735 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-23 11:51:28 +08:00
Stephen Hu	ba7e087aef	Refactor:remove useless try catch for ppt parser (#12063 ) ### What problem does this PR solve? remove useless try catch for ppt parser ### Type of change - [x] Refactoring	2025-12-22 13:09:42 +08:00
buua436	b49eb6826b	Feat: enhance Excel image extraction with vision-based descriptions (#12054 ) ### What problem does this PR solve? issue: [#11618](https://github.com/infiniflow/ragflow/issues/11618) change: enhance Excel image extraction with vision-based descriptions ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-22 10:17:44 +08:00
concertdictate	4dd8cdc38b	task executor issues (#12006 ) ### What problem does this PR solve? Fixes #8706 - `InfinityException: TOO_MANY_CONNECTIONS` when running multiple task executor workers ### Problem Description When running RAGFlow with 8-16 task executor workers, most workers fail to start properly. Checking logs revealed that workers were stuck/hanging during Infinity connection initialization - only 1-2 workers would successfully register in Redis while the rest remained blocked. ### Root Cause The Infinity SDK `ConnectionPool` pre-allocates all connections in `__init__`. With the default `max_size=32` and multiple workers (e.g., 16), this creates 16×32=512 connections immediately on startup, exceeding Infinity's default 128 connection limit. Workers hang while waiting for connections that can never be established. ### Changes 1. Prevent Infinity connection storm (`rag/utils/infinity_conn.py`, `rag/svr/task_executor.py`) - Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since operations are synchronous) - Added staggered startup delay (2s per worker) to spread connection initialization 2. Handle None children_delimiter (`rag/app/naive.py`) - Use `or ""` to handle explicitly set None values from parser config 3. MinerU parser robustness (`deepdoc/parser/mineru_parser.py`) - Use `.get()` for optional output fields that may be missing - Fix DISCARDED block handling: change `pass` to `continue` to skip discarded blocks entirely ### Why `max_size=4` is sufficient \| Workers \| Pool Size \| Total Connections \| Infinity Limit \| \|---------\|-----------\|-------------------\|----------------\| \| 16 \| 32 \| 512 \| 128 ❌ \| \| 16 \| 4 \| 64 \| 128 ✅ \| \| 32 \| 4 \| 128 \| 128 ✅ \| - All RAGFlow operations are synchronous: `get_conn()` → operation → `release_conn()` - No parallel `docStoreConn` operations in the codebase - Maximum 1-2 concurrent connections needed per worker; 4 provides safety margin ### MinerU DISCARDED block bug When MinerU returns blocks with `type: "discarded"` (headers, footers, watermarks, page numbers, artifacts), the previous code used `pass` which left the `section` variable undefined, causing: - UnboundLocalError if DISCARDED is the first block - Duplicate content if DISCARDED follows another block (stale value from previous iteration) Root cause confirmed via MinerU source code: From [`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14): ```python class BlockType: DISCARDED = 'discarded' # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE ``` Per [MinerU documentation](https://opendatalab.github.io/MinerU/reference/output_files/), discarded blocks contain content that should be filtered out for clean text extraction. Fix: Changed `pass` to `continue` to skip discarded blocks entirely. ### Testing - Verified all 16 workers now register successfully in Redis - All workers heartbeating correctly - Document parsing works as expected - MinerU parsing with DISCARDED blocks no longer crashes ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: user210 <user210@rt>	2025-12-18 10:03:30 +08:00
Yongteng Lei	672958a192	Fix: model not authorized (#12001 ) ### What problem does this PR solve? Fix model not authorized. #11973. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-17 19:48:24 +08:00
Jin Hai	d38f8a1562	Add license and Fix IDE warnings (#11985 ) ### What problem does this PR solve? - Add license - Fix IDE warnings ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-17 17:04:44 +08:00
Yongteng Lei	03f9be7cbb	Refa: only support MinerU-API now (#11977 ) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring	2025-12-17 12:58:48 +08:00
concertdictate	49c74d08e8	Feature/mineru improvements (#11938 ) 我已在下面的评论中用中文重复说明。 ### What problem does this PR solve? ## Summary This PR enhances the MinerU document parser with additional configuration options, giving users more control over PDF parsing behavior and improving support for multilingual documents. ## Changes ### Backend (`deepdoc/parser/mineru_parser.py`) - Added configurable parsing options: - Parse Method: `auto`, `txt`, or `ocr` — allows users to choose the extraction strategy - Formula Recognition: Toggle for enabling/disabling formula extraction (useful to disable for Cyrillic documents where it may cause issues) - Table Recognition: Toggle for enabling/disabling table extraction - Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate RAGFlow language settings to MinerU-compatible language codes for better OCR accuracy - Improved parser configuration handling to pass these options through the processing pipeline ### Frontend (`web/`) - Created new `MinerUOptionsFormField` component that conditionally renders when MinerU is selected as the layout recognition engine - Added UI controls for: - Parse method selection (dropdown) - Formula recognition toggle (switch) - Table recognition toggle (switch) - Added i18n translations for English and Chinese - Integrated the options into both the dataset creation dialog and dataset settings page ### Integration - Updated `rag/app/naive.py` to forward MinerU options to the parser - Updated task service to handle the new configuration parameters ## Why MinerU is a powerful document parser, but the default settings don't work well for all document types. This PR allows users to: 1. Choose the best parsing method for their documents 2. Disable formula recognition for Cyrillic/non-Latin scripts where it causes issues 3. Control table extraction based on document needs 4. Benefit from automatic language detection for better OCR results ## Testing - [x] Tested MinerU parsing with different parse methods - [x] Verified UI renders correctly when MinerU is selected/deselected - [x] Confirmed settings persist correctly in dataset configuration ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: user210 <user210@rt> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-16 13:15:25 +08:00
Yongteng Lei	e9710b7aa9	Refa: treat MinerU as an OCR model 2 (#11905 ) ### What problem does this PR solve? Treat MinerU as an OCR model 2. #11903 ### Type of change - [x] Refactoring	2025-12-11 17:33:12 +08:00
TeslaZY	7b96113d4c	MinerU supports for the new backend vlm-mlx-engine (#11864 ) ### What problem does this PR solve? MinerU new version supports for the new backend vlm-mlx-engine，https://github.com/opendatalab/MinerU . ### Type of change - [ x ] New Feature (non-breaking change which adds functionality)	2025-12-11 09:59:38 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00
Yongteng Lei	a94b3b9df2	Refa: treat MinerU as an OCR model (#11849 ) ### What problem does this PR solve? Treat MinerU as an OCR model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-09 18:54:14 +08:00
Kevin Hu	09a3854ed8	Fix: chunk method error. (#11807 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 14:28:23 +08:00
Jin Hai	43f51baa96	Fix errors (#11804 ) ### What problem does this PR solve? 1. typos 2. grammar errors. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 12:21:18 +08:00
Jin Hai	6546f86b4e	Fix errors (#11795 ) ### What problem does this PR solve? - typos - IDE warnings ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 09:42:10 +08:00
Kevin Hu	797e03f843	Fix: none type error. (#11735 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-04 14:14:38 +08:00
Yongteng Lei	648342b62f	Fix: handle MinerU sanitized filenames when reading output (#11701 ) ### What problem does this PR solve? Handle MinerU sanitized filenames when reading output. #11613, #11620. Thanks @shaoqing404 for raising this issue. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-03 17:24:37 +08:00
Yongteng Lei	9d0309aedc	Fix: [MinerU] Missing output file (#11623 ) ### What problem does this PR solve? Add fallbacks for MinerU output path. #11613, #11620. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:17:43 +08:00
buua436	a674338c21	Fix: remove garbage filtering rules (#11567 ) ### What problem does this PR solve? change: remove garbage filtering rules ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-27 17:54:49 +08:00
buua436	b6314164c5	Feat:new component Loop (#11447 ) ### What problem does this PR solve? issue: #10427 change: new component Loop ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 15:55:32 +08:00
myoldcat	8c28587821	Fix issue where HTML file parsing may lose content. (#11536 ) ### What problem does this PR solve? ##### Problem Description When parsing HTML files, some page content may be lost. For example, text inside nested `<font>` tags within multiple `<div>` elements (e.g., `<div><font>Text_1</font></div><div><font>Text_2</font></div>`) fails to be preserved correctly. ###### Root Cause #1: Block ID propagation is interrupted 1. Block ID generation: When the parser encounters a `<div>`, it generates a new `block_id` because `<div>` belongs to `BLOCK_TAGS`. 2. Recursive processing: This `block_id` is passed down recursively to process the `<div>`’s child nodes. 3. Interruption occurs: When processing a child `<font>` tag, the code enters the `else` branch of `read_text_recursively` (since `<font>` is a Tag). 4. Bug location: The first line in this `else` branch explicitly sets `block_id = None`. - This discards the valid `block_id` inherited from the parent `<div>`. - Since `<font>` is not in `BLOCK_TAGS`, it does not generate a new `block_id`, so it passes `None` to its child text nodes. 5. Consequence: The extracted text nodes have an empty `block_id` in their `metadata`. During the subsequent `merge_block_text` step, these texts cannot be correctly associated with their original `<div>` block due to the missing ID. As a result, all text from `<font>` tags gets merged together, which then triggers a second issue during concatenation. 6. Solution: Remove the forced reset of `block_id` to `None`. When the current tag (e.g., `<font>`) is not a block-level element, it should inherit the `block_id` passed down from its parent. This ensures consistent ownership across the hierarchy: `div` → `font` → `text`. ###### Root Cause #2: Data loss during text concatenation 1. The line `current_content += (" " if current_content else "" + content)` has a misplaced parenthesis. When `current_content` is non-empty (`True`): - The ternary expression evaluates to `" "` (a single space). - The code executes `current_content += " "`. - Result: Only a space is appended—the new `content` string is completely discarded. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-27 09:40:10 +08:00
Yongteng Lei	7c20c964b4	Fix: incorrect image merging for naive markdown parser (#11520 ) ### What problem does this PR solve? Fix incorrect image merging for naive markdown parser. #9349 [ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:06 +08:00
FallingSnowFlake	1033a3ae26	Fix: improve PDF text type detection by expanding regex content (#11432 ) - Add whitespace validation to the PDF English text checking regex - Reduce false negatives in English PDF content recognition ### What problem does this PR solve? The core idea is to expand the regex content used for English text detection so it can accommodate more valid characters commonly found in English PDFs. The modifications include: - Adding support for space in the regex. - Ensuring the update does not reduce existing detection accuracy. ### Type of change - [✅] Bug Fix (non-breaking change which fixes an issue)	2025-11-21 14:33:29 +08:00
Billy Bao	d3d2ccc76c	Feat: add more chunking method (#11413 ) ### What problem does this PR solve? Feat: add more chunking method #11311 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 19:07:17 +08:00
buua436	c8ab9079b3	Fix:improve multi-column document detection (#11415 ) ### What problem does this PR solve? change: improve multi-column document detection ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-20 19:00:38 +08:00
aidan	420c97199a	Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041 ) ### What problem does this PR solve? - Added TCADP Parser configuration fields to PDF, PPT, and spreadsheet parsing forms - Implemented support for setting table result type (Markdown/HTML) and Markdown image response type (URL/Text) - Updated TCADP Parser to handle return format settings from configuration or parameters - Enhanced frontend to dynamically show TCADP options based on selected parsing method - Modified backend to pass format parameters when calling TCADP API - Optimized form default value logic for TCADP configuration items - Updated multilingual resource files for new configuration options ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 10:08:42 +08:00
Billy Bao	0884e9a4d9	Fix: bbox not included in mineru output (#11365 ) ### What problem does this PR solve? Fix: bbox not included in mineru output #11315 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-19 13:59:32 +08:00
Yongteng Lei	c2b7c305fa	Fix: crop index may out of range (#11341 ) ### What problem does this PR solve? Crop index may out of range. #11323 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-18 17:01:54 +08:00
Billy Bao	fea157ba08	Fix: manual parser with mineru (#11336 ) ### What problem does this PR solve? Fix: manual parser with mineru #11320 Fix: missing parameter in mineru #11334 Fix: add outlines parameter for pdf parsers ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-18 15:22:52 +08:00
Billy Bao	e7e89d3ecb	Doc: style fix (#11295 ) ### What problem does this PR solve? Style fix based on #11283 ### Type of change - [x] Documentation Update	2025-11-17 11:16:34 +08:00
Stephen Hu	12db62b9c7	Refactor: improve mineru_parser get property logic (#11268 ) ### What problem does this PR solve? improve mineru_parser get property logic ### Type of change - [x] Refactoring	2025-11-14 16:32:35 +08:00
Kevin Hu	ba71160b14	Refa: rm useless code. (#11238 ) ### Type of change - [x] Refactoring	2025-11-13 09:59:55 +08:00
buua436	8ef2f79d0a	Fix:reset the agent component’s output (#11222 ) ### What problem does this PR solve? change: “After each dialogue turn, the agent component’s output is not reset.” ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-13 09:49:12 +08:00
Jin Hai	f98b24c9bf	Move api.settings to common.settings (#11036 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 09:36:38 +08:00
Billy Bao	121c51661d	Fix: Markdown table extractor (#11018 ) ### What problem does this PR solve? Now markdown table extractor supports <table ...>. #10966 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-05 16:10:21 +08:00
Jin Hai	bab3fce136	Move some constants to common (#11004 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 08:01:39 +08:00

1 2 3 4 5 ...

274 Commits