ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-31 07:36:46 +08:00

Author	SHA1	Message	Date
Yongteng Lei	9d0309aedc	Fix: [MinerU] Missing output file (#11623 ) ### What problem does this PR solve? Add fallbacks for MinerU output path. #11613, #11620. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-01 12:17:43 +08:00
buua436	a674338c21	Fix: remove garbage filtering rules (#11567 ) ### What problem does this PR solve? change: remove garbage filtering rules ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-27 17:54:49 +08:00
buua436	b6314164c5	Feat:new component Loop (#11447 ) ### What problem does this PR solve? issue: #10427 change: new component Loop ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 15:55:32 +08:00
myoldcat	8c28587821	Fix issue where HTML file parsing may lose content. (#11536 ) ### What problem does this PR solve? ##### Problem Description When parsing HTML files, some page content may be lost. For example, text inside nested `<font>` tags within multiple `<div>` elements (e.g., `<div><font>Text_1</font></div><div><font>Text_2</font></div>`) fails to be preserved correctly. ###### Root Cause #1: Block ID propagation is interrupted 1. Block ID generation: When the parser encounters a `<div>`, it generates a new `block_id` because `<div>` belongs to `BLOCK_TAGS`. 2. Recursive processing: This `block_id` is passed down recursively to process the `<div>`’s child nodes. 3. Interruption occurs: When processing a child `<font>` tag, the code enters the `else` branch of `read_text_recursively` (since `<font>` is a Tag). 4. Bug location: The first line in this `else` branch explicitly sets `block_id = None`. - This discards the valid `block_id` inherited from the parent `<div>`. - Since `<font>` is not in `BLOCK_TAGS`, it does not generate a new `block_id`, so it passes `None` to its child text nodes. 5. Consequence: The extracted text nodes have an empty `block_id` in their `metadata`. During the subsequent `merge_block_text` step, these texts cannot be correctly associated with their original `<div>` block due to the missing ID. As a result, all text from `<font>` tags gets merged together, which then triggers a second issue during concatenation. 6. Solution: Remove the forced reset of `block_id` to `None`. When the current tag (e.g., `<font>`) is not a block-level element, it should inherit the `block_id` passed down from its parent. This ensures consistent ownership across the hierarchy: `div` → `font` → `text`. ###### Root Cause #2: Data loss during text concatenation 1. The line `current_content += (" " if current_content else "" + content)` has a misplaced parenthesis. When `current_content` is non-empty (`True`): - The ternary expression evaluates to `" "` (a single space). - The code executes `current_content += " "`. - Result: Only a space is appended—the new `content` string is completely discarded. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-27 09:40:10 +08:00
Yongteng Lei	7c20c964b4	Fix: incorrect image merging for naive markdown parser (#11520 ) ### What problem does this PR solve? Fix incorrect image merging for naive markdown parser. #9349 [ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:06 +08:00
FallingSnowFlake	1033a3ae26	Fix: improve PDF text type detection by expanding regex content (#11432 ) - Add whitespace validation to the PDF English text checking regex - Reduce false negatives in English PDF content recognition ### What problem does this PR solve? The core idea is to expand the regex content used for English text detection so it can accommodate more valid characters commonly found in English PDFs. The modifications include: - Adding support for space in the regex. - Ensuring the update does not reduce existing detection accuracy. ### Type of change - [✅] Bug Fix (non-breaking change which fixes an issue)	2025-11-21 14:33:29 +08:00
Billy Bao	d3d2ccc76c	Feat: add more chunking method (#11413 ) ### What problem does this PR solve? Feat: add more chunking method #11311 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 19:07:17 +08:00
buua436	c8ab9079b3	Fix:improve multi-column document detection (#11415 ) ### What problem does this PR solve? change: improve multi-column document detection ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-20 19:00:38 +08:00
aidan	420c97199a	Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041 ) ### What problem does this PR solve? - Added TCADP Parser configuration fields to PDF, PPT, and spreadsheet parsing forms - Implemented support for setting table result type (Markdown/HTML) and Markdown image response type (URL/Text) - Updated TCADP Parser to handle return format settings from configuration or parameters - Enhanced frontend to dynamically show TCADP options based on selected parsing method - Modified backend to pass format parameters when calling TCADP API - Optimized form default value logic for TCADP configuration items - Updated multilingual resource files for new configuration options ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 10:08:42 +08:00
Billy Bao	0884e9a4d9	Fix: bbox not included in mineru output (#11365 ) ### What problem does this PR solve? Fix: bbox not included in mineru output #11315 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-19 13:59:32 +08:00
Yongteng Lei	c2b7c305fa	Fix: crop index may out of range (#11341 ) ### What problem does this PR solve? Crop index may out of range. #11323 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-18 17:01:54 +08:00
Billy Bao	fea157ba08	Fix: manual parser with mineru (#11336 ) ### What problem does this PR solve? Fix: manual parser with mineru #11320 Fix: missing parameter in mineru #11334 Fix: add outlines parameter for pdf parsers ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-18 15:22:52 +08:00
Billy Bao	e7e89d3ecb	Doc: style fix (#11295 ) ### What problem does this PR solve? Style fix based on #11283 ### Type of change - [x] Documentation Update	2025-11-17 11:16:34 +08:00
Stephen Hu	12db62b9c7	Refactor: improve mineru_parser get property logic (#11268 ) ### What problem does this PR solve? improve mineru_parser get property logic ### Type of change - [x] Refactoring	2025-11-14 16:32:35 +08:00
Kevin Hu	ba71160b14	Refa: rm useless code. (#11238 ) ### Type of change - [x] Refactoring	2025-11-13 09:59:55 +08:00
buua436	8ef2f79d0a	Fix:reset the agent component’s output (#11222 ) ### What problem does this PR solve? change: “After each dialogue turn, the agent component’s output is not reset.” ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-13 09:49:12 +08:00
Jin Hai	f98b24c9bf	Move api.settings to common.settings (#11036 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 09:36:38 +08:00
Billy Bao	121c51661d	Fix: Markdown table extractor (#11018 ) ### What problem does this PR solve? Now markdown table extractor supports <table ...>. #10966 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-05 16:10:21 +08:00
Jin Hai	bab3fce136	Move some constants to common (#11004 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 08:01:39 +08:00
Yongteng Lei	2677617f93	Feat: supports MinerU http-client/server method (#10961 ) ### What problem does this PR solve? Add support for MinerU http-client/server method. To use MinerU with vLLM server: 1. Set up a vLLM server running MinerU: ```bash mineru-vllm-server --port 30000 ``` 2. Configure the following environment variables: - `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru` (or the path to your MinerU executable) - `MINERU_BACKEND="vlm-http-client"` - `MINERU_SERVER_URL="http://your-vllm-server-ip:30000"` 3. Follow the standard MinerU setup steps as described above. With this configuration, RAGFlow will connect to your vLLM server to perform document parsing, which can significantly improve parsing performance for complex documents while reducing the resource requirements on your RAGFlow server. ![1](https://github.com/user-attachments/assets/46624a0c-0f3b-423e-ace8-81801e97a27d) ![2](https://github.com/user-attachments/assets/66ccc004-a598-47d4-93cb-fe176834f83b) ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2025-11-04 16:03:30 +08:00
Jin Hai	1e45137284	Move 'timeout' to common folder (#10983 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-04 11:51:12 +08:00
Kevin Hu	3e5a39482e	Feat: Support multiple data sources synchronizations (#10954 ) ### What problem does this PR solve? #10953 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-03 19:59:18 +08:00
Jin Hai	1284647694	Refactor file utils (#10970 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 18:54:55 +08:00
Jin Hai	076d811086	Introduce common/config_utils.py (#10968 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 17:25:06 +08:00
Jin Hai	78631a3fd3	Move some functions out of 'api/utils/common.py' (#10948 ) ### What problem does this PR solve? as title. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 12:34:47 +08:00
Jin Hai	360f5c1179	Move token related functions to common (#10942 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 08:50:05 +08:00
Jin Hai	44f2d6f5da	Move 'get_project_base_directory' to common directory (#10940 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 21:05:28 +08:00
Stephen Hu	09dd786674	Fix:KeyError: 'table_body' of mineru parser (#10773 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/10769 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-31 10:07:56 +08:00
buua436	bb9504d1cc	Fix:enhance delimiters in markdown parser (#10896 ) ### What problem does this PR solve? issue: [#10890](https://github.com/infiniflow/ragflow/issues/10890) change： enhance delimiters in markdown parser ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-30 17:36:51 +08:00
Edward Chen	b52f09adfe	Mineru api support (#10874 ) ### What problem does this PR solve? support local mineru api in docker instance. like no gpu in wsl on windows, but has mineru api with gpu support. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-30 17:31:46 +08:00
Billy Bao	057ae646f2	Fix: logging issues (#10836 ) ### What problem does this PR solve? Fix: logging issues #10835 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 14:10:47 +08:00
buua436	60a6cf7c7a	Fix:remove unexpected keyword argument in table_structure_recognizer logging (#10831 ) ### What problem does this PR solve? issue: #10825 change: remove unexpected keyword argument in table_structure_recognizer logging ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 11:02:43 +08:00
Billy Bao	e59458c36b	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 ) ### What problem does this PR solve? Fix: parsing excel with chartsheet #10815 Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 09:40:37 +08:00
Yongteng Lei	5acc407240	Feat: MinerU supports VLM-Transfomers backend (#10809 ) ### What problem does this PR solve? MinerU supports VLM-Transfomers backend. Set `MINERU_BACKEND="pipeline"` to choose the backend. (Options: pipeline \| vlm-transformers, default is pipeline) ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-10-27 17:04:13 +08:00
aidan	33a189f620	Feat: add TCADP Parser (#10775 ) ### What problem does this PR solve? This PR adds a new TCADP (Tencent Cloud Advanced Document Processing) parser to RAGFlow, enabling users to leverage Tencent Cloud's document parsing capabilities for more accurate and structured document processing. The implementation includes: New TCADP Parser: A complete implementation of Tencent Cloud's document parsing API without SDK dependency Configuration Support: Added configuration options in service_conf.yaml for Tencent Cloud API credentials Frontend Integration: Updated UI components to support the new TCADP parser option Error Handling: Comprehensive error handling and retry mechanisms for API calls Result Processing: Support for both SSE streaming and JSON response formats from Tencent Cloud API ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-10-27 15:14:58 +08:00
Zhichang Yu	73144e278b	Don't release full image (#10654 ) ### What problem does this PR solve? Introduced gpu profile in .env Added Dockerfile_tei fix datrie Removed LIGHTEN flag ### Type of change - [x] Documentation Update - [x] Refactoring	2025-10-23 23:02:27 +08:00
buua436	0ff2042fc1	Feat: add Docling parser (#10759 ) ### What problem does this PR solve? issue: #3945 change: add Docling parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-23 19:44:25 +08:00
buua436	41fade3fe6	Fix:wrong param in manual chunk (#10710 ) ### What problem does this PR solve? change: wrong param in manual chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 20:10:54 +08:00
Stephen Hu	9d12380806	Fix: Excel2HTML can't support XLS（Excel 97-2003） (#10660 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/10602 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 09:52:59 +08:00
buua436	6ab96287c9	Feat:Vision Model Image Enhancement in Manual/Paper/Book/One chunker (#10640 ) ### What problem does this PR solve? issue: [#7472](https://github.com/infiniflow/ragflow/issues/7472) change: Vision Model Image Enhancement in Manual chunker ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-21 09:36:27 +08:00
Yongteng Lei	387baf858f	Feat: add MinerU parser (#10621 ) ### What problem does this PR solve? Add MinerU parser. #3945, #8092. Set `MINERU_EXECUTABLE` to the MinerU executable path, defaults to `mineru`. Set `MINERU_DELETE_OUTPUT=0` to preserve MinerU's output, default is 1, which deletes temporary output. Set `MINERU_OUTPUT_DIR` to choose the MinerU output directory (uses the temporary directory if unset). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-17 09:55:39 +08:00
Yongteng Lei	5200711441	Feat: add support for multi-column PDF parsing (#10475 ) ### What problem does this PR solve? Add support for multi-columns PDF parsing. #9878, #9919. Two-column sample: <img width="1885" height="1020" alt="image" src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28" /> Three-column sample: <img width="1881" height="992" alt="image" src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a" /> Single-column sample: <img width="1883" height="1042" alt="image" src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:46:09 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Billy Bao	534fa60b2a	Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 (#10472 ) ### What problem does this PR solve? Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-10 20:44:05 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
XIANG LI	f631073ac2	Fix OCR GPU provider mem limit handling (#10407 ) ### What problem does this PR solve? - Running DeepDoc OCR on large PDFs inside the GPU docker-compose setup would intermittently fail with [ONNXRuntimeError] ... p2o.Clip.6 ... Available memory of 0 is smaller than requested bytes ... - Root cause: load_model() in deepdoc/vision/ocr.py treated device_id=None as-is. torch.cuda.device_count() > device_id then raised a TypeError, the helper returned False, and ONNXRuntime quietly fell back to CPUExecutionProvider with the hard-coded 512 MB limit, which then triggered the allocator failure. - Environment where this reproduces: Windows 11, AMD 5900x, 64 GB RAM, RTX 3090 (24 GB), docker-compose-gpu.yml from upstream, default DeepDoc + GraphRAG parser settings, ingesting heavy PDF such as 《内科学》（第10版）.pdf (~180 MB). Fixes: - Normalize device_id to 0 when it is None before calling any CUDA APIs, so the GPU path is considered available. - Allow configuring the CUDA provider’s memory cap via OCR_GPU_MEM_LIMIT_MB (default 2048 MB) and expose OCR_ARENA_EXTEND_STRATEGY; the calculated byte limit is logged to confirm the effective settings. After the change, ragflow_server.log shows for example load_model ... uses GPU (device 0, gpu_mem_limit=21474836480, arena_strategy=kNextPowerOfTwo) and the same document finishes OCR without allocator errors. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-10 11:03:12 +08:00
Billy Bao	f04c9e2937	Fix: correctly update parser method & correct vllm pdf parser (#10441 ) ### What problem does this PR solve? Fix: correctly update parser method ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2025-10-09 19:03:12 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Jin Hai	b0b866c8fd	Refactor: move some functions out of api/utils/__init__.py (#10216 ) ### What problem does this PR solve? Refactor import modules. ### Type of change - [x] Refactoring --------- Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-09-25 18:04:49 +08:00
Jin Hai	4eb7659499	Fix bug: broken import from rag.prompts.prompts (#10217 ) ### What problem does this PR solve? Fix broken imports ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2025-09-23 10:19:25 +08:00

1 2 3 4 5

243 Commits