ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-25 20:56:33 +08:00

Author	SHA1	Message	Date
buua436	bb9504d1cc	Fix:enhance delimiters in markdown parser (#10896 ) ### What problem does this PR solve? issue: [#10890](https://github.com/infiniflow/ragflow/issues/10890) change： enhance delimiters in markdown parser ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-30 17:36:51 +08:00
Edward Chen	b52f09adfe	Mineru api support (#10874 ) ### What problem does this PR solve? support local mineru api in docker instance. like no gpu in wsl on windows, but has mineru api with gpu support. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-30 17:31:46 +08:00
Billy Bao	057ae646f2	Fix: logging issues (#10836 ) ### What problem does this PR solve? Fix: logging issues #10835 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 14:10:47 +08:00
Billy Bao	e59458c36b	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 ) ### What problem does this PR solve? Fix: parsing excel with chartsheet #10815 Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 09:40:37 +08:00
Yongteng Lei	5acc407240	Feat: MinerU supports VLM-Transfomers backend (#10809 ) ### What problem does this PR solve? MinerU supports VLM-Transfomers backend. Set `MINERU_BACKEND="pipeline"` to choose the backend. (Options: pipeline \| vlm-transformers, default is pipeline) ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-10-27 17:04:13 +08:00
aidan	33a189f620	Feat: add TCADP Parser (#10775 ) ### What problem does this PR solve? This PR adds a new TCADP (Tencent Cloud Advanced Document Processing) parser to RAGFlow, enabling users to leverage Tencent Cloud's document parsing capabilities for more accurate and structured document processing. The implementation includes: New TCADP Parser: A complete implementation of Tencent Cloud's document parsing API without SDK dependency Configuration Support: Added configuration options in service_conf.yaml for Tencent Cloud API credentials Frontend Integration: Updated UI components to support the new TCADP parser option Error Handling: Comprehensive error handling and retry mechanisms for API calls Result Processing: Support for both SSE streaming and JSON response formats from Tencent Cloud API ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-10-27 15:14:58 +08:00
Zhichang Yu	73144e278b	Don't release full image (#10654 ) ### What problem does this PR solve? Introduced gpu profile in .env Added Dockerfile_tei fix datrie Removed LIGHTEN flag ### Type of change - [x] Documentation Update - [x] Refactoring	2025-10-23 23:02:27 +08:00
buua436	0ff2042fc1	Feat: add Docling parser (#10759 ) ### What problem does this PR solve? issue: #3945 change: add Docling parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-23 19:44:25 +08:00
buua436	41fade3fe6	Fix:wrong param in manual chunk (#10710 ) ### What problem does this PR solve? change: wrong param in manual chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 20:10:54 +08:00
Stephen Hu	9d12380806	Fix: Excel2HTML can't support XLS（Excel 97-2003） (#10660 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/10602 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-21 09:52:59 +08:00
buua436	6ab96287c9	Feat:Vision Model Image Enhancement in Manual/Paper/Book/One chunker (#10640 ) ### What problem does this PR solve? issue: [#7472](https://github.com/infiniflow/ragflow/issues/7472) change: Vision Model Image Enhancement in Manual chunker ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-21 09:36:27 +08:00
Yongteng Lei	387baf858f	Feat: add MinerU parser (#10621 ) ### What problem does this PR solve? Add MinerU parser. #3945, #8092. Set `MINERU_EXECUTABLE` to the MinerU executable path, defaults to `mineru`. Set `MINERU_DELETE_OUTPUT=0` to preserve MinerU's output, default is 1, which deletes temporary output. Set `MINERU_OUTPUT_DIR` to choose the MinerU output directory (uses the temporary directory if unset). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-17 09:55:39 +08:00
Yongteng Lei	5200711441	Feat: add support for multi-column PDF parsing (#10475 ) ### What problem does this PR solve? Add support for multi-columns PDF parsing. #9878, #9919. Two-column sample: <img width="1885" height="1020" alt="image" src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28" /> Three-column sample: <img width="1881" height="992" alt="image" src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a" /> Single-column sample: <img width="1883" height="1042" alt="image" src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:46:09 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Billy Bao	534fa60b2a	Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 (#10472 ) ### What problem does this PR solve? Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-10 20:44:05 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
Billy Bao	f04c9e2937	Fix: correctly update parser method & correct vllm pdf parser (#10441 ) ### What problem does this PR solve? Fix: correctly update parser method ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2025-10-09 19:03:12 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Jin Hai	4eb7659499	Fix bug: broken import from rag.prompts.prompts (#10217 ) ### What problem does this PR solve? Fix broken imports ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2025-09-23 10:19:25 +08:00
Lynn	62d35b1b73	Fix: handle zero (#10149 ) ### What problem does this PR solve? Handle zero and nan in calculate. #10125 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-09-18 16:28:03 +08:00
Lynn	d353f7f7f8	Feat/parse audio (#10133 ) ### What problem does this PR solve? Dataflow support audio. And fix giteeAI's sequence2text model. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-09-18 09:31:32 +08:00
buua436	c9ea22ef69	Fix: set default chunk_token_num in html_parser (#10118 ) ### What problem does this PR solve? issue: [Bug]: Agent component (HTTP Request) "'>' not supported between instances of 'int' and 'NoneType'" [#10096](https://github.com/infiniflow/ragflow/issues/10096) Change: When the Invoke class instantiates HtmlParser without providing the chunk_token_num parameter, the value defaults to None, leading to a comparison error with block_token_count. This change sets the default chunk_token_num to 512 to prevent such errors. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: BadwomanCraZY <511528396@qq.com>	2025-09-17 09:36:31 +08:00
Yongteng Lei	bc0281040b	Feat: add support for the Ascend layout recognizer (#10105 ) ### What problem does this PR solve? Supports Ascend layout recognizer. Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n` (for example, n=0) to specify the Ascend device ID. Ensure that you have installed the [ais tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench) properly. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-16 09:51:15 +08:00
Yongteng Lei	0d9c1f1c3c	Feat: dataflow supports Spreadsheet and Word processor document (#9996 ) ### What problem does this PR solve? Dataflow supports Spreadsheet and Word processor document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-10 13:02:53 +08:00
湛露先生	1ee9c0b8d9	fix xss in excel_parser (#9909 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>	2025-09-05 09:58:03 +08:00
Kevin Hu	c27172b3bc	Feat: init dataflow. (#9791 ) ### What problem does this PR solve? #9790 Close #9782 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-28 18:40:32 +08:00
pingguoCooler	cf0011be67	Feat: Upgrade html parser (#9675 ) ### What problem does this PR solve? parse more html content. ### Type of change - [x] Other (please describe):	2025-08-27 12:43:55 +08:00
Yongteng Lei	382458ace7	Feat: advanced markdown parsing (#9607 ) ### What problem does this PR solve? Using AST parsing to handle markdown more accurately, preventing components from being cut off by chunking. #9564 <img width="1746" height="993" alt="image" src="https://github.com/user-attachments/assets/4aaf4bf6-5714-4d48-a9cf-864f59633f7f" /> <img width="1739" height="982" alt="image" src="https://github.com/user-attachments/assets/dc00233f-7a55-434f-bbb7-74ce7f57a6cf" /> <img width="559" height="100" alt="image" src="https://github.com/user-attachments/assets/4a556b5b-d9c6-4544-a486-8ac342bd504e" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-21 09:36:18 +08:00
Yongteng Lei	eef43fa25c	Fix: unexpected truncated Excel files (#9500 ) ### What problem does this PR solve? Handle unexpected truncated Excel files. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-15 17:00:34 +08:00
Jay Xu	79e2edc835	Fix "File contains no valid workbook part" (#9360 ) ### What problem does this PR solve? fix "File contains no valid workbook part" stacktrace: ``` Traceback (most recent call last): File "/ragflow/deepdoc/parser/excel_parser.py", line 54, in _load_excel_to_workbook return RAGFlowExcelParser._dataframe_to_workbook(df) File "/ragflow/deepdoc/parser/excel_parser.py", line 69, in _dataframe_to_workbook ws.cell(row=row_num, column=col_num, value=value) File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/worksheet/worksheet.py", line 246, in cell cell.value = value File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 218, in value self._bind_value(value) File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 197, in _bind_value value = self.check_string(value) File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 165, in check_string raise IllegalCharacterError(f"{value} cannot be used in worksheets.") ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-08-12 14:58:36 +08:00
Jay Xu	569ab011c4	Add fallback to use 'calamine' parse engine in excel_parser.py (#9374 ) ### What problem does this PR solve? add fallback to `calamine` engine when parse error raised using the default `openpyxl` / `xlrd` engine. e.g. the following error can be fixed: ``` Traceback (most recent call last): File "/ragflow/deepdoc/parser/excel_parser.py", line 53, in _load_excel_to_workbook df = pd.read_excel(file_like_object) File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel io = ExcelFile( File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__ self._reader = self._engines[engine]( File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__ super().__init__( File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__ self.book = self.load_workbook(self.handles.handle, engine_kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook return open_workbook(file_contents=data, **engine_kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook bk = open_workbook_xls( File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 68, in open_workbook_xls bk.biff2_8_load( File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 641, in biff2_8_load cd.locate_named_stream(UNICODE_LITERAL(qname)) File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 398, in locate_named_stream result = self._locate_stream( File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 429, in _locate_stream raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s])) xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4 ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-12 12:41:33 +08:00
Jay Xu	6ad8b54754	fix "TypeError: '<' not supported between instances of 'Emu' and 'Non… (#9209 ) …eType'" ### What problem does this PR solve? fix "TypeError: '<' not supported between instances of 'Emu' and 'NoneType'" ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-04 16:07:03 +08:00
Kevin Hu	a16cd4f110	Refa: add result to callback for agent tool use. (#9137 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-08-01 21:49:39 +08:00
Yongteng Lei	39ef2ffba9	Feat: parsing supports jsonl or ldjson format (#9087 ) ### What problem does this PR solve? Supports jsonl or ldjson format. Feature request from [discussion](https://github.com/orgs/infiniflow/discussions/8774). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 09:48:20 +08:00
Jin Hai	03daf4618c	Refactor parser code (#9042 ) ### What problem does this PR solve? Refactor code ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-07-25 12:04:07 +08:00
Kevin Hu	ecdb1701df	Perf: test llm before RAPTOR. (#8897 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-17 16:48:50 +08:00
Yongteng Lei	51a8604dcb	Fix: fixed context loss caused by separating markdown tables from original text (#8844 ) ### What problem does this PR solve? Fix context loss caused by separating markdown tables from original text. #6871, #8804. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 13:03:01 +08:00
Kevin Hu	6d256ff0f5	Perf: ignore concate between rows. (#8507 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-06-26 14:55:37 +08:00
Jin Hai	4a2ff633e0	Fix typo in code (#8327 ) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-06-18 09:41:09 +08:00
Stephen Hu	2e44c3b743	Fix:Unimplemented function in ppt_parser (#8095 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8088 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-06 10:05:58 +08:00
giiiiiithub	6ba5a4348a	set PARALLEL_DEVICES default value= 0 (#7935 ) ### What problem does this PR solve? it would be fail if PARALLEL_DEVICES = None in OCR class , because it pass 0 to TextDetector and TextRecognizer init method. and It would be simpler to set 0 as the default value for PARALLEL_DEVICES. ### Type of change - [x] Refactoring	2025-05-29 13:32:16 +08:00
Emmanuel Ferdman	d4a123d6dd	Fix: resolve regex library warnings (#7782 ) ### What problem does this PR solve? This small PR resolves the regex library warnings showing in Python3.11: ```python DeprecationWarning: 'count' is passed as positional argument ``` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-22 10:06:28 +08:00
Yongteng Lei	b908c33464	Fix: uncaptured image data with position information (#7683 ) ### What problem does this PR solve? Fixed uncaptured figure data with position information. #7466, #7681 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-05-19 19:33:28 +08:00
liuzhenghua	ea5e8caa69	feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562 ) ### What problem does this PR solve? When the PDF uses vector fonts, the rendered text in the captured page image often has missing strokes, leading to numerous OCR errors and incorrect characters. Similar issues also occur in the extracted chart images. Before ![0089e1f76205b5b3](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1) After ![03053149e919773a](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2) You can use the following document for testing. [Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-12 09:50:21 +08:00
Kevin Hu	a14865e6bb	Fix: empty query issue. (#7551 ) ### What problem does this PR solve? #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-09 12:20:19 +08:00
Kevin Hu	9d3dd13fef	Refa: text order be robuster. (#7525 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-05-08 12:58:10 +08:00
Stephen Hu	953b3e1b3f	Fix: Sometimes VisionFigureParser.figures may is tuple (#7477 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7466 I think due to some times we can not get position ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-06 17:38:22 +08:00
liuzhenghua	2f768b96e8	perf: optimze figure parser (#7392 ) ### What problem does this PR solve? When parsing documents containing images, the current code uses a single-threaded approach to call the VL model, resulting in extremely slow parsing speed (e.g., parsing a Word document with dozens of images takes over 20 minutes). By switching to a multithreaded approach to call the VL model, the parsing speed can be improved to an acceptable level. ### Type of change - [x] Performance Improvement --------- Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-06 14:39:45 +08:00
zhudongwork	10432a1be7	Refa: Optimize pptx shape extraction to reduce content loss (#6703 ) ### What problem does this PR solve? When parsing pptx files, some shapes do not contain the `shape_type` attribute, which causes the original code to throw an exception during extraction, leading to failure in content extraction. This optimization introduces handling logic for such anomalous shapes, providing a safer and more robust processing mechanism. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [x] Performance Improvement - [ ] Other (please describe):	2025-04-22 10:16:24 +08:00
gsmini	53c653b099	fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859 ) i use PdfParser in local(refer to this case: https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like this: ``` import re import openpyxl from ragflow.api.db import ParserType from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \ title_frequency, \ tokenize_chunks from ragflow.rag.utils import num_tokens_from_string from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser def logger(prog=None, msg=""): print(msg) class Pdf(PdfParser): def __init__(self): self.model_speciess = ParserType.MANUAL.value super().__init__() def __call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None): from timeit import default_timer as timer start = timer() callback(msg="OCR is running...") self.__images__( filename if not binary else binary, zoomin, from_page, to_page, callback ) callback(msg="OCR finished.") print("OCR:", timer() - start) self._layouts_rec(zoomin) callback(0.65, "Layout analysis finished.") print("layouts:", timer() - start) self._table_transformer_job(zoomin) callback(0.67, "Table analysis finished.") self._text_merge() tbls = self._extract_table_figure(True, zoomin, True, True) self._concat_downward() self._filter_forpages() callback(0.68, "Text merging finished") # clean mess for b in self.boxes: b["text"] = re.sub(r"([\t 　]\|\u3000){2,}", " ", b["text"].strip()) return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin)) for i, b in enumerate(self.boxes)], tbls ``` show err like this: ``` File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__ self.pdf.close() AttributeError: 'PdfReader' object has no attribute 'close' ``` i found ragflow source code use `pdfplumber.open`（https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43） and replace` self.pdf `with ` pdf2_read` （from pypdf import PdfReader as pdf2_read）in line 1024 (https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024) ``` self.pdf = pdf2_read ``` --- and I found that `pdfplumber` can be used in this way： ``` file_path="xxx.pdf" res = pdfplumber.open(file_path) res.close() ``` but `pypdf.PdfReader` source code do not has `close` func, source code use like this ``` with open(stream, "rb") as fh: stream = BytesIO(fh.read()) self._stream_opened = True ``` > https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156 so I moved the `self.pdf.close` function call and fixed this problem hoping to help the project😊	2025-04-14 09:40:13 +08:00

1 2 3 4

169 Commits