ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-02-02 00:25:06 +08:00

Author	SHA1	Message	Date
Stephen Hu	2e44c3b743	Fix:Unimplemented function in ppt_parser (#8095 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8088 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-06 10:05:58 +08:00
giiiiiithub	6ba5a4348a	set PARALLEL_DEVICES default value= 0 (#7935 ) ### What problem does this PR solve? it would be fail if PARALLEL_DEVICES = None in OCR class , because it pass 0 to TextDetector and TextRecognizer init method. and It would be simpler to set 0 as the default value for PARALLEL_DEVICES. ### Type of change - [x] Refactoring	2025-05-29 13:32:16 +08:00
Emmanuel Ferdman	d4a123d6dd	Fix: resolve regex library warnings (#7782 ) ### What problem does this PR solve? This small PR resolves the regex library warnings showing in Python3.11: ```python DeprecationWarning: 'count' is passed as positional argument ``` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-22 10:06:28 +08:00
Yongteng Lei	b908c33464	Fix: uncaptured image data with position information (#7683 ) ### What problem does this PR solve? Fixed uncaptured figure data with position information. #7466, #7681 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-05-19 19:33:28 +08:00
liuzhenghua	ea5e8caa69	feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562 ) ### What problem does this PR solve? When the PDF uses vector fonts, the rendered text in the captured page image often has missing strokes, leading to numerous OCR errors and incorrect characters. Similar issues also occur in the extracted chart images. Before ![0089e1f76205b5b3](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1) After ![03053149e919773a](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2) You can use the following document for testing. [Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-12 09:50:21 +08:00
Kevin Hu	a14865e6bb	Fix: empty query issue. (#7551 ) ### What problem does this PR solve? #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-09 12:20:19 +08:00
Kevin Hu	9d3dd13fef	Refa: text order be robuster. (#7525 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-05-08 12:58:10 +08:00
Stephen Hu	953b3e1b3f	Fix: Sometimes VisionFigureParser.figures may is tuple (#7477 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7466 I think due to some times we can not get position ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-06 17:38:22 +08:00
liuzhenghua	2f768b96e8	perf: optimze figure parser (#7392 ) ### What problem does this PR solve? When parsing documents containing images, the current code uses a single-threaded approach to call the VL model, resulting in extremely slow parsing speed (e.g., parsing a Word document with dozens of images takes over 20 minutes). By switching to a multithreaded approach to call the VL model, the parsing speed can be improved to an acceptable level. ### Type of change - [x] Performance Improvement --------- Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-06 14:39:45 +08:00
zhudongwork	10432a1be7	Refa: Optimize pptx shape extraction to reduce content loss (#6703 ) ### What problem does this PR solve? When parsing pptx files, some shapes do not contain the `shape_type` attribute, which causes the original code to throw an exception during extraction, leading to failure in content extraction. This optimization introduces handling logic for such anomalous shapes, providing a safer and more robust processing mechanism. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [x] Performance Improvement - [ ] Other (please describe):	2025-04-22 10:16:24 +08:00
Kevin Hu	ed5f81b02e	Fix: abnormal cell mergeing. (#6991 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-04-14 11:00:11 +08:00
gsmini	53c653b099	fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859 ) i use PdfParser in local(refer to this case: https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like this: ``` import re import openpyxl from ragflow.api.db import ParserType from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \ title_frequency, \ tokenize_chunks from ragflow.rag.utils import num_tokens_from_string from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser def logger(prog=None, msg=""): print(msg) class Pdf(PdfParser): def __init__(self): self.model_speciess = ParserType.MANUAL.value super().__init__() def __call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None): from timeit import default_timer as timer start = timer() callback(msg="OCR is running...") self.__images__( filename if not binary else binary, zoomin, from_page, to_page, callback ) callback(msg="OCR finished.") print("OCR:", timer() - start) self._layouts_rec(zoomin) callback(0.65, "Layout analysis finished.") print("layouts:", timer() - start) self._table_transformer_job(zoomin) callback(0.67, "Table analysis finished.") self._text_merge() tbls = self._extract_table_figure(True, zoomin, True, True) self._concat_downward() self._filter_forpages() callback(0.68, "Text merging finished") # clean mess for b in self.boxes: b["text"] = re.sub(r"([\t 　]\|\u3000){2,}", " ", b["text"].strip()) return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin)) for i, b in enumerate(self.boxes)], tbls ``` show err like this: ``` File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__ self.pdf.close() AttributeError: 'PdfReader' object has no attribute 'close' ``` i found ragflow source code use `pdfplumber.open`（https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43） and replace` self.pdf `with ` pdf2_read` （from pypdf import PdfReader as pdf2_read）in line 1024 (https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024) ``` self.pdf = pdf2_read ``` --- and I found that `pdfplumber` can be used in this way： ``` file_path="xxx.pdf" res = pdfplumber.open(file_path) res.close() ``` but `pypdf.PdfReader` source code do not has `close` func, source code use like this ``` with open(stream, "rb") as fh: stream = BytesIO(fh.read()) self._stream_opened = True ``` > https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156 so I moved the `self.pdf.close` function call and fixed this problem hoping to help the project😊	2025-04-14 09:40:13 +08:00
Kevin Hu	3bb1e012e6	Fix: assistant deleteion issue. (#6906 ) ### What problem does this PR solve? #6875 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-04-09 20:29:40 +08:00
Kevin Hu	2caf15b24c	Refa: trival. (#6802 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-04-03 19:01:24 +08:00
donblack01	0b48a2e0d1	Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613 ) ### What problem does this PR solve? Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tangyu <1@1.com>	2025-03-28 09:33:49 +08:00
Stephen Hu	d77380f024	Feat: support pic base bullet for PPT (#6406 ) ### What problem does this PR solve? support pic base bullet for PPT modify one mistake in document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-24 09:31:31 +08:00
Yongteng Lei	9611185eb4	Feat: add VLM-boosted DocX parser (#6307 ) ### What problem does this PR solve? Add VLM-boosted DocX parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 11:24:44 +08:00
Yongteng Lei	1d6760dd84	Feat: add VLM-boosted PDF parser (#6278 ) ### What problem does this PR solve? Add VLM-boosted PDF parser if VLM is set. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 09:39:32 +08:00
Yongteng Lei	5cf610af40	Feat: add vision LLM PDF parser (#6173 ) ### What problem does this PR solve? Add vision LLM PDF parser ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-03-18 14:52:20 +08:00
Stephen Hu	b0b4b7ba33	Feat: Improve Recognizer.py performance (#6185 ) ### What problem does this PR solve? For the create_inputs method based on np operation to replace for loop ### Type of change - [x] Performance Improvement	2025-03-18 09:39:49 +08:00
Stephen Hu	79482ff672	Refa: Improve ppt_parser better handle list (#6162 ) ### What problem does this PR solve? This pull request (PR) incorporates codes for parsing PPTX files, aiming to more precisely depict text in list formats (hint list by .). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-03-17 17:02:39 +08:00
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-03-17 16:49:54 +08:00
Kevin Hu	bfa8d342b3	Fix: retrieval debug mode issue. (#6150 ) ### What problem does this PR solve? #6139 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-17 13:07:13 +08:00
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972 ) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	2025-03-17 11:58:40 +08:00
Yongteng Lei	4ff609b6a8	Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027 ) ### What problem does this PR solve? Optimize OCR garbage identification to reduce unnecessary filtering. #5713 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-13 18:48:32 +08:00
Yongteng Lei	7cd37c37cd	Feat: add CSV file parsing support (#5989 ) ### What problem does this PR solve? Add CSV file parsing support #4552, #5849, #5870 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-12 19:20:50 +08:00
donblack01	b1a46d5adc	Fix:when start with source code not in docker env report 'UnicodeDec… (#5802 ) ### What problem does this PR solve? fix:when start with source code not in docker env report "UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5: illegal multibyte sequence" in windows ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tangyu <1@1.com>	2025-03-10 11:22:06 +08:00
liwenju0	5b0e38060a	Feat：Optimize the table extraction logic in the Markdown parser: (#5663 ) Enhance the recognition of both borderless and bordered Markdown tables. Add support for extracting HTML tables, including various scenarios with nested HTML tags. Improve performance by using conditional checks to reduce unnecessary regular expression matching. ### What problem does this PR solve? Optimize the table extraction logic in the Markdown parser: Enhance the recognition of both borderless and bordered Markdown tables. Add support for extracting HTML tables, including various scenarios with nested HTML tags. Improve performance by using conditional checks to reduce unnecessary regular expression matching. ### Type of change - [x] Performance Improvement Co-authored-by: wenju.li <wenju.li@deepctr.cn>	2025-03-07 17:02:35 +08:00
Kevin Hu	8fb8374dfc	Fix: delimiter issue. (#5720 ) ### What problem does this PR solve? #5704 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-06 17:51:22 +08:00
yihong	4326873af6	refactor: no need to inherit in python3 clean the code (#5659 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-03-05 18:03:53 +08:00
非法操作	ca04ae9540	Minor: improve doc and rm unused file (#5634 ) ### What problem does this PR solve? The `ocr.res` file is already included in the model directory `rag/res/deepdoc`, but it doesn't seem to be utilized here. ### Type of change - [x] Documentation Update	2025-03-05 12:59:54 +08:00
hy89	b0c21b00d9	Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633 ) Optimize error handling and support parsing of XLS(EXCEL97—2003) files.	2025-03-05 11:55:27 +08:00
Zhichang Yu	c813c1ff4c	Made task_executor async to speedup parsing (#5530 ) ### What problem does this PR solve? Made task_executor async to speedup parsing ### Type of change - [x] Performance Improvement	2025-03-03 18:59:49 +08:00
yihong	8a2542157f	Fix: possible memory leaks close #5277 (#5500 ) ### What problem does this PR solve? close #5277 by make sure the file close ### Type of change - [x] Performance Improvement --------- Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-03-03 10:26:45 +08:00
yihong	37aacb3960	Refa: drop useless fasttext (#5470 ) ### What problem does this PR solve? This patch drop useless fastext which is seems useless in the code base and its very kind of hard install should close #4498 ### Type of change - [x] Refactoring Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-02-28 14:30:56 +08:00
Yongteng Lei	83d0949498	Fix: fix special delimiter parsing issue (#5448 ) ### What problem does this PR solve? Fix special delimiter parsing issue #5382 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-27 18:33:55 +08:00
Zhichang Yu	db42d0e0ae	Optimize ocr (#5297 ) ### What problem does this PR solve? Introduced OCR.recognize_batch ### Type of change - [x] Performance Improvement	2025-02-24 16:21:55 +08:00
Zhichang Yu	0151d42156	Reuse loaded modules if possible (#5231 ) ### What problem does this PR solve? Reuse loaded modules if possible ### Type of change - [x] Refactoring	2025-02-21 17:21:01 +08:00
Zhichang Yu	c326f14fed	Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly (#5182 ) ### What problem does this PR solve? Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly ### Type of change - [x] Performance Improvement	2025-02-20 15:41:12 +08:00
Kevin Hu	b08bb56f6c	Display thinking for deepseek r1 (#4904 ) ### What problem does this PR solve? #4903 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-02-12 15:43:13 +08:00
Mathias Panzenböck	6b389e01b5	Remove use of eval() from operators.py (#4888 ) Use `np.float32()` instead. ### What problem does this PR solve? Using `eval()` can lead to code injections. I think `eval()` is only used to parse a floating point number here. This change preserves the correct behavior if the string `"None"` is supplied. But if that behavior isn't intended then this part could be just deleted instead, since `np.float32()` is parsing strings anyway: ```Python if isinstance(scale, str): scale = eval(scale) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-12 12:53:42 +08:00
SkyfireWXY	8fcca1b958	fix: big xls file error (#4859 ) ### What problem does this PR solve? if *.xls file is too large, .eg >50M, I get error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-12 12:39:25 +08:00
Zhichang Yu	3411d0a2ce	Added cuda_is_available (#4725 ) ### What problem does this PR solve? Added cuda_is_available ### Type of change - [x] Refactoring	2025-02-05 18:01:23 +08:00
Zhichang Yu	e1526846da	Fixed GPU detection on CPU only environment (#4711 ) ### What problem does this PR solve? Fixed GPU detection on CPU only environment. Close #4692 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-05 12:02:43 +08:00
Kevin Hu	6f30397bb5	Infinity adapt to graphrag. (#4663 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-01-27 18:35:18 +08:00
Kevin Hu	1bff6b7333	Fix t_ocr.py for PNG image. (#4625 ) ### What problem does this PR solve? #4586 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-01-24 11:47:27 +08:00
Zhichang Yu	4230402fbb	deepdoc use GPU if possible (#4618 ) ### What problem does this PR solve? deepdoc use GPU if possible ### Type of change - [x] Refactoring	2025-01-24 09:48:02 +08:00
Mathias Panzenböck	1a367664f1	Remove usage of eval() from postprocess.py (#4571 ) Remove usage of `eval()` from postprocess.py ### What problem does this PR solve? The use of `eval()` is a potential security risk. While the use of `eval()` is guarded and thus not a security risk normally, `assert`s aren't run if `-O` or `-OO` is passed to the interpreter, and as such then the guard would not apply. In any case there is no reason to use `eval()` here at all. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Other (please describe): Potential security fix if somehow the passed `modul_name` could be user controlled.	2025-01-22 19:37:24 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Mathias Panzenböck	75e1981e13	Remove use of eval() from recognizer.py (#4480 ) `eval(op_type)` -> `getattr(operators, op_type)` ### What problem does this PR solve? Using `eval()` can lead to code injections and is entirely unnecessary here. ### Type of change - [x] Other (please describe): Best practice code improvement, preventing the possibility of code injection.	2025-01-20 09:52:47 +08:00

1 2 3 4 5

216 Commits