ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-26 13:16:34 +08:00

Author	SHA1	Message	Date
Yongteng Lei	5200711441	Feat: add support for multi-column PDF parsing (#10475 ) ### What problem does this PR solve? Add support for multi-columns PDF parsing. #9878, #9919. Two-column sample: <img width="1885" height="1020" alt="image" src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28" /> Three-column sample: <img width="1881" height="992" alt="image" src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a" /> Single-column sample: <img width="1883" height="1042" alt="image" src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:46:09 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Billy Bao	534fa60b2a	Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 (#10472 ) ### What problem does this PR solve? Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-10 20:44:05 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
Billy Bao	f04c9e2937	Fix: correctly update parser method & correct vllm pdf parser (#10441 ) ### What problem does this PR solve? Fix: correctly update parser method ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2025-10-09 19:03:12 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Jin Hai	4eb7659499	Fix bug: broken import from rag.prompts.prompts (#10217 ) ### What problem does this PR solve? Fix broken imports ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2025-09-23 10:19:25 +08:00
Lynn	62d35b1b73	Fix: handle zero (#10149 ) ### What problem does this PR solve? Handle zero and nan in calculate. #10125 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-09-18 16:28:03 +08:00
Lynn	d353f7f7f8	Feat/parse audio (#10133 ) ### What problem does this PR solve? Dataflow support audio. And fix giteeAI's sequence2text model. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-09-18 09:31:32 +08:00
buua436	c9ea22ef69	Fix: set default chunk_token_num in html_parser (#10118 ) ### What problem does this PR solve? issue: [Bug]: Agent component (HTTP Request) "'>' not supported between instances of 'int' and 'NoneType'" [#10096](https://github.com/infiniflow/ragflow/issues/10096) Change: When the Invoke class instantiates HtmlParser without providing the chunk_token_num parameter, the value defaults to None, leading to a comparison error with block_token_count. This change sets the default chunk_token_num to 512 to prevent such errors. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: BadwomanCraZY <511528396@qq.com>	2025-09-17 09:36:31 +08:00
Yongteng Lei	bc0281040b	Feat: add support for the Ascend layout recognizer (#10105 ) ### What problem does this PR solve? Supports Ascend layout recognizer. Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n` (for example, n=0) to specify the Ascend device ID. Ensure that you have installed the [ais tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench) properly. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-16 09:51:15 +08:00
Yongteng Lei	0d9c1f1c3c	Feat: dataflow supports Spreadsheet and Word processor document (#9996 ) ### What problem does this PR solve? Dataflow supports Spreadsheet and Word processor document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-10 13:02:53 +08:00
湛露先生	1ee9c0b8d9	fix xss in excel_parser (#9909 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>	2025-09-05 09:58:03 +08:00
Kevin Hu	c27172b3bc	Feat: init dataflow. (#9791 ) ### What problem does this PR solve? #9790 Close #9782 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-28 18:40:32 +08:00
pingguoCooler	cf0011be67	Feat: Upgrade html parser (#9675 ) ### What problem does this PR solve? parse more html content. ### Type of change - [x] Other (please describe):	2025-08-27 12:43:55 +08:00
Yongteng Lei	382458ace7	Feat: advanced markdown parsing (#9607 ) ### What problem does this PR solve? Using AST parsing to handle markdown more accurately, preventing components from being cut off by chunking. #9564 <img width="1746" height="993" alt="image" src="https://github.com/user-attachments/assets/4aaf4bf6-5714-4d48-a9cf-864f59633f7f" /> <img width="1739" height="982" alt="image" src="https://github.com/user-attachments/assets/dc00233f-7a55-434f-bbb7-74ce7f57a6cf" /> <img width="559" height="100" alt="image" src="https://github.com/user-attachments/assets/4a556b5b-d9c6-4544-a486-8ac342bd504e" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-21 09:36:18 +08:00
Yongteng Lei	eef43fa25c	Fix: unexpected truncated Excel files (#9500 ) ### What problem does this PR solve? Handle unexpected truncated Excel files. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-15 17:00:34 +08:00
Jay Xu	79e2edc835	Fix "File contains no valid workbook part" (#9360 ) ### What problem does this PR solve? fix "File contains no valid workbook part" stacktrace: ``` Traceback (most recent call last): File "/ragflow/deepdoc/parser/excel_parser.py", line 54, in _load_excel_to_workbook return RAGFlowExcelParser._dataframe_to_workbook(df) File "/ragflow/deepdoc/parser/excel_parser.py", line 69, in _dataframe_to_workbook ws.cell(row=row_num, column=col_num, value=value) File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/worksheet/worksheet.py", line 246, in cell cell.value = value File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 218, in value self._bind_value(value) File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 197, in _bind_value value = self.check_string(value) File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 165, in check_string raise IllegalCharacterError(f"{value} cannot be used in worksheets.") ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-08-12 14:58:36 +08:00
Jay Xu	569ab011c4	Add fallback to use 'calamine' parse engine in excel_parser.py (#9374 ) ### What problem does this PR solve? add fallback to `calamine` engine when parse error raised using the default `openpyxl` / `xlrd` engine. e.g. the following error can be fixed: ``` Traceback (most recent call last): File "/ragflow/deepdoc/parser/excel_parser.py", line 53, in _load_excel_to_workbook df = pd.read_excel(file_like_object) File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel io = ExcelFile( File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__ self._reader = self._engines[engine]( File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__ super().__init__( File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__ self.book = self.load_workbook(self.handles.handle, engine_kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook return open_workbook(file_contents=data, **engine_kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook bk = open_workbook_xls( File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 68, in open_workbook_xls bk.biff2_8_load( File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 641, in biff2_8_load cd.locate_named_stream(UNICODE_LITERAL(qname)) File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 398, in locate_named_stream result = self._locate_stream( File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 429, in _locate_stream raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s])) xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4 ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-12 12:41:33 +08:00
Jay Xu	6ad8b54754	fix "TypeError: '<' not supported between instances of 'Emu' and 'Non… (#9209 ) …eType'" ### What problem does this PR solve? fix "TypeError: '<' not supported between instances of 'Emu' and 'NoneType'" ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-04 16:07:03 +08:00
Kevin Hu	a16cd4f110	Refa: add result to callback for agent tool use. (#9137 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-08-01 21:49:39 +08:00
Yongteng Lei	39ef2ffba9	Feat: parsing supports jsonl or ldjson format (#9087 ) ### What problem does this PR solve? Supports jsonl or ldjson format. Feature request from [discussion](https://github.com/orgs/infiniflow/discussions/8774). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 09:48:20 +08:00
Jin Hai	03daf4618c	Refactor parser code (#9042 ) ### What problem does this PR solve? Refactor code ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-07-25 12:04:07 +08:00
Kevin Hu	ecdb1701df	Perf: test llm before RAPTOR. (#8897 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-17 16:48:50 +08:00
Yongteng Lei	51a8604dcb	Fix: fixed context loss caused by separating markdown tables from original text (#8844 ) ### What problem does this PR solve? Fix context loss caused by separating markdown tables from original text. #6871, #8804. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 13:03:01 +08:00
Kevin Hu	6d256ff0f5	Perf: ignore concate between rows. (#8507 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-06-26 14:55:37 +08:00
Jin Hai	4a2ff633e0	Fix typo in code (#8327 ) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-06-18 09:41:09 +08:00
Stephen Hu	2e44c3b743	Fix:Unimplemented function in ppt_parser (#8095 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8088 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-06 10:05:58 +08:00
giiiiiithub	6ba5a4348a	set PARALLEL_DEVICES default value= 0 (#7935 ) ### What problem does this PR solve? it would be fail if PARALLEL_DEVICES = None in OCR class , because it pass 0 to TextDetector and TextRecognizer init method. and It would be simpler to set 0 as the default value for PARALLEL_DEVICES. ### Type of change - [x] Refactoring	2025-05-29 13:32:16 +08:00
Emmanuel Ferdman	d4a123d6dd	Fix: resolve regex library warnings (#7782 ) ### What problem does this PR solve? This small PR resolves the regex library warnings showing in Python3.11: ```python DeprecationWarning: 'count' is passed as positional argument ``` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-22 10:06:28 +08:00
Yongteng Lei	b908c33464	Fix: uncaptured image data with position information (#7683 ) ### What problem does this PR solve? Fixed uncaptured figure data with position information. #7466, #7681 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-05-19 19:33:28 +08:00
liuzhenghua	ea5e8caa69	feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562 ) ### What problem does this PR solve? When the PDF uses vector fonts, the rendered text in the captured page image often has missing strokes, leading to numerous OCR errors and incorrect characters. Similar issues also occur in the extracted chart images. Before ![0089e1f76205b5b3](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1) After ![03053149e919773a](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2) You can use the following document for testing. [Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-12 09:50:21 +08:00
Kevin Hu	a14865e6bb	Fix: empty query issue. (#7551 ) ### What problem does this PR solve? #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-09 12:20:19 +08:00
Kevin Hu	9d3dd13fef	Refa: text order be robuster. (#7525 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-05-08 12:58:10 +08:00
Stephen Hu	953b3e1b3f	Fix: Sometimes VisionFigureParser.figures may is tuple (#7477 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7466 I think due to some times we can not get position ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-06 17:38:22 +08:00
liuzhenghua	2f768b96e8	perf: optimze figure parser (#7392 ) ### What problem does this PR solve? When parsing documents containing images, the current code uses a single-threaded approach to call the VL model, resulting in extremely slow parsing speed (e.g., parsing a Word document with dozens of images takes over 20 minutes). By switching to a multithreaded approach to call the VL model, the parsing speed can be improved to an acceptable level. ### Type of change - [x] Performance Improvement --------- Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-06 14:39:45 +08:00
zhudongwork	10432a1be7	Refa: Optimize pptx shape extraction to reduce content loss (#6703 ) ### What problem does this PR solve? When parsing pptx files, some shapes do not contain the `shape_type` attribute, which causes the original code to throw an exception during extraction, leading to failure in content extraction. This optimization introduces handling logic for such anomalous shapes, providing a safer and more robust processing mechanism. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [x] Performance Improvement - [ ] Other (please describe):	2025-04-22 10:16:24 +08:00
gsmini	53c653b099	fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859 ) i use PdfParser in local(refer to this case: https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like this: ``` import re import openpyxl from ragflow.api.db import ParserType from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \ title_frequency, \ tokenize_chunks from ragflow.rag.utils import num_tokens_from_string from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser def logger(prog=None, msg=""): print(msg) class Pdf(PdfParser): def __init__(self): self.model_speciess = ParserType.MANUAL.value super().__init__() def __call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None): from timeit import default_timer as timer start = timer() callback(msg="OCR is running...") self.__images__( filename if not binary else binary, zoomin, from_page, to_page, callback ) callback(msg="OCR finished.") print("OCR:", timer() - start) self._layouts_rec(zoomin) callback(0.65, "Layout analysis finished.") print("layouts:", timer() - start) self._table_transformer_job(zoomin) callback(0.67, "Table analysis finished.") self._text_merge() tbls = self._extract_table_figure(True, zoomin, True, True) self._concat_downward() self._filter_forpages() callback(0.68, "Text merging finished") # clean mess for b in self.boxes: b["text"] = re.sub(r"([\t 　]\|\u3000){2,}", " ", b["text"].strip()) return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin)) for i, b in enumerate(self.boxes)], tbls ``` show err like this: ``` File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__ self.pdf.close() AttributeError: 'PdfReader' object has no attribute 'close' ``` i found ragflow source code use `pdfplumber.open`（https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43） and replace` self.pdf `with ` pdf2_read` （from pypdf import PdfReader as pdf2_read）in line 1024 (https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024) ``` self.pdf = pdf2_read ``` --- and I found that `pdfplumber` can be used in this way： ``` file_path="xxx.pdf" res = pdfplumber.open(file_path) res.close() ``` but `pypdf.PdfReader` source code do not has `close` func, source code use like this ``` with open(stream, "rb") as fh: stream = BytesIO(fh.read()) self._stream_opened = True ``` > https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156 so I moved the `self.pdf.close` function call and fixed this problem hoping to help the project😊	2025-04-14 09:40:13 +08:00
donblack01	0b48a2e0d1	Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613 ) ### What problem does this PR solve? Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tangyu <1@1.com>	2025-03-28 09:33:49 +08:00
Stephen Hu	d77380f024	Feat: support pic base bullet for PPT (#6406 ) ### What problem does this PR solve? support pic base bullet for PPT modify one mistake in document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-24 09:31:31 +08:00
Yongteng Lei	9611185eb4	Feat: add VLM-boosted DocX parser (#6307 ) ### What problem does this PR solve? Add VLM-boosted DocX parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 11:24:44 +08:00
Yongteng Lei	1d6760dd84	Feat: add VLM-boosted PDF parser (#6278 ) ### What problem does this PR solve? Add VLM-boosted PDF parser if VLM is set. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 09:39:32 +08:00
Yongteng Lei	5cf610af40	Feat: add vision LLM PDF parser (#6173 ) ### What problem does this PR solve? Add vision LLM PDF parser ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-03-18 14:52:20 +08:00
Stephen Hu	79482ff672	Refa: Improve ppt_parser better handle list (#6162 ) ### What problem does this PR solve? This pull request (PR) incorporates codes for parsing PPTX files, aiming to more precisely depict text in list formats (hint list by .). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-03-17 17:02:39 +08:00
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-03-17 16:49:54 +08:00
Kevin Hu	bfa8d342b3	Fix: retrieval debug mode issue. (#6150 ) ### What problem does this PR solve? #6139 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-17 13:07:13 +08:00
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972 ) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	2025-03-17 11:58:40 +08:00
Yongteng Lei	7cd37c37cd	Feat: add CSV file parsing support (#5989 ) ### What problem does this PR solve? Add CSV file parsing support #4552, #5849, #5870 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-12 19:20:50 +08:00
donblack01	b1a46d5adc	Fix:when start with source code not in docker env report 'UnicodeDec… (#5802 ) ### What problem does this PR solve? fix:when start with source code not in docker env report "UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5: illegal multibyte sequence" in windows ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tangyu <1@1.com>	2025-03-10 11:22:06 +08:00
liwenju0	5b0e38060a	Feat：Optimize the table extraction logic in the Markdown parser: (#5663 ) Enhance the recognition of both borderless and bordered Markdown tables. Add support for extracting HTML tables, including various scenarios with nested HTML tags. Improve performance by using conditional checks to reduce unnecessary regular expression matching. ### What problem does this PR solve? Optimize the table extraction logic in the Markdown parser: Enhance the recognition of both borderless and bordered Markdown tables. Add support for extracting HTML tables, including various scenarios with nested HTML tags. Improve performance by using conditional checks to reduce unnecessary regular expression matching. ### Type of change - [x] Performance Improvement Co-authored-by: wenju.li <wenju.li@deepctr.cn>	2025-03-07 17:02:35 +08:00

1 2 3 4

157 Commits