Commit Graph

169 Commits

Author SHA1 Message Date
bb9504d1cc Fix:enhance delimiters in markdown parser (#10896)
### What problem does this PR solve?
issue:
[#10890](https://github.com/infiniflow/ragflow/issues/10890)
change:
enhance delimiters in markdown parser
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-30 17:36:51 +08:00
b52f09adfe Mineru api support (#10874)
### What problem does this PR solve?

support local mineru api in docker instance. like no gpu in wsl on
windows, but has mineru api with gpu support.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-10-30 17:31:46 +08:00
057ae646f2 Fix: logging issues (#10836)
### What problem does this PR solve?

Fix: logging issues #10835 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-28 14:10:47 +08:00
e59458c36b Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819)
### What problem does this PR solve?

Fix: parsing excel with chartsheet #10815

Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-28 09:40:37 +08:00
5acc407240 Feat: MinerU supports VLM-Transfomers backend (#10809)
### What problem does this PR solve?

MinerU supports VLM-Transfomers backend.

Set `MINERU_BACKEND="pipeline"` to choose the backend. (Options:
pipeline | vlm-transformers, default is pipeline)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-10-27 17:04:13 +08:00
33a189f620 Feat: add TCADP Parser (#10775)
### What problem does this PR solve?

This PR adds a new TCADP (Tencent Cloud Advanced Document Processing)
parser to RAGFlow, enabling users to leverage Tencent Cloud's document
parsing capabilities for more accurate and structured document
processing. The implementation includes:
New TCADP Parser: A complete implementation of Tencent Cloud's document
parsing API without SDK dependency
Configuration Support: Added configuration options in service_conf.yaml
for Tencent Cloud API credentials
Frontend Integration: Updated UI components to support the new TCADP
parser option
Error Handling: Comprehensive error handling and retry mechanisms for
API calls
Result Processing: Support for both SSE streaming and JSON response
formats from Tencent Cloud API

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-10-27 15:14:58 +08:00
73144e278b Don't release full image (#10654)
### What problem does this PR solve?

Introduced gpu profile in .env
Added Dockerfile_tei
fix datrie
Removed LIGHTEN flag

### Type of change

- [x] Documentation Update
- [x] Refactoring
2025-10-23 23:02:27 +08:00
0ff2042fc1 Feat: add Docling parser (#10759)
### What problem does this PR solve?
issue:
#3945
change:
add Docling parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-23 19:44:25 +08:00
41fade3fe6 Fix:wrong param in manual chunk (#10710)
### What problem does this PR solve?

change:
wrong param in manual chunk

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-21 20:10:54 +08:00
9d12380806 Fix: Excel2HTML can't support XLS(Excel 97-2003) (#10660)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/10602

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-21 09:52:59 +08:00
6ab96287c9 Feat:Vision Model Image Enhancement in Manual/Paper/Book/One chunker (#10640)
### What problem does this PR solve?
issue:
[#7472](https://github.com/infiniflow/ragflow/issues/7472)
change:
Vision Model Image Enhancement in Manual chunker
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-21 09:36:27 +08:00
387baf858f Feat: add MinerU parser (#10621)
### What problem does this PR solve?

Add MinerU parser. #3945, #8092.

Set `MINERU_EXECUTABLE` to the MinerU executable path, defaults to
`mineru`.

Set `MINERU_DELETE_OUTPUT=0` to preserve MinerU's output, default is 1,
which deletes temporary output.

Set `MINERU_OUTPUT_DIR` to choose the MinerU output directory (uses the
temporary directory if unset).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-17 09:55:39 +08:00
5200711441 Feat: add support for multi-column PDF parsing (#10475)
### What problem does this PR solve?

Add support for multi-columns PDF parsing. #9878, #9919.

Two-column sample:
<img width="1885" height="1020" alt="image"
src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28"
/>

Three-column sample: 
<img width="1881" height="992" alt="image"
src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a"
/>

Single-column sample:
<img width="1883" height="1042" alt="image"
src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093"
/>



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-10-11 18:46:09 +08:00
7d2f65671f Feat: debugging toc part. (#10486)
### What problem does this PR solve?

#10436

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-11 18:45:21 +08:00
534fa60b2a Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent through Python API. #10415 (#10472)
### What problem does this PR solve?
Fix: Agent.reset() argument wrong #10463 & Unable to converse with agent
through Python API. #10415

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-10 20:44:05 +08:00
0d8791936e Feat: TOC retrieval (#10456)
### What problem does this PR solve?

#10436

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-10 17:07:55 +08:00
f04c9e2937 Fix: correctly update parser method & correct vllm pdf parser (#10441)
### What problem does this PR solve?

Fix: correctly update parser method

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
2025-10-09 19:03:12 +08:00
cbf04ee470 Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423)
### What problem does this PR solve?

#9869

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: jinhai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: chanx <1243304602@qq.com>
Co-authored-by: balibabu <cike8899@users.noreply.github.com>
Co-authored-by: Lynn <lynn_inf@hotmail.com>
Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com>
Co-authored-by: huangzl <huangzl@shinemo.com>
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
Co-authored-by: Wilmer <33392318@qq.com>
Co-authored-by: Adrian Weidig <adrianweidig@gmx.net>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yongteng Lei <yongtengrey@outlook.com>
Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com>
Co-authored-by: BadwomanCraZY <511528396@qq.com>
Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com>
Co-authored-by: Russell Valentine <russ@coldstonelabs.org>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Billy Bao <newyorkupperbay@gmail.com>
Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com>
Co-authored-by: TensorNull <tensor.null@gmail.com>
Co-authored-by: TeslaZY <TeslaZY@outlook.com>
Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com>
Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com>
Co-authored-by: He Wang <wanghechn@qq.com>
Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com>
Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box>
Co-authored-by: Stephen Hu <stephenhu@seismic.com>
Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com>
Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com>
Co-authored-by: mxc <mxc@example.com>
Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com>
Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com>
Co-authored-by: mcoder6425 <mcoder64@gmail.com>
Co-authored-by: lemsn <lemsn@msn.com>
Co-authored-by: lemsn <lemsn@126.com>
Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com>
Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com>
Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>
2025-10-09 12:36:19 +08:00
4eb7659499 Fix bug: broken import from rag.prompts.prompts (#10217)
### What problem does this PR solve?

Fix broken imports

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: jinhai <haijin.chn@gmail.com>
2025-09-23 10:19:25 +08:00
62d35b1b73 Fix: handle zero (#10149)
### What problem does this PR solve?

Handle zero and nan in calculate.
#10125

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-18 16:28:03 +08:00
d353f7f7f8 Feat/parse audio (#10133)
### What problem does this PR solve?

Dataflow support audio.  And fix giteeAI's sequence2text model. 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-09-18 09:31:32 +08:00
c9ea22ef69 Fix: set default chunk_token_num in html_parser (#10118)
### What problem does this PR solve?

issue:
[Bug]: Agent component (HTTP Request) "'>' not supported between
instances of 'int' and 'NoneType'"
[#10096](https://github.com/infiniflow/ragflow/issues/10096)

Change:
When the Invoke class instantiates HtmlParser without providing the
chunk_token_num parameter, the value defaults to None, leading to a
comparison error with block_token_count.

This change sets the default chunk_token_num to 512 to prevent such
errors.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: BadwomanCraZY <511528396@qq.com>
2025-09-17 09:36:31 +08:00
bc0281040b Feat: add support for the Ascend layout recognizer (#10105)
### What problem does this PR solve?

Supports Ascend layout recognizer.

Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable
the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n`
(for example, n=0) to specify the Ascend device ID.

Ensure that you have installed the [ais
tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)
properly.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 09:51:15 +08:00
0d9c1f1c3c Feat: dataflow supports Spreadsheet and Word processor document (#9996)
### What problem does this PR solve?

Dataflow supports Spreadsheet and Word processor document

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-10 13:02:53 +08:00
1ee9c0b8d9 fix xss in excel_parser (#9909)
### What problem does this PR solve?



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement

Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-09-05 09:58:03 +08:00
c27172b3bc Feat: init dataflow. (#9791)
### What problem does this PR solve?

#9790

Close #9782

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-28 18:40:32 +08:00
cf0011be67 Feat: Upgrade html parser (#9675)
### What problem does this PR solve?

parse more html content.

### Type of change

- [x] Other (please describe):
2025-08-27 12:43:55 +08:00
382458ace7 Feat: advanced markdown parsing (#9607)
### What problem does this PR solve?

Using AST parsing to handle markdown more accurately, preventing
components from being cut off by chunking. #9564

<img width="1746" height="993" alt="image"
src="https://github.com/user-attachments/assets/4aaf4bf6-5714-4d48-a9cf-864f59633f7f"
/>

<img width="1739" height="982" alt="image"
src="https://github.com/user-attachments/assets/dc00233f-7a55-434f-bbb7-74ce7f57a6cf"
/>

<img width="559" height="100" alt="image"
src="https://github.com/user-attachments/assets/4a556b5b-d9c6-4544-a486-8ac342bd504e"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-21 09:36:18 +08:00
eef43fa25c Fix: unexpected truncated Excel files (#9500)
### What problem does this PR solve?

Handle unexpected truncated Excel files.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-15 17:00:34 +08:00
79e2edc835 Fix "File contains no valid workbook part" (#9360)
### What problem does this PR solve?

fix "File contains no valid workbook part"

stacktrace:
```
Traceback (most recent call last):
  File "/ragflow/deepdoc/parser/excel_parser.py", line 54, in _load_excel_to_workbook
    return RAGFlowExcelParser._dataframe_to_workbook(df)
  File "/ragflow/deepdoc/parser/excel_parser.py", line 69, in _dataframe_to_workbook
    ws.cell(row=row_num, column=col_num, value=value)
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/worksheet/worksheet.py", line 246, in cell
    cell.value = value
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 218, in value
    self._bind_value(value)
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 197, in _bind_value
    value = self.check_string(value)
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 165, in check_string
    raise IllegalCharacterError(f"{value} cannot be used in worksheets.")
```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2025-08-12 14:58:36 +08:00
569ab011c4 Add fallback to use 'calamine' parse engine in excel_parser.py (#9374)
### What problem does this PR solve?

add fallback to `calamine` engine when parse error raised using the
default `openpyxl` / `xlrd` engine.
e.g. the following error can be fixed:
```
Traceback (most recent call last):
  File "/ragflow/deepdoc/parser/excel_parser.py", line 53, in _load_excel_to_workbook
    df = pd.read_excel(file_like_object)
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel
    io = ExcelFile(
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__
    self._reader = self._engines[engine](
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__
    super().__init__(
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__
    self.book = self.load_workbook(self.handles.handle, engine_kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook
    return open_workbook(file_contents=data, **engine_kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook
    bk = open_workbook_xls(
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 68, in open_workbook_xls
    bk.biff2_8_load(
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 641, in biff2_8_load
    cd.locate_named_stream(UNICODE_LITERAL(qname))
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 398, in locate_named_stream
    result = self._locate_stream(
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 429, in _locate_stream
    raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4
```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-12 12:41:33 +08:00
6ad8b54754 fix "TypeError: '<' not supported between instances of 'Emu' and 'Non… (#9209)
…eType'"

### What problem does this PR solve?

fix "TypeError: '<' not supported between instances of 'Emu' and
'NoneType'"

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-04 16:07:03 +08:00
a16cd4f110 Refa: add result to callback for agent tool use. (#9137)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-08-01 21:49:39 +08:00
39ef2ffba9 Feat: parsing supports jsonl or ldjson format (#9087)
### What problem does this PR solve?

Supports jsonl or ldjson format. Feature request from
[discussion](https://github.com/orgs/infiniflow/discussions/8774).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-07-30 09:48:20 +08:00
03daf4618c Refactor parser code (#9042)
### What problem does this PR solve?

Refactor code

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-07-25 12:04:07 +08:00
ecdb1701df Perf: test llm before RAPTOR. (#8897)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2025-07-17 16:48:50 +08:00
51a8604dcb Fix: fixed context loss caused by separating markdown tables from original text (#8844)
### What problem does this PR solve?

Fix context loss caused by separating markdown tables from original
text. #6871, #8804.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-15 13:03:01 +08:00
6d256ff0f5 Perf: ignore concate between rows. (#8507)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2025-06-26 14:55:37 +08:00
4a2ff633e0 Fix typo in code (#8327)
### What problem does this PR solve?

Fix typo in code

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-06-18 09:41:09 +08:00
2e44c3b743 Fix:Unimplemented function in ppt_parser (#8095)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/8088

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-06-06 10:05:58 +08:00
6ba5a4348a set PARALLEL_DEVICES default value= 0 (#7935)
### What problem does this PR solve?


it would be fail if PARALLEL_DEVICES = None in OCR class , because it
pass 0 to TextDetector and TextRecognizer init method.

and It would be simpler to set 0 as the default value for
PARALLEL_DEVICES.

### Type of change

- [x] Refactoring
2025-05-29 13:32:16 +08:00
d4a123d6dd Fix: resolve regex library warnings (#7782)
### What problem does this PR solve?
This small PR resolves the regex library warnings showing in Python3.11:
```python
DeprecationWarning: 'count' is passed as positional argument
```

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-22 10:06:28 +08:00
b908c33464 Fix: uncaptured image data with position information (#7683)
### What problem does this PR solve?

Fixed uncaptured figure data with position information. #7466, #7681

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-05-19 19:33:28 +08:00
ea5e8caa69 feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562)
### What problem does this PR solve?

When the PDF uses vector fonts, the rendered text in the captured page
image often has missing strokes, leading to numerous OCR errors and
incorrect characters. Similar issues also occur in the extracted chart
images.

**Before**

![0089e1f76205b5b3](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1)

**After**

![03053149e919773a](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2)

You can use the following document for testing.

[Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf)


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
2025-05-12 09:50:21 +08:00
a14865e6bb Fix: empty query issue. (#7551)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-09 12:20:19 +08:00
9d3dd13fef Refa: text order be robuster. (#7525)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2025-05-08 12:58:10 +08:00
953b3e1b3f Fix: Sometimes VisionFigureParser.figures may is tuple (#7477)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/7466
I think due to some times we can not get position 

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-06 17:38:22 +08:00
2f768b96e8 perf: optimze figure parser (#7392)
### What problem does this PR solve?

When parsing documents containing images, the current code uses a
single-threaded approach to call the VL model, resulting in extremely
slow parsing speed (e.g., parsing a Word document with dozens of images
takes over 20 minutes).

By switching to a multithreaded approach to call the VL model, the
parsing speed can be improved to an acceptable level.

### Type of change

- [x] Performance Improvement

---------

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
2025-05-06 14:39:45 +08:00
10432a1be7 Refa: Optimize pptx shape extraction to reduce content loss (#6703)
### What problem does this PR solve?

When parsing pptx files, some shapes do not contain the `shape_type`
attribute, which causes the original code to throw an exception during
extraction, leading to failure in content extraction. This optimization
introduces handling logic for such anomalous shapes, providing a safer
and more robust processing mechanism.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
2025-04-22 10:16:24 +08:00
53c653b099 fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859)
i use PdfParser in local(refer to this case:
https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like
this:
```
import re
import openpyxl

from ragflow.api.db import ParserType
from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \
    title_frequency, \
    tokenize_chunks
from ragflow.rag.utils import num_tokens_from_string
from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser


def logger(prog=None, msg=""):
    print(msg)


class Pdf(PdfParser):
    def __init__(self):
        self.model_speciess = ParserType.MANUAL.value
        super().__init__()

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
        from timeit import default_timer as timer
        start = timer()
        callback(msg="OCR is running...")

        self.__images__(
            filename if not binary else binary,
            zoomin,
            from_page,
            to_page,
            callback
        )
        callback(msg="OCR finished.")
        print("OCR:", timer() - start)
   
        self._layouts_rec(zoomin)
        callback(0.65, "Layout analysis finished.")
        print("layouts:", timer() - start)

        self._table_transformer_job(zoomin)
        callback(0.67, "Table analysis finished.")


        self._text_merge()
        tbls = self._extract_table_figure(True, zoomin, True, True)
        self._concat_downward()  
        self._filter_forpages()   
        callback(0.68, "Text merging finished")

        # clean mess
        for b in self.boxes:
            b["text"] = re.sub(r"([\t  ]|\u3000){2,}", " ", b["text"].strip())

        return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin))
                for i, b in enumerate(self.boxes)], tbls


```

show err like this:
```
  File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__
    self.pdf.close()
AttributeError: 'PdfReader' object has no attribute 'close'
```

i found ragflow source code use
`pdfplumber.open`(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43)

and replace` self.pdf `with ` pdf2_read` (from pypdf import PdfReader as
pdf2_read)in line 1024
(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024)
```
self.pdf = pdf2_read
```


---
and I found that `pdfplumber` can be used in this way:
```
file_path="xxx.pdf"
res = pdfplumber.open(file_path)
res.close()
```

but `pypdf.PdfReader` source code do not has `close` func, source code
use like this

```
 with open(stream, "rb") as fh:
         stream = BytesIO(fh.read())
          self._stream_opened = True
```
> https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156

so I moved the `self.pdf.close` function call and fixed this problem
hoping to help the project😊
2025-04-14 09:40:13 +08:00