mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 04:22:28 +08:00
Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702)
### What problem does this PR solve?
This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:
1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
This PR adds:
* root-directory detection
* fallback lookup using sanitized names
* a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.
2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
`ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:
* parse position tags when present
* default to empty positions when missing
preserving backward compatibility and preventing crashes.
### Type of change
* [x] Bug Fix (non-breaking change which fixes an issue)
[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)
we need your help [yongtenglei](https://github.com/yongtenglei)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
This commit is contained in:
@ -219,23 +219,27 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
)
|
)
|
||||||
|
|
||||||
def _normalize_section(section):
|
def _normalize_section(section):
|
||||||
# pad section to length 3: (txt, sec_id, poss)
|
# Pad/normalize to (txt, layout, positions)
|
||||||
if len(section) == 1:
|
if not isinstance(section, (list, tuple)):
|
||||||
|
section = (section, "", [])
|
||||||
|
elif len(section) == 1:
|
||||||
section = (section[0], "", [])
|
section = (section[0], "", [])
|
||||||
elif len(section) == 2:
|
elif len(section) == 2:
|
||||||
section = (section[0], "", section[1])
|
section = (section[0], "", section[1])
|
||||||
elif len(section) != 3:
|
else:
|
||||||
raise ValueError(f"Unexpected section length: {len(section)} (value={section!r})")
|
section = (section[0], section[1], section[2])
|
||||||
|
|
||||||
txt, layoutno, poss = section
|
txt, layoutno, poss = section
|
||||||
if isinstance(poss, str):
|
if isinstance(poss, str):
|
||||||
poss = pdf_parser.extract_positions(poss)
|
poss = pdf_parser.extract_positions(poss)
|
||||||
first = poss[0] # tuple: ([pn], x1, x2, y1, y2)
|
if poss:
|
||||||
pn = first[0]
|
first = poss[0] # tuple: ([pn], x1, x2, y1, y2)
|
||||||
|
pn = first[0]
|
||||||
if isinstance(pn, list):
|
if isinstance(pn, list) and pn:
|
||||||
pn = pn[0] # [pn] -> pn
|
pn = pn[0] # [pn] -> pn
|
||||||
poss[0] = (pn, *first[1:])
|
poss[0] = (pn, *first[1:])
|
||||||
|
if not poss:
|
||||||
|
poss = []
|
||||||
|
|
||||||
return (txt, layoutno, poss)
|
return (txt, layoutno, poss)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user