mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-31 17:15:32 +08:00
Fix: parent-children pipleine bad case. (#12246)
### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
This commit is contained in:
@ -1206,7 +1206,7 @@ class RAGFlowPdfParser:
|
|||||||
start = timer()
|
start = timer()
|
||||||
self._text_merge()
|
self._text_merge()
|
||||||
self._concat_downward()
|
self._concat_downward()
|
||||||
self._naive_vertical_merge(zoomin)
|
#self._naive_vertical_merge(zoomin)
|
||||||
if callback:
|
if callback:
|
||||||
callback(0.92, "Text merged ({:.2f}s)".format(timer() - start))
|
callback(0.92, "Text merged ({:.2f}s)".format(timer() - start))
|
||||||
|
|
||||||
|
|||||||
@ -92,9 +92,9 @@ class Splitter(ProcessBase):
|
|||||||
continue
|
continue
|
||||||
split_sec = re.split(r"(%s)" % custom_pattern, c, flags=re.DOTALL)
|
split_sec = re.split(r"(%s)" % custom_pattern, c, flags=re.DOTALL)
|
||||||
if split_sec:
|
if split_sec:
|
||||||
for txt in split_sec:
|
for j in range(0, len(split_sec), 2):
|
||||||
docs.append({
|
docs.append({
|
||||||
"text": txt,
|
"text": split_sec[j],
|
||||||
"mom": c
|
"mom": c
|
||||||
})
|
})
|
||||||
else:
|
else:
|
||||||
@ -155,9 +155,9 @@ class Splitter(ProcessBase):
|
|||||||
split_sec = re.split(r"(%s)" % custom_pattern, c["text"], flags=re.DOTALL)
|
split_sec = re.split(r"(%s)" % custom_pattern, c["text"], flags=re.DOTALL)
|
||||||
if split_sec:
|
if split_sec:
|
||||||
c["mom"] = c["text"]
|
c["mom"] = c["text"]
|
||||||
for txt in split_sec:
|
for j in range(0, len(split_sec), 2):
|
||||||
cc = deepcopy(c)
|
cc = deepcopy(c)
|
||||||
cc["text"] = txt
|
cc["text"] = split_sec[j]
|
||||||
docs.append(cc)
|
docs.append(cc)
|
||||||
else:
|
else:
|
||||||
docs.append(c)
|
docs.append(c)
|
||||||
|
|||||||
@ -1,13 +1,17 @@
|
|||||||
Extract important structured information from the given content.
|
## Role: Metadata extraction expert
|
||||||
Output ONLY a valid JSON string with no additional text.
|
## Constraints:
|
||||||
If no important structured information is found, output an empty JSON object: {}.
|
- Core Directive: Extract important structured information from the given content. Output ONLY a valid JSON string. No Markdown (e.g., ```json), no explanations, and no notes.
|
||||||
|
- Schema Parsing: In the `properties` object provided in Schema, the attribute name (e.g., 'author') is the target Key. Extract values based on the `description`; if no `description` is provided, refer to the key's literal meaning.
|
||||||
|
- Extraction Rules: Extract only when there is an explicit semantic correlation. If multiple values or data points match a field's definition, extract and include all of them. Strictly follow the Schema below and only output matched key-value pairs. If the content is irrelevant or no matching information is identified, you **MUST** output {}.
|
||||||
|
- Data Source: Extraction must be based solely on content below. Semantic mapping (synonyms) is allowed, but strictly prohibit hallucinations or fabricated facts.
|
||||||
|
|
||||||
Important structured information structure as following:
|
## Enum Rules (Triggered ONLY if an enum list is present):
|
||||||
|
- Value Lock: All extracted values MUST strictly match the provided enum list.
|
||||||
|
- Normalization: Map synonyms or variants in the text back to the standard enum value (e.g., "Dec" to "December").
|
||||||
|
- Fallback: Output {} if no explicit match or synonym is identified.
|
||||||
|
|
||||||
|
## Schema for extraction:
|
||||||
{{ schema }}
|
{{ schema }}
|
||||||
|
|
||||||
---------------------------
|
## Content to analyze:
|
||||||
The given content as following:
|
{{ content }}
|
||||||
|
|
||||||
{{ content }}
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user