From 8c28587821fa8ef5cd8bb24b3e1e3c020987444e Mon Sep 17 00:00:00 2001 From: myoldcat Date: Thu, 27 Nov 2025 09:40:10 +0800 Subject: [PATCH] Fix issue where HTML file parsing may lose content. (#11536) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What problem does this PR solve? ##### Problem Description When parsing HTML files, some page content may be lost. For example, text inside nested `` tags within multiple `
` elements (e.g., `
Text_1
Text_2
`) fails to be preserved correctly. ###### Root Cause #1: Block ID propagation is interrupted 1. **Block ID generation**: When the parser encounters a `
`, it generates a new `block_id` because `
` belongs to `BLOCK_TAGS`. 2. **Recursive processing**: This `block_id` is passed down recursively to process the `
`’s child nodes. 3. **Interruption occurs**: When processing a child `` tag, the code enters the `else` branch of `read_text_recursively` (since `` is a Tag). 4. **Bug location**: The first line in this `else` branch explicitly sets **`block_id = None`**. - This discards the valid `block_id` inherited from the parent `
`. - Since `` is not in `BLOCK_TAGS`, it does not generate a new `block_id`, so it passes `None` to its child text nodes. 5. **Consequence**: The extracted text nodes have an empty `block_id` in their `metadata`. During the subsequent `merge_block_text` step, these texts cannot be correctly associated with their original `
` block due to the missing ID. As a result, all text from `` tags gets merged together, which then triggers a second issue during concatenation. 6. **Solution:** Remove the forced reset of `block_id` to `None`. When the current tag (e.g., ``) is not a block-level element, it should inherit the `block_id` passed down from its parent. This ensures consistent ownership across the hierarchy: `div` → `font` → `text`. ###### Root Cause #2: Data loss during text concatenation 1. The line `current_content += (" " if current_content else "" + content)` has a misplaced parenthesis. When `current_content` is non-empty (`True`): - The ternary expression evaluates to `" "` (a single space). - The code executes `current_content += " "`. - **Result**: Only a space is appended—**the new `content` string is completely discarded**. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --- deepdoc/parser/html_parser.py | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/deepdoc/parser/html_parser.py b/deepdoc/parser/html_parser.py index 44ff10389..7e4467c16 100644 --- a/deepdoc/parser/html_parser.py +++ b/deepdoc/parser/html_parser.py @@ -138,7 +138,6 @@ class RAGFlowHtmlParser: "metadata": {"table_id": table_id, "index": table_list.index(t)}}) return table_info_list else: - block_id = None if str.lower(element.name) in BLOCK_TAGS: block_id = str(uuid.uuid1()) for child in element.children: @@ -172,7 +171,7 @@ class RAGFlowHtmlParser: if tag_name == "table": table_info_list.append(item) else: - current_content += (" " if current_content else "" + content) + current_content += (" " if current_content else "") + content if current_content: block_content.append(current_content) return block_content, table_info_list