Feat: add support for multi-column PDF parsing (#10475)

### What problem does this PR solve? Add support for multi-columns PDF parsing. #9878, #9919. Two-column sample: <img width="1885" height="1020" alt="image" src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28" /> Three-column sample: <img width="1881" height="992" alt="image" src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a" /> Single-column sample: <img width="1883" height="1042" alt="image" src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)
2026-01-30 15:16:45 +08:00 · 2025-10-11 18:46:09 +08:00
parent c21cea2038
commit 5200711441
3 changed files with 196 additions and 85 deletions
--- a/deepdoc/parser/markdown_parser.py
+++ b/deepdoc/parser/markdown_parser.py
@ -17,7 +17,6 @@

 import re

-import mistune
 from markdown import markdown


@ -117,8 +116,6 @@ class MarkdownElementExtractor:
    def __init__(self, markdown_content):
        self.markdown_content = markdown_content
        self.lines = markdown_content.split("\n")
-        self.ast_parser = mistune.create_markdown(renderer="ast")
-        self.ast_nodes = self.ast_parser(markdown_content)

    def extract_elements(self):
        """Extract individual elements (headers, code blocks, lists, etc.)"""