Feat: support context window for docx (#12455)

### What problem does this PR solve?

Feat: support context window for docx

#12303

Done:
- [x] naive.py
- [x] one.py

TODO:
- [ ] book.py
- [ ] manual.py

Fix: incorrect image position
Fix: incorrect chunk type tag

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
This commit is contained in:
Magicbook1108
2026-01-07 15:08:17 +08:00
committed by GitHub
parent a442c9cac6
commit 011bbe9556
7 changed files with 397 additions and 120 deletions

View File

@ -87,10 +87,18 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback(0.1, "Start to parse.")
doc_parser = naive.Docx()
# TODO: table of contents need to be removed
sections, tbls = doc_parser(
main_sections = doc_parser(
filename, binary=binary, from_page=from_page, to_page=to_page)
sections = []
tbls = []
for text, image, html in main_sections:
sections.append((text, image))
tbls.append(((None, html), ""))
remove_contents_table(sections, eng=is_english(
random_choices([t for t, _ in sections], k=200)))
tbls = vision_figure_parser_docx_wrapper(sections=sections, tbls=tbls, callback=callback, **kwargs)
# tbls = [((None, lns), None) for lns in tbls]
sections = [(item[0], item[1] if item[1] is not None else "") for item in sections if