Feat: Upgrade html parser (#9675)

### What problem does this PR solve?

parse more html content.

### Type of change

- [x] Other (please describe):
This commit is contained in:
pingguoCooler
2025-08-27 12:43:55 +08:00
committed by GitHub
parent 1f47001c82
commit cf0011be67
2 changed files with 179 additions and 13 deletions

View File

@ -517,7 +517,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
elif re.search(r"\.(htm|html)$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
sections = HtmlParser()(filename, binary)
chunk_token_num = int(parser_config.get("chunk_token_num", 128))
sections = HtmlParser()(filename, binary, chunk_token_num)
sections = [(_, "") for _ in sections if _]
callback(0.8, "Finish parsing.")