Add .doc file parser. (#497)

### What problem does this PR solve? Add `.doc` file parser, using tika. ``` pip install tika ``` ``` from tika import parser from io import BytesIO def extract_text_from_doc_bytes(doc_bytes): file_like_object = BytesIO(doc_bytes) parsed = parser.from_buffer(file_like_object) return parsed["content"] ``` ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chrysanthemum-boy <fannc@qq.com>
2026-01-31 07:36:46 +08:00 · 2024-04-23 15:31:43 +08:00
parent 0dfc8ddc0f
commit 72384b191d
6 changed files with 47 additions and 6 deletions
--- a/api/utils/file_utils.py
+++ b/api/utils/file_utils.py
@ -147,7 +147,7 @@ def filename_type(filename):
        return FileType.PDF.value

    if re.match(
-            r".*\.(docx|ppt|pptx|yml|xml|htm|json|csv|txt|ini|xls|xlsx|wps|rtf|hlp|pages|numbers|key|md)$", filename):
+            r".*\.(doc|docx|ppt|pptx|yml|xml|htm|json|csv|txt|ini|xls|xlsx|wps|rtf|hlp|pages|numbers|key|md)$", filename):
        return FileType.DOC.value

    if re.match(