Add .doc file parser. (#497)

### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO

def extract_text_from_doc_bytes(doc_bytes):
    file_like_object = BytesIO(doc_bytes)
    parsed = parser.from_buffer(file_like_object)
    return parsed["content"]
```
### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chrysanthemum-boy <fannc@qq.com>
This commit is contained in:
chrysanthemum-boy
2024-04-23 15:31:43 +08:00
committed by GitHub
parent 0dfc8ddc0f
commit 72384b191d
6 changed files with 47 additions and 6 deletions

View File

@ -147,7 +147,7 @@ def filename_type(filename):
return FileType.PDF.value
if re.match(
r".*\.(docx|ppt|pptx|yml|xml|htm|json|csv|txt|ini|xls|xlsx|wps|rtf|hlp|pages|numbers|key|md)$", filename):
r".*\.(doc|docx|ppt|pptx|yml|xml|htm|json|csv|txt|ini|xls|xlsx|wps|rtf|hlp|pages|numbers|key|md)$", filename):
return FileType.DOC.value
if re.match(