Add .doc file parser. (#497)

### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO

def extract_text_from_doc_bytes(doc_bytes):
    file_like_object = BytesIO(doc_bytes)
    parsed = parser.from_buffer(file_like_object)
    return parsed["content"]
```
### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chrysanthemum-boy <fannc@qq.com>
This commit is contained in:
chrysanthemum-boy
2024-04-23 15:31:43 +08:00
committed by GitHub
parent 0dfc8ddc0f
commit 72384b191d
6 changed files with 47 additions and 6 deletions

View File

@ -116,6 +116,7 @@ sniffio==1.3.1
StrEnum==0.4.15
sympy==1.12
threadpoolctl==3.3.0
tika==2.6.0
tiktoken==0.6.0
tokenizers==0.15.2
torch==2.2.1
@ -133,4 +134,4 @@ xxhash==3.4.1
yarl==1.9.4
zhipuai==2.0.1
BCEmbedding
loguru==0.7.2
loguru==0.7.2