mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 20:42:30 +08:00
Add .doc file parser. (#497)
### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO
def extract_text_from_doc_bytes(doc_bytes):
file_like_object = BytesIO(doc_bytes)
parsed = parser.from_buffer(file_like_object)
return parsed["content"]
```
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: chrysanthemum-boy <fannc@qq.com>
This commit is contained in:
committed by
GitHub
parent
0dfc8ddc0f
commit
72384b191d
@ -116,6 +116,7 @@ sniffio==1.3.1
|
||||
StrEnum==0.4.15
|
||||
sympy==1.12
|
||||
threadpoolctl==3.3.0
|
||||
tika==2.6.0
|
||||
tiktoken==0.6.0
|
||||
tokenizers==0.15.2
|
||||
torch==2.2.1
|
||||
@ -133,4 +134,4 @@ xxhash==3.4.1
|
||||
yarl==1.9.4
|
||||
zhipuai==2.0.1
|
||||
BCEmbedding
|
||||
loguru==0.7.2
|
||||
loguru==0.7.2
|
||||
Reference in New Issue
Block a user