Compare commits

..

117 Commits

Author SHA1 Message Date
48607c3cfb Update README (#670)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-05-08 12:01:26 +08:00
d15ba37313 update docker file to support low version npm package (#669)
### Type of change

- [x] Refactoring
2024-05-08 10:40:38 +08:00
a553dc8dbd feat: support DeepSeek (#667)
### What problem does this PR solve?

#666 
feat: support DeepSeek
feat: preview word and excel

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2024-05-08 10:30:18 +08:00
eb27a4309e add support for deepseek (#668)
### What problem does this PR solve?

#666 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-08 10:30:02 +08:00
48e1534bf4 Update conversation_api.md 2024-05-08 09:05:35 +08:00
e9d19c4684 Update conversation_api.md 2024-05-08 09:04:23 +08:00
8d6d7f6887 fix task losting isssue (#665)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 20:46:45 +08:00
a6e4b74d94 remove unused dependency (#664)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 19:46:17 +08:00
a5aed2412f fix bugs (#662)
### What problem does this PR solve?

Fix import error for task_service.py

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 16:41:56 +08:00
2810c60757 refine doc for v0.5.0 (#660)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-05-07 13:19:33 +08:00
62afcf5ac8 fix bug (#659)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 13:16:12 +08:00
a74c755d83 Update .env 2024-05-07 12:56:14 +08:00
7013d7f620 refine text decode (#657)
### What problem does this PR solve?
#651 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 12:25:47 +08:00
de839fc3f0 optimize srv broker and executor logic (#630)
### What problem does this PR solve?

Optimize task broker and executor for reduce memory usage and deployment
complexity.

### Type of change
- [x] Performance Improvement
- [x] Refactoring

### Change Log
- Enhance redis utils for message queue(use stream)
- Modify task broker logic via message queue (1.get parse event from
message queue 2.use ThreadPoolExecutor async executor )
- Modify the table column name of document and task (process_duation ->
process_duration maybe just a spelling mistake)
- Reformat some code style(just what i see)
- Add requirement_dev.txt for developer
- Add redis container on docker compose

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-05-07 11:43:33 +08:00
c6b6c748ae fix file encoding detection bug (#653)
### What problem does this PR solve?

#651 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 10:01:24 +08:00
ca5acc151a Refactor: Use TaskStatus enum for task status handling (#646)
### What problem does this PR solve?

This commit changes the status 'not started' from being hard-coded to
being maintained by the TaskStatus enum. This enhancement ensures
consistency across the codebase and improves maintainability.

### Type of change

- [x] Refactoring
2024-05-06 18:39:17 +08:00
385dbe5ab5 fix: add spin to parsing status icon of dataset table (#649)
### What problem does this PR solve?

fix: add spin to parsing status icon of dataset table
#648 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-06 18:37:31 +08:00
3050a8cb07 Update README badge (#639)
### What problem does this PR solve?

Entry to RAGFlow's online demo was not easy to find. Also note that text
"RAGFlow" in the badge is already a given. Hence the change.

### Type of change

- [x] Documentation Update
2024-05-04 15:31:11 +08:00
9c77d367d0 Updated faq.md (#636)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update
2024-05-03 12:11:15 +08:00
5f03a4de11 remove redis (#629)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-30 19:00:41 +08:00
290e5d958d docs: Add instructions for launching service from source (#619)
This commit includes detailed steps for setting up and launching the
service directly from the source code. It covers cloning the repository,
setting up a virtual environment, configuring environment variables, and
starting the service using Docker. This update ensures that developers
have clear guidance on how to get the service running in a development
environment.

### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change
- [x] Documentation Update
2024-04-30 18:45:53 +08:00
9703633a57 fix: filter knowledge list by keywords and clear the selected file list after the file is uploaded successfully and add ellipsis pattern to chunk list (#628)
### What problem does this PR solve?

#627 
fix: filter knowledge list by keywords
fix: clear the selected file list after the file is uploaded
successfully
feat: add ellipsis pattern to chunk list

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 18:43:26 +08:00
7d3b68bb1e refine code (#626)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2024-04-30 17:53:28 +08:00
c89f3c3cdb Fix missing 'ollama' package in requirements.txt (#621)
### What problem does this PR solve?

This commit resolves an issue where the 'ollama' package was
inadvertently omitted from the requirements.txt file. The package has
now been added to ensure all dependencies are correctly installed for
the project.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 16:29:46 +08:00
5d7f573379 Fix: missing 'redis' package in requirements.txt (#622)
### What problem does this PR solve?

This commit resolves an issue where the 'redis' package was
inadvertently omitted from the requirements.txt file. The package has
now been added to ensure all dependencies are correctly installed for
the project.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 16:29:27 +08:00
cab274f560 remove PyMuPDF (#618)
### What problem does this PR solve?
#613 

### Type of change


- [x] Other (please describe):
2024-04-30 12:38:09 +08:00
7059ec2298 fix: fixed the issue that ModelSetting could not be saved #614 (#617)
### What problem does this PR solve?

fix: fixed the issue that ModelSetting  could not be saved #614

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 11:27:10 +08:00
674b3aeafd fix disable and enable llm setting in dialog (#616)
### What problem does this PR solve?
#614 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 11:04:14 +08:00
4c1476032d fix: omit long file names (#608)
### What problem does this PR solve?

#607
fix: omit long file names
fix: change the parsing method from tag to select
fix: replace icon for new chat
fix: change the OK button text of the Chat Bot API modal to close


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-29 18:22:17 +08:00
2af74cc494 refine docker layers (#606)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2024-04-29 17:57:40 +08:00
38f0cc016f fix: #567 use modal to upload files in the knowledge base (#601)
### What problem does this PR solve?

fix:  #567 use modal to upload files in the knowledge base

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-29 15:45:19 +08:00
6874c6f3a7 refine document upload (#602)
### What problem does this PR solve?

#567 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-29 15:45:08 +08:00
8acc01a227 refine redis connection (#599)
### What problem does this PR solve?

#591 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-29 08:52:38 +08:00
8c07992b6c refine code (#595)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 19:13:33 +08:00
aee8b48d2f feat: add FlowCanvas (#593)
### What problem does this PR solve?

feat: handle operator drag
feat: add FlowCanvas
#592

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-28 19:03:54 +08:00
daf215d266 Updated FAQ: Range of input length (#594)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update
2024-04-28 19:03:43 +08:00
cdcc779705 refine document by using latest as version number (#588)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-28 16:16:08 +08:00
d589b0f568 fix exception in pdf parser (#584)
### What problem does this PR solve?
#451 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-28 14:23:53 +08:00
9d60a84958 refactor code (#583)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 13:19:54 +08:00
aadb9cbec8 remove default redis configuration (#582)
### What problem does this PR solve?
#580 
### Type of change

- [x] Refactoring
2024-04-28 12:14:56 +08:00
038822f3bd make cites in conversation API configurable (#576)
### What problem does this PR solve?

#566 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-28 11:56:17 +08:00
ae501c58fa fix: display the current language directly at the top and do not disp… (#579)
…lay reference symbols for documents in external chat boxes  #566 #577

### What problem does this PR solve?

fix: display the current language directly at the top and do not display
reference symbols for documents in external chat boxes #566 #577

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-28 11:50:03 +08:00
944776f207 fix bug about fetching file from minio (#574)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-28 09:57:40 +08:00
f1c98aad6b Update version info (#564)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update
- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-26 20:07:26 +08:00
ab06f502d7 fix bug of file management (#565)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-26 19:59:21 +08:00
6329339a32 feat: add Tooltip to action icon of FileManager (#561)
### What problem does this PR solve?
#345
feat: add Tooltip to action icon of FileManager 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-26 18:55:37 +08:00
84b39c60f6 fix rename bug (#562)
### What problem does this PR solve?

fix rename file bugs
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-26 18:55:21 +08:00
eb62c669ae feat: translate FileManager #345 (#558)
### What problem does this PR solve?
#345
feat: translate FileManager
feat: batch delete files from the file table in the knowledge base

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-26 17:22:23 +08:00
f69ff39fa0 add file management feature (#560)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-26 17:21:53 +08:00
b1cd203904 Update version to 0.3.2 (#550)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-26 09:58:35 +08:00
b75d75e995 fix youdao bug (#551)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-26 09:58:22 +08:00
76c477f211 chore: disable Kibana volume storage in Docker Compose (#548)
### What problem does this PR solve?

Since Kibana service is not currently being used, the associated volume
'kibanadata' has been commented out in the Docker Compose file. This
change helps to prevent the allocation of unnecessary resources and
simplifies the configuration.

### Type of change

- [x] Refactoring
unused Kibana volume storage
2024-04-26 08:54:27 +08:00
1b01c4fe69 Updated badge link (#545)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update
2024-04-25 19:34:21 +08:00
188f3ddfc5 Update version to v0.3.1 (#544)
### What problem does this PR solve?

Update version to v0.3.1

### Type of change

- [x] Documentation Update

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-25 19:18:04 +08:00
1dcd439c58 feat: add file icon to table of FileManager #345 (#543)
### What problem does this PR solve?

feat: add file icon to table of FileManager #345
fix: modify datasetDescription

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-25 19:06:24 +08:00
26003b5076 Add upload file by knowledge base name API. (#539)
### What problem does this PR solve?
Add upload file by knowledge base name API.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---------

Co-authored-by: chrysanthemum-boy <fannc@qq.com>
2024-04-25 15:10:19 +08:00
4130e5c5e5 Updated badge (#540)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Updates
2024-04-25 15:08:57 +08:00
d0af2f92f2 Added release badge (#538)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update
2024-04-25 14:31:54 +08:00
66f8d35632 Refactor (#537)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-25 14:14:28 +08:00
cf9b554c3a there's no need to connect to Redis in order to use Redis (#536)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Documentation Update
2024-04-25 14:01:39 +08:00
aeabc0c9a4 Add disk requirements on the README (#535)
### What problem does this PR solve?

Add disk requirements on the README

### Type of change

- [x] Documentation Update

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-25 14:00:48 +08:00
9db44da992 Add docker support for OpenCloudOS 9 (#526)
### What problem does this PR solve?

This PR aims to add support for running Ragflow on Docker with the
OpenCloudOS 9 distribution.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: edwardewang <edwardewang@tencent.com>
2024-04-25 08:46:53 +08:00
51e7697df7 feat: upload file in FileManager #345 (#529)
### What problem does this PR solve?

feat: upload file in FileManager #345 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-25 08:46:18 +08:00
b06d6395bb Updated minimum RAM capacity (#528)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2024-04-24 19:22:00 +08:00
b79f0b0cac Add .DS_Store and docker/ragflow-logs to the git ignore list (#523)
### What problem does this PR solve?

Ignore temporal files to help Mac developers.

### Type of change


- [x] Other (please describe):

Co-authored-by: PLIX870I <plix870i@V-SPDT-XIAOHUI-MB.local>
2024-04-24 17:05:01 +08:00
fe51488973 editorial updates (#525)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2024-04-24 17:04:23 +08:00
5d1803c31d Add an entry in Debugging section (#481)
### What problem does this PR solve?

_Add an entry in Debugging section._

### Type of change

- [x] Documentation Update

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2024-04-24 12:21:41 +08:00
bd76a82c1f Update conversation_api.md (#489)
Fixed a spelling error:
save -> safe

### What problem does this PR solve?

Fixed a spelling error:
save -> safe

### Type of change

- [x] Documentation Update
2024-04-24 12:21:14 +08:00
2bc9a7cc18 Add Chinese readme for DeepDoc (#515)
### What problem does this PR solve?

Add Chinese explanation for deepdoc

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [*] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-04-24 12:20:56 +08:00
2d228dbf7f feat: create folder #345 (#518)
### What problem does this PR solve?

feat: create folder
feat: ensure that all files in the current folder can be correctly
requested after renaming the folder
#345 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-24 11:07:22 +08:00
369400c483 fix bug of table in docx (#510)
### What problem does this PR solve?
#509 
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-23 19:10:33 +08:00
6405041b4d fix: cannot save the system model setting #468 (#508)
### What problem does this PR solve?

fix: cannot save the system model setting #468
feat: rename file in FileManager
feat: add FileManager
feat: override useSelector type

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-23 17:46:56 +08:00
aa71462a9f fix bug #502 (#504)
### What problem does this PR solve?

#502 
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-23 16:01:46 +08:00
72384b191d Add .doc file parser. (#497)
### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO

def extract_text_from_doc_bytes(doc_bytes):
    file_like_object = BytesIO(doc_bytes)
    parsed = parser.from_buffer(file_like_object)
    return parsed["content"]
```
### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chrysanthemum-boy <fannc@qq.com>
2024-04-23 15:31:43 +08:00
0dfc8ddc0f enlarge docker memory usage (#501)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-23 14:41:10 +08:00
78402d9a57 enlarge docker memory usage (#496)
### What problem does this PR solve?

### Type of change


- [x] Refactoring
2024-04-23 10:28:09 +08:00
b448c212ee Adjust the structure of FAQ (#479)
### Type of change

- [x] Documentation Update
2024-04-22 16:51:28 +08:00
0aaade088b .doc file is not support yet. fix regular expression ,then message can be alert (#487)
…e alert

### What problem does this PR solve?

.doc file is not support yet, fix the regular expression ,then right
message can by alert

### Type of change

- [ ] Bug Fix  : issule: 474
2024-04-22 16:44:20 +08:00
a38e163035 remove doc from supported processing types (#488)
### What problem does this PR solve?
#474 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-22 15:46:09 +08:00
3610e1e5b4 fix ollama issuet push (#486)
### What problem does this PR solve?

#477 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-22 15:13:01 +08:00
11949f9f2e feat: support markdown files (#483)
parse markdown files as txt

### What problem does this PR solve?

support markdown files

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-22 14:43:36 +08:00
b8e58fe27a add redis to accelerate access of minio (#482)
### What problem does this PR solve?

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-22 14:11:09 +08:00
fc87c20bd8 fix: 🐛 Fix duplicate ports in docker-compose (#472)
### What problem does this PR solve?

Fix duplicate ports in docker-compose

![image](https://github.com/infiniflow/ragflow/assets/54298540/32649b74-97dc-4004-b9aa-ac5e77b368a5)


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-21 22:46:07 +08:00
dee6299ddf Update format (#467)
### What problem does this PR solve?

Update README format

### Type of change

- [x] Documentation Update

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-19 20:13:39 +08:00
101df2b470 Refine conversaion docs (#465)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-19 19:15:00 +08:00
c055f40dff feat: #345 even if the backend data returns empty, the skeleton of the chart will be displayed. (#461)
… chart will be displayed.

### What problem does this PR solve?

feat: #345 even if the backend data returns empty, the skeleton of the
chart will be displayed.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-19 19:05:30 +08:00
7da3f88e54 refine docs for chat bot api. (#463)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-19 19:05:15 +08:00
10b79effab trivals (#462)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-19 18:54:24 +08:00
7e41b4bc94 change readme for 0.3.0 release (#459)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2024-04-19 18:19:15 +08:00
ed6081845a Fit a lot of encodings for text file. (#458)
### What problem does this PR solve?

#384

### Type of change

- [x] Performance Improvement
2024-04-19 18:02:53 +08:00
cda7b607cb feat: translate EmbedModal #345 (#455)
### What problem does this PR solve?

Embed the chat window into other websites through iframe

#345 

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2024-04-19 16:55:23 +08:00
962c66714e fix divide by zero bug (#447)
### What problem does this PR solve?

#445 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-19 11:26:38 +08:00
39f1feaccb Bug fix pdf parse index out of range (#440)
### What problem does this PR solve?

fix a bug comes when parse some pdf file #436 

### Type of change

- [☑️ ] Bug Fix (non-breaking change which fixes an issue)
2024-04-19 08:44:51 +08:00
1dada69daa fix: replace some pictures of chunk method #437 (#438)
### What problem does this PR solve?

some chunk method pictures are not in English #437

feat: set the height of both html and body to 100%
feat: add SharedChat
feat: add shared hooks

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-18 19:27:53 +08:00
fe2f5205fc add lf end-lines in *.sh (#425)
### What problem does this PR solve?

link #279 #266 

### Type of change

- [x] Documentation Update

---------

Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
2024-04-18 17:17:54 +08:00
ac574af60a Add env to expose minio port to the host (#426)
### What problem does this PR solve?

The docker-compose file can't config minio related port by .env file. So
I just add env `MINIO_CONSOLE_PORT=9001
MINIO_PORT=9000` to .env file.

### Type of change

- [x] Refactoring
2024-04-18 15:45:09 +08:00
0499a3f621 rm page number exception for pdf parser (#424)
### What problem does this PR solve?

#423 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-18 12:09:56 +08:00
453c29170f make sure the models will not be load twice (#422)
### What problem does this PR solve?

#381 
### Type of change

- [x] Refactoring
2024-04-18 09:37:23 +08:00
YC
e8570da856 Update table.py to convert clmns to string (#414)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-17 19:48:11 +08:00
dd7559a009 Update PR template (#415)
### What problem does this PR solve?

Update PR template

### Type of change

- [x] Documentation Update

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-17 16:43:08 +08:00
3719ff7299 Added some debugging FAQs (#413)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-17 16:32:36 +08:00
800b5c7aaa fix bulk error for table method (#407)
### What problem does this PR solve?


Issue link:#366

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-17 12:17:14 +08:00
f12f30bb7b Add automation scripts to support displaying environment information such as RAGFlow repository version, operating system, Python version, etc. in a Linux environment for users to report issues. (#396)
### What problem does this PR solve?
Add automation scripts to support displaying environment information
such as RAGFlow repository version, operating system, Python version,
etc. in a Linux environment for users to report issues.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-17 11:54:06 +08:00
30846c83b2 feat: modify the description of qa (#406)
### What problem does this PR solve?

feat: modify the description of qa

Issue link: #405

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-17 11:51:01 +08:00
2afe7a74b3 Added FAQs (#395)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-16 19:51:20 +08:00
d4e0bfc8a5 fix gb2312 encoding issue (#394)
### What problem does this PR solve?

Issue link:#384
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-16 19:45:14 +08:00
044daff668 Fix document bug (#393)
### What problem does this PR solve?

As title

### Type of change

- [x] Documentation Update

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-16 19:23:09 +08:00
03f8b01b3b fix bug for fasetembed (#392)
### What problem does this PR solve?

Issue link:#325

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-16 19:12:12 +08:00
ad6f0a1ce5 feat: add overview (#391)
### What problem does this PR solve?

feat: render stats charts
feat: create api token
feat: delete api token
feat: add ChatApiKeyModal
feat: add RagLineChart


Issue link: #345

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-16 19:06:47 +08:00
b3843138f4 change version number in readme (#390)
### What problem does this PR solve?


### Type of change

- [x] Documentation Update
2024-04-16 19:04:29 +08:00
e0bdcbbeba fix: revert #359 (#388)
### What problem does this PR solve?

![图片](https://github.com/infiniflow/ragflow/assets/106524776/e62dc04d-dd72-4ef6-ab1f-a2a219dc197a)

revert #359

Issue link:#359

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-16 18:55:51 +08:00
582340a184 Added FAQ why document parsing gets stuck (#385)
### What problem does this PR solve?


### Type of change


- [x] Documentation Update
2024-04-16 16:55:44 +08:00
890561703b Add bce-embedding and fastembed (#383)
### What problem does this PR solve?


Issue link:#326

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-16 16:42:19 +08:00
a7be5d4e8b build ragflow image from scratch (#376)
### What problem does this PR solve?

issue: #205 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-16 12:29:58 +08:00
c344486aa0 enlarge max file number per user limit (#373)
### What problem does this PR solve?

Issue link:#370

### Type of change

- [x] Refactoring
2024-04-16 10:00:52 +08:00
111501af5e make <xxxx> visiable (#369)
### What problem does this PR solve?


![image](https://github.com/infiniflow/ragflow/assets/106524776/0c526a56-05b1-42f8-8bf5-cb23a97183b8)

make `<xxxx>` visiable
it was misinterpreted as part of the HTML tags

![image](https://github.com/infiniflow/ragflow/assets/106524776/1c42aef0-6989-40c1-b129-47a835b038a7)

Issue link:None

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Breaking Change (fix or feature that could cause existing
functionality not to work as expected)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Test cases
- [ ] Python SDK impacted, Need to update PyPI
- [ ] Other (please describe):
2024-04-16 09:47:57 +08:00
9e75bd4d88 change version number (#368)
### What problem does this PR solve?

Issue link: for release v0.1.0

### Type of change

- [x] Documentation Update
- [x] Refactoring
2024-04-15 19:51:18 +08:00
213 changed files with 11315 additions and 2592 deletions

1
.gitattributes vendored Normal file
View File

@ -0,0 +1 @@
*.sh text eol=lf

View File

@ -1,5 +1,5 @@
name: Bug Report
description: Create a bug issue for infinity
description: Create a bug issue for RAGFlow
title: "[Bug]: "
labels: [bug]
body:

View File

@ -1,7 +1,7 @@
---
name: Feature request
title: '[Feature Request]: '
about: Suggest an idea for Infinity
about: Suggest an idea for RAGFlow
labels: ''
---

View File

@ -1,5 +1,5 @@
name: Feature request
description: Propose a feature request for infinity.
description: Propose a feature request for RAGFlow.
title: "[Feature Request]: "
labels: [feature request]
body:

View File

@ -1,5 +1,5 @@
name: Question
description: Ask questions on infinity
description: Ask questions on RAGFlow
title: "[Question]: "
labels: [question]
body:

View File

@ -1,5 +1,5 @@
name: Subtask
description: "Propose a subtask for infinity"
description: "Propose a subtask for RAGFlow"
title: "[Subtask]: "
labels: [subtask]

View File

@ -2,16 +2,11 @@
_Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._
Issue link:#[Link the issue here]
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Test cases
- [ ] Python SDK impacted, Need to update PyPI
- [ ] Other (please describe):

8
.gitignore vendored
View File

@ -21,3 +21,11 @@ Cargo.lock
.idea/
.vscode/
# Exclude Mac generated files
.DS_Store
# Exclude the log folder
docker/ragflow-logs/
/flask_session
/logs

View File

@ -1,20 +1,20 @@
FROM swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow-base:v1.0
USER root
WORKDIR /ragflow
ADD ./web ./web
RUN cd ./web && npm i && npm run build
ADD ./api ./api
ADD ./conf ./conf
ADD ./deepdoc ./deepdoc
ADD ./rag ./rag
ENV PYTHONPATH=/ragflow/
ENV HF_ENDPOINT=https://hf-mirror.com
ADD docker/entrypoint.sh ./entrypoint.sh
RUN chmod +x ./entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]
FROM swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow-base:v1.0
USER root
WORKDIR /ragflow
ADD ./web ./web
RUN cd ./web && npm i --force && npm run build
ADD ./api ./api
ADD ./conf ./conf
ADD ./deepdoc ./deepdoc
ADD ./rag ./rag
ENV PYTHONPATH=/ragflow/
ENV HF_ENDPOINT=https://hf-mirror.com
ADD docker/entrypoint.sh ./entrypoint.sh
RUN chmod +x ./entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

View File

@ -9,7 +9,7 @@ RUN /root/miniconda3/envs/py11/bin/pip install onnxruntime-gpu --extra-index-url
ADD ./web ./web
RUN cd ./web && npm i && npm run build
RUN cd ./web && npm i --force && npm run build
ADD ./api ./api
ADD ./conf ./conf

54
Dockerfile.scratch Normal file
View File

@ -0,0 +1,54 @@
FROM ubuntu:22.04
USER root
WORKDIR /ragflow
RUN apt-get update && apt-get install -y wget curl build-essential libopenmpi-dev
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /root/miniconda3 && \
rm ~/miniconda.sh && ln -s /root/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo ". /root/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate base" >> ~/.bashrc
ENV PATH /root/miniconda3/bin:$PATH
RUN conda create -y --name py11 python=3.11
ENV CONDA_DEFAULT_ENV py11
ENV CONDA_PREFIX /root/miniconda3/envs/py11
ENV PATH $CONDA_PREFIX/bin:$PATH
RUN curl -sL https://deb.nodesource.com/setup_14.x | bash -
RUN apt-get install -y nodejs
RUN apt-get install -y nginx
ADD ./web ./web
ADD ./api ./api
ADD ./conf ./conf
ADD ./deepdoc ./deepdoc
ADD ./rag ./rag
ADD ./requirements.txt ./requirements.txt
RUN apt install openmpi-bin openmpi-common libopenmpi-dev
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/openmpi/lib:$LD_LIBRARY_PATH
RUN rm /root/miniconda3/envs/py11/compiler_compat/ld
RUN cd ./web && npm i --force && npm run build
RUN conda run -n py11 pip install -i https://mirrors.aliyun.com/pypi/simple/ -r ./requirements.txt
RUN apt-get update && \
apt-get install -y libglib2.0-0 libgl1-mesa-glx && \
rm -rf /var/lib/apt/lists/*
RUN conda run -n py11 pip install -i https://mirrors.aliyun.com/pypi/simple/ ollama
RUN conda run -n py11 python -m nltk.downloader punkt
RUN conda run -n py11 python -m nltk.downloader wordnet
ENV PYTHONPATH=/ragflow/
ENV HF_ENDPOINT=https://hf-mirror.com
ADD docker/entrypoint.sh ./entrypoint.sh
RUN chmod +x ./entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

56
Dockerfile.scratch.oc9 Normal file
View File

@ -0,0 +1,56 @@
FROM opencloudos/opencloudos:9.0
USER root
WORKDIR /ragflow
RUN dnf update -y && dnf install -y wget curl gcc-c++ openmpi-devel
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /root/miniconda3 && \
rm ~/miniconda.sh && ln -s /root/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo ". /root/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate base" >> ~/.bashrc
ENV PATH /root/miniconda3/bin:$PATH
RUN conda create -y --name py11 python=3.11
ENV CONDA_DEFAULT_ENV py11
ENV CONDA_PREFIX /root/miniconda3/envs/py11
ENV PATH $CONDA_PREFIX/bin:$PATH
# RUN curl -sL https://rpm.nodesource.com/setup_14.x | bash -
RUN dnf install -y nodejs
RUN dnf install -y nginx
ADD ./web ./web
ADD ./api ./api
ADD ./conf ./conf
ADD ./deepdoc ./deepdoc
ADD ./rag ./rag
ADD ./requirements.txt ./requirements.txt
RUN dnf install -y openmpi openmpi-devel python3-openmpi
ENV C_INCLUDE_PATH /usr/include/openmpi-x86_64:$C_INCLUDE_PATH
ENV LD_LIBRARY_PATH /usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
RUN rm /root/miniconda3/envs/py11/compiler_compat/ld
RUN cd ./web && npm i --force && npm run build
RUN conda run -n py11 pip install $(grep -ivE "mpi4py" ./requirements.txt) # without mpi4py==3.1.5
RUN conda run -n py11 pip install redis
RUN dnf update -y && \
dnf install -y glib2 mesa-libGL && \
dnf clean all
RUN conda run -n py11 pip install ollama
RUN conda run -n py11 python -m nltk.downloader punkt
RUN conda run -n py11 python -m nltk.downloader wordnet
ENV PYTHONPATH=/ragflow/
ENV HF_ENDPOINT=https://hf-mirror.com
ADD docker/entrypoint.sh ./entrypoint.sh
RUN chmod +x ./entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

View File

@ -11,13 +11,16 @@
</p>
<p align="center">
<a href="https://github.com/infiniflow/ragflow/releases/latest">
<img src="https://img.shields.io/github/v/release/infiniflow/ragflow?color=blue&label=Latest%20Release" alt="Latest Release">
</a>
<a href="https://demo.ragflow.io" target="_blank">
<img alt="Static Badge" src="https://img.shields.io/badge/RAGFLOW-LLM-white?&labelColor=dd0af7"></a>
<img alt="Static Badge" src="https://img.shields.io/badge/Online-Demo-4e6b99"></a>
<a href="https://hub.docker.com/r/infiniflow/ragflow" target="_blank">
<img src="https://img.shields.io/badge/docker_pull-ragflow:v1.0-brightgreen"
alt="docker pull ragflow:v1.0"></a>
<img src="https://img.shields.io/badge/docker_pull-ragflow:v0.5.0-brightgreen"
alt="docker pull infiniflow/ragflow:v0.5.0"></a>
<a href="https://github.com/infiniflow/ragflow/blob/main/LICENSE">
<img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=7d09f1" alt="license">
<img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=1570EF" alt="license">
</a>
</p>
@ -55,10 +58,14 @@
## 📌 Latest Features
- 2024-04-11 Support [Xinference](./docs/xinference.md) for local LLM deployment.
- 2024-04-10 Add a new layout recognization model for analyzing Laws documentation.
- 2024-04-08 Support [Ollama](./docs/ollama.md) for local LLM deployment.
- 2024-04-07 Support Chinese UI.
- 2024-05-08 Integrates LLM DeepSeek.
- 2024-04-26 Adds file management.
- 2024-04-19 Supports conversation API ([detail](./docs/conversation_api.md)).
- 2024-04-16 Integrates an embedding model 'bce-embedding-base_v1' from [BCEmbedding](https://github.com/netease-youdao/BCEmbedding), and [FastEmbed](https://github.com/qdrant/fastembed), which is designed specifically for light and speedy embedding.
- 2024-04-11 Supports [Xinference](./docs/xinference.md) for local LLM deployment.
- 2024-04-10 Adds a new layout recognition model for analyzing Laws documentation.
- 2024-04-08 Supports [Ollama](./docs/ollama.md) for local LLM deployment.
- 2024-04-07 Supports Chinese UI.
## 🔎 System Architecture
@ -70,8 +77,9 @@
### 📝 Prerequisites
- CPU >= 2 cores
- RAM >= 8 GB
- CPU >= 4 cores
- RAM >= 16 GB
- Disk >= 50 GB
- Docker >= 24.0.0 & Docker Compose >= v2.26.1
> If you have not installed Docker on your local machine (Windows, Mac, or Linux), see [Install Docker Engine](https://docs.docker.com/engine/install/).
@ -111,6 +119,7 @@
$ chmod +x ./entrypoint.sh
$ docker compose up -d
```
> Please note that running the above commands will automatically download the development version docker image of RAGFlow. If you want to download and run a specific version of docker image, please find the RAGFLOW_VERSION variable in the docker/.env file, change it to the corresponding version, for example, RAGFLOW_VERSION=v0.5.0, and run the above commands.
> The core image is about 9 GB in size and may take a while to load.
@ -135,9 +144,10 @@
* Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit
```
> If you skip this confirmation step and directly log in to RAGFlow, your browser may prompt a `network anomaly` error because, at that moment, your RAGFlow may not be fully initialized.
5. In your web browser, enter the IP address of your server and log in to RAGFlow.
> In the given scenario, you only need to enter `http://IP_OF_YOUR_MACHINE` (sans port number) as the default HTTP serving port `80` can be omitted when using the default configurations.
> With default settings, you only need to enter `http://IP_OF_YOUR_MACHINE` (**sans** port number) as the default HTTP serving port `80` can be omitted when using the default configurations.
6. In [service_conf.yaml](./docker/service_conf.yaml), select the desired LLM factory in `user_default_llm` and update the `API_KEY` field with the corresponding API key.
> See [./docs/llm_api_key_setup.md](./docs/llm_api_key_setup.md) for more information.
@ -171,12 +181,72 @@ To build the Docker images from source:
```bash
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
$ docker build -t infiniflow/ragflow:v1.0 .
$ docker build -t infiniflow/ragflow:dev .
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
```
## 🛠️ Launch Service from Source
To launch the service from source, please follow these steps:
1. Clone the repository
```bash
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
```
2. Create a virtual environment (ensure Anaconda or Miniconda is installed)
```bash
$ conda create -n ragflow python=3.11.0
$ conda activate ragflow
$ pip install -r requirements.txt
```
If CUDA version is greater than 12.0, execute the following additional commands:
```bash
$ pip uninstall -y onnxruntime-gpu
$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
```
3. Copy the entry script and configure environment variables
```bash
$ cp docker/entrypoint.sh .
$ vi entrypoint.sh
```
Use the following commands to obtain the Python path and the ragflow project path:
```bash
$ which python
$ pwd
```
Set the output of `which python` as the value for `PY` and the output of `pwd` as the value for `PYTHONPATH`.
If `LD_LIBRARY_PATH` is already configured, it can be commented out.
```bash
# Adjust configurations according to your actual situation; the two export commands are newly added.
PY=${PY}
export PYTHONPATH=${PYTHONPATH}
# Optional: Add Hugging Face mirror
export HF_ENDPOINT=https://hf-mirror.com
```
4. Start the base services
```bash
$ cd docker
$ docker compose -f docker-compose-base.yml up -d
```
5. Check the configuration files
Ensure that the settings in **docker/.env** match those in **conf/service_conf.yaml**. The IP addresses and ports for related services in **service_conf.yaml** should be changed to the local machine IP and ports exposed by the container.
6. Launch the service
```bash
$ chmod +x ./entrypoint.sh
$ bash ./entrypoint.sh
```
## 📚 Documentation
- [FAQ](./docs/faq.md)

View File

@ -11,13 +11,16 @@
</p>
<p align="center">
<a href="https://github.com/infiniflow/ragflow/releases/latest">
<img src="https://img.shields.io/github/v/release/infiniflow/ragflow?color=blue&label=Latest%20Release" alt="Latest Release">
</a>
<a href="https://demo.ragflow.io" target="_blank">
<img alt="Static Badge" src="https://img.shields.io/badge/RAGFLOW-LLM-white?&labelColor=dd0af7"></a>
<img alt="Static Badge" src="https://img.shields.io/badge/Online-Demo-4e6b99"></a>
<a href="https://hub.docker.com/r/infiniflow/ragflow" target="_blank">
<img src="https://img.shields.io/badge/docker_pull-ragflow:v1.0-brightgreen"
alt="docker pull ragflow:v1.0"></a>
<img src="https://img.shields.io/badge/docker_pull-ragflow:v0.5.0-brightgreen"
alt="docker pull infiniflow/ragflow:v0.5.0"></a>
<a href="https://github.com/infiniflow/ragflow/blob/main/LICENSE">
<img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=7d09f1" alt="license">
<img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=1570EF" alt="license">
</a>
</p>
@ -55,6 +58,11 @@
## 📌 最新の機能
- 2024-05-08
- 2024-04-26 「ファイル管理」機能を追加しました。
- 2024-04-19 会話 API をサポートします ([詳細](./docs/conversation_api.md))。
- 2024-04-16 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) から埋め込みモデル「bce-embedding-base_v1」を追加します。
- 2024-04-16 [FastEmbed](https://github.com/qdrant/fastembed) は、軽量かつ高速な埋め込み用に設計されています。
- 2024-04-11 ローカル LLM デプロイメント用に [Xinference](./docs/xinference.md) をサポートします。
- 2024-04-10 メソッド「Laws」に新しいレイアウト認識モデルを追加します。
- 2024-04-08 [Ollama](./docs/ollama.md) を使用した大規模モデルのローカライズされたデプロイメントをサポートします。
@ -70,8 +78,9 @@
### 📝 必要条件
- CPU >= 2 cores
- RAM >= 8 GB
- CPU >= 4 cores
- RAM >= 16 GB
- Disk >= 50 GB
- Docker >= 24.0.0 & Docker Compose >= v2.26.1
> ローカルマシンWindows、Mac、または Linuxに Docker をインストールしていない場合は、[Docker Engine のインストール](https://docs.docker.com/engine/install/) を参照してください。
@ -112,7 +121,9 @@
$ docker compose up -d
```
> コアイメージのサイズは約 15 GB で、ロードに時間がかかる場合があります
> 上記のコマンドを実行すると、RAGFlowの開発版dockerイメージが自動的にダウンロードされます。 特定のバージョンのDockerイメージをダウンロードして実行したい場合は、docker/.envファイルのRAGFLOW_VERSION変数を見つけて、対応するバージョンに変更してください。 例えば、RAGFLOW_VERSION=v0.5.0として、上記のコマンドを実行してください
> コアイメージのサイズは約 9 GB で、ロードに時間がかかる場合があります。
4. サーバーを立ち上げた後、サーバーの状態を確認する:
@ -135,6 +146,7 @@
* Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit
```
> もし確認ステップをスキップして直接 RAGFlow にログインした場合、その時点で RAGFlow が完全に初期化されていない可能性があるため、ブラウザーがネットワーク異常エラーを表示するかもしれません。
5. ウェブブラウザで、プロンプトに従ってサーバーの IP アドレスを入力し、RAGFlow にログインします。
> デフォルトの設定を使用する場合、デフォルトの HTTP サービングポート `80` は省略できるので、与えられたシナリオでは、`http://IP_OF_YOUR_MACHINE`(ポート番号は省略)だけを入力すればよい。
@ -171,12 +183,72 @@
```bash
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
$ docker build -t infiniflow/ragflow:v1.0 .
$ docker build -t infiniflow/ragflow:v0.5.0 .
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
```
## 🛠️ ソースコードからサービスを起動する方法
ソースコードからサービスを起動する場合は、以下の手順に従ってください:
1. リポジトリをクローンします
```bash
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
```
2. 仮想環境を作成しますAnacondaまたはMinicondaがインストールされていることを確認してください
```bash
$ conda create -n ragflow python=3.11.0
$ conda activate ragflow
$ pip install -r requirements.txt
```
CUDAのバージョンが12.0以上の場合、以下の追加コマンドを実行してください:
```bash
$ pip uninstall -y onnxruntime-gpu
$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
```
3. エントリースクリプトをコピーし、環境変数を設定します
```bash
$ cp docker/entrypoint.sh .
$ vi entrypoint.sh
```
以下のコマンドでPythonのパスとragflowプロジェクトのパスを取得します
```bash
$ which python
$ pwd
```
`which python`の出力を`PY`の値として、`pwd`の出力を`PYTHONPATH`の値として設定します。
`LD_LIBRARY_PATH`が既に設定されている場合は、コメントアウトできます。
```bash
# 実際の状況に応じて設定を調整してください。以下の二つのexportは新たに追加された設定です
PY=${PY}
export PYTHONPATH=${PYTHONPATH}
# オプションHugging Faceミラーを追加
export HF_ENDPOINT=https://hf-mirror.com
```
4. 基本サービスを起動します
```bash
$ cd docker
$ docker compose -f docker-compose-base.yml up -d
```
5. 設定ファイルを確認します
**docker/.env**内の設定が**conf/service_conf.yaml**内の設定と一致していることを確認してください。**service_conf.yaml**内の関連サービスのIPアドレスとポートは、ローカルマシンのIPアドレスとコンテナが公開するポートに変更する必要があります。
6. サービスを起動します
```bash
$ chmod +x ./entrypoint.sh
$ bash ./entrypoint.sh
```
## 📚 ドキュメンテーション
- [FAQ](./docs/faq.md)

View File

@ -11,13 +11,16 @@
</p>
<p align="center">
<a href="https://github.com/infiniflow/ragflow/releases/latest">
<img src="https://img.shields.io/github/v/release/infiniflow/ragflow?color=blue&label=Latest%20Release" alt="Latest Release">
</a>
<a href="https://demo.ragflow.io" target="_blank">
<img alt="Static Badge" src="https://img.shields.io/badge/RAGFLOW-LLM-white?&labelColor=dd0af7"></a>
<img alt="Static Badge" src="https://img.shields.io/badge/Online-Demo-4e6b99"></a>
<a href="https://hub.docker.com/r/infiniflow/ragflow" target="_blank">
<img src="https://img.shields.io/badge/docker_pull-ragflow:v1.0-brightgreen"
alt="docker pull ragflow:v1.0"></a>
<img src="https://img.shields.io/badge/docker_pull-ragflow:v0.5.0-brightgreen"
alt="docker pull infiniflow/ragflow:v0.5.0"></a>
<a href="https://github.com/infiniflow/ragflow/blob/main/LICENSE">
<img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=7d09f1" alt="license">
<img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=1570EF" alt="license">
</a>
</p>
@ -55,6 +58,10 @@
## 📌 新增功能
- 2024-05-08 集成大模型 DeepSeek
- 2024-04-26 增添了'文件管理'功能.
- 2024-04-19 支持对话 API ([更多](./docs/conversation_api.md)).
- 2024-04-16 集成嵌入模型 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) 和 专为轻型和高速嵌入而设计的 [FastEmbed](https://github.com/qdrant/fastembed) 。
- 2024-04-11 支持用 [Xinference](./docs/xinference.md) 本地化部署大模型。
- 2024-04-10 为Laws版面分析增加了底层模型。
- 2024-04-08 支持用 [Ollama](./docs/ollama.md) 本地化部署大模型。
@ -70,8 +77,9 @@
### 📝 前提条件
- CPU >= 2
- RAM >= 8 GB
- CPU >= 4
- RAM >= 16 GB
- Disk >= 50 GB
- Docker >= 24.0.0 & Docker Compose >= v2.26.1
> 如果你并没有在本机安装 DockerWindows、Mac或者 Linux, 可以参考文档 [Install Docker Engine](https://docs.docker.com/engine/install/) 自行安装。
@ -112,7 +120,9 @@
$ docker compose -f docker-compose-CN.yml up -d
```
> 核心镜像文件大约 15 GB可能需要一定时间拉取。请耐心等待
> 请注意,运行上述命令会自动下载 RAGFlow 的开发版本 docker 镜像。如果你想下载并运行特定版本的 docker 镜像,请在 docker/.env 文件中找到 RAGFLOW_VERSION 变量,将其改为对应版本。例如 RAGFLOW_VERSION=v0.5.0,然后运行上述命令
> 核心镜像文件大约 9 GB可能需要一定时间拉取。请耐心等待。
4. 服务器启动成功后再次确认服务器状态:
@ -135,6 +145,7 @@
* Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit
```
> 如果您跳过这一步系统确认步骤就登录 RAGFlow你的浏览器有可能会提示 `network anomaly` 或 `网络异常`,因为 RAGFlow 可能并未完全启动成功。
5. 在你的浏览器中输入你的服务器对应的 IP 地址并登录 RAGFlow。
> 上面这个例子中,您只需输入 http://IP_OF_YOUR_MACHINE 即可:未改动过配置则无需输入端口(默认的 HTTP 服务端口 80
@ -171,12 +182,72 @@
```bash
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
$ docker build -t infiniflow/ragflow:v1.0 .
$ docker build -t infiniflow/ragflow:v0.5.0 .
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
```
## 🛠️ 源码启动服务
如需从源码启动服务,请参考以下步骤:
1. 克隆仓库
```bash
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow/
```
2. 创建虚拟环境(确保已安装 Anaconda 或 Miniconda
```bash
$ conda create -n ragflow python=3.11.0
$ conda activate ragflow
$ pip install -r requirements.txt
```
如果cuda > 12.0,需额外执行以下命令:
```bash
$ pip uninstall -y onnxruntime-gpu
$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
```
3. 拷贝入口脚本并配置环境变量
```bash
$ cp docker/entrypoint.sh .
$ vi entrypoint.sh
```
使用以下命令获取python路径及ragflow项目路径
```bash
$ which python
$ pwd
```
将上述`which python`的输出作为`PY`的值,将`pwd`的输出作为`PYTHONPATH`的值。
`LD_LIBRARY_PATH`如果环境已经配置好,可以注释掉。
```bash
# 此处配置需要按照实际情况调整两个export为新增配置
PY=${PY}
export PYTHONPATH=${PYTHONPATH}
# 可选添加Hugging Face镜像
export HF_ENDPOINT=https://hf-mirror.com
```
4. 启动基础服务
```bash
$ cd docker
$ docker compose -f docker-compose-base.yml up -d
```
5. 检查配置文件
确保**docker/.env**中的配置与**conf/service_conf.yaml**中配置一致, **service_conf.yaml**中相关服务的IP地址与端口应该改成本机IP地址及容器映射出来的端口。
6. 启动服务
```bash
$ chmod +x ./entrypoint.sh
$ bash ./entrypoint.sh
```
## 📚 技术文档
- [FAQ](./docs/faq.md)

View File

@ -54,7 +54,7 @@ app.errorhandler(Exception)(server_error_response)
#app.config["LOGIN_DISABLED"] = True
app.config["SESSION_PERMANENT"] = False
app.config["SESSION_TYPE"] = "filesystem"
app.config['MAX_CONTENT_LENGTH'] = os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024)
app.config['MAX_CONTENT_LENGTH'] = int(os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024))
Session(app)
login_manager = LoginManager()

View File

@ -13,18 +13,28 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import re
from datetime import datetime, timedelta
from flask import request
from flask_login import login_required, current_user
from api.db import FileType, ParserType
from api.db.db_models import APIToken, API4Conversation
from api.db.services import duplicate_name
from api.db.services.api_service import APITokenService, API4ConversationService
from api.db.services.dialog_service import DialogService, chat
from api.db.services.document_service import DocumentService
from api.db.services.knowledgebase_service import KnowledgebaseService
from api.db.services.user_service import UserTenantService
from api.settings import RetCode
from api.utils import get_uuid, current_timestamp, datetime_format
from api.utils.api_utils import server_error_response, get_data_error_result, get_json_result, validate_request
from itsdangerous import URLSafeTimedSerializer
from api.utils.file_utils import filename_type, thumbnail
from rag.utils.minio_conn import MINIO
def generate_confirmation_token(tenent_id):
serializer = URLSafeTimedSerializer(tenent_id)
@ -105,8 +115,8 @@ def stats():
res = {
"pv": [(o["dt"], o["pv"]) for o in objs],
"uv": [(o["dt"], o["uv"]) for o in objs],
"speed": [(o["dt"], o["tokens"]/o["duration"]) for o in objs],
"tokens": [(o["dt"], o["tokens"]/1000.) for o in objs],
"speed": [(o["dt"], float(o["tokens"])/(float(o["duration"]+0.1))) for o in objs],
"tokens": [(o["dt"], float(o["tokens"])/1000.) for o in objs],
"round": [(o["dt"], o["round"]) for o in objs],
"thumb_up": [(o["dt"], o["thumb_up"]) for o in objs]
}
@ -115,8 +125,7 @@ def stats():
return server_error_response(e)
@manager.route('/new_conversation', methods=['POST'])
@validate_request("user_id")
@manager.route('/new_conversation', methods=['GET'])
def set_conversation():
token = request.headers.get('Authorization').split()[1]
objs = APIToken.query(token=token)
@ -131,7 +140,7 @@ def set_conversation():
conv = {
"id": get_uuid(),
"dialog_id": dia.id,
"user_id": req["user_id"],
"user_id": request.args.get("user_id", ""),
"message": [{"role": "assistant", "content": dia.prompt_config["prologue"]}]
}
API4ConversationService.save(**conv)
@ -177,7 +186,6 @@ def completion():
conv.reference.append(ans["reference"])
conv.message.append({"role": "assistant", "content": ans["answer"]})
API4ConversationService.append_message(conv.id, conv.to_dict())
APITokenService.APITokenService(token)
return get_json_result(data=ans)
except Exception as e:
return server_error_response(e)
@ -193,4 +201,74 @@ def get(conversation_id):
return get_json_result(data=conv.to_dict())
except Exception as e:
return server_error_response(e)
return server_error_response(e)
@manager.route('/document/upload', methods=['POST'])
@validate_request("kb_name")
def upload():
token = request.headers.get('Authorization').split()[1]
objs = APIToken.query(token=token)
if not objs:
return get_json_result(
data=False, retmsg='Token is not valid!"', retcode=RetCode.AUTHENTICATION_ERROR)
kb_name = request.form.get("kb_name").strip()
tenant_id = objs[0].tenant_id
try:
e, kb = KnowledgebaseService.get_by_name(kb_name, tenant_id)
if not e:
return get_data_error_result(
retmsg="Can't find this knowledgebase!")
kb_id = kb.id
except Exception as e:
return server_error_response(e)
if 'file' not in request.files:
return get_json_result(
data=False, retmsg='No file part!', retcode=RetCode.ARGUMENT_ERROR)
file = request.files['file']
if file.filename == '':
return get_json_result(
data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
try:
if DocumentService.get_doc_count(kb.tenant_id) >= int(os.environ.get('MAX_FILE_NUM_PER_USER', 8192)):
return get_data_error_result(
retmsg="Exceed the maximum file number of a free user!")
filename = duplicate_name(
DocumentService.query,
name=file.filename,
kb_id=kb_id)
filetype = filename_type(filename)
if not filetype:
return get_data_error_result(
retmsg="This type of file has not been supported yet!")
location = filename
while MINIO.obj_exist(kb_id, location):
location += "_"
blob = request.files['file'].read()
MINIO.put(kb_id, location, blob)
doc = {
"id": get_uuid(),
"kb_id": kb.id,
"parser_id": kb.parser_id,
"parser_config": kb.parser_config,
"created_by": kb.tenant_id,
"type": filetype,
"name": filename,
"location": location,
"size": len(blob),
"thumbnail": thumbnail(filename, blob)
}
if doc["type"] == FileType.VISUAL:
doc["parser_id"] = ParserType.PICTURE.value
if re.search(r"\.(ppt|pptx|pages)$", filename):
doc["parser_id"] = ParserType.PRESENTATION.value
doc = DocumentService.insert(doc)
return get_json_result(data=doc.to_json())
except Exception as e:
return server_error_response(e)

View File

@ -20,8 +20,9 @@ from flask_login import login_required, current_user
from elasticsearch_dsl import Q
from rag.app.qa import rmPrefix, beAdoc
from rag.nlp import search, huqie
from rag.utils import ELASTICSEARCH, rmSpace
from rag.nlp import search, rag_tokenizer
from rag.utils.es_conn import ELASTICSEARCH
from rag.utils import rmSpace
from api.db import LLMType, ParserType
from api.db.services.knowledgebase_service import KnowledgebaseService
from api.db.services.llm_service import TenantLLMService
@ -124,10 +125,10 @@ def set():
d = {
"id": req["chunk_id"],
"content_with_weight": req["content_with_weight"]}
d["content_ltks"] = huqie.qie(req["content_with_weight"])
d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
d["content_ltks"] = rag_tokenizer.tokenize(req["content_with_weight"])
d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
d["important_kwd"] = req["important_kwd"]
d["important_tks"] = huqie.qie(" ".join(req["important_kwd"]))
d["important_tks"] = rag_tokenizer.tokenize(" ".join(req["important_kwd"]))
if "available_int" in req:
d["available_int"] = req["available_int"]
@ -151,7 +152,7 @@ def set():
retmsg="Q&A must be separated by TAB/ENTER key.")
q, a = rmPrefix(arr[0]), rmPrefix[arr[1]]
d = beAdoc(d, arr[0], arr[1], not any(
[huqie.is_chinese(t) for t in q + a]))
[rag_tokenizer.is_chinese(t) for t in q + a]))
v, c = embd_mdl.encode([doc.name, req["content_with_weight"]])
v = 0.1 * v[0] + 0.9 * v[1] if doc.parser_id != ParserType.QA else v[1]
@ -201,11 +202,11 @@ def create():
md5 = hashlib.md5()
md5.update((req["content_with_weight"] + req["doc_id"]).encode("utf-8"))
chunck_id = md5.hexdigest()
d = {"id": chunck_id, "content_ltks": huqie.qie(req["content_with_weight"]),
d = {"id": chunck_id, "content_ltks": rag_tokenizer.tokenize(req["content_with_weight"]),
"content_with_weight": req["content_with_weight"]}
d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
d["important_kwd"] = req.get("important_kwd", [])
d["important_tks"] = huqie.qie(" ".join(req.get("important_kwd", [])))
d["important_tks"] = rag_tokenizer.tokenize(" ".join(req.get("important_kwd", [])))
d["create_time"] = str(datetime.datetime.now()).replace("T", " ")[:19]
d["create_timestamp_flt"] = datetime.datetime.now().timestamp()
@ -252,7 +253,7 @@ def retrieval_test():
return get_data_error_result(retmsg="Knowledgebase not found!")
embd_mdl = TenantLLMService.model_instance(
kb.tenant_id, LLMType.EMBEDDING.value)
kb.tenant_id, LLMType.EMBEDDING.value, llm_name=kb.embd_id)
ranks = retrievaler.retrieval(question, embd_mdl, kb.tenant_id, [kb_id], page, size, similarity_threshold,
vector_similarity_weight, top, doc_ids)
for c in ranks["chunks"]:

View File

@ -35,13 +35,7 @@ def set_dialog():
top_n = req.get("top_n", 6)
similarity_threshold = req.get("similarity_threshold", 0.1)
vector_similarity_weight = req.get("vector_similarity_weight", 0.3)
llm_setting = req.get("llm_setting", {
"temperature": 0.1,
"top_p": 0.3,
"frequency_penalty": 0.7,
"presence_penalty": 0.4,
"max_tokens": 215
})
llm_setting = req.get("llm_setting", {})
default_prompt = {
"system": """你是一个智能助手,请总结知识库的内容来回答问题,请列举知识库中的数据详细回答。当所有知识库内容都与问题无关时,你的回答必须包括“知识库中未找到您要的答案!”这句话。回答需要考虑聊天历史。
以下是知识库:

View File

@ -14,7 +14,7 @@
# limitations under the License
#
import base64
import os
import pathlib
import re
@ -22,8 +22,13 @@ import flask
from elasticsearch_dsl import Q
from flask import request
from flask_login import login_required, current_user
from api.db.db_models import Task
from api.db.services.file2document_service import File2DocumentService
from api.db.services.file_service import FileService
from api.db.services.task_service import TaskService, queue_tasks
from rag.nlp import search
from rag.utils import ELASTICSEARCH
from rag.utils.es_conn import ELASTICSEARCH
from api.db.services import duplicate_name
from api.db.services.knowledgebase_service import KnowledgebaseService
from api.utils.api_utils import server_error_response, get_data_error_result, validate_request
@ -47,54 +52,59 @@ def upload():
if 'file' not in request.files:
return get_json_result(
data=False, retmsg='No file part!', retcode=RetCode.ARGUMENT_ERROR)
file = request.files['file']
if file.filename == '':
file_objs = request.files.getlist('file')
for file_obj in file_objs:
if file_obj.filename == '':
return get_json_result(
data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
err = []
for file in file_objs:
try:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
raise LookupError("Can't find this knowledgebase!")
MAX_FILE_NUM_PER_USER = int(os.environ.get('MAX_FILE_NUM_PER_USER', 0))
if MAX_FILE_NUM_PER_USER > 0 and DocumentService.get_doc_count(kb.tenant_id) >= MAX_FILE_NUM_PER_USER:
raise RuntimeError("Exceed the maximum file number of a free user!")
filename = duplicate_name(
DocumentService.query,
name=file.filename,
kb_id=kb.id)
filetype = filename_type(filename)
if filetype == FileType.OTHER.value:
raise RuntimeError("This type of file has not been supported yet!")
location = filename
while MINIO.obj_exist(kb_id, location):
location += "_"
blob = file.read()
MINIO.put(kb_id, location, blob)
doc = {
"id": get_uuid(),
"kb_id": kb.id,
"parser_id": kb.parser_id,
"parser_config": kb.parser_config,
"created_by": current_user.id,
"type": filetype,
"name": filename,
"location": location,
"size": len(blob),
"thumbnail": thumbnail(filename, blob)
}
if doc["type"] == FileType.VISUAL:
doc["parser_id"] = ParserType.PICTURE.value
if re.search(r"\.(ppt|pptx|pages)$", filename):
doc["parser_id"] = ParserType.PRESENTATION.value
DocumentService.insert(doc)
except Exception as e:
err.append(file.filename + ": " + str(e))
if err:
return get_json_result(
data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
try:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
return get_data_error_result(
retmsg="Can't find this knowledgebase!")
if DocumentService.get_doc_count(kb.tenant_id) >= 128:
return get_data_error_result(
retmsg="Exceed the maximum file number of a free user!")
filename = duplicate_name(
DocumentService.query,
name=file.filename,
kb_id=kb.id)
filetype = filename_type(filename)
if not filetype:
return get_data_error_result(
retmsg="This type of file has not been supported yet!")
location = filename
while MINIO.obj_exist(kb_id, location):
location += "_"
blob = request.files['file'].read()
MINIO.put(kb_id, location, blob)
doc = {
"id": get_uuid(),
"kb_id": kb.id,
"parser_id": kb.parser_id,
"parser_config": kb.parser_config,
"created_by": current_user.id,
"type": filetype,
"name": filename,
"location": location,
"size": len(blob),
"thumbnail": thumbnail(filename, blob)
}
if doc["type"] == FileType.VISUAL:
doc["parser_id"] = ParserType.PICTURE.value
if re.search(r"\.(ppt|pptx|pages)$", filename):
doc["parser_id"] = ParserType.PRESENTATION.value
doc = DocumentService.insert(doc)
return get_json_result(data=doc.to_json())
except Exception as e:
return server_error_response(e)
data=False, retmsg="\n".join(err), retcode=RetCode.SERVER_ERROR)
return get_json_result(data=True)
@manager.route('/create', methods=['POST'])
@ -216,26 +226,37 @@ def change_status():
@validate_request("doc_id")
def rm():
req = request.json
try:
e, doc = DocumentService.get_by_id(req["doc_id"])
if not e:
return get_data_error_result(retmsg="Document not found!")
tenant_id = DocumentService.get_tenant_id(req["doc_id"])
if not tenant_id:
return get_data_error_result(retmsg="Tenant not found!")
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
doc_ids = req["doc_id"]
if isinstance(doc_ids, str): doc_ids = [doc_ids]
errors = ""
for doc_id in doc_ids:
try:
e, doc = DocumentService.get_by_id(doc_id)
DocumentService.increment_chunk_num(
doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
if not DocumentService.delete(doc):
return get_data_error_result(
retmsg="Database error (Document removal)!")
if not e:
return get_data_error_result(retmsg="Document not found!")
tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id:
return get_data_error_result(retmsg="Tenant not found!")
MINIO.rm(doc.kb_id, doc.location)
return get_json_result(data=True)
except Exception as e:
return server_error_response(e)
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
DocumentService.increment_chunk_num(
doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
if not DocumentService.delete(doc):
return get_data_error_result(
retmsg="Database error (Document removal)!")
informs = File2DocumentService.get_by_document_id(doc_id)
if not informs:
MINIO.rm(doc.kb_id, doc.location)
else:
File2DocumentService.delete_by_document_id(doc_id)
except Exception as e:
errors += str(e)
if errors: return server_error_response(e)
return get_json_result(data=True)
@manager.route('/run', methods=['POST'])
@ -257,6 +278,14 @@ def run():
return get_data_error_result(retmsg="Tenant not found!")
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=id), idxnm=search.index_name(tenant_id))
if str(req["run"]) == TaskStatus.RUNNING.value:
TaskService.filter_delete([Task.doc_id == id])
e, doc = DocumentService.get_by_id(id)
doc = doc.to_dict()
doc["tenant_id"] = tenant_id
bucket, name = File2DocumentService.get_minio_address(doc_id=doc["id"])
queue_tasks(doc, bucket, name)
return get_json_result(data=True)
except Exception as e:
@ -287,6 +316,11 @@ def rename():
return get_data_error_result(
retmsg="Database error (Document rename)!")
informs = File2DocumentService.get_by_document_id(req["doc_id"])
if informs:
e, file = FileService.get_by_id(informs[0].file_id)
FileService.update_by_id(file.id, {"name": req["name"]})
return get_json_result(data=True)
except Exception as e:
return server_error_response(e)
@ -300,7 +334,13 @@ def get(doc_id):
if not e:
return get_data_error_result(retmsg="Document not found!")
response = flask.make_response(MINIO.get(doc.kb_id, doc.location))
informs = File2DocumentService.get_by_document_id(doc_id)
if not informs:
response = flask.make_response(MINIO.get(doc.kb_id, doc.location))
else:
e, file = FileService.get_by_id(informs[0].file_id)
response = flask.make_response(MINIO.get(file.parent_id, doc.location))
ext = re.search(r"\.([^.]+)$", doc.name)
if ext:
if doc.type == FileType.VISUAL.value:
@ -336,7 +376,8 @@ def change_parser():
return get_data_error_result(retmsg="Not supported yet!")
e = DocumentService.update_by_id(doc.id,
{"parser_id": req["parser_id"], "progress": 0, "progress_msg": "", "run": "0"})
{"parser_id": req["parser_id"], "progress": 0, "progress_msg": "",
"run": TaskStatus.UNSTART.value})
if not e:
return get_data_error_result(retmsg="Document not found!")
if "parser_config" in req:

View File

@ -0,0 +1,137 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
#
from elasticsearch_dsl import Q
from api.db.db_models import File2Document
from api.db.services.file2document_service import File2DocumentService
from api.db.services.file_service import FileService
from flask import request
from flask_login import login_required, current_user
from api.db.services.knowledgebase_service import KnowledgebaseService
from api.utils.api_utils import server_error_response, get_data_error_result, validate_request
from api.utils import get_uuid
from api.db import FileType
from api.db.services.document_service import DocumentService
from api.settings import RetCode
from api.utils.api_utils import get_json_result
from rag.nlp import search
from rag.utils.es_conn import ELASTICSEARCH
@manager.route('/convert', methods=['POST'])
@login_required
@validate_request("file_ids", "kb_ids")
def convert():
req = request.json
kb_ids = req["kb_ids"]
file_ids = req["file_ids"]
file2documents = []
try:
for file_id in file_ids:
e, file = FileService.get_by_id(file_id)
file_ids_list = [file_id]
if file.type == FileType.FOLDER.value:
file_ids_list = FileService.get_all_innermost_file_ids(file_id, [])
for id in file_ids_list:
informs = File2DocumentService.get_by_file_id(id)
# delete
for inform in informs:
doc_id = inform.document_id
e, doc = DocumentService.get_by_id(doc_id)
if not e:
return get_data_error_result(retmsg="Document not found!")
tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id:
return get_data_error_result(retmsg="Tenant not found!")
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
DocumentService.increment_chunk_num(
doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
if not DocumentService.delete(doc):
return get_data_error_result(
retmsg="Database error (Document removal)!")
File2DocumentService.delete_by_file_id(id)
# insert
for kb_id in kb_ids:
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
return get_data_error_result(
retmsg="Can't find this knowledgebase!")
e, file = FileService.get_by_id(id)
if not e:
return get_data_error_result(
retmsg="Can't find this file!")
doc = DocumentService.insert({
"id": get_uuid(),
"kb_id": kb.id,
"parser_id": kb.parser_id,
"parser_config": kb.parser_config,
"created_by": current_user.id,
"type": file.type,
"name": file.name,
"location": file.location,
"size": file.size
})
file2document = File2DocumentService.insert({
"id": get_uuid(),
"file_id": id,
"document_id": doc.id,
})
file2documents.append(file2document.to_json())
return get_json_result(data=file2documents)
except Exception as e:
return server_error_response(e)
@manager.route('/rm', methods=['POST'])
@login_required
@validate_request("file_ids")
def rm():
req = request.json
file_ids = req["file_ids"]
if not file_ids:
return get_json_result(
data=False, retmsg='Lack of "Files ID"', retcode=RetCode.ARGUMENT_ERROR)
try:
for file_id in file_ids:
informs = File2DocumentService.get_by_file_id(file_id)
if not informs:
return get_data_error_result(retmsg="Inform not found!")
for inform in informs:
if not inform:
return get_data_error_result(retmsg="Inform not found!")
File2DocumentService.delete_by_file_id(file_id)
doc_id = inform.document_id
e, doc = DocumentService.get_by_id(doc_id)
if not e:
return get_data_error_result(retmsg="Document not found!")
tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id:
return get_data_error_result(retmsg="Tenant not found!")
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
DocumentService.increment_chunk_num(
doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
if not DocumentService.delete(doc):
return get_data_error_result(
retmsg="Database error (Document removal)!")
return get_json_result(data=True)
except Exception as e:
return server_error_response(e)

347
api/apps/file_app.py Normal file
View File

@ -0,0 +1,347 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
#
import os
import pathlib
import re
import flask
from elasticsearch_dsl import Q
from flask import request
from flask_login import login_required, current_user
from api.db.services.document_service import DocumentService
from api.db.services.file2document_service import File2DocumentService
from api.utils.api_utils import server_error_response, get_data_error_result, validate_request
from api.utils import get_uuid
from api.db import FileType
from api.db.services import duplicate_name
from api.db.services.file_service import FileService
from api.settings import RetCode
from api.utils.api_utils import get_json_result
from api.utils.file_utils import filename_type
from rag.nlp import search
from rag.utils.es_conn import ELASTICSEARCH
from rag.utils.minio_conn import MINIO
@manager.route('/upload', methods=['POST'])
@login_required
# @validate_request("parent_id")
def upload():
pf_id = request.form.get("parent_id")
if not pf_id:
root_folder = FileService.get_root_folder(current_user.id)
pf_id = root_folder.id
if 'file' not in request.files:
return get_json_result(
data=False, retmsg='No file part!', retcode=RetCode.ARGUMENT_ERROR)
file_objs = request.files.getlist('file')
for file_obj in file_objs:
if file_obj.filename == '':
return get_json_result(
data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
file_res = []
try:
for file_obj in file_objs:
e, file = FileService.get_by_id(pf_id)
if not e:
return get_data_error_result(
retmsg="Can't find this folder!")
MAX_FILE_NUM_PER_USER = int(os.environ.get('MAX_FILE_NUM_PER_USER', 0))
if MAX_FILE_NUM_PER_USER > 0 and DocumentService.get_doc_count(current_user.id) >= MAX_FILE_NUM_PER_USER:
return get_data_error_result(
retmsg="Exceed the maximum file number of a free user!")
# split file name path
if not file_obj.filename:
e, file = FileService.get_by_id(pf_id)
file_obj_names = [file.name, file_obj.filename]
else:
full_path = '/' + file_obj.filename
file_obj_names = full_path.split('/')
file_len = len(file_obj_names)
# get folder
file_id_list = FileService.get_id_list_by_id(pf_id, file_obj_names, 1, [pf_id])
len_id_list = len(file_id_list)
# create folder
if file_len != len_id_list:
e, file = FileService.get_by_id(file_id_list[len_id_list - 1])
if not e:
return get_data_error_result(retmsg="Folder not found!")
last_folder = FileService.create_folder(file, file_id_list[len_id_list - 1], file_obj_names,
len_id_list)
else:
e, file = FileService.get_by_id(file_id_list[len_id_list - 2])
if not e:
return get_data_error_result(retmsg="Folder not found!")
last_folder = FileService.create_folder(file, file_id_list[len_id_list - 2], file_obj_names,
len_id_list)
# file type
filetype = filename_type(file_obj_names[file_len - 1])
location = file_obj_names[file_len - 1]
while MINIO.obj_exist(last_folder.id, location):
location += "_"
blob = file_obj.read()
filename = duplicate_name(
FileService.query,
name=file_obj_names[file_len - 1],
parent_id=last_folder.id)
file = {
"id": get_uuid(),
"parent_id": last_folder.id,
"tenant_id": current_user.id,
"created_by": current_user.id,
"type": filetype,
"name": filename,
"location": location,
"size": len(blob),
}
file = FileService.insert(file)
MINIO.put(last_folder.id, location, blob)
file_res.append(file.to_json())
return get_json_result(data=file_res)
except Exception as e:
return server_error_response(e)
@manager.route('/create', methods=['POST'])
@login_required
@validate_request("name")
def create():
req = request.json
pf_id = request.json.get("parent_id")
input_file_type = request.json.get("type")
if not pf_id:
root_folder = FileService.get_root_folder(current_user.id)
pf_id = root_folder.id
try:
if not FileService.is_parent_folder_exist(pf_id):
return get_json_result(
data=False, retmsg="Parent Folder Doesn't Exist!", retcode=RetCode.OPERATING_ERROR)
if FileService.query(name=req["name"], parent_id=pf_id):
return get_data_error_result(
retmsg="Duplicated folder name in the same folder.")
if input_file_type == FileType.FOLDER.value:
file_type = FileType.FOLDER.value
else:
file_type = FileType.VIRTUAL.value
file = FileService.insert({
"id": get_uuid(),
"parent_id": pf_id,
"tenant_id": current_user.id,
"created_by": current_user.id,
"name": req["name"],
"location": "",
"size": 0,
"type": file_type
})
return get_json_result(data=file.to_json())
except Exception as e:
return server_error_response(e)
@manager.route('/list', methods=['GET'])
@login_required
def list():
pf_id = request.args.get("parent_id")
keywords = request.args.get("keywords", "")
page_number = int(request.args.get("page", 1))
items_per_page = int(request.args.get("page_size", 15))
orderby = request.args.get("orderby", "create_time")
desc = request.args.get("desc", True)
if not pf_id:
root_folder = FileService.get_root_folder(current_user.id)
pf_id = root_folder.id
try:
e, file = FileService.get_by_id(pf_id)
if not e:
return get_data_error_result(retmsg="Folder not found!")
files, total = FileService.get_by_pf_id(
current_user.id, pf_id, page_number, items_per_page, orderby, desc, keywords)
parent_folder = FileService.get_parent_folder(pf_id)
if not FileService.get_parent_folder(pf_id):
return get_json_result(retmsg="File not found!")
return get_json_result(data={"total": total, "files": files, "parent_folder": parent_folder.to_json()})
except Exception as e:
return server_error_response(e)
@manager.route('/root_folder', methods=['GET'])
@login_required
def get_root_folder():
try:
root_folder = FileService.get_root_folder(current_user.id)
return get_json_result(data={"root_folder": root_folder.to_json()})
except Exception as e:
return server_error_response(e)
@manager.route('/parent_folder', methods=['GET'])
@login_required
def get_parent_folder():
file_id = request.args.get("file_id")
try:
e, file = FileService.get_by_id(file_id)
if not e:
return get_data_error_result(retmsg="Folder not found!")
parent_folder = FileService.get_parent_folder(file_id)
return get_json_result(data={"parent_folder": parent_folder.to_json()})
except Exception as e:
return server_error_response(e)
@manager.route('/all_parent_folder', methods=['GET'])
@login_required
def get_all_parent_folders():
file_id = request.args.get("file_id")
try:
e, file = FileService.get_by_id(file_id)
if not e:
return get_data_error_result(retmsg="Folder not found!")
parent_folders = FileService.get_all_parent_folders(file_id)
parent_folders_res = []
for parent_folder in parent_folders:
parent_folders_res.append(parent_folder.to_json())
return get_json_result(data={"parent_folders": parent_folders_res})
except Exception as e:
return server_error_response(e)
@manager.route('/rm', methods=['POST'])
@login_required
@validate_request("file_ids")
def rm():
req = request.json
file_ids = req["file_ids"]
try:
for file_id in file_ids:
e, file = FileService.get_by_id(file_id)
if not e:
return get_data_error_result(retmsg="File or Folder not found!")
if not file.tenant_id:
return get_data_error_result(retmsg="Tenant not found!")
if file.type == FileType.FOLDER.value:
file_id_list = FileService.get_all_innermost_file_ids(file_id, [])
for inner_file_id in file_id_list:
e, file = FileService.get_by_id(inner_file_id)
if not e:
return get_data_error_result(retmsg="File not found!")
MINIO.rm(file.parent_id, file.location)
FileService.delete_folder_by_pf_id(current_user.id, file_id)
else:
if not FileService.delete(file):
return get_data_error_result(
retmsg="Database error (File removal)!")
# delete file2document
informs = File2DocumentService.get_by_file_id(file_id)
for inform in informs:
doc_id = inform.document_id
e, doc = DocumentService.get_by_id(doc_id)
if not e:
return get_data_error_result(retmsg="Document not found!")
tenant_id = DocumentService.get_tenant_id(doc_id)
if not tenant_id:
return get_data_error_result(retmsg="Tenant not found!")
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
DocumentService.increment_chunk_num(
doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
if not DocumentService.delete(doc):
return get_data_error_result(
retmsg="Database error (Document removal)!")
File2DocumentService.delete_by_file_id(file_id)
return get_json_result(data=True)
except Exception as e:
return server_error_response(e)
@manager.route('/rename', methods=['POST'])
@login_required
@validate_request("file_id", "name")
def rename():
req = request.json
try:
e, file = FileService.get_by_id(req["file_id"])
if not e:
return get_data_error_result(retmsg="File not found!")
if pathlib.Path(req["name"].lower()).suffix != pathlib.Path(
file.name.lower()).suffix:
return get_json_result(
data=False,
retmsg="The extension of file can't be changed",
retcode=RetCode.ARGUMENT_ERROR)
if FileService.query(name=req["name"], pf_id=file.parent_id):
return get_data_error_result(
retmsg="Duplicated file name in the same folder.")
if not FileService.update_by_id(
req["file_id"], {"name": req["name"]}):
return get_data_error_result(
retmsg="Database error (File rename)!")
informs = File2DocumentService.get_by_file_id(req["file_id"])
if informs:
if not DocumentService.update_by_id(
informs[0].document_id, {"name": req["name"]}):
return get_data_error_result(
retmsg="Database error (Document rename)!")
return get_json_result(data=True)
except Exception as e:
return server_error_response(e)
@manager.route('/get/<file_id>', methods=['GET'])
# @login_required
def get(file_id):
try:
e, file = FileService.get_by_id(file_id)
if not e:
return get_data_error_result(retmsg="Document not found!")
response = flask.make_response(MINIO.get(file.parent_id, file.location))
ext = re.search(r"\.([^.]+)$", file.name)
if ext:
if file.type == FileType.VISUAL.value:
response.headers.set('Content-Type', 'image/%s' % ext.group(1))
else:
response.headers.set(
'Content-Type',
'application/%s' %
ext.group(1))
return response
except Exception as e:
return server_error_response(e)

View File

@ -28,7 +28,7 @@ from api.db.db_models import Knowledgebase
from api.settings import stat_logger, RetCode
from api.utils.api_utils import get_json_result
from rag.nlp import search
from rag.utils import ELASTICSEARCH
from rag.utils.es_conn import ELASTICSEARCH
@manager.route('/create', methods=['post'])
@ -111,7 +111,7 @@ def detail():
@login_required
def list():
page_number = request.args.get("page", 1)
items_per_page = request.args.get("page_size", 15)
items_per_page = request.args.get("page_size", 150)
orderby = request.args.get("orderby", "create_time")
desc = request.args.get("desc", True)
try:

View File

@ -28,7 +28,7 @@ from rag.llm import EmbeddingModel, ChatModel
def factories():
try:
fac = LLMFactoriesService.get_all()
return get_json_result(data=[f.to_dict() for f in fac])
return get_json_result(data=[f.to_dict() for f in fac if f.name not in ["Youdao", "FastEmbed"]])
except Exception as e:
return server_error_response(e)
@ -174,7 +174,7 @@ def list():
llms = [m.to_dict()
for m in llms if m.status == StatusEnum.VALID.value]
for m in llms:
m["available"] = m["fid"] in facts or m["llm_name"].lower() == "flag-embedding"
m["available"] = m["fid"] in facts or m["llm_name"].lower() == "flag-embedding" or m["fid"] in ["Youdao","FastEmbed"]
llm_set = set([m["llm_name"] for m in llms])
for o in objs:

View File

@ -14,6 +14,7 @@
# limitations under the License.
#
import re
from datetime import datetime
from flask import request, session, redirect
from werkzeug.security import generate_password_hash, check_password_hash
@ -22,11 +23,12 @@ from flask_login import login_required, current_user, login_user, logout_user
from api.db.db_models import TenantLLM
from api.db.services.llm_service import TenantLLMService, LLMService
from api.utils.api_utils import server_error_response, validate_request
from api.utils import get_uuid, get_format_time, decrypt, download_img
from api.db import UserTenantRole, LLMType
from api.utils import get_uuid, get_format_time, decrypt, download_img, current_timestamp, datetime_format
from api.db import UserTenantRole, LLMType, FileType
from api.settings import RetCode, GITHUB_OAUTH, CHAT_MDL, EMBEDDING_MDL, ASR_MDL, IMAGE2TEXT_MDL, PARSERS, API_KEY, \
LLM_FACTORY, LLM_BASE_URL
from api.db.services.user_service import UserService, TenantService, UserTenantService
from api.db.services.file_service import FileService
from api.settings import stat_logger
from api.utils.api_utils import get_json_result, cors_reponse
@ -56,6 +58,8 @@ def login():
response_data = user.to_json()
user.access_token = get_uuid()
login_user(user)
user.update_time = current_timestamp(),
user.update_date = datetime_format(datetime.now()),
user.save()
msg = "Welcome back!"
return cors_reponse(data=response_data, auth=user.get_id(), retmsg=msg)
@ -218,6 +222,17 @@ def user_register(user_id, user):
"invited_by": user_id,
"role": UserTenantRole.OWNER
}
file_id = get_uuid()
file = {
"id": file_id,
"parent_id": file_id,
"tenant_id": user_id,
"created_by": user_id,
"name": "/",
"type": FileType.FOLDER.value,
"size": 0,
"location": "",
}
tenant_llm = []
for llm in LLMService.query(fid=LLM_FACTORY):
tenant_llm.append({"tenant_id": user_id,
@ -233,6 +248,7 @@ def user_register(user_id, user):
TenantService.insert(**tenant)
UserTenantService.insert(**usr_tenant)
TenantLLMService.insert_many(tenant_llm)
FileService.insert(file)
return UserService.query(email=user["email"])

View File

@ -45,6 +45,8 @@ class FileType(StrEnum):
VISUAL = 'visual'
AURAL = 'aural'
VIRTUAL = 'virtual'
FOLDER = 'folder'
OTHER = "other"
class LLMType(StrEnum):
@ -62,6 +64,7 @@ class ChatStyle(StrEnum):
class TaskStatus(StrEnum):
UNSTART = "0"
RUNNING = "1"
CANCEL = "2"
DONE = "3"

View File

@ -629,7 +629,7 @@ class Document(DataBaseModel):
max_length=128,
null=False,
default="local",
help_text="where dose this document from")
help_text="where dose this document come from")
type = CharField(max_length=32, null=False, help_text="file extension")
created_by = CharField(
max_length=32,
@ -669,6 +669,61 @@ class Document(DataBaseModel):
db_table = "document"
class File(DataBaseModel):
id = CharField(
max_length=32,
primary_key=True,
)
parent_id = CharField(
max_length=32,
null=False,
help_text="parent folder id",
index=True)
tenant_id = CharField(
max_length=32,
null=False,
help_text="tenant id",
index=True)
created_by = CharField(
max_length=32,
null=False,
help_text="who created it")
name = CharField(
max_length=255,
null=False,
help_text="file name or folder name",
index=True)
location = CharField(
max_length=255,
null=True,
help_text="where dose it store")
size = IntegerField(default=0)
type = CharField(max_length=32, null=False, help_text="file extension")
class Meta:
db_table = "file"
class File2Document(DataBaseModel):
id = CharField(
max_length=32,
primary_key=True,
)
file_id = CharField(
max_length=32,
null=True,
help_text="file id",
index=True)
document_id = CharField(
max_length=32,
null=True,
help_text="document id",
index=True)
class Meta:
db_table = "file2document"
class Task(DataBaseModel):
id = CharField(max_length=32, primary_key=True)
doc_id = CharField(max_length=32, null=False, index=True)
@ -697,7 +752,7 @@ class Dialog(DataBaseModel):
null=True,
default="Chinese",
help_text="English|Chinese")
llm_id = CharField(max_length=32, null=False, help_text="default llm ID")
llm_id = CharField(max_length=128, null=False, help_text="default llm ID")
llm_setting = JSONField(null=False, default={"temperature": 0.1, "top_p": 0.3, "frequency_penalty": 0.7,
"presence_penalty": 0.4, "max_tokens": 215})
prompt_type = CharField(

View File

@ -18,7 +18,7 @@ import time
import uuid
from api.db import LLMType, UserTenantRole
from api.db.db_models import init_database_tables as init_web_db, LLMFactories, LLM
from api.db.db_models import init_database_tables as init_web_db, LLMFactories, LLM, TenantLLM
from api.db.services import UserService
from api.db.services.llm_service import LLMFactoriesService, LLMService, TenantLLMService, LLMBundle
from api.db.services.user_service import TenantService, UserTenantService
@ -114,12 +114,21 @@ factory_infos = [{
"logo": "",
"tags": "TEXT EMBEDDING",
"status": "1",
},
{
}, {
"name": "Xinference",
"logo": "",
"tags": "LLM,TEXT EMBEDDING,SPEECH2TEXT,MODERATION",
"status": "1",
},{
"name": "Youdao",
"logo": "",
"tags": "LLM,TEXT EMBEDDING,SPEECH2TEXT,MODERATION",
"status": "1",
},{
"name": "DeepSeek",
"logo": "",
"tags": "LLM",
"status": "1",
},
# {
# "name": "文心一言",
@ -254,12 +263,6 @@ def init_llm_factory():
"tags": "LLM,CHAT,",
"max_tokens": 7900,
"model_type": LLMType.CHAT.value
}, {
"fid": factory_infos[4]["name"],
"llm_name": "flag-embedding",
"tags": "TEXT EMBEDDING,",
"max_tokens": 128 * 1000,
"model_type": LLMType.EMBEDDING.value
}, {
"fid": factory_infos[4]["name"],
"llm_name": "moonshot-v1-32k",
@ -325,6 +328,29 @@ def init_llm_factory():
"max_tokens": 2147483648,
"model_type": LLMType.EMBEDDING.value
},
# ------------------------ Youdao -----------------------
{
"fid": factory_infos[7]["name"],
"llm_name": "maidalun1020/bce-embedding-base_v1",
"tags": "TEXT EMBEDDING,",
"max_tokens": 512,
"model_type": LLMType.EMBEDDING.value
},
# ------------------------ DeepSeek -----------------------
{
"fid": factory_infos[8]["name"],
"llm_name": "deepseek-chat",
"tags": "LLM,CHAT,",
"max_tokens": 32768,
"model_type": LLMType.CHAT.value
},
{
"fid": factory_infos[8]["name"],
"llm_name": "deepseek-coder",
"tags": "LLM,CHAT,",
"max_tokens": 16385,
"model_type": LLMType.CHAT.value
},
]
for info in factory_infos:
try:
@ -337,9 +363,13 @@ def init_llm_factory():
except Exception as e:
pass
LLMFactoriesService.filter_delete([LLMFactories.name=="Local"])
LLMService.filter_delete([LLM.fid=="Local"])
LLMFactoriesService.filter_delete([LLMFactories.name == "Local"])
LLMService.filter_delete([LLM.fid == "Local"])
LLMService.filter_delete([LLM.fid == "Moonshot", LLM.llm_name == "flag-embedding"])
TenantLLMService.filter_delete([TenantLLM.llm_factory == "Moonshot", TenantLLM.llm_name == "flag-embedding"])
LLMFactoriesService.filter_delete([LLMFactoriesService.model.name == "QAnything"])
LLMService.filter_delete([LLMService.model.fid == "QAnything"])
TenantLLMService.filter_update([TenantLLMService.model.llm_factory == "QAnything"], {"llm_factory": "Youdao"})
"""
drop table llm;
drop table llm_factories;

View File

@ -40,8 +40,8 @@ class API4ConversationService(CommonService):
@classmethod
@DB.connection_context()
def append_message(cls, id, conversation):
cls.model.update_by_id(id, conversation)
return cls.model.update(round=cls.model.round + 1).where(id=id).execute()
cls.update_by_id(id, conversation)
return cls.model.update(round=cls.model.round + 1).where(cls.model.id==id).execute()
@classmethod
@DB.connection_context()

View File

@ -80,8 +80,12 @@ def chat(dialog, messages, **kwargs):
raise LookupError("LLM(%s) not found" % dialog.llm_id)
max_tokens = 1024
else: max_tokens = llm[0].max_tokens
kbs = KnowledgebaseService.get_by_ids(dialog.kb_ids)
embd_nms = list(set([kb.embd_id for kb in kbs]))
assert len(embd_nms) == 1, "Knowledge bases use different embedding models."
questions = [m["content"] for m in messages if m["role"] == "user"]
embd_mdl = LLMBundle(dialog.tenant_id, LLMType.EMBEDDING)
embd_mdl = LLMBundle(dialog.tenant_id, LLMType.EMBEDDING, embd_nms[0])
chat_mdl = LLMBundle(dialog.tenant_id, LLMType.CHAT, dialog.llm_id)
prompt_config = dialog.prompt_config
@ -132,7 +136,7 @@ def chat(dialog, messages, **kwargs):
chat_logger.info("User: {}|Assistant: {}".format(
msg[-1]["content"], answer))
if knowledges and prompt_config.get("quote", True):
if knowledges and (prompt_config.get("quote", True) and kwargs.get("quote", True)):
answer, idx = retrievaler.insert_citations(answer,
[ck["content_ltks"]
for ck in kbinfos["chunks"]],

View File

@ -13,10 +13,18 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from peewee import Expression
import random
from datetime import datetime
from elasticsearch_dsl import Q
from api.settings import stat_logger
from api.utils import current_timestamp, get_format_time
from rag.utils.es_conn import ELASTICSEARCH
from rag.utils.minio_conn import MINIO
from rag.nlp import search
from api.db import FileType, TaskStatus
from api.db.db_models import DB, Knowledgebase, Tenant
from api.db.db_models import DB, Knowledgebase, Tenant, Task
from api.db.db_models import Document
from api.db.services.common_service import CommonService
from api.db.services.knowledgebase_service import KnowledgebaseService
@ -71,7 +79,21 @@ class DocumentService(CommonService):
@classmethod
@DB.connection_context()
def get_newly_uploaded(cls, tm, mod=0, comm=1, items_per_page=64):
def remove_document(cls, doc, tenant_id):
ELASTICSEARCH.deleteByQuery(
Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
cls.increment_chunk_num(
doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
if not cls.delete(doc):
raise RuntimeError("Database error (Document removal)!")
MINIO.rm(doc.kb_id, doc.location)
return cls.delete_by_id(doc.id)
@classmethod
@DB.connection_context()
def get_newly_uploaded(cls):
fields = [
cls.model.id,
cls.model.kb_id,
@ -93,11 +115,9 @@ class DocumentService(CommonService):
cls.model.status == StatusEnum.VALID.value,
~(cls.model.type == FileType.VIRTUAL.value),
cls.model.progress == 0,
cls.model.update_time >= tm,
cls.model.run == TaskStatus.RUNNING.value,
(Expression(cls.model.create_time, "%%", comm) == mod))\
.order_by(cls.model.update_time.asc())\
.paginate(1, items_per_page)
cls.model.update_time >= current_timestamp() - 1000 * 600,
cls.model.run == TaskStatus.RUNNING.value)\
.order_by(cls.model.update_time.asc())
return list(docs.dicts())
@classmethod
@ -177,3 +197,55 @@ class DocumentService(CommonService):
on=(Knowledgebase.id == cls.model.kb_id)).where(
Knowledgebase.tenant_id == tenant_id)
return len(docs)
@classmethod
@DB.connection_context()
def begin2parse(cls, docid):
cls.update_by_id(
docid, {"progress": random.random() * 1 / 100.,
"progress_msg": "Task dispatched...",
"process_begin_at": get_format_time()
})
@classmethod
@DB.connection_context()
def update_progress(cls):
docs = cls.get_unfinished_docs()
for d in docs:
try:
tsks = Task.query(doc_id=d["id"], order_by=Task.create_time)
if not tsks:
continue
msg = []
prg = 0
finished = True
bad = 0
status = TaskStatus.RUNNING.value
for t in tsks:
if 0 <= t.progress < 1:
finished = False
prg += t.progress if t.progress >= 0 else 0
msg.append(t.progress_msg)
if t.progress == -1:
bad += 1
prg /= len(tsks)
if finished and bad:
prg = -1
status = TaskStatus.FAIL.value
elif finished:
status = TaskStatus.DONE.value
msg = "\n".join(msg)
info = {
"process_duation": datetime.timestamp(
datetime.now()) -
d["process_begin_at"].timestamp(),
"run": status}
if prg != 0:
info["progress"] = prg
if msg:
info["progress_msg"] = msg
cls.update_by_id(d["id"], info)
except Exception as e:
stat_logger.error("fetch task exception:" + str(e))

View File

@ -0,0 +1,83 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from datetime import datetime
from api.db.db_models import DB
from api.db.db_models import File, Document, File2Document
from api.db.services.common_service import CommonService
from api.db.services.document_service import DocumentService
from api.db.services.file_service import FileService
from api.utils import current_timestamp, datetime_format
class File2DocumentService(CommonService):
model = File2Document
@classmethod
@DB.connection_context()
def get_by_file_id(cls, file_id):
objs = cls.model.select().where(cls.model.file_id == file_id)
return objs
@classmethod
@DB.connection_context()
def get_by_document_id(cls, document_id):
objs = cls.model.select().where(cls.model.document_id == document_id)
return objs
@classmethod
@DB.connection_context()
def insert(cls, obj):
if not cls.save(**obj):
raise RuntimeError("Database error (File)!")
e, obj = cls.get_by_id(obj["id"])
if not e:
raise RuntimeError("Database error (File retrieval)!")
return obj
@classmethod
@DB.connection_context()
def delete_by_file_id(cls, file_id):
return cls.model.delete().where(cls.model.file_id == file_id).execute()
@classmethod
@DB.connection_context()
def delete_by_document_id(cls, doc_id):
return cls.model.delete().where(cls.model.document_id == doc_id).execute()
@classmethod
@DB.connection_context()
def update_by_file_id(cls, file_id, obj):
obj["update_time"] = current_timestamp()
obj["update_date"] = datetime_format(datetime.now())
num = cls.model.update(obj).where(cls.model.id == file_id).execute()
e, obj = cls.get_by_id(cls.model.id)
return obj
@classmethod
@DB.connection_context()
def get_minio_address(cls, doc_id=None, file_id=None):
if doc_id:
ids = File2DocumentService.get_by_document_id(doc_id)
else:
ids = File2DocumentService.get_by_file_id(file_id)
if ids:
e, file = FileService.get_by_id(ids[0].file_id)
return file.parent_id, file.location
else:
assert doc_id, "please specify doc_id"
e, doc = DocumentService.get_by_id(doc_id)
return doc.kb_id, doc.location

View File

@ -0,0 +1,243 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from flask_login import current_user
from peewee import fn
from api.db import FileType
from api.db.db_models import DB, File2Document, Knowledgebase
from api.db.db_models import File, Document
from api.db.services.common_service import CommonService
from api.utils import get_uuid
class FileService(CommonService):
model = File
@classmethod
@DB.connection_context()
def get_by_pf_id(cls, tenant_id, pf_id, page_number, items_per_page,
orderby, desc, keywords):
if keywords:
files = cls.model.select().where(
(cls.model.tenant_id == tenant_id)
& (cls.model.parent_id == pf_id), (fn.LOWER(cls.model.name).like(f"%%{keywords.lower()}%%")))
else:
files = cls.model.select().where((cls.model.tenant_id == tenant_id)
& (cls.model.parent_id == pf_id))
count = files.count()
if desc:
files = files.order_by(cls.model.getter_by(orderby).desc())
else:
files = files.order_by(cls.model.getter_by(orderby).asc())
files = files.paginate(page_number, items_per_page)
res_files = list(files.dicts())
for file in res_files:
if file["type"] == FileType.FOLDER.value:
file["size"] = cls.get_folder_size(file["id"])
file['kbs_info'] = []
continue
kbs_info = cls.get_kb_id_by_file_id(file['id'])
file['kbs_info'] = kbs_info
return res_files, count
@classmethod
@DB.connection_context()
def get_kb_id_by_file_id(cls, file_id):
kbs = (cls.model.select(*[Knowledgebase.id, Knowledgebase.name])
.join(File2Document, on=(File2Document.file_id == file_id))
.join(Document, on=(File2Document.document_id == Document.id))
.join(Knowledgebase, on=(Knowledgebase.id == Document.kb_id))
.where(cls.model.id == file_id))
if not kbs: return []
kbs_info_list = []
for kb in list(kbs.dicts()):
kbs_info_list.append({"kb_id": kb['id'], "kb_name": kb['name']})
return kbs_info_list
@classmethod
@DB.connection_context()
def get_by_pf_id_name(cls, id, name):
file = cls.model.select().where((cls.model.parent_id == id) & (cls.model.name == name))
if file.count():
e, file = cls.get_by_id(file[0].id)
if not e:
raise RuntimeError("Database error (File retrieval)!")
return file
return None
@classmethod
@DB.connection_context()
def get_id_list_by_id(cls, id, name, count, res):
if count < len(name):
file = cls.get_by_pf_id_name(id, name[count])
if file:
res.append(file.id)
return cls.get_id_list_by_id(file.id, name, count + 1, res)
else:
return res
else:
return res
@classmethod
@DB.connection_context()
def get_all_innermost_file_ids(cls, folder_id, result_ids):
subfolders = cls.model.select().where(cls.model.parent_id == folder_id)
if subfolders.exists():
for subfolder in subfolders:
cls.get_all_innermost_file_ids(subfolder.id, result_ids)
else:
result_ids.append(folder_id)
return result_ids
@classmethod
@DB.connection_context()
def create_folder(cls, file, parent_id, name, count):
if count > len(name) - 2:
return file
else:
file = cls.insert({
"id": get_uuid(),
"parent_id": parent_id,
"tenant_id": current_user.id,
"created_by": current_user.id,
"name": name[count],
"location": "",
"size": 0,
"type": FileType.FOLDER.value
})
return cls.create_folder(file, file.id, name, count + 1)
@classmethod
@DB.connection_context()
def is_parent_folder_exist(cls, parent_id):
parent_files = cls.model.select().where(cls.model.id == parent_id)
if parent_files.count():
return True
cls.delete_folder_by_pf_id(parent_id)
return False
@classmethod
@DB.connection_context()
def get_root_folder(cls, tenant_id):
file = cls.model.select().where(cls.model.tenant_id == tenant_id and
cls.model.parent_id == cls.model.id)
if not file:
file_id = get_uuid()
file = {
"id": file_id,
"parent_id": file_id,
"tenant_id": tenant_id,
"created_by": tenant_id,
"name": "/",
"type": FileType.FOLDER.value,
"size": 0,
"location": "",
}
cls.save(**file)
else:
file_id = file[0].id
e, file = cls.get_by_id(file_id)
if not e:
raise RuntimeError("Database error (File retrieval)!")
return file
@classmethod
@DB.connection_context()
def get_parent_folder(cls, file_id):
file = cls.model.select().where(cls.model.id == file_id)
if file.count():
e, file = cls.get_by_id(file[0].parent_id)
if not e:
raise RuntimeError("Database error (File retrieval)!")
else:
raise RuntimeError("Database error (File doesn't exist)!")
return file
@classmethod
@DB.connection_context()
def get_all_parent_folders(cls, start_id):
parent_folders = []
current_id = start_id
while current_id:
e, file = cls.get_by_id(current_id)
if file.parent_id != file.id and e:
parent_folders.append(file)
current_id = file.parent_id
else:
parent_folders.append(file)
break
return parent_folders
@classmethod
@DB.connection_context()
def insert(cls, file):
if not cls.save(**file):
raise RuntimeError("Database error (File)!")
e, file = cls.get_by_id(file["id"])
if not e:
raise RuntimeError("Database error (File retrieval)!")
return file
@classmethod
@DB.connection_context()
def delete(cls, file):
return cls.delete_by_id(file.id)
@classmethod
@DB.connection_context()
def delete_by_pf_id(cls, folder_id):
return cls.model.delete().where(cls.model.parent_id == folder_id).execute()
@classmethod
@DB.connection_context()
def delete_folder_by_pf_id(cls, user_id, folder_id):
try:
files = cls.model.select().where((cls.model.tenant_id == user_id)
& (cls.model.parent_id == folder_id))
for file in files:
cls.delete_folder_by_pf_id(user_id, file.id)
return cls.model.delete().where((cls.model.tenant_id == user_id)
& (cls.model.id == folder_id)).execute(),
except Exception as e:
print(e)
raise RuntimeError("Database error (File retrieval)!")
@classmethod
@DB.connection_context()
def get_file_count(cls, tenant_id):
files = cls.model.select(cls.model.id).where(cls.model.tenant_id == tenant_id)
return len(files)
@classmethod
@DB.connection_context()
def get_folder_size(cls, folder_id):
size = 0
def dfs(parent_id):
nonlocal size
for f in cls.model.select(*[cls.model.id, cls.model.size, cls.model.type]).where(
cls.model.parent_id == parent_id, cls.model.id != parent_id):
size += f.size
if f.type == FileType.FOLDER.value:
dfs(f.id)
dfs(folder_id)
return size

View File

@ -27,7 +27,8 @@ class KnowledgebaseService(CommonService):
page_number, items_per_page, orderby, desc):
kbs = cls.model.select().where(
((cls.model.tenant_id.in_(joined_tenant_ids) & (cls.model.permission ==
TenantPermission.TEAM.value)) | (cls.model.tenant_id == user_id))
TenantPermission.TEAM.value)) | (
cls.model.tenant_id == user_id))
& (cls.model.status == StatusEnum.VALID.value)
)
if desc:
@ -56,7 +57,8 @@ class KnowledgebaseService(CommonService):
cls.model.chunk_num,
cls.model.parser_id,
cls.model.parser_config]
kbs = cls.model.select(*fields).join(Tenant, on=((Tenant.id == cls.model.tenant_id) & (Tenant.status == StatusEnum.VALID.value))).where(
kbs = cls.model.select(*fields).join(Tenant, on=(
(Tenant.id == cls.model.tenant_id) & (Tenant.status == StatusEnum.VALID.value))).where(
(cls.model.id == kb_id),
(cls.model.status == StatusEnum.VALID.value)
)
@ -86,6 +88,7 @@ class KnowledgebaseService(CommonService):
old[k] = list(set(old[k] + v))
else:
old[k] = v
dfs_update(m.parser_config, config)
cls.update_by_id(id, {"parser_config": m.parser_config})
@ -97,3 +100,15 @@ class KnowledgebaseService(CommonService):
if k.parser_config and "field_map" in k.parser_config:
conf.update(k.parser_config["field_map"])
return conf
@classmethod
@DB.connection_context()
def get_by_name(cls, kb_name, tenant_id):
kb = cls.model.select().where(
(cls.model.name == kb_name)
& (cls.model.tenant_id == tenant_id)
& (cls.model.status == StatusEnum.VALID.value)
)
if kb:
return True, kb[0]
return False, None

View File

@ -66,7 +66,7 @@ class TenantLLMService(CommonService):
raise LookupError("Tenant not found")
if llm_type == LLMType.EMBEDDING.value:
mdlnm = tenant.embd_id
mdlnm = tenant.embd_id if not llm_name else llm_name
elif llm_type == LLMType.SPEECH2TEXT.value:
mdlnm = tenant.asr_id
elif llm_type == LLMType.IMAGE2TEXT.value:
@ -77,9 +77,19 @@ class TenantLLMService(CommonService):
assert False, "LLM type error"
model_config = cls.get_api_key(tenant_id, mdlnm)
if model_config: model_config = model_config.to_dict()
if not model_config:
raise LookupError("Model({}) not authorized".format(mdlnm))
model_config = model_config.to_dict()
if llm_type == LLMType.EMBEDDING.value:
llm = LLMService.query(llm_name=llm_name)
if llm and llm[0].fid in ["Youdao", "FastEmbed"]:
model_config = {"llm_factory": llm[0].fid, "api_key":"", "llm_name": llm_name, "api_base": ""}
if not model_config:
if llm_name == "flag-embedding":
model_config = {"llm_factory": "Tongyi-Qianwen", "api_key": "",
"llm_name": llm_name, "api_base": ""}
else:
raise LookupError("Model({}) not authorized".format(mdlnm))
if llm_type == LLMType.EMBEDDING.value:
if model_config["llm_factory"] not in EmbeddingModel:
return
@ -118,9 +128,11 @@ class TenantLLMService(CommonService):
else:
assert False, "LLM type error"
num = cls.model.update(used_tokens=cls.model.used_tokens + used_tokens)\
.where(cls.model.tenant_id == tenant_id, cls.model.llm_name == mdlnm)\
.execute()
num = 0
for u in cls.query(tenant_id = tenant_id, llm_name=mdlnm):
num += cls.model.update(used_tokens = u.used_tokens + used_tokens)\
.where(cls.model.tenant_id == tenant_id, cls.model.llm_name == mdlnm)\
.execute()
return num

View File

@ -13,12 +13,21 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from peewee import Expression
from api.db.db_models import DB
import random
from api.db.db_utils import bulk_insert_into_db
from deepdoc.parser import PdfParser
from peewee import JOIN
from api.db.db_models import DB, File2Document, File
from api.db import StatusEnum, FileType, TaskStatus
from api.db.db_models import Task, Document, Knowledgebase, Tenant
from api.db.services.common_service import CommonService
from api.db.services.document_service import DocumentService
from api.utils import current_timestamp, get_uuid
from deepdoc.parser.excel_parser import RAGFlowExcelParser
from rag.settings import SVR_QUEUE_NAME
from rag.utils.minio_conn import MINIO
from rag.utils.redis_conn import REDIS_CONN
class TaskService(CommonService):
@ -26,7 +35,7 @@ class TaskService(CommonService):
@classmethod
@DB.connection_context()
def get_tasks(cls, tm, mod=0, comm=1, items_per_page=64):
def get_tasks(cls, task_id):
fields = [
cls.model.id,
cls.model.doc_id,
@ -41,24 +50,42 @@ class TaskService(CommonService):
Document.size,
Knowledgebase.tenant_id,
Knowledgebase.language,
Tenant.embd_id,
Knowledgebase.embd_id,
Tenant.img2txt_id,
Tenant.asr_id,
cls.model.update_time]
docs = cls.model.select(*fields) \
.join(Document, on=(cls.model.doc_id == Document.id)) \
.join(Knowledgebase, on=(Document.kb_id == Knowledgebase.id)) \
.join(Tenant, on=(Knowledgebase.tenant_id == Tenant.id))\
.where(
Document.status == StatusEnum.VALID.value,
Document.run == TaskStatus.RUNNING.value,
~(Document.type == FileType.VIRTUAL.value),
cls.model.progress == 0,
cls.model.update_time >= tm,
(Expression(cls.model.create_time, "%%", comm) == mod))\
.order_by(cls.model.update_time.asc())\
.paginate(1, items_per_page)
return list(docs.dicts())
.join(Tenant, on=(Knowledgebase.tenant_id == Tenant.id)) \
.where(cls.model.id == task_id)
docs = list(docs.dicts())
if not docs: return []
cls.model.update(progress_msg=cls.model.progress_msg + "\n" + "Task has been received.",
progress=random.random() / 10.).where(
cls.model.id == docs[0]["id"]).execute()
return docs
@classmethod
@DB.connection_context()
def get_ongoing_doc_name(cls):
with DB.lock("get_task", -1):
docs = cls.model.select(*[Document.id, Document.kb_id, Document.location, File.parent_id]) \
.join(Document, on=(cls.model.doc_id == Document.id)) \
.join(File2Document, on=(File2Document.document_id == Document.id), join_type=JOIN.LEFT_OUTER) \
.join(File, on=(File2Document.file_id == File.id), join_type=JOIN.LEFT_OUTER) \
.where(
Document.status == StatusEnum.VALID.value,
Document.run == TaskStatus.RUNNING.value,
~(Document.type == FileType.VIRTUAL.value),
cls.model.progress < 1,
cls.model.create_time >= current_timestamp() - 1000 * 600
)
docs = list(docs.dicts())
if not docs: return []
return list(set([(d["parent_id"] if d["parent_id"] else d["kb_id"], d["location"]) for d in docs]))
@classmethod
@DB.connection_context()
@ -74,9 +101,62 @@ class TaskService(CommonService):
@classmethod
@DB.connection_context()
def update_progress(cls, id, info):
if info["progress_msg"]:
cls.model.update(progress_msg=cls.model.progress_msg + "\n" + info["progress_msg"]).where(
cls.model.id == id).execute()
if "progress" in info:
cls.model.update(progress=info["progress"]).where(
cls.model.id == id).execute()
with DB.lock("update_progress", -1):
if info["progress_msg"]:
cls.model.update(progress_msg=cls.model.progress_msg + "\n" + info["progress_msg"]).where(
cls.model.id == id).execute()
if "progress" in info:
cls.model.update(progress=info["progress"]).where(
cls.model.id == id).execute()
def queue_tasks(doc, bucket, name):
def new_task():
nonlocal doc
return {
"id": get_uuid(),
"doc_id": doc["id"]
}
tsks = []
if doc["type"] == FileType.PDF.value:
file_bin = MINIO.get(bucket, name)
do_layout = doc["parser_config"].get("layout_recognize", True)
pages = PdfParser.total_page_number(doc["name"], file_bin)
page_size = doc["parser_config"].get("task_page_size", 12)
if doc["parser_id"] == "paper":
page_size = doc["parser_config"].get("task_page_size", 22)
if doc["parser_id"] == "one":
page_size = 1000000000
if not do_layout:
page_size = 1000000000
page_ranges = doc["parser_config"].get("pages")
if not page_ranges:
page_ranges = [(1, 100000)]
for s, e in page_ranges:
s -= 1
s = max(0, s)
e = min(e - 1, pages)
for p in range(s, e, page_size):
task = new_task()
task["from_page"] = p
task["to_page"] = min(p + page_size, e)
tsks.append(task)
elif doc["parser_id"] == "table":
file_bin = MINIO.get(bucket, name)
rn = RAGFlowExcelParser.row_number(
doc["name"], file_bin)
for i in range(0, rn, 3000):
task = new_task()
task["from_page"] = i
task["to_page"] = min(i + 3000, rn)
tsks.append(task)
else:
tsks.append(new_task())
bulk_insert_into_db(Task, tsks, True)
DocumentService.begin2parse(doc["id"])
for t in tsks:
REDIS_CONN.queue_product(SVR_QUEUE_NAME, message=t)

View File

@ -18,10 +18,14 @@ import logging
import os
import signal
import sys
import time
import traceback
from concurrent.futures import ThreadPoolExecutor
from werkzeug.serving import run_simple
from api.apps import app
from api.db.runtime_config import RuntimeConfig
from api.db.services.document_service import DocumentService
from api.settings import (
HOST, HTTP_PORT, access_logger, database_logger, stat_logger,
)
@ -31,6 +35,16 @@ from api.db.db_models import init_database_tables as init_web_db
from api.db.init_data import init_web_data
from api.versions import get_versions
def update_progress():
while True:
time.sleep(1)
try:
DocumentService.update_progress()
except Exception as e:
stat_logger.error("update_progress exception:" + str(e))
if __name__ == '__main__':
print("""
____ ______ __
@ -71,6 +85,9 @@ if __name__ == '__main__':
peewee_logger.addHandler(database_logger.handlers[0])
peewee_logger.setLevel(database_logger.level)
thr = ThreadPoolExecutor(max_workers=1)
thr.submit(update_progress)
# start http server
try:
stat_logger.info("RAG Flow http server start...")

View File

@ -32,7 +32,7 @@ access_logger = getLogger("access")
database_logger = getLogger("database")
chat_logger = getLogger("chat")
from rag.utils import ELASTICSEARCH
from rag.utils.es_conn import ELASTICSEARCH
from rag.nlp import search
from api.utils import get_base_config, decrypt_database_config

View File

@ -19,7 +19,7 @@ import os
import re
from io import BytesIO
import fitz
import pdfplumber
from PIL import Image
from cachetools import LRUCache, cached
from ruamel.yaml import YAML
@ -66,6 +66,15 @@ def get_rag_python_directory(*args):
return get_rag_directory("python", *args)
def get_home_cache_dir():
dir = os.path.join(os.path.expanduser('~'), ".ragflow")
try:
os.mkdir(dir)
except OSError as error:
pass
return dir
@cached(cache=LRUCache(maxsize=10))
def load_json_conf(conf_path):
if os.path.isabs(conf_path):
@ -147,7 +156,7 @@ def filename_type(filename):
return FileType.PDF.value
if re.match(
r".*\.(docx|doc|ppt|pptx|yml|xml|htm|json|csv|txt|ini|xls|xlsx|wps|rtf|hlp|pages|numbers|key|md)$", filename):
r".*\.(doc|docx|ppt|pptx|yml|xml|htm|json|csv|txt|ini|xls|xlsx|wps|rtf|hlp|pages|numbers|key|md)$", filename):
return FileType.DOC.value
if re.match(
@ -155,17 +164,17 @@ def filename_type(filename):
return FileType.AURAL.value
if re.match(r".*\.(jpg|jpeg|png|tif|gif|pcx|tga|exif|fpx|svg|psd|cdr|pcd|dxf|ufo|eps|ai|raw|WMF|webp|avif|apng|icon|ico|mpg|mpeg|avi|rm|rmvb|mov|wmv|asf|dat|asx|wvx|mpe|mpa|mp4)$", filename):
return FileType.VISUAL
return FileType.VISUAL.value
return FileType.OTHER.value
def thumbnail(filename, blob):
filename = filename.lower()
if re.match(r".*\.pdf$", filename):
pdf = fitz.open(stream=blob, filetype="pdf")
pix = pdf[0].get_pixmap(matrix=fitz.Matrix(0.03, 0.03))
pdf = pdfplumber.open(BytesIO(blob))
buffered = BytesIO()
Image.frombytes("RGB", [pix.width, pix.height],
pix.samples).save(buffered, format="png")
pdf.pages[0].to_image().annotated.save(buffered, format="png")
return "data:image/png;base64," + \
base64.b64encode(buffered.getvalue()).decode("utf-8")

View File

@ -1,7 +1,7 @@
{
"settings": {
"index": {
"number_of_shards": 4,
"number_of_shards": 2,
"number_of_replicas": 0,
"refresh_interval" : "1000ms"
},

View File

@ -15,9 +15,14 @@ minio:
host: 'minio:9000'
es:
hosts: 'http://es01:9200'
redis:
db: 1
password: 'infini_rag_flow'
host: 'redis:6379'
user_default_llm:
factory: 'Tongyi-Qianwen'
api_key: 'sk-xxxxxxxxxxxxx'
base_url: ''
oauth:
github:
client_id: xxxxxxxxxxxxxxxxxxxxxxxxx

View File

@ -1 +1,116 @@
[English](./README.md) | 简体中文
[English](./README.md) | 简体中文
# *Deep*Doc
- [*Deep*Doc](#deepdoc)
- [1. 介绍](#1-介绍)
- [2. 视觉处理](#2-视觉处理)
- [3. 解析器](#3-解析器)
- [简历](#简历)
<a name="1"></a>
## 1. 介绍
对于来自不同领域、具有不同格式和不同检索要求的大量文档,准确的分析成为一项极具挑战性的任务。*Deep*Doc 就是为了这个目的而诞生的。到目前为止,*Deep*Doc 中有两个组成部分视觉处理和解析器。如果您对我们的OCR、布局识别和TSR结果感兴趣您可以运行下面的测试程序。
```bash
python deepdoc/vision/t_ocr.py -h
usage: t_ocr.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR]
options:
-h, --help show this help message and exit
--inputs INPUTS Directory where to store images or PDFs, or a file path to a single image or PDF
--output_dir OUTPUT_DIR
Directory where to store the output images. Default: './ocr_outputs'
```
```bash
python deepdoc/vision/t_recognizer.py -h
usage: t_recognizer.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR] [--threshold THRESHOLD] [--mode {layout,tsr}]
options:
-h, --help show this help message and exit
--inputs INPUTS Directory where to store images or PDFs, or a file path to a single image or PDF
--output_dir OUTPUT_DIR
Directory where to store the output images. Default: './layouts_outputs'
--threshold THRESHOLD
A threshold to filter out detections. Default: 0.5
--mode {layout,tsr} Task mode: layout recognition or table structure recognition
```
HuggingFace为我们的模型提供服务。如果你在下载HuggingFace模型时遇到问题这可能会有所帮助
```bash
export HF_ENDPOINT=https://hf-mirror.com
```
<a name="2"></a>
## 2. 视觉处理
作为人类,我们使用视觉信息来解决问题。
- **OCROptical Character Recognition光学字符识别**。由于许多文档都是以图像形式呈现的或者至少能够转换为图像因此OCR是文本提取的一个非常重要、基本甚至通用的解决方案。
```bash
python deepdoc/vision/t_ocr.py --inputs=path_to_images_or_pdfs --output_dir=path_to_store_result
```
输入可以是图像或PDF的目录或者单个图像、PDF文件。您可以查看文件夹 `path_to_store_result` 其中有演示结果位置的图像以及包含OCR文本的txt文件。
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://github.com/infiniflow/ragflow/assets/12318111/f25bee3d-aaf7-4102-baf5-d5208361d110" width="900"/>
</div>
- 布局识别Layout recognition。来自不同领域的文件可能有不同的布局如报纸、杂志、书籍和简历在布局方面是不同的。只有当机器有准确的布局分析时它才能决定这些文本部分是连续的还是不连续的或者这个部分需要表结构识别Table Structure RecognitionTSR来处理或者这个部件是一个图形并用这个标题来描述。我们有10个基本布局组件涵盖了大多数情况
- 文本
- 标题
- 配图
- 配图标题
- 表格
- 表格标题
- 页头
- 页尾
- 参考引用
- 公式
请尝试以下命令以查看布局检测结果。
```bash
python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=layout --output_dir=path_to_store_result
```
输入可以是图像或PDF的目录或者单个图像、PDF文件。您可以查看文件夹 `path_to_store_result` ,其中有显示检测结果的图像,如下所示:
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://github.com/infiniflow/ragflow/assets/12318111/07e0f625-9b28-43d0-9fbb-5bf586cd286f" width="1000"/>
</div>
- **TSRTable Structure Recognition表结构识别**。数据表是一种常用的结构用于表示包括数字或文本在内的数据。表的结构可能非常复杂比如层次结构标题、跨单元格和投影行标题。除了TSR我们还将内容重新组合成LLM可以很好理解的句子。TSR任务有五个标签
- 列
- 行
- 列标题
- 行标题
- 合并单元格
请尝试以下命令以查看布局检测结果。
```bash
python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=tsr --output_dir=path_to_store_result
```
输入可以是图像或PDF的目录或者单个图像、PDF文件。您可以查看文件夹 `path_to_store_result` 其中包含图像和html页面这些页面展示了以下检测结果
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://github.com/infiniflow/ragflow/assets/12318111/cb24e81b-f2ba-49f3-ac09-883d75606f4c" width="1000"/>
</div>
<a name="3"></a>
## 3. 解析器
PDF、DOCX、EXCEL和PPT四种文档格式都有相应的解析器。最复杂的是PDF解析器因为PDF具有灵活性。PDF解析器的输出包括
- 在PDF中有自己位置的文本块页码和矩形位置
- 带有PDF裁剪图像的表格以及已经翻译成自然语言句子的内容。
- 图中带标题和文字的图。
### 简历
简历是一种非常复杂的文件。一份由各种布局的非结构化文本组成的简历可以分解为由近百个字段组成的结构化数据。我们还没有打开解析器,因为我们在解析过程之后打开了处理方法。

View File

@ -1,6 +1,6 @@
from .pdf_parser import HuParser as PdfParser, PlainParser
from .docx_parser import HuDocxParser as DocxParser
from .excel_parser import HuExcelParser as ExcelParser
from .ppt_parser import HuPptParser as PptParser
from .pdf_parser import RAGFlowPdfParser as PdfParser, PlainParser
from .docx_parser import RAGFlowDocxParser as DocxParser
from .excel_parser import RAGFlowExcelParser as ExcelParser
from .ppt_parser import RAGFlowPptParser as PptParser

View File

@ -3,11 +3,11 @@ from docx import Document
import re
import pandas as pd
from collections import Counter
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from io import BytesIO
class HuDocxParser:
class RAGFlowDocxParser:
def __extract_table_content(self, tb):
df = []
@ -35,14 +35,14 @@ class HuDocxParser:
for p, n in patt:
if re.search(p, b):
return n
tks = [t for t in huqie.qie(b).split(" ") if len(t) > 1]
tks = [t for t in rag_tokenizer.tokenize(b).split(" ") if len(t) > 1]
if len(tks) > 3:
if len(tks) < 12:
return "Tx"
else:
return "Lx"
if len(tks) == 1 and huqie.tag(tks[0]) == "nr":
if len(tks) == 1 and rag_tokenizer.tag(tks[0]) == "nr":
return "Nr"
return "Ot"

View File

@ -3,8 +3,10 @@ from openpyxl import load_workbook
import sys
from io import BytesIO
from rag.nlp import find_codec
class HuExcelParser:
class RAGFlowExcelParser:
def html(self, fnm):
if isinstance(fnm, str):
wb = load_workbook(fnm)
@ -66,10 +68,11 @@ class HuExcelParser:
return total
if fnm.split(".")[-1].lower() in ["csv", "txt"]:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
return len(txt.split("\n"))
if __name__ == "__main__":
psr = HuExcelParser()
psr = RAGFlowExcelParser()
psr(sys.argv[1])

View File

@ -2,7 +2,6 @@
import os
import random
import fitz
import xgboost as xgb
from io import BytesIO
import torch
@ -11,19 +10,19 @@ import pdfplumber
import logging
from PIL import Image, ImageDraw
import numpy as np
from timeit import default_timer as timer
from PyPDF2 import PdfReader as pdf2_read
from api.utils.file_utils import get_project_base_directory
from deepdoc.vision import OCR, Recognizer, LayoutRecognizer, TableStructureRecognizer
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from copy import deepcopy
from huggingface_hub import snapshot_download
logging.getLogger("pdfminer").setLevel(logging.WARNING)
class HuParser:
class RAGFlowPdfParser:
def __init__(self):
self.ocr = OCR()
if hasattr(self, "model_speciess"):
@ -37,17 +36,18 @@ class HuParser:
self.updown_cnt_mdl.set_param({"device": "cuda"})
try:
model_dir = os.path.join(
get_project_base_directory(),
"rag/res/deepdoc")
get_project_base_directory(),
"rag/res/deepdoc")
self.updown_cnt_mdl.load_model(os.path.join(
model_dir, "updown_concat_xgb.model"))
except Exception as e:
model_dir = snapshot_download(
repo_id="InfiniFlow/text_concat_xgb_v1.0")
repo_id="InfiniFlow/text_concat_xgb_v1.0",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False)
self.updown_cnt_mdl.load_model(os.path.join(
model_dir, "updown_concat_xgb.model"))
self.page_from = 0
"""
If you have trouble downloading HuggingFace models, -_^ this might help!!
@ -62,7 +62,7 @@ class HuParser:
"""
def __char_width(self, c):
return (c["x1"] - c["x0"]) // len(c["text"])
return (c["x1"] - c["x0"]) // max(len(c["text"]), 1)
def __height(self, c):
return c["bottom"] - c["top"]
@ -74,7 +74,7 @@ class HuParser:
def _y_dis(
self, a, b):
return (
b["top"] + b["bottom"] - a["top"] - a["bottom"]) / 2
b["top"] + b["bottom"] - a["top"] - a["bottom"]) / 2
def _match_proj(self, b):
proj_patt = [
@ -94,13 +94,13 @@ class HuParser:
h = max(self.__height(up), self.__height(down))
y_dis = self._y_dis(up, down)
LEN = 6
tks_down = huqie.qie(down["text"][:LEN]).split(" ")
tks_up = huqie.qie(up["text"][-LEN:]).split(" ")
tks_down = rag_tokenizer.tokenize(down["text"][:LEN]).split(" ")
tks_up = rag_tokenizer.tokenize(up["text"][-LEN:]).split(" ")
tks_all = up["text"][-LEN:].strip() \
+ (" " if re.match(r"[a-zA-Z0-9]+",
up["text"][-1] + down["text"][0]) else "") \
+ down["text"][:LEN].strip()
tks_all = huqie.qie(tks_all).split(" ")
+ (" " if re.match(r"[a-zA-Z0-9]+",
up["text"][-1] + down["text"][0]) else "") \
+ down["text"][:LEN].strip()
tks_all = rag_tokenizer.tokenize(tks_all).split(" ")
fea = [
up.get("R", -1) == down.get("R", -1),
y_dis / h,
@ -121,7 +121,7 @@ class HuParser:
True if re.search(r"[,][^。.]+$", up["text"]) else False,
True if re.search(r"[,][^。.]+$", up["text"]) else False,
True if re.search(r"[\(][^\)]+$", up["text"])
and re.search(r"[\)]", down["text"]) else False,
and re.search(r"[\)]", down["text"]) else False,
self._match_proj(down),
True if re.match(r"[A-Z]", down["text"]) else False,
True if re.match(r"[A-Z]", up["text"][-1]) else False,
@ -141,8 +141,8 @@ class HuParser:
tks_down[-1] == tks_up[-1],
max(down["in_row"], up["in_row"]),
abs(down["in_row"] - up["in_row"]),
len(tks_down) == 1 and huqie.tag(tks_down[0]).find("n") >= 0,
len(tks_up) == 1 and huqie.tag(tks_up[0]).find("n") >= 0
len(tks_down) == 1 and rag_tokenizer.tag(tks_down[0]).find("n") >= 0,
len(tks_up) == 1 and rag_tokenizer.tag(tks_up[0]).find("n") >= 0
]
return fea
@ -183,7 +183,7 @@ class HuParser:
continue
for tb in tbls: # for table
left, top, right, bott = tb["x0"] - MARGIN, tb["top"] - MARGIN, \
tb["x1"] + MARGIN, tb["bottom"] + MARGIN
tb["x1"] + MARGIN, tb["bottom"] + MARGIN
left *= ZM
top *= ZM
right *= ZM
@ -295,7 +295,7 @@ class HuParser:
for b in bxs:
if not b["text"]:
left, right, top, bott = b["x0"] * ZM, b["x1"] * \
ZM, b["top"] * ZM, b["bottom"] * ZM
ZM, b["top"] * ZM, b["bottom"] * ZM
b["text"] = self.ocr.recognize(np.array(img),
np.array([[left, top], [right, top], [right, bott], [left, bott]],
dtype=np.float32))
@ -469,7 +469,8 @@ class HuParser:
continue
if re.match(r"[0-9]{2,3}/[0-9]{3}$", up["text"]) \
or re.match(r"[0-9]{2,3}/[0-9]{3}$", down["text"]):
or re.match(r"[0-9]{2,3}/[0-9]{3}$", down["text"]) \
or not down["text"].strip():
i += 1
continue
@ -597,7 +598,7 @@ class HuParser:
if b["text"].strip()[0] != b_["text"].strip()[0] \
or b["text"].strip()[0].lower() in set("qwertyuopasdfghjklzxcvbnm") \
or huqie.is_chinese(b["text"].strip()[0]) \
or rag_tokenizer.is_chinese(b["text"].strip()[0]) \
or b["top"] > b_["bottom"]:
i += 1
continue
@ -620,7 +621,7 @@ class HuParser:
i += 1
continue
lout_no = str(self.boxes[i]["page_number"]) + \
"-" + str(self.boxes[i]["layoutno"])
"-" + str(self.boxes[i]["layoutno"])
if TableStructureRecognizer.is_caption(self.boxes[i]) or self.boxes[i]["layout_type"] in ["table caption",
"title",
"figure caption",
@ -828,9 +829,13 @@ class HuParser:
pn = [bx["page_number"]]
top = bx["top"] - self.page_cum_height[pn[0] - 1]
bott = bx["bottom"] - self.page_cum_height[pn[0] - 1]
page_images_cnt = len(self.page_images)
if pn[-1] - 1 >= page_images_cnt: return ""
while bott * ZM > self.page_images[pn[-1] - 1].size[1]:
bott -= self.page_images[pn[-1] - 1].size[1] / ZM
pn.append(pn[-1] + 1)
if pn[-1] - 1 >= page_images_cnt:
return ""
return "@@{}\t{:.1f}\t{:.1f}\t{:.1f}\t{:.1f}##" \
.format("-".join([str(p) for p in pn]),
@ -916,9 +921,7 @@ class HuParser:
fnm) if not binary else pdfplumber.open(BytesIO(binary))
return len(pdf.pages)
except Exception as e:
pdf = fitz.open(fnm) if not binary else fitz.open(
stream=fnm, filetype="pdf")
return len(pdf)
logging.error(str(e))
def __images__(self, fnm, zoomin=3, page_from=0,
page_to=299, callback=None):
@ -930,6 +933,7 @@ class HuParser:
self.page_cum_height = [0]
self.page_layout = []
self.page_from = page_from
st = timer()
try:
self.pdf = pdfplumber.open(fnm) if isinstance(
fnm, str) else pdfplumber.open(BytesIO(fnm))
@ -939,23 +943,7 @@ class HuParser:
self.pdf.pages[page_from:page_to]]
self.total_page = len(self.pdf.pages)
except Exception as e:
self.pdf = fitz.open(fnm) if isinstance(
fnm, str) else fitz.open(
stream=fnm, filetype="pdf")
self.page_images = []
self.page_chars = []
mat = fitz.Matrix(zoomin, zoomin)
self.total_page = len(self.pdf)
for i, page in enumerate(self.pdf):
if i < page_from:
continue
if i >= page_to:
break
pix = page.get_pixmap(matrix=mat)
img = Image.frombytes("RGB", [pix.width, pix.height],
pix.samples)
self.page_images.append(img)
self.page_chars.append([])
logging.error(str(e))
self.outlines = []
try:
@ -968,6 +956,7 @@ class HuParser:
self.outlines.append((a["/Title"], depth))
continue
dfs(a, depth + 1)
dfs(outlines, 0)
except Exception as e:
logging.warning(f"Outlines exception: {e}")
@ -977,13 +966,15 @@ class HuParser:
logging.info("Images converted.")
self.is_english = [re.search(r"[a-zA-Z0-9,/¸;:'\[\]\(\)!@#$%^&*\"?<>._-]{30,}", "".join(
random.choices([c["text"] for c in self.page_chars[i]], k=min(100, len(self.page_chars[i]))))) for i in
range(len(self.page_chars))]
range(len(self.page_chars))]
if sum([1 if e else 0 for e in self.is_english]) > len(
self.page_images) / 2:
self.is_english = True
else:
self.is_english = False
self.is_english = False
st = timer()
for i, img in enumerate(self.page_images):
chars = self.page_chars[i] if not self.is_english else []
self.mean_height.append(
@ -1001,15 +992,11 @@ class HuParser:
chars[j]["width"]) / 2:
chars[j]["text"] += " "
j += 1
# if i > 0:
# if not chars:
# self.page_cum_height.append(img.size[1] / zoomin)
# else:
# self.page_cum_height.append(
# np.max([c["bottom"] for c in chars]))
self.__ocr(i + 1, img, chars, zoomin)
if callback:
if callback and i % 6 == 5:
callback(prog=(i + 1) * 0.6 / len(self.page_images), msg="")
# print("OCR:", timer()-st)
if not self.is_english and not any(
[c for c in self.page_chars]) and self.boxes:
@ -1045,7 +1032,7 @@ class HuParser:
left, right, top, bottom = float(left), float(
right), float(top), float(bottom)
poss.append(([int(p) - 1 for p in pn.split("-")],
left, right, top, bottom))
left, right, top, bottom))
if not poss:
if need_position:
return None, None
@ -1071,7 +1058,7 @@ class HuParser:
self.page_images[pns[0]].crop((left * ZM, top * ZM,
right *
ZM, min(
bottom, self.page_images[pns[0]].size[1])
bottom, self.page_images[pns[0]].size[1])
))
)
if 0 < ii < len(poss) - 1:

View File

@ -14,7 +14,7 @@ from io import BytesIO
from pptx import Presentation
class HuPptParser(object):
class RAGFlowPptParser(object):
def __init__(self):
super().__init__()

View File

@ -1,6 +1,6 @@
import re,json,os
import pandas as pd
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from . import regions
current_file_path = os.path.dirname(os.path.abspath(__file__))
GOODS = pd.read_csv(os.path.join(current_file_path, "res/corp_baike_len.csv"), sep="\t", header=0).fillna(0)
@ -22,14 +22,14 @@ def baike(cid, default_v=0):
def corpNorm(nm, add_region=True):
global CORP_TKS
if not nm or type(nm)!=type(""):return ""
nm = huqie.tradi2simp(huqie.strQ2B(nm)).lower()
nm = rag_tokenizer.tradi2simp(rag_tokenizer.strQ2B(nm)).lower()
nm = re.sub(r"&amp;", "&", nm)
nm = re.sub(r"[\(\)\+'\"\t \*\\【】-]+", " ", nm)
nm = re.sub(r"([—-]+.*| +co\..*|corp\..*| +inc\..*| +ltd.*)", "", nm, 10000, re.IGNORECASE)
nm = re.sub(r"(计算机|技术|(技术|科技|网络)*有限公司|公司|有限|研发中心|中国|总部)$", "", nm, 10000, re.IGNORECASE)
if not nm or (len(nm)<5 and not regions.isName(nm[0:2])):return nm
tks = huqie.qie(nm).split(" ")
tks = rag_tokenizer.tokenize(nm).split(" ")
reg = [t for i,t in enumerate(tks) if regions.isName(t) and (t != "中国" or i > 0)]
nm = ""
for t in tks:

View File

@ -1,9 +1,9 @@
# -*- coding: utf-8 -*-
import re, copy, time, datetime, demjson, \
import re, copy, time, datetime, demjson3, \
traceback, signal
import numpy as np
from deepdoc.parser.resume.entities import degrees, schools, corporations
from rag.nlp import huqie, surname
from rag.nlp import rag_tokenizer, surname
from xpinyin import Pinyin
from contextlib import contextmanager
@ -83,7 +83,7 @@ def forEdu(cv):
if n.get("school_name") and isinstance(n["school_name"], str):
sch.append(re.sub(r"(211|985|重点大学|[,&;-])", "", n["school_name"]))
e["sch_nm_kwd"] = sch[-1]
fea.append(huqie.qieqie(huqie.qie(n.get("school_name", ""))).split(" ")[-1])
fea.append(rag_tokenizer.fine_grained_tokenize(rag_tokenizer.tokenize(n.get("school_name", ""))).split(" ")[-1])
if n.get("discipline_name") and isinstance(n["discipline_name"], str):
maj.append(n["discipline_name"])
@ -166,10 +166,10 @@ def forEdu(cv):
if "tag_kwd" not in cv: cv["tag_kwd"] = []
if "好学历" not in cv["tag_kwd"]: cv["tag_kwd"].append("好学历")
if cv.get("major_kwd"): cv["major_tks"] = huqie.qie(" ".join(maj))
if cv.get("school_name_kwd"): cv["school_name_tks"] = huqie.qie(" ".join(sch))
if cv.get("first_school_name_kwd"): cv["first_school_name_tks"] = huqie.qie(" ".join(fsch))
if cv.get("first_major_kwd"): cv["first_major_tks"] = huqie.qie(" ".join(fmaj))
if cv.get("major_kwd"): cv["major_tks"] = rag_tokenizer.tokenize(" ".join(maj))
if cv.get("school_name_kwd"): cv["school_name_tks"] = rag_tokenizer.tokenize(" ".join(sch))
if cv.get("first_school_name_kwd"): cv["first_school_name_tks"] = rag_tokenizer.tokenize(" ".join(fsch))
if cv.get("first_major_kwd"): cv["first_major_tks"] = rag_tokenizer.tokenize(" ".join(fmaj))
return cv
@ -187,17 +187,17 @@ def forProj(cv):
if n.get("achivement"): desc.append(str(n["achivement"]))
if pro_nms:
# cv["pro_nms_tks"] = huqie.qie(" ".join(pro_nms))
cv["project_name_tks"] = huqie.qie(pro_nms[0])
# cv["pro_nms_tks"] = rag_tokenizer.tokenize(" ".join(pro_nms))
cv["project_name_tks"] = rag_tokenizer.tokenize(pro_nms[0])
if desc:
cv["pro_desc_ltks"] = huqie.qie(rmHtmlTag(" ".join(desc)))
cv["project_desc_ltks"] = huqie.qie(rmHtmlTag(desc[0]))
cv["pro_desc_ltks"] = rag_tokenizer.tokenize(rmHtmlTag(" ".join(desc)))
cv["project_desc_ltks"] = rag_tokenizer.tokenize(rmHtmlTag(desc[0]))
return cv
def json_loads(line):
return demjson.decode(re.sub(r": *(True|False)", r": '\1'", line))
return demjson3.decode(re.sub(r": *(True|False)", r": '\1'", line))
def forWork(cv):
@ -280,25 +280,25 @@ def forWork(cv):
if fea["corporation_id"]: cv["corporation_id"] = fea["corporation_id"]
if fea["position_name"]:
cv["position_name_tks"] = huqie.qie(fea["position_name"][0])
cv["position_name_sm_tks"] = huqie.qieqie(cv["position_name_tks"])
cv["pos_nm_tks"] = huqie.qie(" ".join(fea["position_name"][1:]))
cv["position_name_tks"] = rag_tokenizer.tokenize(fea["position_name"][0])
cv["position_name_sm_tks"] = rag_tokenizer.fine_grained_tokenize(cv["position_name_tks"])
cv["pos_nm_tks"] = rag_tokenizer.tokenize(" ".join(fea["position_name"][1:]))
if fea["industry_name"]:
cv["industry_name_tks"] = huqie.qie(fea["industry_name"][0])
cv["industry_name_sm_tks"] = huqie.qieqie(cv["industry_name_tks"])
cv["indu_nm_tks"] = huqie.qie(" ".join(fea["industry_name"][1:]))
cv["industry_name_tks"] = rag_tokenizer.tokenize(fea["industry_name"][0])
cv["industry_name_sm_tks"] = rag_tokenizer.fine_grained_tokenize(cv["industry_name_tks"])
cv["indu_nm_tks"] = rag_tokenizer.tokenize(" ".join(fea["industry_name"][1:]))
if fea["corporation_name"]:
cv["corporation_name_kwd"] = fea["corporation_name"][0]
cv["corp_nm_kwd"] = fea["corporation_name"]
cv["corporation_name_tks"] = huqie.qie(fea["corporation_name"][0])
cv["corporation_name_sm_tks"] = huqie.qieqie(cv["corporation_name_tks"])
cv["corp_nm_tks"] = huqie.qie(" ".join(fea["corporation_name"][1:]))
cv["corporation_name_tks"] = rag_tokenizer.tokenize(fea["corporation_name"][0])
cv["corporation_name_sm_tks"] = rag_tokenizer.fine_grained_tokenize(cv["corporation_name_tks"])
cv["corp_nm_tks"] = rag_tokenizer.tokenize(" ".join(fea["corporation_name"][1:]))
if fea["responsibilities"]:
cv["responsibilities_ltks"] = huqie.qie(fea["responsibilities"][0])
cv["resp_ltks"] = huqie.qie(" ".join(fea["responsibilities"][1:]))
cv["responsibilities_ltks"] = rag_tokenizer.tokenize(fea["responsibilities"][0])
cv["resp_ltks"] = rag_tokenizer.tokenize(" ".join(fea["responsibilities"][1:]))
if fea["subordinates_count"]: fea["subordinates_count"] = [int(i) for i in fea["subordinates_count"] if
re.match(r"[^0-9]+$", str(i))]
@ -444,15 +444,15 @@ def parse(cv):
if nms:
t = k[:-4]
cv[f"{t}_kwd"] = nms
cv[f"{t}_tks"] = huqie.qie(" ".join(nms))
cv[f"{t}_tks"] = rag_tokenizer.tokenize(" ".join(nms))
except Exception as e:
print("【EXCEPTION】:", str(traceback.format_exc()), cv[k])
cv[k] = []
# tokenize fields
if k in tks_fld:
cv[f"{k}_tks"] = huqie.qie(cv[k])
if k in small_tks_fld: cv[f"{k}_sm_tks"] = huqie.qie(cv[f"{k}_tks"])
cv[f"{k}_tks"] = rag_tokenizer.tokenize(cv[k])
if k in small_tks_fld: cv[f"{k}_sm_tks"] = rag_tokenizer.tokenize(cv[f"{k}_tks"])
# keyword fields
if k in kwd_fld: cv[f"{k}_kwd"] = [n.lower()
@ -492,7 +492,7 @@ def parse(cv):
cv["name_kwd"] = name
cv["name_pinyin_kwd"] = PY.get_pinyins(nm[:20], ' ')[:3]
cv["name_tks"] = (
huqie.qie(name) + " " + (" ".join(list(name)) if not re.match(r"[a-zA-Z ]+$", name) else "")
rag_tokenizer.tokenize(name) + " " + (" ".join(list(name)) if not re.match(r"[a-zA-Z ]+$", name) else "")
) if name else ""
else:
cv["integerity_flt"] /= 2.
@ -515,7 +515,7 @@ def parse(cv):
cv["updated_at_dt"] = f"%s-%02d-%02d 00:00:00" % (y, int(m), int(d))
# long text tokenize
if cv.get("responsibilities"): cv["responsibilities_ltks"] = huqie.qie(rmHtmlTag(cv["responsibilities"]))
if cv.get("responsibilities"): cv["responsibilities_ltks"] = rag_tokenizer.tokenize(rmHtmlTag(cv["responsibilities"]))
# for yes or no field
fea = []

View File

@ -1,12 +1,13 @@
import pdfplumber
from .ocr import OCR
from .recognizer import Recognizer
from .layout_recognizer import LayoutRecognizer
from .table_structure_recognizer import TableStructureRecognizer
def init_in_out(args):
from PIL import Image
import fitz
import os
import traceback
from api.utils.file_utils import traversal_files
@ -18,13 +19,11 @@ def init_in_out(args):
def pdf_pages(fnm, zoomin=3):
nonlocal outputs, images
pdf = fitz.open(fnm)
mat = fitz.Matrix(zoomin, zoomin)
for i, page in enumerate(pdf):
pix = page.get_pixmap(matrix=mat)
img = Image.frombytes("RGB", [pix.width, pix.height],
pix.samples)
images.append(img)
pdf = pdfplumber.open(fnm)
images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
enumerate(pdf.pages)]
for i, page in enumerate(images):
outputs.append(os.path.split(fnm)[-1] + f"_{i}.jpg")
def images_and_outputs(fnm):

View File

@ -43,7 +43,9 @@ class LayoutRecognizer(Recognizer):
"rag/res/deepdoc")
super().__init__(self.labels, domain, model_dir)
except Exception as e:
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc")
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False)
super().__init__(self.labels, domain, model_dir)
self.garbage_layouts = ["footer", "header", "reference"]

View File

@ -486,7 +486,9 @@ class OCR(object):
self.text_detector = TextDetector(model_dir)
self.text_recognizer = TextRecognizer(model_dir)
except Exception as e:
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc")
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False)
self.text_detector = TextDetector(model_dir)
self.text_recognizer = TextRecognizer(model_dir)

View File

@ -41,7 +41,9 @@ class Recognizer(object):
"rag/res/deepdoc")
model_file_path = os.path.join(model_dir, task_name + ".onnx")
if not os.path.exists(model_file_path):
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc")
model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False)
model_file_path = os.path.join(model_dir, task_name + ".onnx")
else:
model_file_path = os.path.join(model_dir, task_name + ".onnx")

View File

@ -11,10 +11,6 @@
# limitations under the License.
#
from deepdoc.vision.seeit import draw_box
from deepdoc.vision import OCR, init_in_out
import argparse
import numpy as np
import os
import sys
sys.path.insert(
@ -25,6 +21,11 @@ sys.path.insert(
os.path.abspath(__file__)),
'../../')))
from deepdoc.vision.seeit import draw_box
from deepdoc.vision import OCR, init_in_out
import argparse
import numpy as np
def main(args):
ocr = OCR()

View File

@ -10,17 +10,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from deepdoc.vision.seeit import draw_box
from deepdoc.vision import Recognizer, LayoutRecognizer, TableStructureRecognizer, OCR, init_in_out
from api.utils.file_utils import get_project_base_directory
import argparse
import os
import sys
import re
import numpy as np
import os, sys
sys.path.insert(
0,
os.path.abspath(
@ -29,6 +19,13 @@ sys.path.insert(
os.path.abspath(__file__)),
'../../')))
from deepdoc.vision.seeit import draw_box
from deepdoc.vision import Recognizer, LayoutRecognizer, TableStructureRecognizer, OCR, init_in_out
from api.utils.file_utils import get_project_base_directory
import argparse
import re
import numpy as np
def main(args):
images, outputs = init_in_out(args)

View File

@ -19,7 +19,7 @@ import numpy as np
from huggingface_hub import snapshot_download
from api.utils.file_utils import get_project_base_directory
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from .recognizer import Recognizer
@ -39,7 +39,9 @@ class TableStructureRecognizer(Recognizer):
get_project_base_directory(),
"rag/res/deepdoc"))
except Exception as e:
super().__init__(self.labels, "tsr", snapshot_download(repo_id="InfiniFlow/deepdoc"))
super().__init__(self.labels, "tsr", snapshot_download(repo_id="InfiniFlow/deepdoc",
local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"),
local_dir_use_symlinks=False))
def __call__(self, images, thr=0.2):
tbls = super().__call__(images, thr)
@ -115,14 +117,14 @@ class TableStructureRecognizer(Recognizer):
for p, n in patt:
if re.search(p, b["text"].strip()):
return n
tks = [t for t in huqie.qie(b["text"]).split(" ") if len(t) > 1]
tks = [t for t in rag_tokenizer.tokenize(b["text"]).split(" ") if len(t) > 1]
if len(tks) > 3:
if len(tks) < 12:
return "Tx"
else:
return "Lx"
if len(tks) == 1 and huqie.tag(tks[0]) == "nr":
if len(tks) == 1 and rag_tokenizer.tag(tks[0]) == "nr":
return "Nr"
return "Ot"

View File

@ -11,16 +11,26 @@ ES_PORT=1200
KIBANA_PORT=6601
# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=4073741824
MEM_LIMIT=8073741824
MYSQL_PASSWORD=infini_rag_flow
MYSQL_PORT=5455
# Port to expose minio to the host
MINIO_CONSOLE_PORT=9001
MINIO_PORT=9000
MINIO_USER=rag_flow
MINIO_PASSWORD=infini_rag_flow
REDIS_PASSWORD=infini_rag_flow
SVR_HTTP_PORT=9380
RAGFLOW_VERSION=dev
TIMEZONE='Asia/Shanghai'
######## OS setup for ES ###########

View File

@ -50,7 +50,7 @@ The serving port of mysql inside the container. The modification should be synch
The max database connection.
### stale_timeout
The timeout duation in seconds.
The timeout duration in seconds.
## minio

View File

@ -0,0 +1,29 @@
include:
- path: ./docker-compose-base.yml
env_file: ./.env
services:
ragflow:
depends_on:
mysql:
condition: service_healthy
es01:
condition: service_healthy
image: edwardelric233/ragflow:oc9
container_name: ragflow-server
ports:
- ${SVR_HTTP_PORT}:9380
- 80:80
- 443:443
volumes:
- ./service_conf.yaml:/ragflow/conf/service_conf.yaml
- ./ragflow-logs:/ragflow/logs
- ./nginx/ragflow.conf:/etc/nginx/conf.d/ragflow.conf
- ./nginx/proxy.conf:/etc/nginx/proxy.conf
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
environment:
- TZ=${TIMEZONE}
- HF_ENDPOINT=https://hf-mirror.com
networks:
- ragflow
restart: always

View File

@ -9,7 +9,7 @@ services:
condition: service_healthy
es01:
condition: service_healthy
image: swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:v1.0
image: swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:${RAGFLOW_VERSION}
container_name: ragflow-server
ports:
- ${SVR_HTTP_PORT}:9380

View File

@ -29,24 +29,6 @@ services:
- ragflow
restart: always
kibana:
depends_on:
es01:
condition: service_healthy
image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
container_name: ragflow-kibana
volumes:
- kibanadata:/usr/share/kibana/data
ports:
- ${KIBANA_PORT}:5601
environment:
- SERVERNAME=kibana
- ELASTICSEARCH_HOSTS=http://es01:9200
- TZ=${TIMEZONE}
mem_limit: ${MEM_LIMIT}
networks:
- ragflow
mysql:
image: mysql:5.7.18
container_name: ragflow-mysql
@ -74,14 +56,13 @@ services:
retries: 3
restart: always
minio:
image: quay.io/minio/minio:RELEASE.2023-12-20T01-00-02Z
container_name: ragflow-minio
command: server --console-address ":9001" /data
ports:
- 9000:9000
- 9001:9001
- ${MINIO_PORT}:9000
- ${MINIO_CONSOLE_PORT}:9001
environment:
- MINIO_ROOT_USER=${MINIO_USER}
- MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
@ -92,16 +73,27 @@ services:
- ragflow
restart: always
redis:
image: redis:7.2.4
container_name: ragflow-redis
command: redis-server --requirepass ${REDIS_PASSWORD} --maxmemory 128mb --maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
networks:
- ragflow
restart: always
volumes:
esdata01:
driver: local
kibanadata:
driver: local
mysql_data:
driver: local
minio_data:
driver: local
redis_data:
driver: local
networks:
ragflow:

View File

@ -9,7 +9,7 @@ services:
condition: service_healthy
es01:
condition: service_healthy
image: infiniflow/ragflow:v1.0
image: infiniflow/ragflow:${RAGFLOW_VERSION}
container_name: ragflow-server
ports:
- ${SVR_HTTP_PORT}:9380
@ -23,7 +23,7 @@ services:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
environment:
- TZ=${TIMEZONE}
- HF_ENDPOINT=https://huggingface.com
- HF_ENDPOINT=https://huggingface.co
networks:
- ragflow
restart: always

View File

@ -12,29 +12,14 @@ function task_exe(){
done
}
function watch_broker(){
while [ 1 -eq 1 ];do
C=`ps aux|grep "task_broker.py"|grep -v grep|wc -l`;
if [ $C -lt 1 ];then
$PY rag/svr/task_broker.py &
fi
sleep 5;
done
}
function task_bro(){
sleep 160;
watch_broker;
}
task_bro &
WS=2
WS=1
for ((i=0;i<WS;i++))
do
task_exe $i $WS &
done
$PY api/ragflow_server.py
while [ 1 -eq 1 ];do
$PY api/ragflow_server.py
done
wait;
wait;

View File

@ -15,6 +15,10 @@ minio:
host: 'minio:9000'
es:
hosts: 'http://es01:9200'
redis:
db: 1
password: 'infini_rag_flow'
host: 'redis:6379'
user_default_llm:
factory: 'Tongyi-Qianwen'
api_key: 'sk-xxxxxxxxxxxxx'
@ -34,4 +38,4 @@ authentication:
permission:
switch: false
component: false
dataset: false
dataset: false

View File

@ -1,5 +1,9 @@
# Conversation API Instruction
<div align="center" style="margin-top:20px;margin-bottom:20px;">
<img src="https://github.com/infiniflow/ragflow/assets/12318111/df0dcc3d-789a-44f7-89f1-7a5f044ab729" width="830"/>
</div>
## Base URL
```buildoutcfg
https://demo.ragflow.io/v1/
@ -7,7 +11,7 @@ https://demo.ragflow.io/v1/
## Authorization
All the APIs are authorized with API-Key. Please keep it save and private. Don't reveal it in any way from the front-end.
All the APIs are authorized with API-Key. Please keep it safe and private. Don't reveal it in any way from the front-end.
The API-Key should put in the header of request:
```buildoutcfg
Authorization: Bearer {API_KEY}
@ -217,6 +221,7 @@ This will be called to get the answer to users' questions.
|------|-------|----|----|
| conversation_id| string | No | This is from calling /new_conversation.|
| messages| json | No | All the conversation history stored here including the latest user's question.|
| quote | bool | Yes | Default: true |
### Response
```json
@ -299,5 +304,61 @@ This will be called to get the answer to users' questions.
## Get document content or image
This is usually used when display content of citation.
### Path: /document/get/\<id\>
### Path: /api/document/get/\<id\>
### Method: GET
## Upload file
This is usually used when upload a file to.
### Path: /api/document/upload/
### Method: POST
### Parameter:
| name | type | optional | description |
|---------|--------|----------|----------------------------------------|
| file | file | No | Upload file. |
| kb_name | string | No | Choose the upload knowledge base name. |
### Response
```json
{
"data": {
"chunk_num": 0,
"create_date": "Thu, 25 Apr 2024 14:30:06 GMT",
"create_time": 1714026606921,
"created_by": "553ec818fd5711ee8ea63043d7ed348e",
"id": "41e9324602cd11ef9f5f3043d7ed348e",
"kb_id": "06802686c0a311ee85d6246e9694c130",
"location": "readme.txt",
"name": "readme.txt",
"parser_config": {
"field_map": {
},
"pages": [
[
0,
1000000
]
]
},
"parser_id": "general",
"process_begin_at": null,
"process_duation": 0.0,
"progress": 0.0,
"progress_msg": "",
"run": "0",
"size": 929,
"source_type": "local",
"status": "1",
"thumbnail": null,
"token_num": 0,
"type": "doc",
"update_date": "Thu, 25 Apr 2024 14:30:06 GMT",
"update_time": 1714026606921
},
"retcode": 0,
"retmsg": "success"
}
```

View File

@ -2,105 +2,222 @@
## General
### What sets RAGFlow apart from other RAG products?
### 1. What sets RAGFlow apart from other RAG products?
The "garbage in garbage out" status quo remains unchanged despite the fact that LLMs have advanced Natural Language Processing (NLP) significantly. In response, RAGFlow introduces two unique features compared to other Retrieval-Augmented Generation (RAG) products.
- Fine-grained document parsing: Document parsing involves images and tables, with the flexibility for you to intervene as needed.
- Traceable answers with reduced hallucinations: You can trust RAGFlow's responses as you can view the citations and references supporting them.
### Which languages does RAGFlow support?
### 2. Which languages does RAGFlow support?
English, simplified Chinese, traditional Chinese for now.
## Performance
### Why does it take longer for RAGFlow to parse a document than LangChain?
### 1. Why does it take longer for RAGFlow to parse a document than LangChain?
We put painstaking effort into document pre-processing tasks like layout analysis, table structure recognition, and OCR (Optical Character Recognition) using our vision model. This contributes to the additional time required.
### 2. Why does RAGFlow require more resources than other projects?
RAGFlow has a number of built-in models for document structure parsing, which account for the additional computational resources.
## Feature
### Which architectures or devices does RAGFlow support?
### 1. Which architectures or devices does RAGFlow support?
ARM64 and Ascend GPU are not supported.
Currently, we only support x86 CPU and Nvidia GPU.
### Do you offer an API for integration with third-party applications?
### 2. Do you offer an API for integration with third-party applications?
These APIs are still in development. Contributions are welcome.
The corresponding APIs are now available. See the [Conversation API](./conversation_api.md) for more information.
### Do you support stream output?
### 3. Do you support stream output?
No, this feature is still in development. Contributions are welcome.
### Is it possible to share dialogue through URL?
### 4. Is it possible to share dialogue through URL?
Yes, this feature is now available.
### 5. Do you support multiple rounds of dialogues, i.e., referencing previous dialogues as context for the current dialogue?
This feature and the related APIs are still in development. Contributions are welcome.
### Do you support multiple rounds of dialogues, i.e., referencing previous dialogues as context for the current dialogue?
This feature and the related APIs are still in development. Contributions are welcome.
## Troubleshooting
## Configurations
### 1. Issues with docker images
### How to increase the length of RAGFlow responses?
#### 1.1 How to build the RAGFlow image from scratch?
1. Right click the desired dialog to display the **Chat Configuration** window.
2. Switch to the **Model Setting** tab and adjust the **Max Tokens** slider to get the desired length.
3. Click **OK** to confirm your change.
```
$ git clone https://github.com/infiniflow/ragflow.git
$ cd ragflow
$ docker build -t infiniflow/ragflow:latest .
$ cd ragflow/docker
$ chmod +x ./entrypoint.sh
$ docker compose up -d
```
#### 1.2 `process "/bin/sh -c cd ./web && npm i && npm run build"` failed
### What does Empty response mean? How to set it?
1. Check your network from within Docker, for example:
```bash
curl https://hf-mirror.com
```
You limit what the system responds to what you specify in **Empty response** if nothing is retrieved from your knowledge base. If you do not specify anything in **Empty response**, you let your LLM improvise, giving it a chance to hallucinate.
2. If your network works fine, the issue lies with the Docker network configuration. Replace the Docker building command:
```bash
docker build -t infiniflow/ragflow:vX.Y.Z.
```
With this:
```bash
docker build -t infiniflow/ragflow:vX.Y.Z. --network host
```
### Can I set the base URL for OpenAI somewhere?
### 2. Issues with huggingface models
![](https://github.com/infiniflow/ragflow/assets/93570324/8cfb6fa4-8a97-415d-b9fa-b6f405a055f3)
#### 2.1 Cannot access https://huggingface.co
A *locally* deployed RAGflow downloads OCR and embedding modules from [Huggingface website](https://huggingface.co) by default. If your machine is unable to access this site, the following error occurs and PDF parsing fails:
```
FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/huggingface/hub/models--InfiniFlow--deepdoc/snapshots/be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res'
```
To fix this issue, use https://hf-mirror.com instead:
### How to run RAGFlow with a locally deployed LLM?
1. Stop all containers and remove all related resources:
You can use Ollama to deploy local LLM. See [here](https://github.com/infiniflow/ragflow/blob/main/docs/ollama.md) for more information.
```bash
cd ragflow/docker/
docker compose down
```
### How to link up ragflow and ollama servers?
2. Replace `https://huggingface.co` with `https://hf-mirror.com` in **ragflow/docker/docker-compose.yml**.
3. Start up the server:
- If RAGFlow is locally deployed, ensure that your RAGFlow and Ollama are in the same LAN.
- If you are using our online demo, ensure that the IP address of your Ollama server is public and accessible.
```bash
docker compose up -d
```
### How to configure RAGFlow to respond with 100% matched results, rather than utilizing LLM?
#### 2.2. `MaxRetryError: HTTPSConnectionPool(host='hf-mirror.com', port=443)`
1. Click the **Knowledge Base** tab in the middle top of the page.
2. Right click the desired knowledge base to display the **Configuration** dialogue.
3. Choose **Q&A** as the chunk method and click **Save** to confirm your change.
This error suggests that you do not have Internet access or are unable to connect to hf-mirror.com. Try the following:
## Debugging
1. Manually download the resource files from [huggingface.co/InfiniFlow/deepdoc](https://huggingface.co/InfiniFlow/deepdoc) to your local folder **~/deepdoc**.
2. Add a volumes to **docker-compose.yml**, for example:
```
- ~/deepdoc:/ragflow/rag/res/deepdoc
```
### How to handle `WARNING: can't find /raglof/rag/res/borker.tm`?
#### 2.3 `FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/huggingface/hub/models--InfiniFlow--deepdoc/snapshots/FileNotFoundError: [Errno 2] No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res'`
1. Check your network from within Docker, for example:
```bash
curl https://hf-mirror.com
```
2. Run `ifconfig` to check the `mtu` value. If the server's `mtu` is `1450` while the NIC's `mtu` in the container is `1500`, this mismatch may cause network instability. Adjust the `mtu` policy as follows:
```
vim docker-compose-base.yml
# Original configuration
networks:
ragflow:
driver: bridge
# Modified configuration
networks:
ragflow:
driver: bridge
driver_opts:
com.docker.network.driver.mtu: 1450
```
### 3. Issues with RAGFlow servers
#### 3.1 `WARNING: can't find /raglof/rag/res/borker.tm`
Ignore this warning and continue. All system warnings can be ignored.
### How to handle `Realtime synonym is disabled, since no redis connection`?
#### 3.2 `network anomaly There is an abnormality in your network and you cannot connect to the server.`
![anomaly](https://github.com/infiniflow/ragflow/assets/93570324/beb7ad10-92e4-4a58-8886-bfb7cbd09e5d)
You will not log in to RAGFlow unless the server is fully initialized. Run `docker logs -f ragflow-server`.
*The server is successfully initialized, if your system displays the following:*
```
____ ______ __
/ __ \ ____ _ ____ _ / ____// /____ _ __
/ /_/ // __ `// __ `// /_ / // __ \| | /| / /
/ _, _// /_/ // /_/ // __/ / // /_/ /| |/ |/ /
/_/ |_| \__,_/ \__, //_/ /_/ \____/ |__/|__/
/____/
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:9380
* Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit
```
### 4. Issues with RAGFlow backend services
#### 4.1 `dependency failed to start: container ragflow-mysql is unhealthy`
`dependency failed to start: container ragflow-mysql is unhealthy` means that your MySQL container failed to start. Try replacing `mysql:5.7.18` with `mariadb:10.5.8` in **docker-compose-base.yml**.
#### 4.2 `Realtime synonym is disabled, since no redis connection`
Ignore this warning and continue. All system warnings can be ignored.
![](https://github.com/infiniflow/ragflow/assets/93570324/ef5a6194-084a-4fe3-bdd5-1c025b40865c)
### Why does it take so long to parse a 2MB document?
#### 4.3 Why does it take so long to parse a 2MB document?
Parsing requests have to wait in queue due to limited server resources. We are currently enhancing our algorithms and increasing computing power.
### How to handle `Index failure`?
#### 4.4 Why does my document parsing stall at under one percent?
![stall](https://github.com/infiniflow/ragflow/assets/93570324/3589cc25-c733-47d5-bbfc-fedb74a3da50)
If your RAGFlow is deployed *locally*, try the following:
1. Check the log of your RAGFlow server to see if it is running properly:
```bash
docker logs -f ragflow-server
```
2. Check if the **task_executor.py** process exists.
3. Check if your RAGFlow server can access hf-mirror.com or huggingface.com.
#### 4.5 Why does my pdf parsing stall near completion, while the log does not show any error?
If your RAGFlow is deployed *locally*, the parsing process is likely killed due to insufficient RAM. Try increasing your memory allocation by increasing the `MEM_LIMIT` value in **docker/.env**.
> Ensure that you restart up your RAGFlow server for your changes to take effect!
> ```bash
> docker compose stop
> ```
> ```bash
> docker compose up -d
> ```
![nearcompletion](https://github.com/infiniflow/ragflow/assets/93570324/563974c3-f8bb-4ec8-b241-adcda8929cbb)
#### 4.6 `Index failure`
An index failure usually indicates an unavailable Elasticsearch service.
### How to check the log of RAGFlow?
#### 4.7 How to check the log of RAGFlow?
```bash
tail -f path_to_ragflow/docker/ragflow-logs/rag/*.log
```
### How to check the status of each component in RAGFlow?
#### 4.8 How to check the status of each component in RAGFlow?
```bash
$ docker ps
@ -108,13 +225,13 @@ $ docker ps
*The system displays the following if all your RAGFlow components are running properly:*
```
5bc45806b680 infiniflow/ragflow:v1.0 "./entrypoint.sh" 11 hours ago Up 11 hours 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp ragflow-server
5bc45806b680 infiniflow/ragflow:latest "./entrypoint.sh" 11 hours ago Up 11 hours 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp ragflow-server
91220e3285dd docker.elastic.co/elasticsearch/elasticsearch:8.11.3 "/bin/tini -- /usr/l…" 11 hours ago Up 11 hours (healthy) 9300/tcp, 0.0.0.0:9200->9200/tcp, :::9200->9200/tcp ragflow-es-01
d8c86f06c56b mysql:5.7.18 "docker-entrypoint.s…" 7 days ago Up 16 seconds (healthy) 0.0.0.0:3306->3306/tcp, :::3306->3306/tcp ragflow-mysql
cd29bcb254bc quay.io/minio/minio:RELEASE.2023-12-20T01-00-02Z "/usr/bin/docker-ent…" 2 weeks ago Up 11 hours 0.0.0.0:9001->9001/tcp, :::9001->9001/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp ragflow-minio
```
### How to handle `Exception: Can't connect to ES cluster`?
#### 4.9 `Exception: Can't connect to ES cluster`
1. Check the status of your Elasticsearch component:
@ -126,7 +243,7 @@ $ docker ps
91220e3285dd docker.elastic.co/elasticsearch/elasticsearch:8.11.3 "/bin/tini -- /usr/l…" 11 hours ago Up 11 hours (healthy) 9300/tcp, 0.0.0.0:9200->9200/tcp, :::9200->9200/tcp ragflow-es-01
```
2. If your container keeps restarting, ensure `vm.max_map_count` >= 262144 as per [this README](https://github.com/infiniflow/ragflow?tab=readme-ov-file#-start-up-the-server).
2. If your container keeps restarting, ensure `vm.max_map_count` >= 262144 as per [this README](https://github.com/infiniflow/ragflow?tab=readme-ov-file#-start-up-the-server). Updating the `vm.max_map_count` value in **/etc/sysctl.conf** is required, if you wish to keep your change permanent. This configuration works only for Linux.
3. If your issue persists, ensure that the ES host setting is correct:
@ -141,13 +258,148 @@ $ docker ps
curl http://<IP_OF_ES>:<PORT_OF_ES>
```
#### 4.10 Can't start ES container and get `Elasticsearch did not exit normally`
### How to handle `{"data":null,"retcode":100,"retmsg":"<NotFound '404: Not Found'>"}`?
This is because you forgot to update the `vm.max_map_count` value in **/etc/sysctl.conf** and your change to this value was reset after a system reboot.
Your IP address or port number may be incorrect. If you are using the default configurations, enter http://<IP_OF_YOUR_MACHINE> (**NOT `localhost`, NOT 9380, AND NO PORT NUMBER REQUIRED!**) in your browser. This should work.
#### 4.11 `{"data":null,"retcode":100,"retmsg":"<NotFound '404: Not Found'>"}`
Your IP address or port number may be incorrect. If you are using the default configurations, enter http://<IP_OF_YOUR_MACHINE> (**NOT 9380, AND NO PORT NUMBER REQUIRED!**) in your browser. This should work.
#### 4.12 `Ollama - Mistral instance running at 127.0.0.1:11434 but cannot add Ollama as model in RagFlow`
A correct Ollama IP address and port is crucial to adding models to Ollama:
- If you are on demo.ragflow.io, ensure that the server hosting Ollama has a publicly accessible IP address.Note that 127.0.0.1 is not a publicly accessible IP address.
- If you deploy RAGFlow locally, ensure that Ollama and RAGFlow are in the same LAN and can comunicate with each other.
#### 4.13 Do you offer examples of using deepdoc to parse PDF or other files?
Yes, we do. See the Python files under the **rag/app** folder.
#### 4.14 Why did I fail to upload a 10MB+ file to my locally deployed RAGFlow?
You probably forgot to update the **MAX_CONTENT_LENGTH** environment variable:
1. Add environment variable `MAX_CONTENT_LENGTH` to **ragflow/docker/.env**:
```
MAX_CONTENT_LENGTH=100000000
```
2. Update **docker-compose.yml**:
```
environment:
- MAX_CONTENT_LENGTH=${MAX_CONTENT_LENGTH}
```
3. Restart the RAGFlow server:
```
docker compose up ragflow -d
```
*Now you should be able to upload files of sizes less than 100MB.*
#### 4.15 `Table 'rag_flow.document' doesn't exist`
This exception occurs when starting up the RAGFlow server. Try the following:
1. Prolong the sleep time: Go to **docker/entrypoint.sh**, locate line 26, and replace `sleep 60` with `sleep 280`.
2. If using Windows, ensure that the **entrypoint.sh** has LF end-lines.
3. Go to **docker/docker-compose.yml**, add the following:
```
./entrypoint.sh:/ragflow/entrypoint.sh
```
4. Change directory:
```bash
cd docker
```
5. Stop the RAGFlow server:
```bash
docker compose stop
```
6. Restart up the RAGFlow server:
```bash
docker compose up
```
#### 4.16 `hint : 102 Fail to access model Connection error`
![hint102](https://github.com/infiniflow/ragflow/assets/93570324/6633d892-b4f8-49b5-9a0a-37a0a8fba3d2)
1. Ensure that the RAGFlow server can access the base URL.
2. Do not forget to append **/v1/** to **http://IP:port**:
**http://IP:port/v1/**
#### 4.17 `FileNotFoundError: [Errno 2] No such file or directory`
1. Check if the status of your minio container is healthy:
```bash
docker ps
```
2. Ensure that the username and password settings of MySQL and MinIO in **docker/.env** are in line with those in **docker/service_conf.yml**.
## Usage
### 1. How to increase the length of RAGFlow responses?
1. Right click the desired dialog to display the **Chat Configuration** window.
2. Switch to the **Model Setting** tab and adjust the **Max Tokens** slider to get the desired length.
3. Click **OK** to confirm your change.
### 2. What does Empty response mean? How to set it?
You limit what the system responds to what you specify in **Empty response** if nothing is retrieved from your knowledge base. If you do not specify anything in **Empty response**, you let your LLM improvise, giving it a chance to hallucinate.
### 3. Can I set the base URL for OpenAI somewhere?
![](https://github.com/infiniflow/ragflow/assets/93570324/8cfb6fa4-8a97-415d-b9fa-b6f405a055f3)
### 4. How to run RAGFlow with a locally deployed LLM?
You can use Ollama to deploy local LLM. See [here](https://github.com/infiniflow/ragflow/blob/main/docs/ollama.md) for more information.
### 5. How to link up ragflow and ollama servers?
- If RAGFlow is locally deployed, ensure that your RAGFlow and Ollama are in the same LAN.
- If you are using our online demo, ensure that the IP address of your Ollama server is public and accessible.
### 6. How to configure RAGFlow to respond with 100% matched results, rather than utilizing LLM?
1. Click **Knowledge Base** in the middle top of the page.
2. Right click the desired knowledge base to display the **Configuration** dialogue.
3. Choose **Q&A** as the chunk method and click **Save** to confirm your change.
### 7 Do I need to connect to Redis?
No, connecting to Redis is not required.
### 8 `Error: Range of input length should be [1, 30000]`
This error occurs because there are too many chunks matching your search criteria. Try reducing the **TopN** and increasing **Similarity threshold** to fix this issue:
1. Click **Chat** in the middle top of the page.
2. Right click the desired conversation > **Edit** > **Prompt Engine**
3. Reduce the **TopN** and/or raise **Silimarity threshold**.
4. Click **OK** to confirm your changes.
![topn](https://github.com/infiniflow/ragflow/assets/93570324/7ec72ab3-0dd2-4cff-af44-e2663b67b2fc)
### 9 How to update RAGFlow to the latest version?
1. Pull the latest source code
```bash
cd ragflow
git pull
```
2. If you used `docker compose up -d` to start up RAGFlow server:
```bash
docker pull infiniflow/ragflow:dev
```
```bash
docker compose up ragflow -d
```
3. If you used `docker compose -f docker-compose-CN.yml up -d` to start up RAGFlow server:
```bash
docker pull swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:dev
```
```bash
docker compose -f docker-compose-CN.yml up -d
```

View File

@ -31,7 +31,7 @@ $ docker exec -it ollama ollama run mistral
<img src="https://github.com/infiniflow/ragflow/assets/12318111/a9df198a-226d-4f30-b8d7-829f00256d46" width="1300"/>
</div>
> Base URL: Enter the base URL where the Ollama service is accessible, like, http://<your-ollama-endpoint-domain>:11434
> Base URL: Enter the base URL where the Ollama service is accessible, like, `http://<your-ollama-endpoint-domain>:11434`.
- Use Ollama Models.

View File

@ -31,7 +31,7 @@ $ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --
<img src="https://github.com/infiniflow/ragflow/assets/12318111/bcbf4d7a-ade6-44c7-ad5f-0a92c8a73789" width="1300"/>
</div>
> Base URL: Enter the base URL where the Xinference service is accessible, like, http://<your-xinference-endpoint-domain>:9997/v1
> Base URL: Enter the base URL where the Xinference service is accessible, like, `http://<your-xinference-endpoint-domain>:9997/v1`.
- Use Xinference Models.

67
printEnvironment.sh Normal file
View File

@ -0,0 +1,67 @@
#!/bin/bash
# The function is used to obtain distribution information
get_distro_info() {
local distro_id=$(lsb_release -i -s 2>/dev/null)
local distro_version=$(lsb_release -r -s 2>/dev/null)
local kernel_version=$(uname -r)
# If lsd_release is not available, try parsing the/etc/* - release file
if [ -z "$distro_id" ] || [ -z "$distro_version" ]; then
distro_id=$(grep '^ID=' /etc/*-release | cut -d= -f2 | tr -d '"')
distro_version=$(grep '^VERSION_ID=' /etc/*-release | cut -d= -f2 | tr -d '"')
fi
echo "$distro_id $distro_version (Kernel version: $kernel_version)"
}
# get Git repo name
git_repo_name=''
if git rev-parse --is-inside-work-tree > /dev/null 2>&1; then
git_repo_name=$(basename "$(git rev-parse --show-toplevel)")
if [ $? -ne 0 ]; then
git_repo_name="(Can't get repo name)"
fi
else
git_repo_name="It NOT a Git repo"
fi
# get CPU type
cpu_model=$(uname -m)
# get memory size
memory_size=$(free -h | grep Mem | awk '{print $2}')
# get docker version
docker_version=''
if command -v docker &> /dev/null; then
docker_version=$(docker --version | cut -d ' ' -f3)
else
docker_version="Docker not installed"
fi
# get python version
python_version=''
if command -v python &> /dev/null; then
python_version=$(python --version | cut -d ' ' -f2)
else
python_version="Python not installed"
fi
# Print all infomation
echo "Current Repo: $git_repo_name"
# get Commit ID
git_version=$(git log -1 --pretty=format:'%h')
if [ -z "$git_version" ]; then
echo "Commit Id: The current directory is not a Git repository, or the Git command is not installed."
else
echo "Commit Id: $git_version"
fi
echo "Operating system: $(get_distro_info)"
echo "CPU Type: $cpu_model"
echo "Memory: $memory_size"
echo "Docker Version: $docker_version"
echo "Python Version: $python_version"

View File

@ -11,19 +11,21 @@
# limitations under the License.
#
import copy
from tika import parser
import re
from io import BytesIO
from rag.nlp import bullets_category, is_english, tokenize, remove_contents_table, \
hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, add_positions, tokenize_chunks
from rag.nlp import huqie
hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, add_positions, \
tokenize_chunks, find_codec
from rag.nlp import rag_tokenizer
from deepdoc.parser import PdfParser, DocxParser, PlainParser
class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
callback(msg="OCR is running...")
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
@ -36,7 +38,7 @@ class Pdf(PdfParser):
start = timer()
self._layouts_rec(zoomin)
callback(0.67, "Layout analysis finished")
print("paddle layouts:", timer() - start)
print("layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.68, "Table analysis finished")
self._text_merge()
@ -61,12 +63,12 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
"""
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
pdf_parser = None
sections, tbls = [], []
if re.search(r"\.docx?$", filename, re.IGNORECASE):
if re.search(r"\.docx$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
doc_parser = DocxParser()
# TODO: table of contents need to be removed
@ -74,6 +76,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
binary if binary else filename, from_page=from_page, to_page=to_page)
remove_contents_table(sections, eng=is_english(
random_choices([t for t, _ in sections], k=200)))
tbls = [((None, lns), None) for lns in tbls]
callback(0.8, "Finish parsing.")
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
@ -87,7 +90,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback(0.1, "Start to parse.")
txt = ""
if binary:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
else:
with open(filename, "r") as f:
while True:
@ -101,9 +105,19 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
random_choices([t for t, _ in sections], k=200)))
callback(0.8, "Finish parsing.")
elif re.search(r"\.doc$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
binary = BytesIO(binary)
doc_parsed = parser.from_buffer(binary)
sections = doc_parsed['content'].split('\n')
sections = [(l, "") for l in sections if l]
remove_contents_table(sections, eng=is_english(
random_choices([t for t, _ in sections], k=200)))
callback(0.8, "Finish parsing.")
else:
raise NotImplementedError(
"file type not supported yet(docx, pdf, txt supported)")
"file type not supported yet(doc, docx, pdf, txt supported)")
make_colon_as_title(sections)
bull = bullets_category(

View File

@ -11,14 +11,15 @@
# limitations under the License.
#
import copy
from tika import parser
import re
from io import BytesIO
from docx import Document
from api.db import ParserType
from rag.nlp import bullets_category, is_english, tokenize, remove_contents_table, hierarchical_merge, \
make_colon_as_title, add_positions, tokenize_chunks
from rag.nlp import huqie
make_colon_as_title, add_positions, tokenize_chunks, find_codec
from rag.nlp import rag_tokenizer
from deepdoc.parser import PdfParser, DocxParser, PlainParser
from rag.settings import cron_logger
@ -57,7 +58,7 @@ class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
callback(msg="OCR is running...")
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
@ -71,7 +72,7 @@ class Pdf(PdfParser):
start = timer()
self._layouts_rec(zoomin)
callback(0.67, "Layout analysis finished")
cron_logger.info("paddle layouts:".format(
cron_logger.info("layouts:".format(
(timer() - start) / (self.total_page + 0.1)))
self._naive_vertical_merge()
@ -88,12 +89,12 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
"""
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
pdf_parser = None
sections = []
if re.search(r"\.docx?$", filename, re.IGNORECASE):
if re.search(r"\.docx$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
for txt in Docx()(filename, binary):
sections.append(txt)
@ -111,7 +112,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback(0.1, "Start to parse.")
txt = ""
if binary:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
else:
with open(filename, "r") as f:
while True:
@ -122,9 +124,18 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
sections = txt.split("\n")
sections = [l for l in sections if l]
callback(0.8, "Finish parsing.")
elif re.search(r"\.doc$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
binary = BytesIO(binary)
doc_parsed = parser.from_buffer(binary)
sections = doc_parsed['content'].split('\n')
sections = [l for l in sections if l]
callback(0.8, "Finish parsing.")
else:
raise NotImplementedError(
"file type not supported yet(docx, pdf, txt supported)")
"file type not supported yet(doc, docx, pdf, txt supported)")
# is it English
eng = lang.lower() == "english" # is_english(sections)

View File

@ -2,7 +2,7 @@ import copy
import re
from api.db import ParserType
from rag.nlp import huqie, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
from deepdoc.parser import PdfParser, PlainParser
from rag.utils import num_tokens_from_string
@ -16,7 +16,7 @@ class Pdf(PdfParser):
to_page=100000, zoomin=3, callback=None):
from timeit import default_timer as timer
start = timer()
callback(msg="OCR is running...")
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
@ -32,7 +32,7 @@ class Pdf(PdfParser):
self._layouts_rec(zoomin)
callback(0.65, "Layout analysis finished.")
print("paddle layouts:", timer() - start)
print("layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.67, "Table analysis finished.")
self._text_merge()
@ -70,8 +70,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
doc = {
"docnm_kwd": filename
}
doc["title_tks"] = huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", doc["docnm_kwd"]))
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_tks"] = rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", doc["docnm_kwd"]))
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
# is it English
eng = lang.lower() == "english" # pdf_parser.is_english

View File

@ -10,12 +10,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from tika import parser
from io import BytesIO
from docx import Document
from timeit import default_timer as timer
import re
from deepdoc.parser.pdf_parser import PlainParser
from rag.app import laws
from rag.nlp import huqie, is_english, tokenize, naive_merge, tokenize_table, add_positions, tokenize_chunks
from rag.nlp import rag_tokenizer, naive_merge, tokenize_table, tokenize_chunks, find_codec
from deepdoc.parser import PdfParser, ExcelParser, DocxParser
from rag.settings import cron_logger
@ -67,7 +68,8 @@ class Docx(DocxParser):
class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
callback(msg="OCR is running...")
start = timer()
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
@ -76,12 +78,11 @@ class Pdf(PdfParser):
callback
)
callback(msg="OCR finished")
cron_logger.info("OCR({}~{}): {}".format(from_page, to_page, timer() - start))
from timeit import default_timer as timer
start = timer()
self._layouts_rec(zoomin)
callback(0.63, "Layout analysis finished.")
print("paddle layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.65, "Table analysis finished.")
self._text_merge()
@ -91,8 +92,7 @@ class Pdf(PdfParser):
self._concat_downward()
#self._filter_forpages()
cron_logger.info("paddle layouts:".format(
(timer() - start) / (self.total_page + 0.1)))
cron_logger.info("layouts: {}".format(timer() - start))
return [(b["text"], self._line_tag(b, zoomin))
for b in self.boxes], tbls
@ -112,13 +112,13 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
"chunk_token_num": 128, "delimiter": "\n!?。;!?", "layout_recognize": True})
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
res = []
pdf_parser = None
sections = []
if re.search(r"\.docx?$", filename, re.IGNORECASE):
if re.search(r"\.docx$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
sections, tbls = Docx()(filename, binary)
res = tokenize_table(tbls, doc, eng)
@ -136,11 +136,12 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
excel_parser = ExcelParser()
sections = [(excel_parser.html(binary), "")]
elif re.search(r"\.txt$", filename, re.IGNORECASE):
elif re.search(r"\.(txt|md)$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
txt = ""
if binary:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
else:
with open(filename, "r") as f:
while True:
@ -152,16 +153,26 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
sections = [(l, "") for l in sections if l]
callback(0.8, "Finish parsing.")
elif re.search(r"\.doc$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
binary = BytesIO(binary)
doc_parsed = parser.from_buffer(binary)
sections = doc_parsed['content'].split('\n')
sections = [(l, "") for l in sections if l]
callback(0.8, "Finish parsing.")
else:
raise NotImplementedError(
"file type not supported yet(docx, pdf, txt supported)")
"file type not supported yet(doc, docx, pdf, txt supported)")
st = timer()
chunks = naive_merge(
sections, parser_config.get(
"chunk_token_num", 128), parser_config.get(
"delimiter", "\n!?。;!?"))
res.extend(tokenize_chunks(chunks, doc, eng, pdf_parser))
cron_logger.info("naive_merge({}): {}".format(filename, timer() - st))
return res

View File

@ -10,16 +10,18 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from tika import parser
from io import BytesIO
import re
from rag.app import laws
from rag.nlp import huqie, tokenize
from rag.nlp import rag_tokenizer, tokenize, find_codec
from deepdoc.parser import PdfParser, ExcelParser, PlainParser
class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
callback(msg="OCR is running...")
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
@ -33,7 +35,7 @@ class Pdf(PdfParser):
start = timer()
self._layouts_rec(zoomin, drop=False)
callback(0.63, "Layout analysis finished.")
print("paddle layouts:", timer() - start)
print("layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.65, "Table analysis finished.")
self._text_merge()
@ -60,7 +62,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
eng = lang.lower() == "english" # is_english(cks)
if re.search(r"\.docx?$", filename, re.IGNORECASE):
if re.search(r"\.docx$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
sections = [txt for txt in laws.Docx()(filename, binary) if txt]
callback(0.8, "Finish parsing.")
@ -82,7 +84,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback(0.1, "Start to parse.")
txt = ""
if binary:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
else:
with open(filename, "r") as f:
while True:
@ -94,15 +97,23 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
sections = [s for s in sections if s]
callback(0.8, "Finish parsing.")
elif re.search(r"\.doc$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
binary = BytesIO(binary)
doc_parsed = parser.from_buffer(binary)
sections = doc_parsed['content'].split('\n')
sections = [l for l in sections if l]
callback(0.8, "Finish parsing.")
else:
raise NotImplementedError(
"file type not supported yet(docx, pdf, txt supported)")
"file type not supported yet(doc, docx, pdf, txt supported)")
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
tokenize(doc, "\n".join(sections), eng)
return [doc]

View File

@ -15,7 +15,7 @@ import re
from collections import Counter
from api.db import ParserType
from rag.nlp import huqie, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
from deepdoc.parser import PdfParser, PlainParser
import numpy as np
from rag.utils import num_tokens_from_string
@ -28,7 +28,7 @@ class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
callback(msg="OCR is running...")
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
@ -42,7 +42,7 @@ class Pdf(PdfParser):
start = timer()
self._layouts_rec(zoomin)
callback(0.63, "Layout analysis finished")
print("paddle layouts:", timer() - start)
print("layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.68, "Table analysis finished")
self._text_merge()
@ -78,7 +78,7 @@ class Pdf(PdfParser):
title = ""
authors = []
i = 0
while i < min(32, len(self.boxes)):
while i < min(32, len(self.boxes)-1):
b = self.boxes[i]
i += 1
if b.get("layoutno", "").find("title") >= 0:
@ -153,10 +153,10 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
else:
raise NotImplementedError("file type not supported yet(pdf supported)")
doc = {"docnm_kwd": filename, "authors_tks": huqie.qie(paper["authors"]),
"title_tks": huqie.qie(paper["title"] if paper["title"] else filename)}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["authors_sm_tks"] = huqie.qieqie(doc["authors_tks"])
doc = {"docnm_kwd": filename, "authors_tks": rag_tokenizer.tokenize(paper["authors"]),
"title_tks": rag_tokenizer.tokenize(paper["title"] if paper["title"] else filename)}
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
doc["authors_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["authors_tks"])
# is it English
eng = lang.lower() == "english" # pdf_parser.is_english
print("It's English.....", eng)

View File

@ -17,7 +17,7 @@ from io import BytesIO
from PIL import Image
from rag.nlp import tokenize, is_english
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from deepdoc.parser import PdfParser, PptParser, PlainParser
from PyPDF2 import PdfReader as pdf2_read
@ -58,7 +58,7 @@ class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
callback(msg="OCR is running...")
callback(msg="OCR is running...")
self.__images__(filename if not binary else binary,
zoomin, from_page, to_page, callback)
callback(0.8, "Page {}~{}: OCR finished".format(
@ -96,9 +96,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
eng = lang.lower() == "english"
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
res = []
if re.search(r"\.pptx?$", filename, re.IGNORECASE):
ppt_parser = Ppt()

View File

@ -15,8 +15,8 @@ from copy import deepcopy
from io import BytesIO
from nltk import word_tokenize
from openpyxl import load_workbook
from rag.nlp import is_english, random_choices
from rag.nlp import huqie
from rag.nlp import is_english, random_choices, find_codec
from rag.nlp import rag_tokenizer
from deepdoc.parser import ExcelParser
@ -73,8 +73,8 @@ def beAdoc(d, q, a, eng):
aprefix = "Answer: " if eng else "回答:"
d["content_with_weight"] = "\t".join(
[qprefix + rmPrefix(q), aprefix + rmPrefix(a)])
d["content_ltks"] = huqie.qie(q)
d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
d["content_ltks"] = rag_tokenizer.tokenize(q)
d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
return d
@ -94,7 +94,7 @@ def chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs):
res = []
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
if re.search(r"\.xlsx?$", filename, re.IGNORECASE):
callback(0.1, "Start to parse.")
@ -106,7 +106,8 @@ def chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs):
callback(0.1, "Start to parse.")
txt = ""
if binary:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
else:
with open(filename, "r") as f:
while True:
@ -115,18 +116,31 @@ def chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs):
break
txt += l
lines = txt.split("\n")
#is_english([rmPrefix(l) for l in lines[:100]])
comma, tab = 0, 0
for l in lines:
if len(l.split(",")) == 2: comma += 1
if len(l.split("\t")) == 2: tab += 1
delimiter = "\t" if tab >= comma else ","
fails = []
for i, line in enumerate(lines):
arr = [l for l in line.split("\t") if len(l) > 1]
question, answer = "", ""
i = 0
while i < len(lines):
arr = lines[i].split(delimiter)
if len(arr) != 2:
fails.append(str(i))
continue
res.append(beAdoc(deepcopy(doc), arr[0], arr[1], eng))
if question: answer += "\n" + lines[i]
else:
fails.append(str(i+1))
elif len(arr) == 2:
if question and answer: res.append(beAdoc(deepcopy(doc), question, answer, eng))
question, answer = arr
i += 1
if len(res) % 999 == 0:
callback(len(res) * 0.6 / len(lines), ("Extract Q&A: {}".format(len(res)) + (
f"{len(fails)} failure, line: %s..." % (",".join(fails[:3])) if fails else "")))
if question: res.append(beAdoc(deepcopy(doc), question, answer, eng))
callback(0.6, ("Extract Q&A: {}".format(len(res)) + (
f"{len(fails)} failure, line: %s..." % (",".join(fails[:3])) if fails else "")))

View File

@ -18,7 +18,7 @@ import re
import pandas as pd
import requests
from api.db.services.knowledgebase_service import KnowledgebaseService
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from deepdoc.parser.resume import refactor
from deepdoc.parser.resume import step_one, step_two
from rag.settings import cron_logger
@ -131,9 +131,9 @@ def chunk(filename, binary=None, callback=None, **kwargs):
titles.append(str(v))
doc = {
"docnm_kwd": filename,
"title_tks": huqie.qie("-".join(titles) + "-简历")
"title_tks": rag_tokenizer.tokenize("-".join(titles) + "-简历")
}
doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
pairs = []
for n, m in field_map.items():
if not resume.get(n):
@ -147,8 +147,8 @@ def chunk(filename, binary=None, callback=None, **kwargs):
doc["content_with_weight"] = "\n".join(
["{}: {}".format(re.sub(r"[^]+", "", k), v) for k, v in pairs])
doc["content_ltks"] = huqie.qie(doc["content_with_weight"])
doc["content_sm_ltks"] = huqie.qieqie(doc["content_ltks"])
doc["content_ltks"] = rag_tokenizer.tokenize(doc["content_with_weight"])
doc["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(doc["content_ltks"])
for n, _ in field_map.items():
if n not in resume:
continue
@ -156,7 +156,7 @@ def chunk(filename, binary=None, callback=None, **kwargs):
len(resume[n]) == 1 or n not in forbidden_select_fields4resume):
resume[n] = resume[n][0]
if n.find("_tks") > 0:
resume[n] = huqie.qieqie(resume[n])
resume[n] = rag_tokenizer.fine_grained_tokenize(resume[n])
doc[n] = resume[n]
print(doc)

View File

@ -20,7 +20,7 @@ from openpyxl import load_workbook
from dateutil.parser import parse as datetime_parse
from api.db.services.knowledgebase_service import KnowledgebaseService
from rag.nlp import huqie, is_english, tokenize
from rag.nlp import rag_tokenizer, is_english, tokenize, find_codec
from deepdoc.parser import ExcelParser
@ -47,6 +47,7 @@ class Excel(ExcelParser):
cell.value for i,
cell in enumerate(
rows[0]) if i not in missed]
if not headers:continue
data = []
for i, r in enumerate(rows[1:]):
rn += 1
@ -147,7 +148,8 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
callback(0.1, "Start to parse.")
txt = ""
if binary:
txt = binary.decode("utf-8")
encoding = find_codec(binary)
txt = binary.decode(encoding, errors="ignore")
else:
with open(filename, "r") as f:
while True:
@ -199,7 +201,7 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
re.sub(
r"(/.*|[^]+?|\([^()]+?\))",
"",
n),
str(n)),
'_')[0] for n in clmns]
clmn_tys = []
for j in range(len(clmns)):
@ -208,14 +210,14 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
df[clmns[j]] = cln
if ty == "text":
txts.extend([str(c) for c in cln if c])
clmns_map = [(py_clmns[i].lower() + fieds_map[clmn_tys[i]], clmns[i].replace("_", " "))
clmns_map = [(py_clmns[i].lower() + fieds_map[clmn_tys[i]], str(clmns[i]).replace("_", " "))
for i in range(len(clmns))]
eng = lang.lower() == "english" # is_english(txts)
for ii, row in df.iterrows():
d = {
"docnm_kwd": filename,
"title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
"title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
}
row_txt = []
for j in range(len(clmns)):
@ -223,10 +225,10 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
continue
if not str(row[clmns[j]]):
continue
#if pd.isna(row[clmns[j]]):
# continue
if pd.isna(row[clmns[j]]):
continue
fld = clmns_map[j][0]
d[fld] = row[clmns[j]] if clmn_tys[j] != "text" else huqie.qie(
d[fld] = row[clmns[j]] if clmn_tys[j] != "text" else rag_tokenizer.tokenize(
row[clmns[j]])
row_txt.append("{}:{}".format(clmns[j], row[clmns[j]]))
if not row_txt:

View File

@ -22,10 +22,10 @@ EmbeddingModel = {
"Ollama": OllamaEmbed,
"OpenAI": OpenAIEmbed,
"Xinference": XinferenceEmbed,
"Tongyi-Qianwen": HuEmbedding, #QWenEmbed,
"Tongyi-Qianwen": DefaultEmbedding, #QWenEmbed,
"ZHIPU-AI": ZhipuEmbed,
"Moonshot": HuEmbedding,
"FastEmbed": FastEmbed
"FastEmbed": FastEmbed,
"Youdao": YoudaoEmbed
}
@ -45,6 +45,7 @@ ChatModel = {
"Tongyi-Qianwen": QWenChat,
"Ollama": OllamaChat,
"Xinference": XinferenceChat,
"Moonshot": MoonshotChat
"Moonshot": MoonshotChat,
"DeepSeek": DeepSeekChat
}

View File

@ -24,16 +24,7 @@ from rag.utils import num_tokens_from_string
class Base(ABC):
def __init__(self, key, model_name):
pass
def chat(self, system, history, gen_conf):
raise NotImplementedError("Please implement encode method!")
class GptTurbo(Base):
def __init__(self, key, model_name="gpt-3.5-turbo", base_url="https://api.openai.com/v1"):
if not base_url: base_url="https://api.openai.com/v1"
def __init__(self, key, model_name, base_url):
self.client = OpenAI(api_key=key, base_url=base_url)
self.model_name = model_name
@ -54,28 +45,28 @@ class GptTurbo(Base):
return "**ERROR**: " + str(e), 0
class MoonshotChat(GptTurbo):
class GptTurbo(Base):
def __init__(self, key, model_name="gpt-3.5-turbo", base_url="https://api.openai.com/v1"):
if not base_url: base_url="https://api.openai.com/v1"
super().__init__(key, model_name, base_url)
class MoonshotChat(Base):
def __init__(self, key, model_name="moonshot-v1-8k", base_url="https://api.moonshot.cn/v1"):
if not base_url: base_url="https://api.moonshot.cn/v1"
self.client = OpenAI(
api_key=key, base_url=base_url)
self.model_name = model_name
super().__init__(key, model_name, base_url)
def chat(self, system, history, gen_conf):
if system:
history.insert(0, {"role": "system", "content": system})
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=history,
**gen_conf)
ans = response.choices[0].message.content.strip()
if response.choices[0].finish_reason == "length":
ans += "...\nFor the content length reason, it stopped, continue?" if is_english(
[ans]) else "······\n由于长度的原因,回答被截断了,要继续吗?"
return ans, response.usage.total_tokens
except openai.APIError as e:
return "**ERROR**: " + str(e), 0
class XinferenceChat(Base):
def __init__(self, key=None, model_name="", base_url=""):
key = "xxx"
super().__init__(key, model_name, base_url)
class DeepSeekChat(Base):
def __init__(self, key, model_name="deepseek-chat", base_url="https://api.deepseek.com/v1"):
if not base_url: base_url="https://api.deepseek.com/v1"
super().__init__(key, model_name, base_url)
class QWenChat(Base):
@ -141,41 +132,19 @@ class OllamaChat(Base):
if system:
history.insert(0, {"role": "system", "content": system})
try:
options = {"temperature": gen_conf.get("temperature", 0.1),
"num_predict": gen_conf.get("max_tokens", 128),
"top_k": gen_conf.get("top_p", 0.3),
"presence_penalty": gen_conf.get("presence_penalty", 0.4),
"frequency_penalty": gen_conf.get("frequency_penalty", 0.7),
}
options = {}
if "temperature" in gen_conf: options["temperature"] = gen_conf["temperature"]
if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]
if "top_p" in gen_conf: options["top_k"] = gen_conf["top_p"]
if "presence_penalty" in gen_conf: options["presence_penalty"] = gen_conf["presence_penalty"]
if "frequency_penalty" in gen_conf: options["frequency_penalty"] = gen_conf["frequency_penalty"]
response = self.client.chat(
model=self.model_name,
messages=history,
options=options
)
ans = response["message"]["content"].strip()
return ans, response["eval_count"] + response["prompt_eval_count"]
return ans, response["eval_count"] + response.get("prompt_eval_count", 0)
except Exception as e:
return "**ERROR**: " + str(e), 0
class XinferenceChat(Base):
def __init__(self, key=None, model_name="", base_url=""):
self.client = OpenAI(api_key="xxx", base_url=base_url)
self.model_name = model_name
def chat(self, system, history, gen_conf):
if system:
history.insert(0, {"role": "system", "content": system})
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=history,
**gen_conf)
ans = response.choices[0].message.content.strip()
if response.choices[0].finish_reason == "length":
ans += "...\nFor the content length reason, it stopped, continue?" if is_english(
[ans]) else "······\n由于长度的原因,回答被截断了,要继续吗?"
return ans, response.usage.total_tokens
except openai.APIError as e:
return "**ERROR**: " + str(e), 0

View File

@ -14,30 +14,33 @@
# limitations under the License.
#
from typing import Optional
from huggingface_hub import snapshot_download
from zhipuai import ZhipuAI
import os
from abc import ABC
from ollama import Client
import dashscope
from openai import OpenAI
from fastembed import TextEmbedding
from FlagEmbedding import FlagModel
import torch
import numpy as np
from api.utils.file_utils import get_project_base_directory
from api.utils.file_utils import get_project_base_directory, get_home_cache_dir
from rag.utils import num_tokens_from_string
try:
flag_model = FlagModel(os.path.join(
get_project_base_directory(),
"rag/res/bge-large-zh-v1.5"),
flag_model = FlagModel(os.path.join(get_home_cache_dir(), "bge-large-zh-v1.5"),
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
use_fp16=torch.cuda.is_available())
except Exception as e:
model_dir = snapshot_download(repo_id="BAAI/bge-large-zh-v1.5",
local_dir=os.path.join(get_home_cache_dir(), "bge-large-zh-v1.5"),
local_dir_use_symlinks=False)
flag_model = FlagModel(model_dir,
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
use_fp16=torch.cuda.is_available())
except Exception as e:
flag_model = FlagModel("BAAI/bge-large-zh-v1.5",
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
use_fp16=torch.cuda.is_available())
class Base(ABC):
@ -51,7 +54,7 @@ class Base(ABC):
raise NotImplementedError("Please implement encode method!")
class HuEmbedding(Base):
class DefaultEmbedding(Base):
def __init__(self, *args, **kwargs):
"""
If you have trouble downloading HuggingFace models, -_^ this might help!!
@ -82,16 +85,17 @@ class HuEmbedding(Base):
class OpenAIEmbed(Base):
def __init__(self, key, model_name="text-embedding-ada-002", base_url="https://api.openai.com/v1"):
if not base_url: base_url="https://api.openai.com/v1"
def __init__(self, key, model_name="text-embedding-ada-002",
base_url="https://api.openai.com/v1"):
if not base_url:
base_url = "https://api.openai.com/v1"
self.client = OpenAI(api_key=key, base_url=base_url)
self.model_name = model_name
def encode(self, texts: list, batch_size=32):
res = self.client.embeddings.create(input=texts,
model=self.model_name)
return np.array([d.embedding for d in res.data]
), res.usage.total_tokens
return np.array([d.embedding for d in res.data]), res.usage.total_tokens
def encode_queries(self, text):
res = self.client.embeddings.create(input=[text],
@ -142,7 +146,7 @@ class ZhipuEmbed(Base):
tks_num = 0
for txt in texts:
res = self.client.embeddings.create(input=txt,
model=self.model_name)
model=self.model_name)
arr.append(res.data[0].embedding)
tks_num += res.usage.total_tokens
return np.array(arr), tks_num
@ -163,14 +167,14 @@ class OllamaEmbed(Base):
tks_num = 0
for txt in texts:
res = self.client.embeddings(prompt=txt,
model=self.model_name)
model=self.model_name)
arr.append(res["embedding"])
tks_num += 128
return np.array(arr), tks_num
def encode_queries(self, text):
res = self.client.embeddings(prompt=text,
model=self.model_name)
model=self.model_name)
return np.array(res["embedding"]), 128
@ -183,10 +187,12 @@ class FastEmbed(Base):
threads: Optional[int] = None,
**kwargs,
):
from fastembed import TextEmbedding
self._model = TextEmbedding(model_name, cache_dir, threads, **kwargs)
def encode(self, texts: list, batch_size=32):
# Using the internal tokenizer to encode the texts and get the total number of tokens
# Using the internal tokenizer to encode the texts and get the total
# number of tokens
encodings = self._model.model.tokenizer.encode_batch(texts)
total_tokens = sum(len(e) for e in encodings)
@ -195,7 +201,8 @@ class FastEmbed(Base):
return np.array(embeddings), total_tokens
def encode_queries(self, text: str):
# Using the internal tokenizer to encode the texts and get the total number of tokens
# Using the internal tokenizer to encode the texts and get the total
# number of tokens
encoding = self._model.model.tokenizer.encode(text)
embedding = next(self._model.query_embed(text)).tolist()
@ -218,3 +225,33 @@ class XinferenceEmbed(Base):
model=self.model_name)
return np.array(res.data[0].embedding), res.usage.total_tokens
class YoudaoEmbed(Base):
_client = None
def __init__(self, key=None, model_name="maidalun1020/bce-embedding-base_v1", **kwargs):
from BCEmbedding import EmbeddingModel as qanthing
if not YoudaoEmbed._client:
try:
print("LOADING BCE...")
YoudaoEmbed._client = qanthing(model_name_or_path=os.path.join(
get_home_cache_dir(),
"bce-embedding-base_v1"))
except Exception as e:
YoudaoEmbed._client = qanthing(
model_name_or_path=model_name.replace(
"maidalun1020", "InfiniFlow"))
def encode(self, texts: list, batch_size=10):
res = []
token_count = 0
for t in texts:
token_count += num_tokens_from_string(t)
for i in range(0, len(texts), batch_size):
embds = YoudaoEmbed._client.encode(texts[i:i + batch_size])
res.extend(embds)
return np.array(res), token_count
def encode_queries(self, text):
embds = YoudaoEmbed._client.encode([text])
return np.array(embds[0]), num_tokens_from_string(text)

View File

@ -2,10 +2,45 @@ import random
from collections import Counter
from rag.utils import num_tokens_from_string
from . import huqie
from . import rag_tokenizer
import re
import copy
all_codecs = [
'utf-8', 'gb2312', 'gbk', 'utf_16', 'ascii', 'big5', 'big5hkscs',
'cp037', 'cp273', 'cp424', 'cp437',
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869',
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125',
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256',
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr',
'gb2312', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2',
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7',
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13',
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u',
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213',
'utf_32', 'utf_32_be', 'utf_32_le''utf_16_be', 'utf_16_le', 'utf_7'
]
def find_codec(blob):
global all_codecs
for c in all_codecs:
try:
blob[:1024].decode(c)
return c
except Exception as e:
pass
try:
blob.decode(c)
return c
except Exception as e:
pass
return "utf-8"
BULLET_PATTERN = [[
r"第[零一二三四五六七八九十百0-9]+(分?编|部分)",
@ -80,8 +115,8 @@ def is_english(texts):
def tokenize(d, t, eng):
d["content_with_weight"] = t
t = re.sub(r"</?(table|td|caption|tr|th)( [^<>]{0,12})?>", " ", t)
d["content_ltks"] = huqie.qie(t)
d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
d["content_ltks"] = rag_tokenizer.tokenize(t)
d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
def tokenize_chunks(chunks, doc, eng, pdf_parser):

View File

@ -1,475 +0,0 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import re
import os
import copy
import base64
import magic
from dataclasses import dataclass
from typing import List
import numpy as np
from io import BytesIO
class HuChunker:
@dataclass
class Fields:
text_chunks: List = None
table_chunks: List = None
def __init__(self):
self.MAX_LVL = 12
self.proj_patt = [
(r"第[零一二三四五六七八九十百]+章", 1),
(r"第[零一二三四五六七八九十百]+[条节]", 2),
(r"[零一二三四五六七八九十百]+[、  ]", 3),
(r"[\(][零一二三四五六七八九十百]+[\)]", 4),
(r"[0-9]+(、|\.[  ]|\.[^0-9])", 5),
(r"[0-9]+\.[0-9]+(、|[  ]|[^0-9])", 6),
(r"[0-9]+\.[0-9]+\.[0-9]+(、|[  ]|[^0-9])", 7),
(r"[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(、|[  ]|[^0-9])", 8),
(r".{,48}[:?]@", 9),
(r"[0-9]+", 10),
(r"[\(][0-9]+[\)]", 11),
(r"[零一二三四五六七八九十百]+是", 12),
(r"[⚫•➢✓ ]", 12)
]
self.lines = []
def _garbage(self, txt):
patt = [
r"(在此保证|不得以任何形式翻版|请勿传阅|仅供内部使用|未经事先书面授权)",
r"(版权(归本公司)*所有|免责声明|保留一切权力|承担全部责任|特别声明|报告中涉及)",
r"(不承担任何责任|投资者的通知事项:|任何机构和个人|本报告仅为|不构成投资)",
r"(不构成对任何个人或机构投资建议|联系其所在国家|本报告由从事证券交易)",
r"(本研究报告由|「认可投资者」|所有研究报告均以|请发邮件至)",
r"(本报告仅供|市场有风险,投资需谨慎|本报告中提及的)",
r"(本报告反映|此信息仅供|证券分析师承诺|具备证券投资咨询业务资格)",
r"^(时间|签字|签章)[:]",
r"(参考文献|目录索引|图表索引)",
r"[ ]*年[ ]+月[ ]+日",
r"^(中国证券业协会|[0-9]+年[0-9]+月[0-9]+日)$",
r"\.{10,}",
r"(———————END|帮我转发|欢迎收藏|快来关注我吧)"
]
return any([re.search(p, txt) for p in patt])
def _proj_match(self, line):
for p, j in self.proj_patt:
if re.match(p, line):
return j
return
def _does_proj_match(self):
mat = [None for _ in range(len(self.lines))]
for i in range(len(self.lines)):
mat[i] = self._proj_match(self.lines[i])
return mat
def naive_text_chunk(self, text, ti="", MAX_LEN=612):
if text:
self.lines = [l.strip().replace(u'\u3000', u' ')
.replace(u'\xa0', u'')
for l in text.split("\n\n")]
self.lines = [l for l in self.lines if not self._garbage(l)]
self.lines = [re.sub(r"([ ]+|&nbsp;)", " ", l)
for l in self.lines if l]
if not self.lines:
return []
arr = self.lines
res = [""]
i = 0
while i < len(arr):
a = arr[i]
if not a:
i += 1
continue
if len(a) > MAX_LEN:
a_ = a.split("\n")
if len(a_) >= 2:
arr.pop(i)
for j in range(2, len(a_) + 1):
if len("\n".join(a_[:j])) >= MAX_LEN:
arr.insert(i, "\n".join(a_[:j - 1]))
arr.insert(i + 1, "\n".join(a_[j - 1:]))
break
else:
assert False, f"Can't split: {a}"
continue
if len(res[-1]) < MAX_LEN / 3:
res[-1] += "\n" + a
else:
res.append(a)
i += 1
if ti:
for i in range(len(res)):
if res[i].find("——来自") >= 0:
continue
res[i] += f"\t——来自“{ti}"
return res
def _merge(self):
# merge continuous same level text
lines = [self.lines[0]] if self.lines else []
for i in range(1, len(self.lines)):
if self.mat[i] == self.mat[i - 1] \
and len(lines[-1]) < 256 \
and len(self.lines[i]) < 256:
lines[-1] += "\n" + self.lines[i]
continue
lines.append(self.lines[i])
self.lines = lines
self.mat = self._does_proj_match()
return self.mat
def text_chunks(self, text):
if text:
self.lines = [l.strip().replace(u'\u3000', u' ')
.replace(u'\xa0', u'')
for l in re.split(r"[\r\n]", text)]
self.lines = [l for l in self.lines if not self._garbage(l)]
self.lines = [l for l in self.lines if l]
self.mat = self._does_proj_match()
mat = self._merge()
tree = []
for i in range(len(self.lines)):
tree.append({"proj": mat[i],
"children": [],
"read": False})
# find all children
for i in range(len(self.lines) - 1):
if tree[i]["proj"] is None:
continue
ed = i + 1
while ed < len(tree) and (tree[ed]["proj"] is None or
tree[ed]["proj"] > tree[i]["proj"]):
ed += 1
nxt = tree[i]["proj"] + 1
st = set([p["proj"] for p in tree[i + 1: ed] if p["proj"]])
while nxt not in st:
nxt += 1
if nxt > self.MAX_LVL:
break
if nxt <= self.MAX_LVL:
for j in range(i + 1, ed):
if tree[j]["proj"] is not None:
break
tree[i]["children"].append(j)
for j in range(i + 1, ed):
if tree[j]["proj"] != nxt:
continue
tree[i]["children"].append(j)
else:
for j in range(i + 1, ed):
tree[i]["children"].append(j)
# get DFS combinations, find all the paths to leaf
paths = []
def dfs(i, path):
nonlocal tree, paths
path.append(i)
tree[i]["read"] = True
if len(self.lines[i]) > 256:
paths.append(path)
return
if not tree[i]["children"]:
if len(path) > 1 or len(self.lines[i]) >= 32:
paths.append(path)
return
for j in tree[i]["children"]:
dfs(j, copy.deepcopy(path))
for i, t in enumerate(tree):
if t["read"]:
continue
dfs(i, [])
# concat txt on the path for all paths
res = []
lines = np.array(self.lines)
for p in paths:
if len(p) < 2:
tree[p[0]]["read"] = False
continue
txt = "\n".join(lines[p[:-1]]) + "\n" + lines[p[-1]]
res.append(txt)
# concat continuous orphans
assert len(tree) == len(lines)
ii = 0
while ii < len(tree):
if tree[ii]["read"]:
ii += 1
continue
txt = lines[ii]
e = ii + 1
while e < len(tree) and not tree[e]["read"] and len(txt) < 256:
txt += "\n" + lines[e]
e += 1
res.append(txt)
ii = e
# if the node has not been read, find its daddy
def find_daddy(st):
nonlocal lines, tree
proj = tree[st]["proj"]
if len(self.lines[st]) > 512:
return [st]
if proj is None:
proj = self.MAX_LVL + 1
for i in range(st - 1, -1, -1):
if tree[i]["proj"] and tree[i]["proj"] < proj:
a = [st] + find_daddy(i)
return a
return []
return res
class PdfChunker(HuChunker):
def __init__(self, pdf_parser):
self.pdf = pdf_parser
super().__init__()
def tableHtmls(self, pdfnm):
_, tbls = self.pdf(pdfnm, return_html=True)
res = []
for img, arr in tbls:
if arr[0].find("<table>") < 0:
continue
buffered = BytesIO()
if img:
img.save(buffered, format="JPEG")
img_str = base64.b64encode(
buffered.getvalue()).decode('utf-8') if img else ""
res.append({"table": arr[0], "image": img_str})
return res
def html(self, pdfnm):
txts, tbls = self.pdf(pdfnm, return_html=True)
res = []
txt_cks = self.text_chunks(txts)
for txt, img in [(self.pdf.remove_tag(c), self.pdf.crop(c))
for c in txt_cks]:
buffered = BytesIO()
if img:
img.save(buffered, format="JPEG")
img_str = base64.b64encode(
buffered.getvalue()).decode('utf-8') if img else ""
res.append({"table": "<p>%s</p>" % txt.replace("\n", "<br/>"),
"image": img_str})
for img, arr in tbls:
if not arr:
continue
buffered = BytesIO()
if img:
img.save(buffered, format="JPEG")
img_str = base64.b64encode(
buffered.getvalue()).decode('utf-8') if img else ""
res.append({"table": arr[0], "image": img_str})
return res
def __call__(self, pdfnm, return_image=True, naive_chunk=False):
flds = self.Fields()
text, tbls = self.pdf(pdfnm)
fnm = pdfnm
txt_cks = self.text_chunks(text) if not naive_chunk else \
self.naive_text_chunk(text, ti=fnm if isinstance(fnm, str) else "")
flds.text_chunks = [(self.pdf.remove_tag(c),
self.pdf.crop(c) if return_image else None) for c in txt_cks]
flds.table_chunks = [(arr, img if return_image else None)
for img, arr in tbls]
return flds
class DocxChunker(HuChunker):
def __init__(self, doc_parser):
self.doc = doc_parser
super().__init__()
def _does_proj_match(self):
mat = []
for s in self.styles:
s = s.split(" ")[-1]
try:
mat.append(int(s))
except Exception as e:
mat.append(None)
return mat
def _merge(self):
i = 1
while i < len(self.lines):
if self.mat[i] == self.mat[i - 1] \
and len(self.lines[i - 1]) < 256 \
and len(self.lines[i]) < 256:
self.lines[i - 1] += "\n" + self.lines[i]
self.styles.pop(i)
self.lines.pop(i)
self.mat.pop(i)
continue
i += 1
self.mat = self._does_proj_match()
return self.mat
def __call__(self, fnm):
flds = self.Fields()
flds.title = os.path.splitext(
os.path.basename(fnm))[0] if isinstance(
fnm, type("")) else ""
secs, tbls = self.doc(fnm)
self.lines = [l for l, s in secs]
self.styles = [s for l, s in secs]
txt_cks = self.text_chunks("")
flds.text_chunks = [(t, None) for t in txt_cks if not self._garbage(t)]
flds.table_chunks = [(tb, None) for tb in tbls for t in tb if t]
return flds
class ExcelChunker(HuChunker):
def __init__(self, excel_parser):
self.excel = excel_parser
super().__init__()
def __call__(self, fnm):
flds = self.Fields()
flds.text_chunks = [(t, None) for t in self.excel(fnm)]
flds.table_chunks = []
return flds
class PptChunker(HuChunker):
def __init__(self):
super().__init__()
def __extract(self, shape):
if shape.shape_type == 19:
tb = shape.table
rows = []
for i in range(1, len(tb.rows)):
rows.append("; ".join([tb.cell(
0, j).text + ": " + tb.cell(i, j).text for j in range(len(tb.columns)) if tb.cell(i, j)]))
return "\n".join(rows)
if shape.has_text_frame:
return shape.text_frame.text
if shape.shape_type == 6:
texts = []
for p in shape.shapes:
t = self.__extract(p)
if t:
texts.append(t)
return "\n".join(texts)
def __call__(self, fnm):
from pptx import Presentation
ppt = Presentation(fnm) if isinstance(
fnm, str) else Presentation(
BytesIO(fnm))
txts = []
for slide in ppt.slides:
texts = []
for shape in slide.shapes:
txt = self.__extract(shape)
if txt:
texts.append(txt)
txts.append("\n".join(texts))
import aspose.slides as slides
import aspose.pydrawing as drawing
imgs = []
with slides.Presentation(BytesIO(fnm)) as presentation:
for slide in presentation.slides:
buffered = BytesIO()
slide.get_thumbnail(
0.5, 0.5).save(
buffered, drawing.imaging.ImageFormat.jpeg)
imgs.append(buffered.getvalue())
assert len(imgs) == len(
txts), "Slides text and image do not match: {} vs. {}".format(len(imgs), len(txts))
flds = self.Fields()
flds.text_chunks = [(txts[i], imgs[i]) for i in range(len(txts))]
flds.table_chunks = []
return flds
class TextChunker(HuChunker):
@dataclass
class Fields:
text_chunks: List = None
table_chunks: List = None
def __init__(self):
super().__init__()
@staticmethod
def is_binary_file(file_path):
mime = magic.Magic(mime=True)
if isinstance(file_path, str):
file_type = mime.from_file(file_path)
else:
file_type = mime.from_buffer(file_path)
if 'text' in file_type:
return False
else:
return True
def __call__(self, fnm):
flds = self.Fields()
if self.is_binary_file(fnm):
return flds
txt = ""
if isinstance(fnm, str):
with open(fnm, "r") as f:
txt = f.read()
else:
txt = fnm.decode("utf-8")
flds.text_chunks = [(c, None) for c in self.naive_text_chunk(txt)]
flds.table_chunks = []
return flds
if __name__ == "__main__":
import sys
sys.path.append(os.path.dirname(__file__) + "/../")
if sys.argv[1].split(".")[-1].lower() == "pdf":
from deepdoc.parser import PdfParser
ckr = PdfChunker(PdfParser())
if sys.argv[1].split(".")[-1].lower().find("doc") >= 0:
from deepdoc.parser import DocxParser
ckr = DocxChunker(DocxParser())
if sys.argv[1].split(".")[-1].lower().find("xlsx") >= 0:
from deepdoc.parser import ExcelParser
ckr = ExcelChunker(ExcelParser())
# ckr.html(sys.argv[1])
print(ckr(sys.argv[1]))

View File

@ -7,14 +7,13 @@ import logging
import copy
from elasticsearch_dsl import Q
from rag.nlp import huqie, term_weight, synonym
from rag.nlp import rag_tokenizer, term_weight, synonym
class EsQueryer:
def __init__(self, es):
self.tw = term_weight.Dealer()
self.es = es
self.syn = synonym.Dealer(None)
self.syn = synonym.Dealer()
self.flds = ["ask_tks^10", "ask_small_tks"]
@staticmethod
@ -47,13 +46,13 @@ class EsQueryer:
txt = re.sub(
r"[ \r\n\t,,。??/`!&]+",
" ",
huqie.tradi2simp(
huqie.strQ2B(
rag_tokenizer.tradi2simp(
rag_tokenizer.strQ2B(
txt.lower()))).strip()
txt = EsQueryer.rmWWW(txt)
if not self.isChinese(txt):
tks = huqie.qie(txt).split(" ")
tks = rag_tokenizer.tokenize(txt).split(" ")
q = copy.deepcopy(tks)
for i in range(1, len(tks)):
q.append("\"%s %s\"^2" % (tks[i - 1], tks[i]))
@ -65,7 +64,7 @@ class EsQueryer:
boost=1)#, minimum_should_match=min_match)
), tks
def needQieqie(tk):
def need_fine_grained_tokenize(tk):
if len(tk) < 4:
return False
if re.match(r"[0-9a-z\.\+#_\*-]+$", tk):
@ -81,7 +80,7 @@ class EsQueryer:
logging.info(json.dumps(twts, ensure_ascii=False))
tms = []
for tk, w in sorted(twts, key=lambda x: x[1] * -1):
sm = huqie.qieqie(tk).split(" ") if needQieqie(tk) else []
sm = rag_tokenizer.fine_grained_tokenize(tk).split(" ") if need_fine_grained_tokenize(tk) else []
sm = [
re.sub(
r"[ ,\./;'\[\]\\`~!@#$%\^&\*\(\)=\+_<>\?:\"\{\}\|,。;‘’【】、!¥……()——《》?:“”-]+",
@ -110,10 +109,10 @@ class EsQueryer:
if len(twts) > 1:
tms += f" (\"%s\"~4)^1.5" % (" ".join([t for t, _ in twts]))
if re.match(r"[0-9a-z ]+$", tt):
tms = f"(\"{tt}\" OR \"%s\")" % huqie.qie(tt)
tms = f"(\"{tt}\" OR \"%s\")" % rag_tokenizer.tokenize(tt)
syns = " OR ".join(
["\"%s\"^0.7" % EsQueryer.subSpecialChar(huqie.qie(s)) for s in syns])
["\"%s\"^0.7" % EsQueryer.subSpecialChar(rag_tokenizer.tokenize(s)) for s in syns])
if syns:
tms = f"({tms})^5 OR ({syns})^0.7"

View File

@ -8,12 +8,13 @@ import re
import string
import sys
from hanziconv import HanziConv
from huggingface_hub import snapshot_download
from nltk import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from api.utils.file_utils import get_project_base_directory
class Huqie:
class RagTokenizer:
def key_(self, line):
return str(line.lower().encode("utf-8"))[2:-1]
@ -240,7 +241,7 @@ class Huqie:
return self.score_(res[::-1])
def qie(self, line):
def tokenize(self, line):
line = self._strQ2B(line).lower()
line = self._tradi2simp(line)
zh_num = len([1 for c in line if is_chinese(c)])
@ -297,7 +298,7 @@ class Huqie:
print("[TKS]", self.merge_(res))
return self.merge_(res)
def qieqie(self, tks):
def fine_grained_tokenize(self, tks):
tks = tks.split(" ")
zh_num = len([1 for c in tks if c and is_chinese(c[0])])
if zh_num < len(tks) * 0.2:
@ -370,53 +371,53 @@ def naiveQie(txt):
return tks
hq = Huqie()
qie = hq.qie
qieqie = hq.qieqie
tag = hq.tag
freq = hq.freq
loadUserDict = hq.loadUserDict
addUserDict = hq.addUserDict
tradi2simp = hq._tradi2simp
strQ2B = hq._strQ2B
tokenizer = RagTokenizer()
tokenize = tokenizer.tokenize
fine_grained_tokenize = tokenizer.fine_grained_tokenize
tag = tokenizer.tag
freq = tokenizer.freq
loadUserDict = tokenizer.loadUserDict
addUserDict = tokenizer.addUserDict
tradi2simp = tokenizer._tradi2simp
strQ2B = tokenizer._strQ2B
if __name__ == '__main__':
huqie = Huqie(debug=True)
tknzr = RagTokenizer(debug=True)
# huqie.addUserDict("/tmp/tmp.new.tks.dict")
tks = huqie.qie(
tks = tknzr.tokenize(
"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈")
print(huqie.qieqie(tks))
tks = huqie.qie(
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize(
"公开征求意见稿提出,境外投资者可使用自有人民币或外汇投资。使用外汇投资的,可通过债券持有人在香港人民币业务清算行及香港地区经批准可进入境内银行间外汇市场进行交易的境外人民币业务参加行(以下统称香港结算行)办理外汇资金兑换。香港结算行由此所产生的头寸可到境内银行间外汇市场平盘。使用外汇投资的,在其投资的债券到期或卖出后,原则上应兑换回外汇。")
print(huqie.qieqie(tks))
tks = huqie.qie(
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize(
"多校划片就是一个小区对应多个小学初中,让买了学区房的家庭也不确定到底能上哪个学校。目的是通过这种方式为学区房降温,把就近入学落到实处。南京市长江大桥")
print(huqie.qieqie(tks))
tks = huqie.qie(
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize(
"实际上当时他们已经将业务中心偏移到安全部门和针对政府企业的部门 Scripts are compiled and cached aaaaaaaaa")
print(huqie.qieqie(tks))
tks = huqie.qie("虽然我不怎么玩")
print(huqie.qieqie(tks))
tks = huqie.qie("蓝月亮如何在外资夹击中生存,那是全宇宙最有意思的")
print(huqie.qieqie(tks))
tks = huqie.qie(
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize("虽然我不怎么玩")
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize("蓝月亮如何在外资夹击中生存,那是全宇宙最有意思的")
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize(
"涡轮增压发动机num最大功率,不像别的共享买车锁电子化的手段,我们接过来是否有意义,黄黄爱美食,不过,今天阿奇要讲到的这家农贸市场,说实话,还真蛮有特色的!不仅环境好,还打出了")
print(huqie.qieqie(tks))
tks = huqie.qie("这周日你去吗?这周日你有空吗?")
print(huqie.qieqie(tks))
tks = huqie.qie("Unity3D开发经验 测试开发工程师 c++双11双11 985 211 ")
print(huqie.qieqie(tks))
tks = huqie.qie(
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize("这周日你去吗?这周日你有空吗?")
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize("Unity3D开发经验 测试开发工程师 c++双11双11 985 211 ")
print(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize(
"数据分析项目经理|数据分析挖掘|数据分析方向|商品数据分析|搜索数据分析 sql python hive tableau Cocos2d-")
print(huqie.qieqie(tks))
print(tknzr.fine_grained_tokenize(tks))
if len(sys.argv) < 2:
sys.exit()
huqie.DEBUG = False
huqie.loadUserDict(sys.argv[1])
tknzr.DEBUG = False
tknzr.loadUserDict(sys.argv[1])
of = open(sys.argv[2], "r")
while True:
line = of.readline()
if not line:
break
print(huqie.qie(line))
print(tknzr.tokenize(line))
of.close()

View File

@ -9,7 +9,7 @@ from dataclasses import dataclass
from rag.settings import es_logger
from rag.utils import rmSpace
from rag.nlp import huqie, query
from rag.nlp import rag_tokenizer, query
import numpy as np
@ -46,7 +46,7 @@ class Dealer:
"k": topk,
"similarity": sim,
"num_candidates": topk * 2,
"query_vector": list(qv)
"query_vector": [float(v) for v in qv]
}
def search(self, req, idxnm, emb_mdl=None):
@ -68,7 +68,7 @@ class Dealer:
pg = int(req.get("page", 1)) - 1
ps = int(req.get("size", 1000))
topk = int(req.get("topk", 1024))
src = req.get("fields", ["docnm_kwd", "content_ltks", "kb_id", "img_id",
src = req.get("fields", ["docnm_kwd", "content_ltks", "kb_id", "img_id", "title_tks", "important_kwd",
"image_id", "doc_id", "q_512_vec", "q_768_vec", "position_int",
"q_1024_vec", "q_1536_vec", "available_int", "content_with_weight"])
@ -128,7 +128,7 @@ class Dealer:
kwds = set([])
for k in keywords:
kwds.add(k)
for kk in huqie.qieqie(k).split(" "):
for kk in rag_tokenizer.fine_grained_tokenize(k).split(" "):
if len(kk) < 2:
continue
if kk in kwds:
@ -237,13 +237,13 @@ class Dealer:
pieces_.append(t)
es_logger.info("{} => {}".format(answer, pieces_))
if not pieces_:
return answer
return answer, set([])
ans_v, _ = embd_mdl.encode(pieces_)
assert len(ans_v[0]) == len(chunk_v[0]), "The dimension of query and chunk do not match: {} vs. {}".format(
len(ans_v[0]), len(chunk_v[0]))
chunks_tks = [huqie.qie(self.qryr.rmWWW(ck)).split(" ")
chunks_tks = [rag_tokenizer.tokenize(self.qryr.rmWWW(ck)).split(" ")
for ck in chunks]
cites = {}
thr = 0.63
@ -251,7 +251,7 @@ class Dealer:
for i, a in enumerate(pieces_):
sim, tksim, vtsim = self.qryr.hybrid_similarity(ans_v[i],
chunk_v,
huqie.qie(
rag_tokenizer.tokenize(
self.qryr.rmWWW(pieces_[i])).split(" "),
chunks_tks,
tkweight, vtweight)
@ -289,8 +289,18 @@ class Dealer:
sres.field[i].get("q_%d_vec" % len(sres.query_vector), "\t".join(["0"] * len(sres.query_vector)))) for i in sres.ids]
if not ins_embd:
return [], [], []
ins_tw = [sres.field[i][cfield].split(" ")
for i in sres.ids]
for i in sres.ids:
if isinstance(sres.field[i].get("important_kwd", []), str):
sres.field[i]["important_kwd"] = [sres.field[i]["important_kwd"]]
ins_tw = []
for i in sres.ids:
content_ltks = sres.field[i][cfield].split(" ")
title_tks = [t for t in sres.field[i].get("title_tks", "").split(" ") if t]
important_kwd = sres.field[i].get("important_kwd", [])
tks = content_ltks + title_tks + important_kwd
ins_tw.append(tks)
sim, tksim, vtsim = self.qryr.hybrid_similarity(sres.query_vector,
ins_embd,
keywords,
@ -300,8 +310,8 @@ class Dealer:
def hybrid_similarity(self, ans_embd, ins_embd, ans, inst):
return self.qryr.hybrid_similarity(ans_embd,
ins_embd,
huqie.qie(ans).split(" "),
huqie.qie(inst).split(" "))
rag_tokenizer.tokenize(ans).split(" "),
rag_tokenizer.tokenize(inst).split(" "))
def retrieval(self, question, embd_mdl, tenant_id, kb_ids, page, page_size, similarity_threshold=0.2,
vector_similarity_weight=0.3, top=1024, doc_ids=None, aggs=True):
@ -368,14 +378,14 @@ class Dealer:
def sql_retrieval(self, sql, fetch_size=128, format="json"):
from api.settings import chat_logger
sql = re.sub(r"[ ]+", " ", sql)
sql = re.sub(r"[ `]+", " ", sql)
sql = sql.replace("%", "")
es_logger.info(f"Get es sql: {sql}")
replaces = []
for r in re.finditer(r" ([a-z_]+_l?tks)( like | ?= ?)'([^']+)'", sql):
fld, v = r.group(1), r.group(3)
match = " MATCH({}, '{}', 'operator=OR;minimum_should_match=30%') ".format(
fld, huqie.qieqie(huqie.qie(v)))
fld, rag_tokenizer.fine_grained_tokenize(rag_tokenizer.tokenize(v)))
replaces.append(
("{}{}'{}'".format(
r.group(1),

View File

@ -17,7 +17,7 @@ class Dealer:
try:
self.dictionary = json.load(open(path, 'r'))
except Exception as e:
logging.warn("Miss synonym.json")
logging.warn("Missing synonym.json")
self.dictionary = {}
if not redis:

View File

@ -4,7 +4,7 @@ import json
import re
import os
import numpy as np
from rag.nlp import huqie
from rag.nlp import rag_tokenizer
from api.utils.file_utils import get_project_base_directory
@ -83,7 +83,7 @@ class Dealer:
txt = re.sub(p, r, txt)
res = []
for t in huqie.qie(txt).split(" "):
for t in rag_tokenizer.tokenize(txt).split(" "):
tk = t
if (stpwd and tk in self.stop_words) or (
re.match(r"[0-9]$", tk) and not num):
@ -161,7 +161,7 @@ class Dealer:
return m[self.ne[t]]
def postag(t):
t = huqie.tag(t)
t = rag_tokenizer.tag(t)
if t in set(["r", "c", "d"]):
return 0.3
if t in set(["ns", "nt"]):
@ -175,14 +175,14 @@ class Dealer:
def freq(t):
if re.match(r"[0-9. -]{2,}$", t):
return 3
s = huqie.freq(t)
s = rag_tokenizer.freq(t)
if not s and re.match(r"[a-z. -]+$", t):
return 300
if not s:
s = 0
if not s and len(t) >= 4:
s = [tt for tt in huqie.qieqie(t).split(" ") if len(tt) > 1]
s = [tt for tt in rag_tokenizer.fine_grained_tokenize(t).split(" ") if len(tt) > 1]
if len(s) > 1:
s = np.min([freq(tt) for tt in s]) / 6.
else:
@ -198,7 +198,7 @@ class Dealer:
elif re.match(r"[a-z. -]+$", t):
return 300
elif len(t) >= 4:
s = [tt for tt in huqie.qieqie(t).split(" ") if len(tt) > 1]
s = [tt for tt in rag_tokenizer.fine_grained_tokenize(t).split(" ") if len(tt) > 1]
if len(s) > 1:
return max(3, np.min([df(tt) for tt in s]) / 6.)

View File

@ -25,6 +25,11 @@ SUBPROCESS_STD_LOG_NAME = "std.log"
ES = get_base_config("es", {})
MINIO = decrypt_database_config(name="minio")
try:
REDIS = decrypt_database_config(name="redis")
except Exception as e:
REDIS = {}
pass
DOC_MAXIMUM_SIZE = 128 * 1024 * 1024
# Logger
@ -39,5 +44,12 @@ LoggerFactory.LEVEL = 30
es_logger = getLogger("es")
minio_logger = getLogger("minio")
cron_logger = getLogger("cron_logger")
cron_logger.setLevel(20)
chunk_logger = getLogger("chunk_logger")
database_logger = getLogger("database")
SVR_QUEUE_NAME = "rag_flow_svr_queue"
SVR_QUEUE_RETENTION = 60*60
SVR_QUEUE_MAX_LEN = 1024
SVR_CONSUMER_NAME = "rag_flow_svr_consumer"
SVR_CONSUMER_GROUP_NAME = "rag_flow_svr_consumer_group"

44
rag/svr/cache_file_svr.py Normal file
View File

@ -0,0 +1,44 @@
import random
import time
import traceback
from api.db.db_models import close_connection
from api.db.services.task_service import TaskService
from rag.settings import cron_logger
from rag.utils.minio_conn import MINIO
from rag.utils.redis_conn import REDIS_CONN
def collect():
doc_locations = TaskService.get_ongoing_doc_name()
print(doc_locations)
if len(doc_locations) == 0:
time.sleep(1)
return
return doc_locations
def main():
locations = collect()
if not locations:return
print("TASKS:", len(locations))
for kb_id, loc in locations:
try:
if REDIS_CONN.is_alive():
try:
key = "{}/{}".format(kb_id, loc)
if REDIS_CONN.exist(key):continue
file_bin = MINIO.get(kb_id, loc)
REDIS_CONN.transaction(key, file_bin, 12 * 60)
cron_logger.info("CACHE: {}".format(loc))
except Exception as e:
traceback.print_stack(e)
except Exception as e:
traceback.print_stack(e)
if __name__ == "__main__":
while True:
main()
close_connection()
time.sleep(1)

View File

@ -1,182 +0,0 @@
#
# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import logging
import os
import time
import random
from datetime import datetime
from api.db.db_models import Task
from api.db.db_utils import bulk_insert_into_db
from api.db.services.task_service import TaskService
from deepdoc.parser import PdfParser
from deepdoc.parser.excel_parser import HuExcelParser
from rag.settings import cron_logger
from rag.utils import MINIO
from rag.utils import findMaxTm
import pandas as pd
from api.db import FileType, TaskStatus
from api.db.services.document_service import DocumentService
from api.settings import database_logger
from api.utils import get_format_time, get_uuid
from api.utils.file_utils import get_project_base_directory
def collect(tm):
docs = DocumentService.get_newly_uploaded(tm)
if len(docs) == 0:
return pd.DataFrame()
docs = pd.DataFrame(docs)
mtm = docs["update_time"].max()
cron_logger.info("TOTAL:{}, To:{}".format(len(docs), mtm))
return docs
def set_dispatching(docid):
try:
DocumentService.update_by_id(
docid, {"progress": random.random() * 1 / 100.,
"progress_msg": "Task dispatched...",
"process_begin_at": get_format_time()
})
except Exception as e:
cron_logger.error("set_dispatching:({}), {}".format(docid, str(e)))
def dispatch():
tm_fnm = os.path.join(
get_project_base_directory(),
"rag/res",
f"broker.tm")
tm = findMaxTm(tm_fnm)
rows = collect(tm)
if len(rows) == 0:
return
tmf = open(tm_fnm, "a+")
for _, r in rows.iterrows():
try:
tsks = TaskService.query(doc_id=r["id"])
if tsks:
for t in tsks:
TaskService.delete_by_id(t.id)
except Exception as e:
cron_logger.exception(e)
def new_task():
nonlocal r
return {
"id": get_uuid(),
"doc_id": r["id"]
}
tsks = []
try:
if r["type"] == FileType.PDF.value:
do_layout = r["parser_config"].get("layout_recognize", True)
pages = PdfParser.total_page_number(
r["name"], MINIO.get(r["kb_id"], r["location"]))
page_size = r["parser_config"].get("task_page_size", 12)
if r["parser_id"] == "paper":
page_size = r["parser_config"].get("task_page_size", 22)
if r["parser_id"] == "one":
page_size = 1000000000
if not do_layout:
page_size = 1000000000
page_ranges = r["parser_config"].get("pages")
if not page_ranges:
page_ranges = [(1, 100000)]
for s, e in page_ranges:
s -= 1
s = max(0, s)
e = min(e - 1, pages)
for p in range(s, e, page_size):
task = new_task()
task["from_page"] = p
task["to_page"] = min(p + page_size, e)
tsks.append(task)
elif r["parser_id"] == "table":
rn = HuExcelParser.row_number(
r["name"], MINIO.get(
r["kb_id"], r["location"]))
for i in range(0, rn, 3000):
task = new_task()
task["from_page"] = i
task["to_page"] = min(i + 3000, rn)
tsks.append(task)
else:
tsks.append(new_task())
bulk_insert_into_db(Task, tsks, True)
set_dispatching(r["id"])
except Exception as e:
cron_logger.exception(e)
tmf.write(str(r["update_time"]) + "\n")
tmf.close()
def update_progress():
docs = DocumentService.get_unfinished_docs()
for d in docs:
try:
tsks = TaskService.query(doc_id=d["id"], order_by=Task.create_time)
if not tsks:
continue
msg = []
prg = 0
finished = True
bad = 0
status = TaskStatus.RUNNING.value
for t in tsks:
if 0 <= t.progress < 1:
finished = False
prg += t.progress if t.progress >= 0 else 0
msg.append(t.progress_msg)
if t.progress == -1:
bad += 1
prg /= len(tsks)
if finished and bad:
prg = -1
status = TaskStatus.FAIL.value
elif finished:
status = TaskStatus.DONE.value
msg = "\n".join(msg)
info = {
"process_duation": datetime.timestamp(
datetime.now()) -
d["process_begin_at"].timestamp(),
"run": status}
if prg != 0:
info["progress"] = prg
if msg:
info["progress_msg"] = msg
DocumentService.update_by_id(d["id"], info)
except Exception as e:
cron_logger.error("fetch task exception:" + str(e))
if __name__ == "__main__":
peewee_logger = logging.getLogger('peewee')
peewee_logger.propagate = False
peewee_logger.addHandler(database_logger.handlers[0])
peewee_logger.setLevel(database_logger.level)
while True:
dispatch()
time.sleep(1)
update_progress()

View File

@ -25,16 +25,18 @@ import time
import traceback
from functools import partial
from api.db.services.file2document_service import File2DocumentService
from rag.utils.minio_conn import MINIO
from api.db.db_models import close_connection
from rag.settings import database_logger
from rag.settings import database_logger, SVR_QUEUE_NAME
from rag.settings import cron_logger, DOC_MAXIMUM_SIZE
from multiprocessing import Pool
import numpy as np
from elasticsearch_dsl import Q
from multiprocessing.context import TimeoutError
from api.db.services.task_service import TaskService
from rag.utils import ELASTICSEARCH
from rag.utils import MINIO
from rag.utils.es_conn import ELASTICSEARCH
from timeit import default_timer as timer
from rag.utils import rmSpace, findMaxTm
from rag.nlp import search
@ -47,6 +49,7 @@ from api.db import LLMType, ParserType
from api.db.services.document_service import DocumentService
from api.db.services.llm_service import LLMBundle
from api.utils.file_utils import get_project_base_directory
from rag.utils.redis_conn import REDIS_CONN
BATCH_SIZE = 64
@ -86,28 +89,38 @@ def set_progress(task_id, from_page=0, to_page=-1,
except Exception as e:
cron_logger.error("set_progress:({}), {}".format(task_id, str(e)))
close_connection()
if cancel:
sys.exit()
def collect(comm, mod, tm):
tasks = TaskService.get_tasks(tm, mod, comm)
if len(tasks) == 0:
time.sleep(1)
def collect():
try:
payload = REDIS_CONN.queue_consumer(SVR_QUEUE_NAME, "rag_flow_svr_task_broker", "rag_flow_svr_task_consumer")
if not payload:
time.sleep(1)
return pd.DataFrame()
except Exception as e:
cron_logger.error("Get task event from queue exception:" + str(e))
return pd.DataFrame()
msg = payload.get_message()
payload.ack()
if not msg: return pd.DataFrame()
if TaskService.do_cancel(msg["id"]):
return pd.DataFrame()
tasks = TaskService.get_tasks(msg["id"])
assert tasks, "{} empty task!".format(msg["id"])
tasks = pd.DataFrame(tasks)
mtm = tasks["update_time"].max()
cron_logger.info("TOTAL:{}, To:{}".format(len(tasks), mtm))
return tasks
def get_minio_binary(bucket, name):
global MINIO
return MINIO.get(bucket, name)
def build(row):
from timeit import default_timer as timer
if row["size"] > DOC_MAXIMUM_SIZE:
set_progress(row["id"], prog=-1, msg="File size exceeds( <= %dMb )" %
(int(DOC_MAXIMUM_SIZE / 1024 / 1024)))
@ -119,12 +132,10 @@ def build(row):
row["from_page"],
row["to_page"])
chunker = FACTORY[row["parser_id"].lower()]
pool = Pool(processes=1)
try:
st = timer()
thr = pool.apply_async(get_minio_binary, args=(row["kb_id"], row["location"]))
binary = thr.get(timeout=90)
pool.terminate()
bucket, name = File2DocumentService.get_minio_address(doc_id=row["doc_id"])
binary = get_minio_binary(bucket, name)
cron_logger.info(
"From minio({}) {}/{}".format(timer()-st, row["location"], row["name"]))
cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
@ -143,7 +154,6 @@ def build(row):
else:
callback(-1, f"Internal server error: %s" %
str(e).replace("'", ""))
pool.terminate()
traceback.print_exc()
cron_logger.error(
@ -156,6 +166,7 @@ def build(row):
"doc_id": row["doc_id"],
"kb_id": [str(row["kb_id"])]
}
el = 0
for ck in cks:
d = copy.deepcopy(doc)
d.update(ck)
@ -175,10 +186,13 @@ def build(row):
else:
d["image"].save(output_buffer, format='JPEG')
st = timer()
MINIO.put(row["kb_id"], d["_id"], output_buffer.getvalue())
el += timer() - st
d["img_id"] = "{}-{}".format(row["kb_id"], d["_id"])
del d["image"]
docs.append(d)
cron_logger.info("MINIO PUT({}):{}".format(row["name"], el))
return docs
@ -230,30 +244,26 @@ def embedding(docs, mdl, parser_config={}, callback=None):
return tk_count
def main(comm, mod):
tm_fnm = os.path.join(
get_project_base_directory(),
"rag/res",
f"{comm}-{mod}.tm")
tm = findMaxTm(tm_fnm)
rows = collect(comm, mod, tm)
def main():
rows = collect()
if len(rows) == 0:
return
tmf = open(tm_fnm, "a+")
for _, r in rows.iterrows():
callback = partial(set_progress, r["id"], r["from_page"], r["to_page"])
try:
embd_mdl = LLMBundle(r["tenant_id"], LLMType.EMBEDDING)
embd_mdl = LLMBundle(r["tenant_id"], LLMType.EMBEDDING, llm_name=r["embd_id"], lang=r["language"])
except Exception as e:
traceback.print_stack(e)
callback(prog=-1, msg=str(e))
continue
st = timer()
cks = build(r)
cron_logger.info("Build chunks({}): {}".format(r["name"], timer()-st))
if cks is None:
continue
if not cks:
tmf.write(str(r["update_time"]) + "\n")
callback(1., "No chunk! Done!")
continue
# TODO: exception handler
@ -261,17 +271,21 @@ def main(comm, mod):
callback(
msg="Finished slicing files(%d). Start to embedding the content." %
len(cks))
st = timer()
try:
tk_count = embedding(cks, embd_mdl, r["parser_config"], callback)
except Exception as e:
callback(-1, "Embedding error:{}".format(str(e)))
cron_logger.error(str(e))
tk_count = 0
cron_logger.info("Embedding elapsed({}): {}".format(r["name"], timer()-st))
callback(msg="Finished embedding! Start to build index!")
callback(msg="Finished embedding({})! Start to build index!".format(timer()-st))
init_kb(r)
chunk_count = len(set([c["_id"] for c in cks]))
st = timer()
es_r = ELASTICSEARCH.bulk(cks, search.index_name(r["tenant_id"]))
cron_logger.info("Indexing elapsed({}): {}".format(r["name"], timer()-st))
if es_r:
callback(-1, "Index failure!")
ELASTICSEARCH.deleteByQuery(
@ -286,11 +300,9 @@ def main(comm, mod):
DocumentService.increment_chunk_num(
r["doc_id"], r["kb_id"], tk_count, chunk_count, 0)
cron_logger.info(
"Chunk doc({}), token({}), chunks({})".format(
r["id"], tk_count, len(cks)))
"Chunk doc({}), token({}), chunks({}), elapsed:{}".format(
r["id"], tk_count, len(cks), timer()-st))
tmf.write(str(r["update_time"]) + "\n")
tmf.close()
if __name__ == "__main__":
@ -299,9 +311,5 @@ if __name__ == "__main__":
peewee_logger.addHandler(database_logger.handlers[0])
peewee_logger.setLevel(database_logger.level)
from mpi4py import MPI
comm = MPI.COMM_WORLD
while True:
main(int(sys.argv[2]), int(sys.argv[1]))
close_connection()
main()

View File

@ -15,9 +15,6 @@ def singleton(cls, *args, **kw):
return _singleton
from .minio_conn import MINIO
from .es_conn import ELASTICSEARCH
def rmSpace(txt):
txt = re.sub(r"([^a-z0-9.,]) +([^ ])", r"\1\2", txt, flags=re.IGNORECASE)
return re.sub(r"([^ ]) +([^a-z0-9.,])", r"\1\2", txt, flags=re.IGNORECASE)

View File

@ -2,6 +2,7 @@ import re
import json
import time
import copy
import elasticsearch
from elastic_transport import ConnectionTimeout
from elasticsearch import Elasticsearch
@ -14,7 +15,7 @@ es_logger.info("Elasticsearch version: "+str(elasticsearch.__version__))
@singleton
class HuEs:
class ESConnection:
def __init__(self):
self.info = {}
self.conn()
@ -453,4 +454,4 @@ class HuEs:
scroll_size = len(page['hits']['hits'])
ELASTICSEARCH = HuEs()
ELASTICSEARCH = ESConnection()

View File

@ -8,7 +8,7 @@ from rag.utils import singleton
@singleton
class HuMinio(object):
class RAGFlowMinio(object):
def __init__(self):
self.conn = None
self.__open__()
@ -35,7 +35,7 @@ class HuMinio(object):
self.conn = None
def put(self, bucket, fnm, binary):
for _ in range(10):
for _ in range(3):
try:
if not self.conn.bucket_exists(bucket):
self.conn.make_bucket(bucket)
@ -56,7 +56,6 @@ class HuMinio(object):
except Exception as e:
minio_logger.error(f"Fail rm {bucket}/{fnm}: " + str(e))
def get(self, bucket, fnm):
for _ in range(1):
try:
@ -87,10 +86,12 @@ class HuMinio(object):
time.sleep(1)
return
MINIO = HuMinio()
MINIO = RAGFlowMinio()
if __name__ == "__main__":
conn = HuMinio()
conn = RAGFlowMinio()
fnm = "/opt/home/kevinhu/docgpt/upload/13/11-408.jpg"
from PIL import Image
img = Image.open(fnm)

139
rag/utils/redis_conn.py Normal file
View File

@ -0,0 +1,139 @@
import json
import redis
import logging
from rag import settings
from rag.utils import singleton
class Payload:
def __init__(self, consumer, queue_name, group_name, msg_id, message):
self.__consumer = consumer
self.__queue_name = queue_name
self.__group_name = group_name
self.__msg_id = msg_id
self.__message = json.loads(message['message'])
def ack(self):
try:
self.__consumer.xack(self.__queue_name, self.__group_name, self.__msg_id)
return True
except Exception as e:
logging.warning("[EXCEPTION]ack" + str(self.__queue_name) + "||" + str(e))
return False
def get_message(self):
return self.__message
@singleton
class RedisDB:
def __init__(self):
self.REDIS = None
self.config = settings.REDIS
self.__open__()
def __open__(self):
try:
self.REDIS = redis.StrictRedis(host=self.config["host"].split(":")[0],
port=int(self.config.get("host", ":6379").split(":")[1]),
db=int(self.config.get("db", 1)),
password=self.config.get("password"),
decode_responses=True)
except Exception as e:
logging.warning("Redis can't be connected.")
return self.REDIS
def is_alive(self):
return self.REDIS is not None
def exist(self, k):
if not self.REDIS: return
try:
return self.REDIS.exists(k)
except Exception as e:
logging.warning("[EXCEPTION]exist" + str(k) + "||" + str(e))
self.__open__()
def get(self, k):
if not self.REDIS: return
try:
return self.REDIS.get(k)
except Exception as e:
logging.warning("[EXCEPTION]get" + str(k) + "||" + str(e))
self.__open__()
def set_obj(self, k, obj, exp=3600):
try:
self.REDIS.set(k, json.dumps(obj, ensure_ascii=False), exp)
return True
except Exception as e:
logging.warning("[EXCEPTION]set_obj" + str(k) + "||" + str(e))
self.__open__()
return False
def set(self, k, v, exp=3600):
try:
self.REDIS.set(k, v, exp)
return True
except Exception as e:
logging.warning("[EXCEPTION]set" + str(k) + "||" + str(e))
self.__open__()
return False
def transaction(self, key, value, exp=3600):
try:
pipeline = self.REDIS.pipeline(transaction=True)
pipeline.set(key, value, exp, nx=True)
pipeline.execute()
return True
except Exception as e:
logging.warning("[EXCEPTION]set" + str(key) + "||" + str(e))
self.__open__()
return False
def queue_product(self, queue, message, exp=settings.SVR_QUEUE_RETENTION) -> bool:
try:
payload = {"message": json.dumps(message)}
pipeline = self.REDIS.pipeline()
pipeline.xadd(queue, payload)
pipeline.expire(queue, exp)
pipeline.execute()
return True
except Exception as e:
logging.warning("[EXCEPTION]producer" + str(queue) + "||" + str(e))
return False
def queue_consumer(self, queue_name, group_name, consumer_name, msg_id=b">") -> Payload:
try:
group_info = self.REDIS.xinfo_groups(queue_name)
if not any(e["name"] == group_name for e in group_info):
self.REDIS.xgroup_create(
queue_name,
group_name,
id="$",
mkstream=True
)
args = {
"groupname": group_name,
"consumername": consumer_name,
"count": 1,
"block": 10000,
"streams": {queue_name: msg_id},
}
messages = self.REDIS.xreadgroup(**args)
if not messages:
return None
stream, element_list = messages[0]
msg_id, payload = element_list[0]
res = Payload(self.REDIS, queue_name, group_name, msg_id, payload)
return res
except Exception as e:
if 'key' in str(e):
pass
else:
logging.warning("[EXCEPTION]consumer" + str(queue_name) + "||" + str(e))
return None
REDIS_CONN = RedisDB()

View File

@ -19,7 +19,7 @@ cryptography==42.0.5
dashscope==1.14.1
datasets==2.17.1
datrie==0.8.2
demjson==2.2.4
demjson3==3.0.6
dill==0.3.8
distro==1.9.0
elastic-transport==8.12.0
@ -50,7 +50,6 @@ joblib==1.3.2
lxml==5.1.0
MarkupSafe==2.1.5
minio==7.2.4
mpi4py==3.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
@ -69,6 +68,7 @@ nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
ollama==0.1.9
onnxruntime-gpu==1.17.1
openai==1.12.0
opencv-python==4.9.0.80
@ -91,8 +91,6 @@ pycryptodomex==3.20.0
pydantic==2.6.2
pydantic_core==2.16.3
PyJWT==2.8.0
PyMuPDF==1.23.25
PyMuPDFb==1.23.22
PyMySQL==1.1.0
PyPDF2==3.0.1
pypdfium2==4.27.0
@ -102,6 +100,7 @@ python-dotenv==1.0.1
python-pptx==0.6.23
pytz==2024.1
PyYAML==6.0.1
redis==5.0.3
regex==2023.12.25
requests==2.31.0
ruamel.yaml==0.18.6
@ -116,6 +115,7 @@ sniffio==1.3.1
StrEnum==0.4.15
sympy==1.12
threadpoolctl==3.3.0
tika==2.6.0
tiktoken==0.6.0
tokenizers==0.15.2
torch==2.2.1
@ -132,3 +132,5 @@ xpinyin==0.7.6
xxhash==3.4.1
yarl==1.9.4
zhipuai==2.0.1
BCEmbedding
loguru==0.7.2

126
requirements_dev.txt Normal file
View File

@ -0,0 +1,126 @@
accelerate==0.27.2
aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
Aspose.Slides==24.2.0
attrs==23.2.0
blinker==1.7.0
cachelib==0.12.0
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
coloredlogs==15.0.1
cryptography==42.0.5
dashscope==1.14.1
datasets==2.17.1
datrie==0.8.2
demjson3==3.0.6
dill==0.3.8
distro==1.9.0
elastic-transport==8.12.0
elasticsearch==8.12.1
elasticsearch-dsl==8.12.0
et-xmlfile==1.1.0
filelock==3.13.1
fastembed==0.2.6
FlagEmbedding==1.2.5
Flask==3.0.2
Flask-Cors==4.0.0
Flask-Login==0.6.3
Flask-Session==0.6.0
flatbuffers==23.5.26
frozenlist==1.4.1
fsspec==2023.10.0
h11==0.14.0
hanziconv==0.3.2
httpcore==1.0.4
httpx==0.27.0
huggingface-hub==0.20.3
humanfriendly==10.0
idna==3.6
install==1.3.5
itsdangerous==2.1.2
Jinja2==3.1.3
joblib==1.3.2
lxml==5.1.0
MarkupSafe==2.1.5
minio==7.2.4
mpi4py==3.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.2.1
nltk==3.8.1
numpy==1.26.4
openai==1.12.0
opencv-python==4.9.0.80
openpyxl==3.1.2
packaging==23.2
pandas==2.2.1
pdfminer.six==20221105
pdfplumber==0.10.4
peewee==3.17.1
pillow==10.2.0
protobuf==4.25.3
psutil==5.9.8
pyarrow==15.0.0
pyarrow-hotfix==0.6
pyclipper==1.3.0.post5
pycparser==2.21
pycryptodome==3.20.0
pycryptodome-test-vectors==1.0.14
pycryptodomex==3.20.0
pydantic==2.6.2
pydantic_core==2.16.3
PyJWT==2.8.0
PyMuPDF==1.23.25
PyMuPDFb==1.23.22
PyMySQL==1.1.0
PyPDF2==3.0.1
pypdfium2==4.27.0
python-dateutil==2.8.2
python-docx==1.1.0
python-dotenv==1.0.1
python-pptx==0.6.23
pytz==2024.1
PyYAML==6.0.1
regex==2023.12.25
requests==2.31.0
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
sentence-transformers==2.4.0
shapely==2.0.3
six==1.16.0
sniffio==1.3.1
StrEnum==0.4.15
sympy==1.12
threadpoolctl==3.3.0
tika==2.6.0
tiktoken==0.6.0
tokenizers==0.15.2
torch==2.2.1
tqdm==4.66.2
transformers==4.38.1
triton==2.2.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
Werkzeug==3.0.1
xgboost==2.0.3
XlsxWriter==3.2.0
xpinyin==0.7.6
xxhash==3.4.1
yarl==1.9.4
zhipuai==2.0.1
BCEmbedding
loguru==0.7.2
ollama==0.1.8
redis==5.0.4

138
web/externals.d.ts vendored Normal file
View File

@ -0,0 +1,138 @@
// This file is generated by Umi automatically
// DO NOT CHANGE IT MANUALLY!
type CSSModuleClasses = { readonly [key: string]: string };
declare module '*.css' {
const classes: CSSModuleClasses;
export default classes;
}
declare module '*.scss' {
const classes: CSSModuleClasses;
export default classes;
}
declare module '*.sass' {
const classes: CSSModuleClasses;
export default classes;
}
declare module '*.less' {
const classes: CSSModuleClasses;
export default classes;
}
declare module '*.styl' {
const classes: CSSModuleClasses;
export default classes;
}
declare module '*.stylus' {
const classes: CSSModuleClasses;
export default classes;
}
// images
declare module '*.jpg' {
const src: string;
export default src;
}
declare module '*.jpeg' {
const src: string;
export default src;
}
declare module '*.png' {
const src: string;
export default src;
}
declare module '*.gif' {
const src: string;
export default src;
}
declare module '*.svg' {
import * as React from 'react';
export const ReactComponent: React.FunctionComponent<
React.SVGProps<SVGSVGElement> & { title?: string }
>;
const src: string;
export default src;
}
declare module '*.ico' {
const src: string;
export default src;
}
declare module '*.webp' {
const src: string;
export default src;
}
declare module '*.avif' {
const src: string;
export default src;
}
// media
declare module '*.mp4' {
const src: string;
export default src;
}
declare module '*.webm' {
const src: string;
export default src;
}
declare module '*.ogg' {
const src: string;
export default src;
}
declare module '*.mp3' {
const src: string;
export default src;
}
declare module '*.wav' {
const src: string;
export default src;
}
declare module '*.flac' {
const src: string;
export default src;
}
declare module '*.aac' {
const src: string;
export default src;
}
// fonts
declare module '*.woff' {
const src: string;
export default src;
}
declare module '*.woff2' {
const src: string;
export default src;
}
declare module '*.eot' {
const src: string;
export default src;
}
declare module '*.ttf' {
const src: string;
export default src;
}
declare module '*.otf' {
const src: string;
export default src;
}
// other
declare module '*.wasm' {
const initWasm: (
options: WebAssembly.Imports,
) => Promise<WebAssembly.Exports>;
export default initWasm;
}
declare module '*.webmanifest' {
const src: string;
export default src;
}
declare module '*.pdf' {
const src: string;
export default src;
}
declare module '*.txt' {
const src: string;
export default src;
}

Some files were not shown because too many files have changed in this diff Show More