Update README (#670 )

### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>
update docker file to support low version npm package (#669 )
2025-12-08 20:42:30 +08:00 · 2024-05-08 12:01:26 +08:00 · 2024-05-08 10:40:38 +08:00 · 2024-05-08 10:30:18 +08:00 · 2024-05-08 10:30:02 +08:00 · 2024-05-08 09:05:35 +08:00
148 changed files with 5632 additions and 1663 deletions
--- a/.gitignore
+++ b/.gitignore
@ -27,3 +27,5 @@ Cargo.lock

 # Exclude the log folder
 docker/ragflow-logs/
+/flask_session
+/logs
--- a/2
+++ b/2
@ -4,7 +4,7 @@ USER  root
 WORKDIR /ragflow

 ADD ./web ./web
-RUN cd ./web && npm i && npm run build
+RUN cd ./web && npm i --force && npm run build

 ADD ./api ./api
 ADD ./conf ./conf
--- a/Dockerfile.cuda
+++ b/Dockerfile.cuda
@ -9,7 +9,7 @@ RUN /root/miniconda3/envs/py11/bin/pip install onnxruntime-gpu --extra-index-url


 ADD ./web ./web
-RUN cd ./web && npm i && npm run build
+RUN cd ./web && npm i --force && npm run build

 ADD ./api ./api
 ADD ./conf ./conf
--- a/Dockerfile.scratch
+++ b/Dockerfile.scratch
@ -34,7 +34,7 @@ ADD ./requirements.txt ./requirements.txt
 RUN apt install openmpi-bin openmpi-common libopenmpi-dev
 ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/openmpi/lib:$LD_LIBRARY_PATH
 RUN rm /root/miniconda3/envs/py11/compiler_compat/ld
-RUN cd ./web && npm i && npm run build
+RUN cd ./web && npm i --force && npm run build
 RUN conda run -n py11 pip install -i https://mirrors.aliyun.com/pypi/simple/ -r ./requirements.txt

 RUN apt-get update && \
--- a/Dockerfile.scratch.oc9
+++ b/Dockerfile.scratch.oc9
@ -35,7 +35,7 @@ RUN dnf install -y openmpi openmpi-devel python3-openmpi
 ENV C_INCLUDE_PATH /usr/include/openmpi-x86_64:$C_INCLUDE_PATH
 ENV LD_LIBRARY_PATH /usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
 RUN rm /root/miniconda3/envs/py11/compiler_compat/ld
-RUN cd ./web && npm i && npm run build
+RUN cd ./web && npm i --force && npm run build
 RUN conda run -n py11 pip install $(grep -ivE "mpi4py" ./requirements.txt) # without mpi4py==3.1.5
 RUN conda run -n py11 pip install redis

--- a/README.md
+++ b/README.md
@ -15,12 +15,12 @@
        <img src="https://img.shields.io/github/v/release/infiniflow/ragflow?color=blue&label=Latest%20Release" alt="Latest Release">
    </a>
    <a href="https://demo.ragflow.io" target="_blank">
-        <img alt="Static Badge" src="https://img.shields.io/badge/RAGFLOW-LLM-white?&labelColor=dd0af7"></a>
+        <img alt="Static Badge" src="https://img.shields.io/badge/Online-Demo-4e6b99"></a>
    <a href="https://hub.docker.com/r/infiniflow/ragflow" target="_blank">
-        <img src="https://img.shields.io/badge/docker_pull-ragflow:v0.3.2-brightgreen"
-            alt="docker pull infiniflow/ragflow:v0.3.2"></a>
+        <img src="https://img.shields.io/badge/docker_pull-ragflow:v0.5.0-brightgreen"
+            alt="docker pull infiniflow/ragflow:v0.5.0"></a>
      <a href="https://github.com/infiniflow/ragflow/blob/main/LICENSE">
-    <img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=7d09f1" alt="license">
+    <img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=1570EF" alt="license">
  </a>
 </p>

@ -58,13 +58,14 @@

 ## 📌 Latest Features

- 2024-04-19 Support conversation API ([detail](./docs/conversation_api.md)).
- 2024-04-16 Add an embedding model 'bce-embedding-base_v1' from [BCEmbedding](https://github.com/netease-youdao/BCEmbedding).
- 2024-04-16 Add [FastEmbed](https://github.com/qdrant/fastembed), which is designed specifically for light and speedy embedding.
- 2024-04-11 Support [Xinference](./docs/xinference.md) for local LLM deployment.
- 2024-04-10 Add a new layout recognization model for analyzing Laws documentation.
- 2024-04-08 Support [Ollama](./docs/ollama.md) for local LLM deployment.
- 2024-04-07 Support Chinese UI.
+- 2024-05-08 Integrates LLM DeepSeek.
+- 2024-04-26 Adds file management.
+- 2024-04-19 Supports conversation API ([detail](./docs/conversation_api.md)).
+- 2024-04-16 Integrates an embedding model 'bce-embedding-base_v1' from [BCEmbedding](https://github.com/netease-youdao/BCEmbedding), and [FastEmbed](https://github.com/qdrant/fastembed), which is designed specifically for light and speedy embedding.
+- 2024-04-11 Supports [Xinference](./docs/xinference.md) for local LLM deployment.
+- 2024-04-10 Adds a new layout recognition model for analyzing Laws documentation.
+- 2024-04-08 Supports [Ollama](./docs/ollama.md) for local LLM deployment.
+- 2024-04-07 Supports Chinese UI.

 ## 🔎 System Architecture

@ -118,6 +119,7 @@
   $ chmod +x ./entrypoint.sh
   $ docker compose up -d
   ```
+   > Please note that running the above commands will automatically download the development version docker image of RAGFlow. If you want to download and run a specific version of docker image, please find the RAGFLOW_VERSION variable in the docker/.env file, change it to the corresponding version, for example, RAGFLOW_VERSION=v0.5.0, and run the above commands.

   > The core image is about 9 GB in size and may take a while to load.

@ -179,12 +181,72 @@ To build the Docker images from source:
 ```bash
 $ git clone https://github.com/infiniflow/ragflow.git
 $ cd ragflow/
-$ docker build -t infiniflow/ragflow:v0.3.2 .
+$ docker build -t infiniflow/ragflow:dev .
 $ cd ragflow/docker
 $ chmod +x ./entrypoint.sh
 $ docker compose up -d
 ```

+## 🛠️ Launch Service from Source
+
+To launch the service from source, please follow these steps:
+
+1. Clone the repository
+```bash
+$ git clone https://github.com/infiniflow/ragflow.git
+$ cd ragflow/
+```
+
+2. Create a virtual environment (ensure Anaconda or Miniconda is installed)
+```bash
+$ conda create -n ragflow python=3.11.0
+$ conda activate ragflow
+$ pip install -r requirements.txt
+```
+If CUDA version is greater than 12.0, execute the following additional commands:
+```bash
+$ pip uninstall -y onnxruntime-gpu
+$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
+```
+
+3. Copy the entry script and configure environment variables
+```bash
+$ cp docker/entrypoint.sh .
+$ vi entrypoint.sh
+```
+Use the following commands to obtain the Python path and the ragflow project path:
+```bash
+$ which python
+$ pwd
+```
+
+Set the output of `which python` as the value for `PY` and the output of `pwd` as the value for `PYTHONPATH`.
+
+If `LD_LIBRARY_PATH` is already configured, it can be commented out.
+
+```bash
+# Adjust configurations according to your actual situation; the two export commands are newly added.
+PY=${PY}
+export PYTHONPATH=${PYTHONPATH}
+# Optional: Add Hugging Face mirror
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+4. Start the base services
+```bash
+$ cd docker
+$ docker compose -f docker-compose-base.yml up -d 
+```
+
+5. Check the configuration files
+Ensure that the settings in **docker/.env** match those in **conf/service_conf.yaml**. The IP addresses and ports for related services in **service_conf.yaml** should be changed to the local machine IP and ports exposed by the container.
+
+6. Launch the service
+```bash
+$ chmod +x ./entrypoint.sh
+$ bash ./entrypoint.sh
+```
+
 ## 📚 Documentation

 - [FAQ](./docs/faq.md)
--- a/README_ja.md
+++ b/README_ja.md
@ -15,12 +15,12 @@
        <img src="https://img.shields.io/github/v/release/infiniflow/ragflow?color=blue&label=Latest%20Release" alt="Latest Release">
    </a>
    <a href="https://demo.ragflow.io" target="_blank">
-        <img alt="Static Badge" src="https://img.shields.io/badge/RAGFLOW-LLM-white?&labelColor=dd0af7"></a>
+        <img alt="Static Badge" src="https://img.shields.io/badge/Online-Demo-4e6b99"></a>
    <a href="https://hub.docker.com/r/infiniflow/ragflow" target="_blank">
-        <img src="https://img.shields.io/badge/docker_pull-ragflow:v0.3.2-brightgreen"
-            alt="docker pull infiniflow/ragflow:v0.3.2"></a>
+        <img src="https://img.shields.io/badge/docker_pull-ragflow:v0.5.0-brightgreen"
+            alt="docker pull infiniflow/ragflow:v0.5.0"></a>
      <a href="https://github.com/infiniflow/ragflow/blob/main/LICENSE">
-    <img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=7d09f1" alt="license">
+    <img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=1570EF" alt="license">
  </a>
 </p>

@ -58,6 +58,8 @@

 ## 📌 最新の機能

+- 2024-05-08 
+- 2024-04-26 「ファイル管理」機能を追加しました。
 - 2024-04-19 会話 API をサポートします ([詳細](./docs/conversation_api.md))。
 - 2024-04-16 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) から埋め込みモデル「bce-embedding-base_v1」を追加します。
 - 2024-04-16 [FastEmbed](https://github.com/qdrant/fastembed) は、軽量かつ高速な埋め込み用に設計されています。
@ -119,7 +121,9 @@
   $ docker compose up -d
   ```

-   > コアイメージのサイズは約 15 GB で、ロードに時間がかかる場合があります。
+   > 上記のコマンドを実行すると、RAGFlowの開発版dockerイメージが自動的にダウンロードされます。 特定のバージョンのDockerイメージをダウンロードして実行したい場合は、docker/.envファイルのRAGFLOW_VERSION変数を見つけて、対応するバージョンに変更してください。 例えば、RAGFLOW_VERSION=v0.5.0として、上記のコマンドを実行してください。
+
+   > コアイメージのサイズは約 9 GB で、ロードに時間がかかる場合があります。

 4. サーバーを立ち上げた後、サーバーの状態を確認する:

@ -179,12 +183,72 @@
 ```bash
 $ git clone https://github.com/infiniflow/ragflow.git
 $ cd ragflow/
-$ docker build -t infiniflow/ragflow:v0.3.2 .
+$ docker build -t infiniflow/ragflow:v0.5.0 .
 $ cd ragflow/docker
 $ chmod +x ./entrypoint.sh
 $ docker compose up -d
 ```

+## 🛠️ ソースコードからサービスを起動する方法
+
+ソースコードからサービスを起動する場合は、以下の手順に従ってください:
+
+1. リポジトリをクローンします
+```bash
+$ git clone https://github.com/infiniflow/ragflow.git
+$ cd ragflow/
+```
+
+2. 仮想環境を作成します（AnacondaまたはMinicondaがインストールされていることを確認してください）
+```bash
+$ conda create -n ragflow python=3.11.0
+$ conda activate ragflow
+$ pip install -r requirements.txt
+```
+CUDAのバージョンが12.0以上の場合、以下の追加コマンドを実行してください：
+```bash
+$ pip uninstall -y onnxruntime-gpu
+$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
+```
+
+3. エントリースクリプトをコピーし、環境変数を設定します
+```bash
+$ cp docker/entrypoint.sh .
+$ vi entrypoint.sh
+```
+以下のコマンドでPythonのパスとragflowプロジェクトのパスを取得します：
+```bash
+$ which python
+$ pwd
+```
+
+`which python`の出力を`PY`の値として、`pwd`の出力を`PYTHONPATH`の値として設定します。
+
+`LD_LIBRARY_PATH`が既に設定されている場合は、コメントアウトできます。
+
+```bash
+# 実際の状況に応じて設定を調整してください。以下の二つのexportは新たに追加された設定です
+PY=${PY}
+export PYTHONPATH=${PYTHONPATH}
+# オプション：Hugging Faceミラーを追加
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+4. 基本サービスを起動します
+```bash
+$ cd docker
+$ docker compose -f docker-compose-base.yml up -d 
+```
+
+5. 設定ファイルを確認します
+**docker/.env**内の設定が**conf/service_conf.yaml**内の設定と一致していることを確認してください。**service_conf.yaml**内の関連サービスのIPアドレスとポートは、ローカルマシンのIPアドレスとコンテナが公開するポートに変更する必要があります。
+
+6. サービスを起動します
+```bash
+$ chmod +x ./entrypoint.sh
+$ bash ./entrypoint.sh
+```
+
 ## 📚 ドキュメンテーション

 - [FAQ](./docs/faq.md)
--- a/README_zh.md
+++ b/README_zh.md
@ -15,12 +15,12 @@
        <img src="https://img.shields.io/github/v/release/infiniflow/ragflow?color=blue&label=Latest%20Release" alt="Latest Release">
    </a>
    <a href="https://demo.ragflow.io" target="_blank">
-        <img alt="Static Badge" src="https://img.shields.io/badge/RAGFLOW-LLM-white?&labelColor=dd0af7"></a>
+        <img alt="Static Badge" src="https://img.shields.io/badge/Online-Demo-4e6b99"></a>
    <a href="https://hub.docker.com/r/infiniflow/ragflow" target="_blank">
-        <img src="https://img.shields.io/badge/docker_pull-ragflow:v0.3.2-brightgreen"
-            alt="docker pull infiniflow/ragflow:v0.3.2"></a>
+        <img src="https://img.shields.io/badge/docker_pull-ragflow:v0.5.0-brightgreen"
+            alt="docker pull infiniflow/ragflow:v0.5.0"></a>
      <a href="https://github.com/infiniflow/ragflow/blob/main/LICENSE">
-    <img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=7d09f1" alt="license">
+    <img height="21" src="https://img.shields.io/badge/License-Apache--2.0-ffffff?style=flat-square&labelColor=d4eaf7&color=1570EF" alt="license">
  </a>
 </p>

@ -58,9 +58,10 @@

 ## 📌 新增功能

+- 2024-05-08 集成大模型 DeepSeek
+- 2024-04-26 增添了'文件管理'功能.
 - 2024-04-19 支持对话 API ([更多](./docs/conversation_api.md)).
- 2024-04-16 添加嵌入模型 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) 。
- 2024-04-16 添加 [FastEmbed](https://github.com/qdrant/fastembed) 专为轻型和高速嵌入而设计。
+- 2024-04-16 集成嵌入模型 [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) 和 专为轻型和高速嵌入而设计的 [FastEmbed](https://github.com/qdrant/fastembed) 。
 - 2024-04-11 支持用 [Xinference](./docs/xinference.md) 本地化部署大模型。
 - 2024-04-10 为‘Laws’版面分析增加了底层模型。
 - 2024-04-08 支持用 [Ollama](./docs/ollama.md) 本地化部署大模型。
@ -119,7 +120,9 @@
   $ docker compose -f docker-compose-CN.yml up -d
   ```

-   > 核心镜像文件大约 15 GB，可能需要一定时间拉取。请耐心等待。
+   > 请注意，运行上述命令会自动下载 RAGFlow 的开发版本 docker 镜像。如果你想下载并运行特定版本的 docker 镜像，请在 docker/.env 文件中找到 RAGFLOW_VERSION 变量，将其改为对应版本。例如 RAGFLOW_VERSION=v0.5.0，然后运行上述命令。
+
+   > 核心镜像文件大约 9 GB，可能需要一定时间拉取。请耐心等待。

 4. 服务器启动成功后再次确认服务器状态：

@ -179,12 +182,72 @@
 ```bash
 $ git clone https://github.com/infiniflow/ragflow.git
 $ cd ragflow/
-$ docker build -t infiniflow/ragflow:v0.3.2 .
+$ docker build -t infiniflow/ragflow:v0.5.0 .
 $ cd ragflow/docker
 $ chmod +x ./entrypoint.sh
 $ docker compose up -d
 ```

+## 🛠️ 源码启动服务
+
+如需从源码启动服务，请参考以下步骤：
+
+1. 克隆仓库
+```bash
+$ git clone https://github.com/infiniflow/ragflow.git
+$ cd ragflow/
+```
+
+2. 创建虚拟环境（确保已安装 Anaconda 或 Miniconda）
+```bash
+$ conda create -n ragflow python=3.11.0
+$ conda activate ragflow
+$ pip install -r requirements.txt
+```
+如果cuda > 12.0，需额外执行以下命令：
+```bash
+$ pip uninstall -y onnxruntime-gpu
+$ pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
+```
+
+3. 拷贝入口脚本并配置环境变量
+```bash
+$ cp docker/entrypoint.sh .
+$ vi entrypoint.sh
+```
+使用以下命令获取python路径及ragflow项目路径：
+```bash
+$ which python
+$ pwd
+```
+
+将上述`which python`的输出作为`PY`的值，将`pwd`的输出作为`PYTHONPATH`的值。
+
+`LD_LIBRARY_PATH`如果环境已经配置好，可以注释掉。
+
+```bash
+# 此处配置需要按照实际情况调整，两个export为新增配置
+PY=${PY}
+export PYTHONPATH=${PYTHONPATH}
+# 可选：添加Hugging Face镜像
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+4. 启动基础服务
+```bash
+$ cd docker
+$ docker compose -f docker-compose-base.yml up -d 
+```
+
+5. 检查配置文件
+确保**docker/.env**中的配置与**conf/service_conf.yaml**中配置一致， **service_conf.yaml**中相关服务的IP地址与端口应该改成本机IP地址及容器映射出来的端口。
+
+6. 启动服务
+```bash
+$ chmod +x ./entrypoint.sh
+$ bash ./entrypoint.sh
+```
+
 ## 📚 技术文档

 - [FAQ](./docs/faq.md)
--- a/api/apps/api_app.py
+++ b/api/apps/api_app.py
@ -33,7 +33,7 @@ from api.utils.api_utils import server_error_response, get_data_error_result, ge
 from itsdangerous import URLSafeTimedSerializer

 from api.utils.file_utils import filename_type, thumbnail
-from rag.utils import MINIO
+from rag.utils.minio_conn import MINIO


 def generate_confirmation_token(tenent_id):
--- a/api/apps/chunk_app.py
+++ b/api/apps/chunk_app.py
@ -20,8 +20,9 @@ from flask_login import login_required, current_user
 from elasticsearch_dsl import Q

 from rag.app.qa import rmPrefix, beAdoc
-from rag.nlp import search, huqie
-from rag.utils import ELASTICSEARCH, rmSpace
+from rag.nlp import search, rag_tokenizer
+from rag.utils.es_conn import ELASTICSEARCH
+from rag.utils import rmSpace
 from api.db import LLMType, ParserType
 from api.db.services.knowledgebase_service import KnowledgebaseService
 from api.db.services.llm_service import TenantLLMService
@ -124,10 +125,10 @@ def set():
    d = {
        "id": req["chunk_id"],
        "content_with_weight": req["content_with_weight"]}
-    d["content_ltks"] = huqie.qie(req["content_with_weight"])
-    d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
+    d["content_ltks"] = rag_tokenizer.tokenize(req["content_with_weight"])
+    d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
    d["important_kwd"] = req["important_kwd"]
-    d["important_tks"] = huqie.qie(" ".join(req["important_kwd"]))
+    d["important_tks"] = rag_tokenizer.tokenize(" ".join(req["important_kwd"]))
    if "available_int" in req:
        d["available_int"] = req["available_int"]

@ -151,7 +152,7 @@ def set():
                    retmsg="Q&A must be separated by TAB/ENTER key.")
            q, a = rmPrefix(arr[0]), rmPrefix[arr[1]]
            d = beAdoc(d, arr[0], arr[1], not any(
-                [huqie.is_chinese(t) for t in q + a]))
+                [rag_tokenizer.is_chinese(t) for t in q + a]))

        v, c = embd_mdl.encode([doc.name, req["content_with_weight"]])
        v = 0.1 * v[0] + 0.9 * v[1] if doc.parser_id != ParserType.QA else v[1]
@ -201,11 +202,11 @@ def create():
    md5 = hashlib.md5()
    md5.update((req["content_with_weight"] + req["doc_id"]).encode("utf-8"))
    chunck_id = md5.hexdigest()
-    d = {"id": chunck_id, "content_ltks": huqie.qie(req["content_with_weight"]),
+    d = {"id": chunck_id, "content_ltks": rag_tokenizer.tokenize(req["content_with_weight"]),
         "content_with_weight": req["content_with_weight"]}
-    d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
+    d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
    d["important_kwd"] = req.get("important_kwd", [])
-    d["important_tks"] = huqie.qie(" ".join(req.get("important_kwd", [])))
+    d["important_tks"] = rag_tokenizer.tokenize(" ".join(req.get("important_kwd", [])))
    d["create_time"] = str(datetime.datetime.now()).replace("T", " ")[:19]
    d["create_timestamp_flt"] = datetime.datetime.now().timestamp()

--- a/api/apps/dialog_app.py
+++ b/api/apps/dialog_app.py
@ -35,13 +35,7 @@ def set_dialog():
    top_n = req.get("top_n", 6)
    similarity_threshold = req.get("similarity_threshold", 0.1)
    vector_similarity_weight = req.get("vector_similarity_weight", 0.3)
-    llm_setting = req.get("llm_setting", {
-        "temperature": 0.1,
-        "top_p": 0.3,
-        "frequency_penalty": 0.7,
-        "presence_penalty": 0.4,
-        "max_tokens": 215
-    })
+    llm_setting = req.get("llm_setting", {})
    default_prompt = {
        "system": """你是一个智能助手，请总结知识库的内容来回答问题，请列举知识库中的数据详细回答。当所有知识库内容都与问题无关时，你的回答必须包括“知识库中未找到您要的答案！”这句话。回答需要考虑聊天历史。
 以下是知识库：
--- a/api/apps/document_app.py
+++ b/api/apps/document_app.py
@ -14,7 +14,6 @@
 #  limitations under the License
 #

-import base64
 import os
 import pathlib
 import re
@ -23,8 +22,13 @@ import flask
 from elasticsearch_dsl import Q
 from flask import request
 from flask_login import login_required, current_user
+
+from api.db.db_models import Task
+from api.db.services.file2document_service import File2DocumentService
+from api.db.services.file_service import FileService
+from api.db.services.task_service import TaskService, queue_tasks
 from rag.nlp import search
-from rag.utils import ELASTICSEARCH
+from rag.utils.es_conn import ELASTICSEARCH
 from api.db.services import duplicate_name
 from api.db.services.knowledgebase_service import KnowledgebaseService
 from api.utils.api_utils import server_error_response, get_data_error_result, validate_request
@ -48,55 +52,59 @@ def upload():
    if 'file' not in request.files:
        return get_json_result(
            data=False, retmsg='No file part!', retcode=RetCode.ARGUMENT_ERROR)
-    file = request.files['file']
-    if file.filename == '':
+
+    file_objs = request.files.getlist('file')
+    for file_obj in file_objs:
+        if file_obj.filename == '':
+            return get_json_result(
+                data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
+
+    err = []
+    for file in file_objs:
+        try:
+            e, kb = KnowledgebaseService.get_by_id(kb_id)
+            if not e:
+                raise LookupError("Can't find this knowledgebase!")
+            MAX_FILE_NUM_PER_USER = int(os.environ.get('MAX_FILE_NUM_PER_USER', 0))
+            if MAX_FILE_NUM_PER_USER > 0 and DocumentService.get_doc_count(kb.tenant_id) >= MAX_FILE_NUM_PER_USER:
+                raise RuntimeError("Exceed the maximum file number of a free user!")
+
+            filename = duplicate_name(
+                DocumentService.query,
+                name=file.filename,
+                kb_id=kb.id)
+            filetype = filename_type(filename)
+            if filetype == FileType.OTHER.value:
+                raise RuntimeError("This type of file has not been supported yet!")
+
+            location = filename
+            while MINIO.obj_exist(kb_id, location):
+                location += "_"
+            blob = file.read()
+            MINIO.put(kb_id, location, blob)
+            doc = {
+                "id": get_uuid(),
+                "kb_id": kb.id,
+                "parser_id": kb.parser_id,
+                "parser_config": kb.parser_config,
+                "created_by": current_user.id,
+                "type": filetype,
+                "name": filename,
+                "location": location,
+                "size": len(blob),
+                "thumbnail": thumbnail(filename, blob)
+            }
+            if doc["type"] == FileType.VISUAL:
+                doc["parser_id"] = ParserType.PICTURE.value
+            if re.search(r"\.(ppt|pptx|pages)$", filename):
+                doc["parser_id"] = ParserType.PRESENTATION.value
+            DocumentService.insert(doc)
+        except Exception as e:
+            err.append(file.filename + ": " + str(e))
+    if err:
        return get_json_result(
-            data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
-
-    try:
-        e, kb = KnowledgebaseService.get_by_id(kb_id)
-        if not e:
-            return get_data_error_result(
-                retmsg="Can't find this knowledgebase!")
-        MAX_FILE_NUM_PER_USER = int(os.environ.get('MAX_FILE_NUM_PER_USER', 0))
-        if MAX_FILE_NUM_PER_USER > 0 and DocumentService.get_doc_count(kb.tenant_id) >= MAX_FILE_NUM_PER_USER:
-            return get_data_error_result(
-                retmsg="Exceed the maximum file number of a free user!")
-
-        filename = duplicate_name(
-            DocumentService.query,
-            name=file.filename,
-            kb_id=kb.id)
-        filetype = filename_type(filename)
-        if not filetype:
-            return get_data_error_result(
-                retmsg="This type of file has not been supported yet!")
-
-        location = filename
-        while MINIO.obj_exist(kb_id, location):
-            location += "_"
-        blob = request.files['file'].read()
-        MINIO.put(kb_id, location, blob)
-        doc = {
-            "id": get_uuid(),
-            "kb_id": kb.id,
-            "parser_id": kb.parser_id,
-            "parser_config": kb.parser_config,
-            "created_by": current_user.id,
-            "type": filetype,
-            "name": filename,
-            "location": location,
-            "size": len(blob),
-            "thumbnail": thumbnail(filename, blob)
-        }
-        if doc["type"] == FileType.VISUAL:
-            doc["parser_id"] = ParserType.PICTURE.value
-        if re.search(r"\.(ppt|pptx|pages)$", filename):
-            doc["parser_id"] = ParserType.PRESENTATION.value
-        doc = DocumentService.insert(doc)
-        return get_json_result(data=doc.to_json())
-    except Exception as e:
-        return server_error_response(e)
+            data=False, retmsg="\n".join(err), retcode=RetCode.SERVER_ERROR)
+    return get_json_result(data=True)


@manager.route('/create', methods=['POST'])
@ -218,26 +226,37 @@ def change_status():
@validate_request("doc_id")
 def rm():
    req = request.json
-    try:
-        e, doc = DocumentService.get_by_id(req["doc_id"])
-        if not e:
-            return get_data_error_result(retmsg="Document not found!")
-        tenant_id = DocumentService.get_tenant_id(req["doc_id"])
-        if not tenant_id:
-            return get_data_error_result(retmsg="Tenant not found!")
-        ELASTICSEARCH.deleteByQuery(
-            Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
+    doc_ids = req["doc_id"]
+    if isinstance(doc_ids, str): doc_ids = [doc_ids]
+    errors = ""
+    for doc_id in doc_ids:
+        try:
+            e, doc = DocumentService.get_by_id(doc_id)

-        DocumentService.increment_chunk_num(
-            doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
-        if not DocumentService.delete(doc):
-            return get_data_error_result(
-                retmsg="Database error (Document removal)!")
+            if not e:
+                return get_data_error_result(retmsg="Document not found!")
+            tenant_id = DocumentService.get_tenant_id(doc_id)
+            if not tenant_id:
+                return get_data_error_result(retmsg="Tenant not found!")

-        MINIO.rm(doc.kb_id, doc.location)
-        return get_json_result(data=True)
-    except Exception as e:
-        return server_error_response(e)
+            ELASTICSEARCH.deleteByQuery(
+                Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
+            DocumentService.increment_chunk_num(
+                doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
+            if not DocumentService.delete(doc):
+                return get_data_error_result(
+                    retmsg="Database error (Document removal)!")
+
+            informs = File2DocumentService.get_by_document_id(doc_id)
+            if not informs:
+                MINIO.rm(doc.kb_id, doc.location)
+            else:
+                File2DocumentService.delete_by_document_id(doc_id)
+        except Exception as e:
+            errors += str(e)
+
+    if errors: return server_error_response(e)
+    return get_json_result(data=True)


@manager.route('/run', methods=['POST'])
@ -259,6 +278,14 @@ def run():
                return get_data_error_result(retmsg="Tenant not found!")
            ELASTICSEARCH.deleteByQuery(
                Q("match", doc_id=id), idxnm=search.index_name(tenant_id))
+            
+            if str(req["run"]) == TaskStatus.RUNNING.value:
+                TaskService.filter_delete([Task.doc_id == id])
+                e, doc = DocumentService.get_by_id(id)
+                doc = doc.to_dict()
+                doc["tenant_id"] = tenant_id
+                bucket, name = File2DocumentService.get_minio_address(doc_id=doc["id"])
+                queue_tasks(doc, bucket, name)

        return get_json_result(data=True)
    except Exception as e:
@ -289,6 +316,11 @@ def rename():
            return get_data_error_result(
                retmsg="Database error (Document rename)!")

+        informs = File2DocumentService.get_by_document_id(req["doc_id"])
+        if informs:
+            e, file = FileService.get_by_id(informs[0].file_id)
+            FileService.update_by_id(file.id, {"name": req["name"]})
+
        return get_json_result(data=True)
    except Exception as e:
        return server_error_response(e)
@ -302,7 +334,13 @@ def get(doc_id):
        if not e:
            return get_data_error_result(retmsg="Document not found!")

-        response = flask.make_response(MINIO.get(doc.kb_id, doc.location))
+        informs = File2DocumentService.get_by_document_id(doc_id)
+        if not informs:
+            response = flask.make_response(MINIO.get(doc.kb_id, doc.location))
+        else:
+            e, file = FileService.get_by_id(informs[0].file_id)
+            response = flask.make_response(MINIO.get(file.parent_id, doc.location))
+
        ext = re.search(r"\.([^.]+)$", doc.name)
        if ext:
            if doc.type == FileType.VISUAL.value:
@ -338,7 +376,8 @@ def change_parser():
            return get_data_error_result(retmsg="Not supported yet!")

        e = DocumentService.update_by_id(doc.id,
-                                         {"parser_id": req["parser_id"], "progress": 0, "progress_msg": "", "run": "0"})
+                                         {"parser_id": req["parser_id"], "progress": 0, "progress_msg": "",
+                                          "run": TaskStatus.UNSTART.value})
        if not e:
            return get_data_error_result(retmsg="Document not found!")
        if "parser_config" in req:
--- a/api/apps/file2document_app.py
+++ b/api/apps/file2document_app.py
@ -0,0 +1,137 @@
+#
+#  Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License
+#
+from elasticsearch_dsl import Q
+
+from api.db.db_models import File2Document
+from api.db.services.file2document_service import File2DocumentService
+from api.db.services.file_service import FileService
+
+from flask import request
+from flask_login import login_required, current_user
+from api.db.services.knowledgebase_service import KnowledgebaseService
+from api.utils.api_utils import server_error_response, get_data_error_result, validate_request
+from api.utils import get_uuid
+from api.db import FileType
+from api.db.services.document_service import DocumentService
+from api.settings import RetCode
+from api.utils.api_utils import get_json_result
+from rag.nlp import search
+from rag.utils.es_conn import ELASTICSEARCH
+
+
+@manager.route('/convert', methods=['POST'])
+@login_required
+@validate_request("file_ids", "kb_ids")
+def convert():
+    req = request.json
+    kb_ids = req["kb_ids"]
+    file_ids = req["file_ids"]
+    file2documents = []
+
+    try:
+        for file_id in file_ids:
+            e, file = FileService.get_by_id(file_id)
+            file_ids_list = [file_id]
+            if file.type == FileType.FOLDER.value:
+                file_ids_list = FileService.get_all_innermost_file_ids(file_id, [])
+            for id in file_ids_list:
+                informs = File2DocumentService.get_by_file_id(id)
+                # delete
+                for inform in informs:
+                    doc_id = inform.document_id
+                    e, doc = DocumentService.get_by_id(doc_id)
+                    if not e:
+                        return get_data_error_result(retmsg="Document not found!")
+                    tenant_id = DocumentService.get_tenant_id(doc_id)
+                    if not tenant_id:
+                        return get_data_error_result(retmsg="Tenant not found!")
+                    ELASTICSEARCH.deleteByQuery(
+                        Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
+                    DocumentService.increment_chunk_num(
+                        doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
+                    if not DocumentService.delete(doc):
+                        return get_data_error_result(
+                            retmsg="Database error (Document removal)!")
+                File2DocumentService.delete_by_file_id(id)
+
+                # insert
+                for kb_id in kb_ids:
+                    e, kb = KnowledgebaseService.get_by_id(kb_id)
+                    if not e:
+                        return get_data_error_result(
+                            retmsg="Can't find this knowledgebase!")
+                    e, file = FileService.get_by_id(id)
+                    if not e:
+                        return get_data_error_result(
+                            retmsg="Can't find this file!")
+
+                    doc = DocumentService.insert({
+                        "id": get_uuid(),
+                        "kb_id": kb.id,
+                        "parser_id": kb.parser_id,
+                        "parser_config": kb.parser_config,
+                        "created_by": current_user.id,
+                        "type": file.type,
+                        "name": file.name,
+                        "location": file.location,
+                        "size": file.size
+                    })
+                    file2document = File2DocumentService.insert({
+                        "id": get_uuid(),
+                        "file_id": id,
+                        "document_id": doc.id,
+                    })
+                    file2documents.append(file2document.to_json())
+        return get_json_result(data=file2documents)
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/rm', methods=['POST'])
+@login_required
+@validate_request("file_ids")
+def rm():
+    req = request.json
+    file_ids = req["file_ids"]
+    if not file_ids:
+        return get_json_result(
+            data=False, retmsg='Lack of "Files ID"', retcode=RetCode.ARGUMENT_ERROR)
+    try:
+        for file_id in file_ids:
+            informs = File2DocumentService.get_by_file_id(file_id)
+            if not informs:
+                return get_data_error_result(retmsg="Inform not found!")
+            for inform in informs:
+                if not inform:
+                    return get_data_error_result(retmsg="Inform not found!")
+                File2DocumentService.delete_by_file_id(file_id)
+                doc_id = inform.document_id
+                e, doc = DocumentService.get_by_id(doc_id)
+                if not e:
+                    return get_data_error_result(retmsg="Document not found!")
+                tenant_id = DocumentService.get_tenant_id(doc_id)
+                if not tenant_id:
+                    return get_data_error_result(retmsg="Tenant not found!")
+                ELASTICSEARCH.deleteByQuery(
+                    Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
+                DocumentService.increment_chunk_num(
+                    doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
+                if not DocumentService.delete(doc):
+                    return get_data_error_result(
+                        retmsg="Database error (Document removal)!")
+        return get_json_result(data=True)
+    except Exception as e:
+        return server_error_response(e)
--- a/api/apps/file_app.py
+++ b/api/apps/file_app.py
@ -0,0 +1,347 @@
+#
+#  Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License
+#
+import os
+import pathlib
+import re
+
+import flask
+from elasticsearch_dsl import Q
+from flask import request
+from flask_login import login_required, current_user
+
+from api.db.services.document_service import DocumentService
+from api.db.services.file2document_service import File2DocumentService
+from api.utils.api_utils import server_error_response, get_data_error_result, validate_request
+from api.utils import get_uuid
+from api.db import FileType
+from api.db.services import duplicate_name
+from api.db.services.file_service import FileService
+from api.settings import RetCode
+from api.utils.api_utils import get_json_result
+from api.utils.file_utils import filename_type
+from rag.nlp import search
+from rag.utils.es_conn import ELASTICSEARCH
+from rag.utils.minio_conn import MINIO
+
+
+@manager.route('/upload', methods=['POST'])
+@login_required
+# @validate_request("parent_id")
+def upload():
+    pf_id = request.form.get("parent_id")
+
+    if not pf_id:
+        root_folder = FileService.get_root_folder(current_user.id)
+        pf_id = root_folder.id
+
+    if 'file' not in request.files:
+        return get_json_result(
+            data=False, retmsg='No file part!', retcode=RetCode.ARGUMENT_ERROR)
+    file_objs = request.files.getlist('file')
+
+    for file_obj in file_objs:
+        if file_obj.filename == '':
+            return get_json_result(
+                data=False, retmsg='No file selected!', retcode=RetCode.ARGUMENT_ERROR)
+    file_res = []
+    try:
+        for file_obj in file_objs:
+            e, file = FileService.get_by_id(pf_id)
+            if not e:
+                return get_data_error_result(
+                    retmsg="Can't find this folder!")
+            MAX_FILE_NUM_PER_USER = int(os.environ.get('MAX_FILE_NUM_PER_USER', 0))
+            if MAX_FILE_NUM_PER_USER > 0 and DocumentService.get_doc_count(current_user.id) >= MAX_FILE_NUM_PER_USER:
+                return get_data_error_result(
+                    retmsg="Exceed the maximum file number of a free user!")
+
+            # split file name path
+            if not file_obj.filename:
+                e, file = FileService.get_by_id(pf_id)
+                file_obj_names = [file.name, file_obj.filename]
+            else:
+                full_path = '/' + file_obj.filename
+                file_obj_names = full_path.split('/')
+            file_len = len(file_obj_names)
+
+            # get folder
+            file_id_list = FileService.get_id_list_by_id(pf_id, file_obj_names, 1, [pf_id])
+            len_id_list = len(file_id_list)
+
+            # create folder
+            if file_len != len_id_list:
+                e, file = FileService.get_by_id(file_id_list[len_id_list - 1])
+                if not e:
+                    return get_data_error_result(retmsg="Folder not found!")
+                last_folder = FileService.create_folder(file, file_id_list[len_id_list - 1], file_obj_names,
+                                                        len_id_list)
+            else:
+                e, file = FileService.get_by_id(file_id_list[len_id_list - 2])
+                if not e:
+                    return get_data_error_result(retmsg="Folder not found!")
+                last_folder = FileService.create_folder(file, file_id_list[len_id_list - 2], file_obj_names,
+                                                        len_id_list)
+
+            # file type
+            filetype = filename_type(file_obj_names[file_len - 1])
+            location = file_obj_names[file_len - 1]
+            while MINIO.obj_exist(last_folder.id, location):
+                location += "_"
+            blob = file_obj.read()
+            filename = duplicate_name(
+                FileService.query,
+                name=file_obj_names[file_len - 1],
+                parent_id=last_folder.id)
+            file = {
+                "id": get_uuid(),
+                "parent_id": last_folder.id,
+                "tenant_id": current_user.id,
+                "created_by": current_user.id,
+                "type": filetype,
+                "name": filename,
+                "location": location,
+                "size": len(blob),
+            }
+            file = FileService.insert(file)
+            MINIO.put(last_folder.id, location, blob)
+            file_res.append(file.to_json())
+        return get_json_result(data=file_res)
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/create', methods=['POST'])
+@login_required
+@validate_request("name")
+def create():
+    req = request.json
+    pf_id = request.json.get("parent_id")
+    input_file_type = request.json.get("type")
+    if not pf_id:
+        root_folder = FileService.get_root_folder(current_user.id)
+        pf_id = root_folder.id
+
+    try:
+        if not FileService.is_parent_folder_exist(pf_id):
+            return get_json_result(
+                data=False, retmsg="Parent Folder Doesn't Exist!", retcode=RetCode.OPERATING_ERROR)
+        if FileService.query(name=req["name"], parent_id=pf_id):
+            return get_data_error_result(
+                retmsg="Duplicated folder name in the same folder.")
+
+        if input_file_type == FileType.FOLDER.value:
+            file_type = FileType.FOLDER.value
+        else:
+            file_type = FileType.VIRTUAL.value
+
+        file = FileService.insert({
+            "id": get_uuid(),
+            "parent_id": pf_id,
+            "tenant_id": current_user.id,
+            "created_by": current_user.id,
+            "name": req["name"],
+            "location": "",
+            "size": 0,
+            "type": file_type
+        })
+
+        return get_json_result(data=file.to_json())
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/list', methods=['GET'])
+@login_required
+def list():
+    pf_id = request.args.get("parent_id")
+
+    keywords = request.args.get("keywords", "")
+
+    page_number = int(request.args.get("page", 1))
+    items_per_page = int(request.args.get("page_size", 15))
+    orderby = request.args.get("orderby", "create_time")
+    desc = request.args.get("desc", True)
+    if not pf_id:
+        root_folder = FileService.get_root_folder(current_user.id)
+        pf_id = root_folder.id
+    try:
+        e, file = FileService.get_by_id(pf_id)
+        if not e:
+            return get_data_error_result(retmsg="Folder not found!")
+
+        files, total = FileService.get_by_pf_id(
+            current_user.id, pf_id, page_number, items_per_page, orderby, desc, keywords)
+
+        parent_folder = FileService.get_parent_folder(pf_id)
+        if not FileService.get_parent_folder(pf_id):
+            return get_json_result(retmsg="File not found!")
+
+        return get_json_result(data={"total": total, "files": files, "parent_folder": parent_folder.to_json()})
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/root_folder', methods=['GET'])
+@login_required
+def get_root_folder():
+    try:
+        root_folder = FileService.get_root_folder(current_user.id)
+        return get_json_result(data={"root_folder": root_folder.to_json()})
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/parent_folder', methods=['GET'])
+@login_required
+def get_parent_folder():
+    file_id = request.args.get("file_id")
+    try:
+        e, file = FileService.get_by_id(file_id)
+        if not e:
+            return get_data_error_result(retmsg="Folder not found!")
+
+        parent_folder = FileService.get_parent_folder(file_id)
+        return get_json_result(data={"parent_folder": parent_folder.to_json()})
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/all_parent_folder', methods=['GET'])
+@login_required
+def get_all_parent_folders():
+    file_id = request.args.get("file_id")
+    try:
+        e, file = FileService.get_by_id(file_id)
+        if not e:
+            return get_data_error_result(retmsg="Folder not found!")
+
+        parent_folders = FileService.get_all_parent_folders(file_id)
+        parent_folders_res = []
+        for parent_folder in parent_folders:
+            parent_folders_res.append(parent_folder.to_json())
+        return get_json_result(data={"parent_folders": parent_folders_res})
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/rm', methods=['POST'])
+@login_required
+@validate_request("file_ids")
+def rm():
+    req = request.json
+    file_ids = req["file_ids"]
+    try:
+        for file_id in file_ids:
+            e, file = FileService.get_by_id(file_id)
+            if not e:
+                return get_data_error_result(retmsg="File or Folder not found!")
+            if not file.tenant_id:
+                return get_data_error_result(retmsg="Tenant not found!")
+
+            if file.type == FileType.FOLDER.value:
+                file_id_list = FileService.get_all_innermost_file_ids(file_id, [])
+                for inner_file_id in file_id_list:
+                    e, file = FileService.get_by_id(inner_file_id)
+                    if not e:
+                        return get_data_error_result(retmsg="File not found!")
+                    MINIO.rm(file.parent_id, file.location)
+                FileService.delete_folder_by_pf_id(current_user.id, file_id)
+            else:
+                if not FileService.delete(file):
+                    return get_data_error_result(
+                        retmsg="Database error (File removal)!")
+
+            # delete file2document
+            informs = File2DocumentService.get_by_file_id(file_id)
+            for inform in informs:
+                doc_id = inform.document_id
+                e, doc = DocumentService.get_by_id(doc_id)
+                if not e:
+                    return get_data_error_result(retmsg="Document not found!")
+                tenant_id = DocumentService.get_tenant_id(doc_id)
+                if not tenant_id:
+                    return get_data_error_result(retmsg="Tenant not found!")
+                ELASTICSEARCH.deleteByQuery(
+                    Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
+                DocumentService.increment_chunk_num(
+                    doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
+                if not DocumentService.delete(doc):
+                    return get_data_error_result(
+                        retmsg="Database error (Document removal)!")
+            File2DocumentService.delete_by_file_id(file_id)
+
+        return get_json_result(data=True)
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/rename', methods=['POST'])
+@login_required
+@validate_request("file_id", "name")
+def rename():
+    req = request.json
+    try:
+        e, file = FileService.get_by_id(req["file_id"])
+        if not e:
+            return get_data_error_result(retmsg="File not found!")
+        if pathlib.Path(req["name"].lower()).suffix != pathlib.Path(
+                file.name.lower()).suffix:
+            return get_json_result(
+                data=False,
+                retmsg="The extension of file can't be changed",
+                retcode=RetCode.ARGUMENT_ERROR)
+        if FileService.query(name=req["name"], pf_id=file.parent_id):
+            return get_data_error_result(
+                retmsg="Duplicated file name in the same folder.")
+
+        if not FileService.update_by_id(
+                req["file_id"], {"name": req["name"]}):
+            return get_data_error_result(
+                retmsg="Database error (File rename)!")
+
+        informs = File2DocumentService.get_by_file_id(req["file_id"])
+        if informs:
+            if not DocumentService.update_by_id(
+                    informs[0].document_id, {"name": req["name"]}):
+                return get_data_error_result(
+                    retmsg="Database error (Document rename)!")
+
+        return get_json_result(data=True)
+    except Exception as e:
+        return server_error_response(e)
+
+
+@manager.route('/get/<file_id>', methods=['GET'])
+# @login_required
+def get(file_id):
+    try:
+        e, file = FileService.get_by_id(file_id)
+        if not e:
+            return get_data_error_result(retmsg="Document not found!")
+
+        response = flask.make_response(MINIO.get(file.parent_id, file.location))
+        ext = re.search(r"\.([^.]+)$", file.name)
+        if ext:
+            if file.type == FileType.VISUAL.value:
+                response.headers.set('Content-Type', 'image/%s' % ext.group(1))
+            else:
+                response.headers.set(
+                    'Content-Type',
+                    'application/%s' %
+                    ext.group(1))
+        return response
+    except Exception as e:
+        return server_error_response(e)
--- a/api/apps/kb_app.py
+++ b/api/apps/kb_app.py
@ -28,7 +28,7 @@ from api.db.db_models import Knowledgebase
 from api.settings import stat_logger, RetCode
 from api.utils.api_utils import get_json_result
 from rag.nlp import search
-from rag.utils import ELASTICSEARCH
+from rag.utils.es_conn import ELASTICSEARCH


@manager.route('/create', methods=['post'])
@ -111,7 +111,7 @@ def detail():
@login_required
 def list():
    page_number = request.args.get("page", 1)
-    items_per_page = request.args.get("page_size", 15)
+    items_per_page = request.args.get("page_size", 150)
    orderby = request.args.get("orderby", "create_time")
    desc = request.args.get("desc", True)
    try:
--- a/api/apps/user_app.py
+++ b/api/apps/user_app.py
@ -24,10 +24,11 @@ from api.db.db_models import TenantLLM
 from api.db.services.llm_service import TenantLLMService, LLMService
 from api.utils.api_utils import server_error_response, validate_request
 from api.utils import get_uuid, get_format_time, decrypt, download_img, current_timestamp, datetime_format
-from api.db import UserTenantRole, LLMType
+from api.db import UserTenantRole, LLMType, FileType
 from api.settings import RetCode, GITHUB_OAUTH, CHAT_MDL, EMBEDDING_MDL, ASR_MDL, IMAGE2TEXT_MDL, PARSERS, API_KEY, \
    LLM_FACTORY, LLM_BASE_URL
 from api.db.services.user_service import UserService, TenantService, UserTenantService
+from api.db.services.file_service import FileService
 from api.settings import stat_logger
 from api.utils.api_utils import get_json_result, cors_reponse

@ -221,6 +222,17 @@ def user_register(user_id, user):
        "invited_by": user_id,
        "role": UserTenantRole.OWNER
    }
+    file_id = get_uuid()
+    file = {
+        "id": file_id,
+        "parent_id": file_id,
+        "tenant_id": user_id,
+        "created_by": user_id,
+        "name": "/",
+        "type": FileType.FOLDER.value,
+        "size": 0,
+        "location": "",
+    }
    tenant_llm = []
    for llm in LLMService.query(fid=LLM_FACTORY):
        tenant_llm.append({"tenant_id": user_id,
@ -236,6 +248,7 @@ def user_register(user_id, user):
    TenantService.insert(**tenant)
    UserTenantService.insert(**usr_tenant)
    TenantLLMService.insert_many(tenant_llm)
+    FileService.insert(file)
    return UserService.query(email=user["email"])


--- a/api/db/init.py
+++ b/api/db/init.py
@ -45,6 +45,8 @@ class FileType(StrEnum):
    VISUAL = 'visual'
    AURAL = 'aural'
    VIRTUAL = 'virtual'
+    FOLDER = 'folder'
+    OTHER = "other"


 class LLMType(StrEnum):
@ -62,6 +64,7 @@ class ChatStyle(StrEnum):


 class TaskStatus(StrEnum):
+    UNSTART = "0"
    RUNNING = "1"
    CANCEL = "2"
    DONE = "3"
--- a/api/db/db_models.py
+++ b/api/db/db_models.py
@ -669,6 +669,61 @@ class Document(DataBaseModel):
        db_table = "document"


+class File(DataBaseModel):
+    id = CharField(
+        max_length=32,
+        primary_key=True,
+    )
+    parent_id = CharField(
+        max_length=32,
+        null=False,
+        help_text="parent folder id",
+        index=True)
+    tenant_id = CharField(
+        max_length=32,
+        null=False,
+        help_text="tenant id",
+        index=True)
+    created_by = CharField(
+        max_length=32,
+        null=False,
+        help_text="who created it")
+    name = CharField(
+        max_length=255,
+        null=False,
+        help_text="file name or folder name",
+        index=True)
+    location = CharField(
+        max_length=255,
+        null=True,
+        help_text="where dose it store")
+    size = IntegerField(default=0)
+    type = CharField(max_length=32, null=False, help_text="file extension")
+
+    class Meta:
+        db_table = "file"
+
+
+class File2Document(DataBaseModel):
+    id = CharField(
+        max_length=32,
+        primary_key=True,
+    )
+    file_id = CharField(
+        max_length=32,
+        null=True,
+        help_text="file id",
+        index=True)
+    document_id = CharField(
+        max_length=32,
+        null=True,
+        help_text="document id",
+        index=True)
+
+    class Meta:
+        db_table = "file2document"
+
+
 class Task(DataBaseModel):
    id = CharField(max_length=32, primary_key=True)
    doc_id = CharField(max_length=32, null=False, index=True)
--- a/api/db/init_data.py
+++ b/api/db/init_data.py
@ -123,7 +123,12 @@ factory_infos = [{
    "name": "Youdao",
    "logo": "",
    "tags": "LLM,TEXT EMBEDDING,SPEECH2TEXT,MODERATION",
-        "status": "1",
+    "status": "1",
+},{
+    "name": "DeepSeek",
+    "logo": "",
+    "tags": "LLM",
+    "status": "1",
 },
    # {
    #     "name": "文心一言",
@ -331,6 +336,21 @@ def init_llm_factory():
            "max_tokens": 512,
            "model_type": LLMType.EMBEDDING.value
        },
+        # ------------------------ DeepSeek -----------------------
+        {
+            "fid": factory_infos[8]["name"],
+            "llm_name": "deepseek-chat",
+            "tags": "LLM,CHAT,",
+            "max_tokens": 32768,
+            "model_type": LLMType.CHAT.value
+        },
+        {
+            "fid": factory_infos[8]["name"],
+            "llm_name": "deepseek-coder",
+            "tags": "LLM,CHAT,",
+            "max_tokens": 16385,
+            "model_type": LLMType.CHAT.value
+        },
    ]
    for info in factory_infos:
        try:
--- a/api/db/services/dialog_service.py
+++ b/api/db/services/dialog_service.py
@ -136,7 +136,7 @@ def chat(dialog, messages, **kwargs):
    chat_logger.info("User: {}|Assistant: {}".format(
        msg[-1]["content"], answer))

-    if knowledges and prompt_config.get("quote", True):
+    if knowledges and (prompt_config.get("quote", True) and kwargs.get("quote", True)):
        answer, idx = retrievaler.insert_citations(answer,
                                                   [ck["content_ltks"]
                                                       for ck in kbinfos["chunks"]],
--- a/api/db/services/document_service.py
+++ b/api/db/services/document_service.py
@ -13,10 +13,18 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 #
-from peewee import Expression
+import random
+from datetime import datetime
+from elasticsearch_dsl import Q
+
+from api.settings import stat_logger
+from api.utils import current_timestamp, get_format_time
+from rag.utils.es_conn import ELASTICSEARCH
+from rag.utils.minio_conn import MINIO
+from rag.nlp import search

 from api.db import FileType, TaskStatus
-from api.db.db_models import DB, Knowledgebase, Tenant
+from api.db.db_models import DB, Knowledgebase, Tenant, Task
 from api.db.db_models import Document
 from api.db.services.common_service import CommonService
 from api.db.services.knowledgebase_service import KnowledgebaseService
@ -71,7 +79,21 @@ class DocumentService(CommonService):

    @classmethod
    @DB.connection_context()
-    def get_newly_uploaded(cls, tm, mod=0, comm=1, items_per_page=64):
+    def remove_document(cls, doc, tenant_id):
+        ELASTICSEARCH.deleteByQuery(
+            Q("match", doc_id=doc.id), idxnm=search.index_name(tenant_id))
+
+        cls.increment_chunk_num(
+            doc.id, doc.kb_id, doc.token_num * -1, doc.chunk_num * -1, 0)
+        if not cls.delete(doc):
+            raise RuntimeError("Database error (Document removal)!")
+
+        MINIO.rm(doc.kb_id, doc.location)
+        return cls.delete_by_id(doc.id)
+
+    @classmethod
+    @DB.connection_context()
+    def get_newly_uploaded(cls):
        fields = [
            cls.model.id,
            cls.model.kb_id,
@ -93,11 +115,9 @@ class DocumentService(CommonService):
                cls.model.status == StatusEnum.VALID.value,
                ~(cls.model.type == FileType.VIRTUAL.value),
                cls.model.progress == 0,
-                cls.model.update_time >= tm,
-                cls.model.run == TaskStatus.RUNNING.value,
-                (Expression(cls.model.create_time, "%%", comm) == mod))\
-            .order_by(cls.model.update_time.asc())\
-            .paginate(1, items_per_page)
+                cls.model.update_time >= current_timestamp() - 1000 * 600,
+                cls.model.run == TaskStatus.RUNNING.value)\
+            .order_by(cls.model.update_time.asc())
        return list(docs.dicts())

    @classmethod
@ -177,3 +197,55 @@ class DocumentService(CommonService):
                                                   on=(Knowledgebase.id == cls.model.kb_id)).where(
            Knowledgebase.tenant_id == tenant_id)
        return len(docs)
+
+    @classmethod
+    @DB.connection_context()
+    def begin2parse(cls, docid):
+        cls.update_by_id(
+            docid, {"progress": random.random() * 1 / 100.,
+                    "progress_msg": "Task dispatched...",
+                    "process_begin_at": get_format_time()
+                    })
+
+    @classmethod
+    @DB.connection_context()
+    def update_progress(cls):
+        docs = cls.get_unfinished_docs()
+        for d in docs:
+            try:
+                tsks = Task.query(doc_id=d["id"], order_by=Task.create_time)
+                if not tsks:
+                    continue
+                msg = []
+                prg = 0
+                finished = True
+                bad = 0
+                status = TaskStatus.RUNNING.value
+                for t in tsks:
+                    if 0 <= t.progress < 1:
+                        finished = False
+                    prg += t.progress if t.progress >= 0 else 0
+                    msg.append(t.progress_msg)
+                    if t.progress == -1:
+                        bad += 1
+                prg /= len(tsks)
+                if finished and bad:
+                    prg = -1
+                    status = TaskStatus.FAIL.value
+                elif finished:
+                    status = TaskStatus.DONE.value
+
+                msg = "\n".join(msg)
+                info = {
+                    "process_duation": datetime.timestamp(
+                        datetime.now()) -
+                                       d["process_begin_at"].timestamp(),
+                    "run": status}
+                if prg != 0:
+                    info["progress"] = prg
+                if msg:
+                    info["progress_msg"] = msg
+                cls.update_by_id(d["id"], info)
+            except Exception as e:
+                stat_logger.error("fetch task exception:" + str(e))
+
--- a/api/db/services/file2document_service.py
+++ b/api/db/services/file2document_service.py
@ -0,0 +1,83 @@
+#
+#  Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+from datetime import datetime
+
+from api.db.db_models import DB
+from api.db.db_models import File, Document, File2Document
+from api.db.services.common_service import CommonService
+from api.db.services.document_service import DocumentService
+from api.db.services.file_service import FileService
+from api.utils import current_timestamp, datetime_format
+
+
+class File2DocumentService(CommonService):
+    model = File2Document
+
+    @classmethod
+    @DB.connection_context()
+    def get_by_file_id(cls, file_id):
+        objs = cls.model.select().where(cls.model.file_id == file_id)
+        return objs
+
+    @classmethod
+    @DB.connection_context()
+    def get_by_document_id(cls, document_id):
+        objs = cls.model.select().where(cls.model.document_id == document_id)
+        return objs
+
+    @classmethod
+    @DB.connection_context()
+    def insert(cls, obj):
+        if not cls.save(**obj):
+            raise RuntimeError("Database error (File)!")
+        e, obj = cls.get_by_id(obj["id"])
+        if not e:
+            raise RuntimeError("Database error (File retrieval)!")
+        return obj
+
+    @classmethod
+    @DB.connection_context()
+    def delete_by_file_id(cls, file_id):
+        return cls.model.delete().where(cls.model.file_id == file_id).execute()
+
+    @classmethod
+    @DB.connection_context()
+    def delete_by_document_id(cls, doc_id):
+        return cls.model.delete().where(cls.model.document_id == doc_id).execute()
+
+    @classmethod
+    @DB.connection_context()
+    def update_by_file_id(cls, file_id, obj):
+        obj["update_time"] = current_timestamp()
+        obj["update_date"] = datetime_format(datetime.now())
+        num = cls.model.update(obj).where(cls.model.id == file_id).execute()
+        e, obj = cls.get_by_id(cls.model.id)
+        return obj
+
+    @classmethod
+    @DB.connection_context()
+    def get_minio_address(cls, doc_id=None, file_id=None):
+        if doc_id:
+            ids = File2DocumentService.get_by_document_id(doc_id)
+        else:
+            ids = File2DocumentService.get_by_file_id(file_id)
+        if ids:
+            e, file = FileService.get_by_id(ids[0].file_id)
+            return file.parent_id, file.location
+        else:
+            assert doc_id, "please specify doc_id"
+            e, doc = DocumentService.get_by_id(doc_id)
+            return doc.kb_id, doc.location
--- a/api/db/services/file_service.py
+++ b/api/db/services/file_service.py
@ -0,0 +1,243 @@
+#
+#  Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+from flask_login import current_user
+from peewee import fn
+
+from api.db import FileType
+from api.db.db_models import DB, File2Document, Knowledgebase
+from api.db.db_models import File, Document
+from api.db.services.common_service import CommonService
+from api.utils import get_uuid
+
+
+class FileService(CommonService):
+    model = File
+
+    @classmethod
+    @DB.connection_context()
+    def get_by_pf_id(cls, tenant_id, pf_id, page_number, items_per_page,
+                     orderby, desc, keywords):
+        if keywords:
+            files = cls.model.select().where(
+                (cls.model.tenant_id == tenant_id)
+                & (cls.model.parent_id == pf_id), (fn.LOWER(cls.model.name).like(f"%%{keywords.lower()}%%")))
+        else:
+            files = cls.model.select().where((cls.model.tenant_id == tenant_id)
+                                             & (cls.model.parent_id == pf_id))
+        count = files.count()
+        if desc:
+            files = files.order_by(cls.model.getter_by(orderby).desc())
+        else:
+            files = files.order_by(cls.model.getter_by(orderby).asc())
+
+        files = files.paginate(page_number, items_per_page)
+
+        res_files = list(files.dicts())
+        for file in res_files:
+            if file["type"] == FileType.FOLDER.value:
+                file["size"] = cls.get_folder_size(file["id"])
+                file['kbs_info'] = []
+                continue
+            kbs_info = cls.get_kb_id_by_file_id(file['id'])
+            file['kbs_info'] = kbs_info
+
+        return res_files, count
+
+    @classmethod
+    @DB.connection_context()
+    def get_kb_id_by_file_id(cls, file_id):
+        kbs = (cls.model.select(*[Knowledgebase.id, Knowledgebase.name])
+               .join(File2Document, on=(File2Document.file_id == file_id))
+               .join(Document, on=(File2Document.document_id == Document.id))
+               .join(Knowledgebase, on=(Knowledgebase.id == Document.kb_id))
+               .where(cls.model.id == file_id))
+        if not kbs: return []
+        kbs_info_list = []
+        for kb in list(kbs.dicts()):
+            kbs_info_list.append({"kb_id": kb['id'], "kb_name": kb['name']})
+        return kbs_info_list
+
+    @classmethod
+    @DB.connection_context()
+    def get_by_pf_id_name(cls, id, name):
+        file = cls.model.select().where((cls.model.parent_id == id) & (cls.model.name == name))
+        if file.count():
+            e, file = cls.get_by_id(file[0].id)
+            if not e:
+                raise RuntimeError("Database error (File retrieval)!")
+            return file
+        return None
+
+    @classmethod
+    @DB.connection_context()
+    def get_id_list_by_id(cls, id, name, count, res):
+        if count < len(name):
+            file = cls.get_by_pf_id_name(id, name[count])
+            if file:
+                res.append(file.id)
+                return cls.get_id_list_by_id(file.id, name, count + 1, res)
+            else:
+                return res
+        else:
+            return res
+
+    @classmethod
+    @DB.connection_context()
+    def get_all_innermost_file_ids(cls, folder_id, result_ids):
+        subfolders = cls.model.select().where(cls.model.parent_id == folder_id)
+        if subfolders.exists():
+            for subfolder in subfolders:
+                cls.get_all_innermost_file_ids(subfolder.id, result_ids)
+        else:
+            result_ids.append(folder_id)
+        return result_ids
+
+    @classmethod
+    @DB.connection_context()
+    def create_folder(cls, file, parent_id, name, count):
+        if count > len(name) - 2:
+            return file
+        else:
+            file = cls.insert({
+                "id": get_uuid(),
+                "parent_id": parent_id,
+                "tenant_id": current_user.id,
+                "created_by": current_user.id,
+                "name": name[count],
+                "location": "",
+                "size": 0,
+                "type": FileType.FOLDER.value
+            })
+            return cls.create_folder(file, file.id, name, count + 1)
+
+    @classmethod
+    @DB.connection_context()
+    def is_parent_folder_exist(cls, parent_id):
+        parent_files = cls.model.select().where(cls.model.id == parent_id)
+        if parent_files.count():
+            return True
+        cls.delete_folder_by_pf_id(parent_id)
+        return False
+
+    @classmethod
+    @DB.connection_context()
+    def get_root_folder(cls, tenant_id):
+        file = cls.model.select().where(cls.model.tenant_id == tenant_id and
+                                        cls.model.parent_id == cls.model.id)
+        if not file:
+            file_id = get_uuid()
+            file = {
+                "id": file_id,
+                "parent_id": file_id,
+                "tenant_id": tenant_id,
+                "created_by": tenant_id,
+                "name": "/",
+                "type": FileType.FOLDER.value,
+                "size": 0,
+                "location": "",
+            }
+            cls.save(**file)
+        else:
+            file_id = file[0].id
+
+        e, file = cls.get_by_id(file_id)
+        if not e:
+            raise RuntimeError("Database error (File retrieval)!")
+        return file
+
+    @classmethod
+    @DB.connection_context()
+    def get_parent_folder(cls, file_id):
+        file = cls.model.select().where(cls.model.id == file_id)
+        if file.count():
+            e, file = cls.get_by_id(file[0].parent_id)
+            if not e:
+                raise RuntimeError("Database error (File retrieval)!")
+        else:
+            raise RuntimeError("Database error (File doesn't exist)!")
+        return file
+
+    @classmethod
+    @DB.connection_context()
+    def get_all_parent_folders(cls, start_id):
+        parent_folders = []
+        current_id = start_id
+        while current_id:
+            e, file = cls.get_by_id(current_id)
+            if file.parent_id != file.id and e:
+                parent_folders.append(file)
+                current_id = file.parent_id
+            else:
+                parent_folders.append(file)
+                break
+        return parent_folders
+
+    @classmethod
+    @DB.connection_context()
+    def insert(cls, file):
+        if not cls.save(**file):
+            raise RuntimeError("Database error (File)!")
+        e, file = cls.get_by_id(file["id"])
+        if not e:
+            raise RuntimeError("Database error (File retrieval)!")
+        return file
+
+    @classmethod
+    @DB.connection_context()
+    def delete(cls, file):
+        return cls.delete_by_id(file.id)
+
+    @classmethod
+    @DB.connection_context()
+    def delete_by_pf_id(cls, folder_id):
+        return cls.model.delete().where(cls.model.parent_id == folder_id).execute()
+
+    @classmethod
+    @DB.connection_context()
+    def delete_folder_by_pf_id(cls, user_id, folder_id):
+        try:
+            files = cls.model.select().where((cls.model.tenant_id == user_id)
+                                             & (cls.model.parent_id == folder_id))
+            for file in files:
+                cls.delete_folder_by_pf_id(user_id, file.id)
+            return cls.model.delete().where((cls.model.tenant_id == user_id)
+                                            & (cls.model.id == folder_id)).execute(),
+        except Exception as e:
+            print(e)
+            raise RuntimeError("Database error (File retrieval)!")
+
+    @classmethod
+    @DB.connection_context()
+    def get_file_count(cls, tenant_id):
+        files = cls.model.select(cls.model.id).where(cls.model.tenant_id == tenant_id)
+        return len(files)
+
+    @classmethod
+    @DB.connection_context()
+    def get_folder_size(cls, folder_id):
+        size = 0
+
+        def dfs(parent_id):
+            nonlocal size
+            for f in cls.model.select(*[cls.model.id, cls.model.size, cls.model.type]).where(
+                    cls.model.parent_id == parent_id, cls.model.id != parent_id):
+                size += f.size
+                if f.type == FileType.FOLDER.value:
+                    dfs(f.id)
+
+        dfs(folder_id)
+        return size
+
--- a/api/db/services/llm_service.py
+++ b/api/db/services/llm_service.py
@ -128,9 +128,11 @@ class TenantLLMService(CommonService):
        else:
            assert False, "LLM type error"

-        num = cls.model.update(used_tokens=cls.model.used_tokens + used_tokens)\
-            .where(cls.model.tenant_id == tenant_id, cls.model.llm_name == mdlnm)\
-            .execute()
+        num = 0
+        for u in cls.query(tenant_id = tenant_id, llm_name=mdlnm):
+            num += cls.model.update(used_tokens = u.used_tokens + used_tokens)\
+                .where(cls.model.tenant_id == tenant_id, cls.model.llm_name == mdlnm)\
+                .execute()
        return num


--- a/api/db/services/task_service.py
+++ b/api/db/services/task_service.py
@ -15,13 +15,19 @@
 #
 import random

-from peewee import Expression
-from api.db.db_models import DB
+from api.db.db_utils import bulk_insert_into_db
+from deepdoc.parser import PdfParser
+from peewee import JOIN
+from api.db.db_models import DB, File2Document, File
 from api.db import StatusEnum, FileType, TaskStatus
 from api.db.db_models import Task, Document, Knowledgebase, Tenant
 from api.db.services.common_service import CommonService
 from api.db.services.document_service import DocumentService
-from api.utils import current_timestamp
+from api.utils import current_timestamp, get_uuid
+from deepdoc.parser.excel_parser import RAGFlowExcelParser
+from rag.settings import SVR_QUEUE_NAME
+from rag.utils.minio_conn import MINIO
+from rag.utils.redis_conn import REDIS_CONN


 class TaskService(CommonService):
@ -29,7 +35,7 @@ class TaskService(CommonService):

    @classmethod
    @DB.connection_context()
-    def get_tasks(cls, tm, mod=0, comm=1, items_per_page=1, takeit=True):
+    def get_tasks(cls, task_id):
        fields = [
            cls.model.id,
            cls.model.doc_id,
@ -48,47 +54,38 @@ class TaskService(CommonService):
            Tenant.img2txt_id,
            Tenant.asr_id,
            cls.model.update_time]
-        with DB.lock("get_task", -1):
-            docs = cls.model.select(*fields) \
-                .join(Document, on=(cls.model.doc_id == Document.id)) \
-                .join(Knowledgebase, on=(Document.kb_id == Knowledgebase.id)) \
-                .join(Tenant, on=(Knowledgebase.tenant_id == Tenant.id))\
-                .where(
-                    Document.status == StatusEnum.VALID.value,
-                    Document.run == TaskStatus.RUNNING.value,
-                    ~(Document.type == FileType.VIRTUAL.value),
-                    cls.model.progress == 0,
-                    #cls.model.update_time >= tm,
-                    #(Expression(cls.model.create_time, "%%", comm) == mod)
-                )\
-                .order_by(cls.model.update_time.asc())\
-                .paginate(0, items_per_page)
-            docs = list(docs.dicts())
-            if not docs: return []
-            if not takeit: return docs
+        docs = cls.model.select(*fields) \
+            .join(Document, on=(cls.model.doc_id == Document.id)) \
+            .join(Knowledgebase, on=(Document.kb_id == Knowledgebase.id)) \
+            .join(Tenant, on=(Knowledgebase.tenant_id == Tenant.id)) \
+            .where(cls.model.id == task_id)
+        docs = list(docs.dicts())
+        if not docs: return []

-            cls.model.update(progress_msg=cls.model.progress_msg + "\n" + "Task has been received.", progress=random.random()/10.).where(
-                cls.model.id == docs[0]["id"]).execute()
-            return docs
+        cls.model.update(progress_msg=cls.model.progress_msg + "\n" + "Task has been received.",
+                         progress=random.random() / 10.).where(
+            cls.model.id == docs[0]["id"]).execute()
+        return docs

    @classmethod
    @DB.connection_context()
    def get_ongoing_doc_name(cls):
        with DB.lock("get_task", -1):
-            docs = cls.model.select(*[Document.kb_id, Document.location]) \
+            docs = cls.model.select(*[Document.id, Document.kb_id, Document.location, File.parent_id]) \
                .join(Document, on=(cls.model.doc_id == Document.id)) \
+                .join(File2Document, on=(File2Document.document_id == Document.id), join_type=JOIN.LEFT_OUTER) \
+                .join(File, on=(File2Document.file_id == File.id), join_type=JOIN.LEFT_OUTER) \
                .where(
                    Document.status == StatusEnum.VALID.value,
                    Document.run == TaskStatus.RUNNING.value,
                    ~(Document.type == FileType.VIRTUAL.value),
-                    cls.model.progress >= 0,
                    cls.model.progress < 1,
-                    cls.model.create_time >= current_timestamp() - 180000
+                    cls.model.create_time >= current_timestamp() - 1000 * 600
                )
            docs = list(docs.dicts())
            if not docs: return []

-            return list(set([(d["kb_id"], d["location"]) for d in docs]))
+            return list(set([(d["parent_id"] if d["parent_id"] else d["kb_id"], d["location"]) for d in docs]))

    @classmethod
    @DB.connection_context()
@ -111,3 +108,55 @@ class TaskService(CommonService):
            if "progress" in info:
                cls.model.update(progress=info["progress"]).where(
                    cls.model.id == id).execute()
+
+
+def queue_tasks(doc, bucket, name):
+    def new_task():
+        nonlocal doc
+        return {
+            "id": get_uuid(),
+            "doc_id": doc["id"]
+        }
+    tsks = []
+
+    if doc["type"] == FileType.PDF.value:
+        file_bin = MINIO.get(bucket, name)
+        do_layout = doc["parser_config"].get("layout_recognize", True)
+        pages = PdfParser.total_page_number(doc["name"], file_bin)
+        page_size = doc["parser_config"].get("task_page_size", 12)
+        if doc["parser_id"] == "paper":
+            page_size = doc["parser_config"].get("task_page_size", 22)
+        if doc["parser_id"] == "one":
+            page_size = 1000000000
+        if not do_layout:
+            page_size = 1000000000
+        page_ranges = doc["parser_config"].get("pages")
+        if not page_ranges:
+            page_ranges = [(1, 100000)]
+        for s, e in page_ranges:
+            s -= 1
+            s = max(0, s)
+            e = min(e - 1, pages)
+            for p in range(s, e, page_size):
+                task = new_task()
+                task["from_page"] = p
+                task["to_page"] = min(p + page_size, e)
+                tsks.append(task)
+
+    elif doc["parser_id"] == "table":
+        file_bin = MINIO.get(bucket, name)
+        rn = RAGFlowExcelParser.row_number(
+            doc["name"], file_bin)
+        for i in range(0, rn, 3000):
+            task = new_task()
+            task["from_page"] = i
+            task["to_page"] = min(i + 3000, rn)
+            tsks.append(task)
+    else:
+        tsks.append(new_task())
+
+    bulk_insert_into_db(Task, tsks, True)
+    DocumentService.begin2parse(doc["id"])
+
+    for t in tsks:
+        REDIS_CONN.queue_product(SVR_QUEUE_NAME, message=t)
--- a/api/ragflow_server.py
+++ b/api/ragflow_server.py
@ -18,10 +18,14 @@ import logging
 import os
 import signal
 import sys
+import time
 import traceback
+from concurrent.futures import ThreadPoolExecutor
+
 from werkzeug.serving import run_simple
 from api.apps import app
 from api.db.runtime_config import RuntimeConfig
+from api.db.services.document_service import DocumentService
 from api.settings import (
    HOST, HTTP_PORT, access_logger, database_logger, stat_logger,
 )
@ -31,6 +35,16 @@ from api.db.db_models import init_database_tables as init_web_db
 from api.db.init_data import init_web_data
 from api.versions import get_versions

+
+def update_progress():
+    while True:
+        time.sleep(1)
+        try:
+            DocumentService.update_progress()
+        except Exception as e:
+            stat_logger.error("update_progress exception:" + str(e))
+
+
 if __name__ == '__main__':
    print("""
    ____                 ______ __               
@ -71,6 +85,9 @@ if __name__ == '__main__':
    peewee_logger.addHandler(database_logger.handlers[0])
    peewee_logger.setLevel(database_logger.level)

+    thr = ThreadPoolExecutor(max_workers=1)
+    thr.submit(update_progress)
+
    # start http server
    try:
        stat_logger.info("RAG Flow http server start...")
--- a/api/settings.py
+++ b/api/settings.py
@ -32,7 +32,7 @@ access_logger = getLogger("access")
 database_logger = getLogger("database")
 chat_logger = getLogger("chat")

-from rag.utils import ELASTICSEARCH
+from rag.utils.es_conn import ELASTICSEARCH
 from rag.nlp import search
 from api.utils import get_base_config, decrypt_database_config

--- a/api/utils/file_utils.py
+++ b/api/utils/file_utils.py
@ -19,7 +19,7 @@ import os
 import re
 from io import BytesIO

-import fitz
+import pdfplumber
 from PIL import Image
 from cachetools import LRUCache, cached
 from ruamel.yaml import YAML
@ -66,6 +66,15 @@ def get_rag_python_directory(*args):
    return get_rag_directory("python", *args)


+def get_home_cache_dir():
+    dir = os.path.join(os.path.expanduser('~'), ".ragflow")
+    try:
+        os.mkdir(dir)
+    except OSError as error:
+        pass
+    return dir
+
+
@cached(cache=LRUCache(maxsize=10))
 def load_json_conf(conf_path):
    if os.path.isabs(conf_path):
@ -155,17 +164,17 @@ def filename_type(filename):
        return FileType.AURAL.value

    if re.match(r".*\.(jpg|jpeg|png|tif|gif|pcx|tga|exif|fpx|svg|psd|cdr|pcd|dxf|ufo|eps|ai|raw|WMF|webp|avif|apng|icon|ico|mpg|mpeg|avi|rm|rmvb|mov|wmv|asf|dat|asx|wvx|mpe|mpa|mp4)$", filename):
-        return FileType.VISUAL
+        return FileType.VISUAL.value
+
+    return FileType.OTHER.value


 def thumbnail(filename, blob):
    filename = filename.lower()
    if re.match(r".*\.pdf$", filename):
-        pdf = fitz.open(stream=blob, filetype="pdf")
-        pix = pdf[0].get_pixmap(matrix=fitz.Matrix(0.03, 0.03))
+        pdf = pdfplumber.open(BytesIO(blob))
        buffered = BytesIO()
-        Image.frombytes("RGB", [pix.width, pix.height],
-                        pix.samples).save(buffered, format="png")
+        pdf.pages[0].to_image().annotated.save(buffered, format="png")
        return "data:image/png;base64," + \
            base64.b64encode(buffered.getvalue()).decode("utf-8")

--- a/conf/service_conf.yaml
+++ b/conf/service_conf.yaml
@ -13,12 +13,12 @@ minio:
  user: 'rag_flow'
  password: 'infini_rag_flow'
  host: 'minio:9000'
+es:
+  hosts: 'http://es01:9200'
 redis:
  db: 1
  password: 'infini_rag_flow'
  host: 'redis:6379'
-es:
-  hosts: 'http://es01:9200'
 user_default_llm:
  factory: 'Tongyi-Qianwen'
  api_key: 'sk-xxxxxxxxxxxxx'
--- a/deepdoc/parser/init.py
+++ b/deepdoc/parser/init.py
@ -1,6 +1,6 @@


-from .pdf_parser import HuParser as PdfParser, PlainParser
-from .docx_parser import HuDocxParser as DocxParser
-from .excel_parser import HuExcelParser as ExcelParser
-from .ppt_parser import HuPptParser as PptParser
+from .pdf_parser import RAGFlowPdfParser as PdfParser, PlainParser
+from .docx_parser import RAGFlowDocxParser as DocxParser
+from .excel_parser import RAGFlowExcelParser as ExcelParser
+from .ppt_parser import RAGFlowPptParser as PptParser
--- a/deepdoc/parser/docx_parser.py
+++ b/deepdoc/parser/docx_parser.py
@ -3,11 +3,11 @@ from docx import Document
 import re
 import pandas as pd
 from collections import Counter
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from io import BytesIO


-class HuDocxParser:
+class RAGFlowDocxParser:

    def __extract_table_content(self, tb):
        df = []
@ -35,14 +35,14 @@ class HuDocxParser:
            for p, n in patt:
                if re.search(p, b):
                    return n
-            tks = [t for t in huqie.qie(b).split(" ") if len(t) > 1]
+            tks = [t for t in rag_tokenizer.tokenize(b).split(" ") if len(t) > 1]
            if len(tks) > 3:
                if len(tks) < 12:
                    return "Tx"
                else:
                    return "Lx"

-            if len(tks) == 1 and huqie.tag(tks[0]) == "nr":
+            if len(tks) == 1 and rag_tokenizer.tag(tks[0]) == "nr":
                return "Nr"

            return "Ot"
--- a/deepdoc/parser/excel_parser.py
+++ b/deepdoc/parser/excel_parser.py
@ -6,7 +6,7 @@ from io import BytesIO
 from rag.nlp import find_codec


-class HuExcelParser:
+class RAGFlowExcelParser:
    def html(self, fnm):
        if isinstance(fnm, str):
            wb = load_workbook(fnm)
@ -69,10 +69,10 @@ class HuExcelParser:

        if fnm.split(".")[-1].lower() in ["csv", "txt"]:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
            return len(txt.split("\n"))


 if __name__ == "__main__":
-    psr = HuExcelParser()
+    psr = RAGFlowExcelParser()
    psr(sys.argv[1])
--- a/deepdoc/parser/pdf_parser.py
+++ b/deepdoc/parser/pdf_parser.py
@ -2,7 +2,6 @@
 import os
 import random

-import fitz
 import xgboost as xgb
 from io import BytesIO
 import torch
@ -16,14 +15,14 @@ from PyPDF2 import PdfReader as pdf2_read

 from api.utils.file_utils import get_project_base_directory
 from deepdoc.vision import OCR, Recognizer, LayoutRecognizer, TableStructureRecognizer
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from copy import deepcopy
 from huggingface_hub import snapshot_download

 logging.getLogger("pdfminer").setLevel(logging.WARNING)


-class HuParser:
+class RAGFlowPdfParser:
    def __init__(self):
        self.ocr = OCR()
        if hasattr(self, "model_speciess"):
@ -95,13 +94,13 @@ class HuParser:
        h = max(self.__height(up), self.__height(down))
        y_dis = self._y_dis(up, down)
        LEN = 6
-        tks_down = huqie.qie(down["text"][:LEN]).split(" ")
-        tks_up = huqie.qie(up["text"][-LEN:]).split(" ")
+        tks_down = rag_tokenizer.tokenize(down["text"][:LEN]).split(" ")
+        tks_up = rag_tokenizer.tokenize(up["text"][-LEN:]).split(" ")
        tks_all = up["text"][-LEN:].strip() \
                  + (" " if re.match(r"[a-zA-Z0-9]+",
                                     up["text"][-1] + down["text"][0]) else "") \
                  + down["text"][:LEN].strip()
-        tks_all = huqie.qie(tks_all).split(" ")
+        tks_all = rag_tokenizer.tokenize(tks_all).split(" ")
        fea = [
            up.get("R", -1) == down.get("R", -1),
            y_dis / h,
@ -142,8 +141,8 @@ class HuParser:
            tks_down[-1] == tks_up[-1],
            max(down["in_row"], up["in_row"]),
            abs(down["in_row"] - up["in_row"]),
-            len(tks_down) == 1 and huqie.tag(tks_down[0]).find("n") >= 0,
-            len(tks_up) == 1 and huqie.tag(tks_up[0]).find("n") >= 0
+            len(tks_down) == 1 and rag_tokenizer.tag(tks_down[0]).find("n") >= 0,
+            len(tks_up) == 1 and rag_tokenizer.tag(tks_up[0]).find("n") >= 0
        ]
        return fea

@ -470,7 +469,8 @@ class HuParser:
                        continue

                    if re.match(r"[0-9]{2,3}/[0-9]{3}$", up["text"]) \
-                            or re.match(r"[0-9]{2,3}/[0-9]{3}$", down["text"]):
+                            or re.match(r"[0-9]{2,3}/[0-9]{3}$", down["text"]) \
+                            or not down["text"].strip():
                        i += 1
                        continue

@ -598,7 +598,7 @@ class HuParser:

            if b["text"].strip()[0] != b_["text"].strip()[0] \
                    or b["text"].strip()[0].lower() in set("qwertyuopasdfghjklzxcvbnm") \
-                    or huqie.is_chinese(b["text"].strip()[0]) \
+                    or rag_tokenizer.is_chinese(b["text"].strip()[0]) \
                    or b["top"] > b_["bottom"]:
                i += 1
                continue
@ -921,9 +921,7 @@ class HuParser:
                fnm) if not binary else pdfplumber.open(BytesIO(binary))
            return len(pdf.pages)
        except Exception as e:
-            pdf = fitz.open(fnm) if not binary else fitz.open(
-                stream=fnm, filetype="pdf")
-            return len(pdf)
+            logging.error(str(e))

    def __images__(self, fnm, zoomin=3, page_from=0,
                   page_to=299, callback=None):
@ -945,23 +943,7 @@ class HuParser:
                               self.pdf.pages[page_from:page_to]]
            self.total_page = len(self.pdf.pages)
        except Exception as e:
-            self.pdf = fitz.open(fnm) if isinstance(
-                fnm, str) else fitz.open(
-                stream=fnm, filetype="pdf")
-            self.page_images = []
-            self.page_chars = []
-            mat = fitz.Matrix(zoomin, zoomin)
-            self.total_page = len(self.pdf)
-            for i, page in enumerate(self.pdf):
-                if i < page_from:
-                    continue
-                if i >= page_to:
-                    break
-                pix = page.get_pixmap(matrix=mat)
-                img = Image.frombytes("RGB", [pix.width, pix.height],
-                                      pix.samples)
-                self.page_images.append(img)
-                self.page_chars.append([])
+            logging.error(str(e))

        self.outlines = []
        try:
--- a/deepdoc/parser/ppt_parser.py
+++ b/deepdoc/parser/ppt_parser.py
@ -14,7 +14,7 @@ from io import BytesIO
 from pptx import Presentation


-class HuPptParser(object):
+class RAGFlowPptParser(object):
    def __init__(self):
        super().__init__()

--- a/deepdoc/parser/resume/entities/corporations.py
+++ b/deepdoc/parser/resume/entities/corporations.py
@ -1,6 +1,6 @@
 import re,json,os
 import pandas as pd
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from . import regions
 current_file_path = os.path.dirname(os.path.abspath(__file__))
 GOODS = pd.read_csv(os.path.join(current_file_path, "res/corp_baike_len.csv"), sep="\t", header=0).fillna(0)
@ -22,14 +22,14 @@ def baike(cid, default_v=0):
 def corpNorm(nm, add_region=True):
    global CORP_TKS
    if not nm or type(nm)!=type(""):return ""
-    nm = huqie.tradi2simp(huqie.strQ2B(nm)).lower()
+    nm = rag_tokenizer.tradi2simp(rag_tokenizer.strQ2B(nm)).lower()
    nm = re.sub(r"&amp;", "&", nm)
    nm = re.sub(r"[\(\)（）\+'\"\t \*\\【】-]+", " ", nm)
    nm = re.sub(r"([—-]+.*| +co\..*|corp\..*| +inc\..*| +ltd.*)", "", nm, 10000, re.IGNORECASE)
    nm = re.sub(r"(计算机|技术|(技术|科技|网络)*有限公司|公司|有限|研发中心|中国|总部)$", "", nm, 10000, re.IGNORECASE)
    if not nm or (len(nm)<5 and not regions.isName(nm[0:2])):return nm

-    tks = huqie.qie(nm).split(" ")
+    tks = rag_tokenizer.tokenize(nm).split(" ")
    reg = [t for i,t in enumerate(tks) if regions.isName(t) and (t != "中国" or i > 0)]
    nm = ""
    for t in tks:
--- a/deepdoc/parser/resume/step_two.py
+++ b/deepdoc/parser/resume/step_two.py
@ -3,7 +3,7 @@ import re, copy, time, datetime, demjson3, \
    traceback, signal
 import numpy as np
 from deepdoc.parser.resume.entities import degrees, schools, corporations
-from rag.nlp import huqie, surname
+from rag.nlp import rag_tokenizer, surname
 from xpinyin import Pinyin
 from contextlib import contextmanager

@ -83,7 +83,7 @@ def forEdu(cv):
        if n.get("school_name") and isinstance(n["school_name"], str):
            sch.append(re.sub(r"(211|985|重点大学|[,&;；-])", "", n["school_name"]))
            e["sch_nm_kwd"] = sch[-1]
-        fea.append(huqie.qieqie(huqie.qie(n.get("school_name", ""))).split(" ")[-1])
+        fea.append(rag_tokenizer.fine_grained_tokenize(rag_tokenizer.tokenize(n.get("school_name", ""))).split(" ")[-1])

        if n.get("discipline_name") and isinstance(n["discipline_name"], str):
            maj.append(n["discipline_name"])
@ -166,10 +166,10 @@ def forEdu(cv):
            if "tag_kwd" not in cv: cv["tag_kwd"] = []
            if "好学历" not in cv["tag_kwd"]: cv["tag_kwd"].append("好学历")

-    if cv.get("major_kwd"): cv["major_tks"] = huqie.qie(" ".join(maj))
-    if cv.get("school_name_kwd"): cv["school_name_tks"] = huqie.qie(" ".join(sch))
-    if cv.get("first_school_name_kwd"): cv["first_school_name_tks"] = huqie.qie(" ".join(fsch))
-    if cv.get("first_major_kwd"): cv["first_major_tks"] = huqie.qie(" ".join(fmaj))
+    if cv.get("major_kwd"): cv["major_tks"] = rag_tokenizer.tokenize(" ".join(maj))
+    if cv.get("school_name_kwd"): cv["school_name_tks"] = rag_tokenizer.tokenize(" ".join(sch))
+    if cv.get("first_school_name_kwd"): cv["first_school_name_tks"] = rag_tokenizer.tokenize(" ".join(fsch))
+    if cv.get("first_major_kwd"): cv["first_major_tks"] = rag_tokenizer.tokenize(" ".join(fmaj))

    return cv

@ -187,11 +187,11 @@ def forProj(cv):
        if n.get("achivement"): desc.append(str(n["achivement"]))

    if pro_nms:
-        # cv["pro_nms_tks"] = huqie.qie(" ".join(pro_nms))
-        cv["project_name_tks"] = huqie.qie(pro_nms[0])
+        # cv["pro_nms_tks"] = rag_tokenizer.tokenize(" ".join(pro_nms))
+        cv["project_name_tks"] = rag_tokenizer.tokenize(pro_nms[0])
    if desc:
-        cv["pro_desc_ltks"] = huqie.qie(rmHtmlTag(" ".join(desc)))
-        cv["project_desc_ltks"] = huqie.qie(rmHtmlTag(desc[0]))
+        cv["pro_desc_ltks"] = rag_tokenizer.tokenize(rmHtmlTag(" ".join(desc)))
+        cv["project_desc_ltks"] = rag_tokenizer.tokenize(rmHtmlTag(desc[0]))

    return cv

@ -280,25 +280,25 @@ def forWork(cv):
    if fea["corporation_id"]: cv["corporation_id"] = fea["corporation_id"]

    if fea["position_name"]:
-        cv["position_name_tks"] = huqie.qie(fea["position_name"][0])
-        cv["position_name_sm_tks"] = huqie.qieqie(cv["position_name_tks"])
-        cv["pos_nm_tks"] = huqie.qie(" ".join(fea["position_name"][1:]))
+        cv["position_name_tks"] = rag_tokenizer.tokenize(fea["position_name"][0])
+        cv["position_name_sm_tks"] = rag_tokenizer.fine_grained_tokenize(cv["position_name_tks"])
+        cv["pos_nm_tks"] = rag_tokenizer.tokenize(" ".join(fea["position_name"][1:]))

    if fea["industry_name"]:
-        cv["industry_name_tks"] = huqie.qie(fea["industry_name"][0])
-        cv["industry_name_sm_tks"] = huqie.qieqie(cv["industry_name_tks"])
-        cv["indu_nm_tks"] = huqie.qie(" ".join(fea["industry_name"][1:]))
+        cv["industry_name_tks"] = rag_tokenizer.tokenize(fea["industry_name"][0])
+        cv["industry_name_sm_tks"] = rag_tokenizer.fine_grained_tokenize(cv["industry_name_tks"])
+        cv["indu_nm_tks"] = rag_tokenizer.tokenize(" ".join(fea["industry_name"][1:]))

    if fea["corporation_name"]:
        cv["corporation_name_kwd"] = fea["corporation_name"][0]
        cv["corp_nm_kwd"] = fea["corporation_name"]
-        cv["corporation_name_tks"] = huqie.qie(fea["corporation_name"][0])
-        cv["corporation_name_sm_tks"] = huqie.qieqie(cv["corporation_name_tks"])
-        cv["corp_nm_tks"] = huqie.qie(" ".join(fea["corporation_name"][1:]))
+        cv["corporation_name_tks"] = rag_tokenizer.tokenize(fea["corporation_name"][0])
+        cv["corporation_name_sm_tks"] = rag_tokenizer.fine_grained_tokenize(cv["corporation_name_tks"])
+        cv["corp_nm_tks"] = rag_tokenizer.tokenize(" ".join(fea["corporation_name"][1:]))

    if fea["responsibilities"]:
-        cv["responsibilities_ltks"] = huqie.qie(fea["responsibilities"][0])
-        cv["resp_ltks"] = huqie.qie(" ".join(fea["responsibilities"][1:]))
+        cv["responsibilities_ltks"] = rag_tokenizer.tokenize(fea["responsibilities"][0])
+        cv["resp_ltks"] = rag_tokenizer.tokenize(" ".join(fea["responsibilities"][1:]))

    if fea["subordinates_count"]: fea["subordinates_count"] = [int(i) for i in fea["subordinates_count"] if
                                                               re.match(r"[^0-9]+$", str(i))]
@ -444,15 +444,15 @@ def parse(cv):
                if nms:
                    t = k[:-4]
                    cv[f"{t}_kwd"] = nms
-                    cv[f"{t}_tks"] = huqie.qie(" ".join(nms))
+                    cv[f"{t}_tks"] = rag_tokenizer.tokenize(" ".join(nms))
            except Exception as e:
                print("【EXCEPTION】:", str(traceback.format_exc()), cv[k])
                cv[k] = []

        # tokenize fields
        if k in tks_fld:
-            cv[f"{k}_tks"] = huqie.qie(cv[k])
-            if k in small_tks_fld: cv[f"{k}_sm_tks"] = huqie.qie(cv[f"{k}_tks"])
+            cv[f"{k}_tks"] = rag_tokenizer.tokenize(cv[k])
+            if k in small_tks_fld: cv[f"{k}_sm_tks"] = rag_tokenizer.tokenize(cv[f"{k}_tks"])

        # keyword fields
        if k in kwd_fld: cv[f"{k}_kwd"] = [n.lower()
@ -492,7 +492,7 @@ def parse(cv):
        cv["name_kwd"] = name
        cv["name_pinyin_kwd"] = PY.get_pinyins(nm[:20], ' ')[:3]
        cv["name_tks"] = (
-                huqie.qie(name) + " " + (" ".join(list(name)) if not re.match(r"[a-zA-Z ]+$", name) else "")
+                rag_tokenizer.tokenize(name) + " " + (" ".join(list(name)) if not re.match(r"[a-zA-Z ]+$", name) else "")
        ) if name else ""
    else:
        cv["integerity_flt"] /= 2.
@ -515,7 +515,7 @@ def parse(cv):
        cv["updated_at_dt"] = f"%s-%02d-%02d 00:00:00" % (y, int(m), int(d))
        # long text tokenize

-    if cv.get("responsibilities"): cv["responsibilities_ltks"] = huqie.qie(rmHtmlTag(cv["responsibilities"]))
+    if cv.get("responsibilities"): cv["responsibilities_ltks"] = rag_tokenizer.tokenize(rmHtmlTag(cv["responsibilities"]))

    # for yes or no field
    fea = []
--- a/deepdoc/vision/init.py
+++ b/deepdoc/vision/init.py
@ -1,12 +1,13 @@
+import pdfplumber

 from .ocr import OCR
 from .recognizer import Recognizer
 from .layout_recognizer import LayoutRecognizer
 from .table_structure_recognizer import TableStructureRecognizer

+
 def init_in_out(args):
    from PIL import Image
-    import fitz
    import os
    import traceback
    from api.utils.file_utils import traversal_files
@ -18,13 +19,11 @@ def init_in_out(args):

    def pdf_pages(fnm, zoomin=3):
        nonlocal outputs, images
-        pdf = fitz.open(fnm)
-        mat = fitz.Matrix(zoomin, zoomin)
-        for i, page in enumerate(pdf):
-            pix = page.get_pixmap(matrix=mat)
-            img = Image.frombytes("RGB", [pix.width, pix.height],
-                                  pix.samples)
-            images.append(img)
+        pdf = pdfplumber.open(fnm)
+        images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
+                            enumerate(pdf.pages)]
+
+        for i, page in enumerate(images):
            outputs.append(os.path.split(fnm)[-1] + f"_{i}.jpg")

    def images_and_outputs(fnm):
--- a/deepdoc/vision/t_ocr.py
+++ b/deepdoc/vision/t_ocr.py
@ -11,10 +11,6 @@
 #  limitations under the License.
 #

-from deepdoc.vision.seeit import draw_box
-from deepdoc.vision import OCR, init_in_out
-import argparse
-import numpy as np
 import os
 import sys
 sys.path.insert(
@ -25,6 +21,11 @@ sys.path.insert(
                os.path.abspath(__file__)),
            '../../')))

+from deepdoc.vision.seeit import draw_box
+from deepdoc.vision import OCR, init_in_out
+import argparse
+import numpy as np
+

 def main(args):
    ocr = OCR()
--- a/deepdoc/vision/t_recognizer.py
+++ b/deepdoc/vision/t_recognizer.py
@ -10,17 +10,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 #
-
-from deepdoc.vision.seeit import draw_box
-from deepdoc.vision import Recognizer, LayoutRecognizer, TableStructureRecognizer, OCR, init_in_out
-from api.utils.file_utils import get_project_base_directory
-import argparse
-import os
-import sys
-import re
-
-import numpy as np
-
+import os, sys
 sys.path.insert(
    0,
    os.path.abspath(
@ -29,6 +19,13 @@ sys.path.insert(
                os.path.abspath(__file__)),
            '../../')))

+from deepdoc.vision.seeit import draw_box
+from deepdoc.vision import Recognizer, LayoutRecognizer, TableStructureRecognizer, OCR, init_in_out
+from api.utils.file_utils import get_project_base_directory
+import argparse
+import re
+import numpy as np
+

 def main(args):
    images, outputs = init_in_out(args)
--- a/deepdoc/vision/table_structure_recognizer.py
+++ b/deepdoc/vision/table_structure_recognizer.py
@ -19,7 +19,7 @@ import numpy as np
 from huggingface_hub import snapshot_download

 from api.utils.file_utils import get_project_base_directory
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from .recognizer import Recognizer


@ -117,14 +117,14 @@ class TableStructureRecognizer(Recognizer):
        for p, n in patt:
            if re.search(p, b["text"].strip()):
                return n
-        tks = [t for t in huqie.qie(b["text"]).split(" ") if len(t) > 1]
+        tks = [t for t in rag_tokenizer.tokenize(b["text"]).split(" ") if len(t) > 1]
        if len(tks) > 3:
            if len(tks) < 12:
                return "Tx"
            else:
                return "Lx"

-        if len(tks) == 1 and huqie.tag(tks[0]) == "nr":
+        if len(tks) == 1 and rag_tokenizer.tag(tks[0]) == "nr":
            return "Nr"

        return "Ot"
--- a/docker/.env
+++ b/docker/.env
@ -25,9 +25,11 @@ MINIO_PORT=9000
 MINIO_USER=rag_flow
 MINIO_PASSWORD=infini_rag_flow

+REDIS_PASSWORD=infini_rag_flow
+
 SVR_HTTP_PORT=9380

-RAGFLOW_VERSION=v0.3.2
+RAGFLOW_VERSION=dev

 TIMEZONE='Asia/Shanghai'

--- a/docker/README.md
+++ b/docker/README.md
@ -50,7 +50,7 @@ The serving port of mysql inside the container. The modification should be synch
 The max database connection.

 ### stale_timeout
-The timeout duation in seconds.
+The timeout duration in seconds.

 ## minio

--- a/docker/docker-compose-base.yml
+++ b/docker/docker-compose-base.yml
@ -29,24 +29,6 @@ services:
      - ragflow
    restart: always

-  #kibana:
-  #  depends_on:
-  #      es01:
-  #        condition: service_healthy
-  #  image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
-  #  container_name: ragflow-kibana
-  #  volumes:
-  #    - kibanadata:/usr/share/kibana/data
-  #  ports:
-  #    - ${KIBANA_PORT}:5601
-  #  environment:
-  #    - SERVERNAME=kibana
-  #    - ELASTICSEARCH_HOSTS=http://es01:9200
-  #    - TZ=${TIMEZONE}
-  #  mem_limit: ${MEM_LIMIT}
-  #  networks:
-  #    - ragflow
-
  mysql:
    image: mysql:5.7.18
    container_name: ragflow-mysql
@ -74,7 +56,6 @@ services:
      retries: 3
    restart: always

-
  minio:
    image: quay.io/minio/minio:RELEASE.2023-12-20T01-00-02Z
    container_name: ragflow-minio
@ -92,16 +73,27 @@ services:
      - ragflow
    restart: always

+  redis:
+    image: redis:7.2.4
+    container_name: ragflow-redis
+    command: redis-server --requirepass ${REDIS_PASSWORD} --maxmemory 128mb --maxmemory-policy allkeys-lru
+    volumes:
+      - redis_data:/data
+    networks:
+      - ragflow
+    restart: always
+
+

 volumes:
  esdata01:
    driver: local
-#  kibanadata:
-#    driver: local
  mysql_data:
    driver: local
  minio_data:
    driver: local
+  redis_data:
+    driver: local

 networks:
  ragflow:
--- a/docker/entrypoint.sh
+++ b/docker/entrypoint.sh
@ -12,28 +12,14 @@ function task_exe(){
    done
 }

-function watch_broker(){
-  while [ 1 -eq 1 ];do
-    C=`ps aux|grep "task_broker.py"|grep -v grep|wc -l`;
-    if [ $C -lt 1 ];then
-       $PY rag/svr/task_broker.py &
-    fi
-    sleep 5;
-  done
-}
-
-function task_bro(){
-    watch_broker;
-}
-
-task_bro &
-
 WS=1
 for ((i=0;i<WS;i++))
 do
  task_exe $i $WS &
 done

-$PY api/ragflow_server.py
+while [ 1 -eq 1 ];do
+    $PY api/ragflow_server.py
+done

-wait;
+wait;
--- a/docker/service_conf.yaml
+++ b/docker/service_conf.yaml
@ -13,12 +13,12 @@ minio:
  user: 'rag_flow'
  password: 'infini_rag_flow'
  host: 'minio:9000'
+es:
+  hosts: 'http://es01:9200'
 redis:
  db: 1
  password: 'infini_rag_flow'
  host: 'redis:6379'
-es:
-  hosts: 'http://es01:9200'
 user_default_llm:
  factory: 'Tongyi-Qianwen'
  api_key: 'sk-xxxxxxxxxxxxx'
@ -38,4 +38,4 @@ authentication:
 permission:
  switch: false
  component: false
-  dataset: false
+  dataset: false
--- a/docs/conversation_api.md
+++ b/docs/conversation_api.md
@ -221,6 +221,7 @@ This will be called to get the answer to users' questions.
 |------|-------|----|----|
 | conversation_id| string | No | This is from calling /new_conversation.|
 | messages| json | No | All the conversation history stored here including the latest user's question.|
+| quote | bool | Yes | Default: true |

 ### Response 
 ```json
@ -360,4 +361,4 @@ This is usually used when upload a file to.
    "retmsg": "success"
 }

-```
+```
--- a/docs/faq.md
+++ b/docs/faq.md
@ -55,7 +55,7 @@ This feature and the related APIs are still in development. Contributions are we
 ```
 $ git clone https://github.com/infiniflow/ragflow.git
 $ cd ragflow
-$ docker build -t infiniflow/ragflow:v0.3.2 .
+$ docker build -t infiniflow/ragflow:latest .
 $ cd ragflow/docker
 $ chmod +x ./entrypoint.sh
 $ docker compose up -d
@ -193,18 +193,31 @@ docker logs -f ragflow-server
 2. Check if the **task_executor.py** process exists.
 3. Check if your RAGFlow server can access hf-mirror.com or huggingface.com.

+#### 4.5 Why does my pdf parsing stall near completion, while the log does not show any error?

-#### 4.5 `Index failure`
+If your RAGFlow is deployed *locally*, the parsing process is likely killed due to insufficient RAM. Try increasing your memory allocation by increasing the `MEM_LIMIT` value in **docker/.env**.
+
+> Ensure that you restart up your RAGFlow server for your changes to take effect!
+> ```bash
+> docker compose stop
+> ```
+> ```bash
+> docker compose up -d
+> ```
+
+![nearcompletion](https://github.com/infiniflow/ragflow/assets/93570324/563974c3-f8bb-4ec8-b241-adcda8929cbb)
+
+#### 4.6 `Index failure`

 An index failure usually indicates an unavailable Elasticsearch service.

-#### 4.6 How to check the log of RAGFlow?
+#### 4.7 How to check the log of RAGFlow?

 ```bash
 tail -f path_to_ragflow/docker/ragflow-logs/rag/*.log
 ```

-#### 4.7 How to check the status of each component in RAGFlow?
+#### 4.8 How to check the status of each component in RAGFlow?

 ```bash
 $ docker ps
@ -212,13 +225,13 @@ $ docker ps
 *The system displays the following if all your RAGFlow components are running properly:* 

 ```
-5bc45806b680   infiniflow/ragflow:v0.3.2     "./entrypoint.sh"        11 hours ago   Up 11 hours               0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp   ragflow-server
+5bc45806b680   infiniflow/ragflow:latest     "./entrypoint.sh"        11 hours ago   Up 11 hours               0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp   ragflow-server
 91220e3285dd   docker.elastic.co/elasticsearch/elasticsearch:8.11.3   "/bin/tini -- /usr/l…"   11 hours ago   Up 11 hours (healthy)     9300/tcp, 0.0.0.0:9200->9200/tcp, :::9200->9200/tcp           ragflow-es-01
 d8c86f06c56b   mysql:5.7.18        "docker-entrypoint.s…"   7 days ago     Up 16 seconds (healthy)   0.0.0.0:3306->3306/tcp, :::3306->3306/tcp     ragflow-mysql
 cd29bcb254bc   quay.io/minio/minio:RELEASE.2023-12-20T01-00-02Z       "/usr/bin/docker-ent…"   2 weeks ago    Up 11 hours      0.0.0.0:9001->9001/tcp, :::9001->9001/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp     ragflow-minio
 ```

-#### 4.8 `Exception: Can't connect to ES cluster`
+#### 4.9 `Exception: Can't connect to ES cluster`

 1. Check the status of your Elasticsearch component:

@ -245,23 +258,26 @@ $ docker ps
    curl http://<IP_OF_ES>:<PORT_OF_ES>
    ```

+#### 4.10 Can't start ES container and get `Elasticsearch did not exit normally`

-#### 4.9 `{"data":null,"retcode":100,"retmsg":"<NotFound '404: Not Found'>"}`
+This is because you forgot to update the `vm.max_map_count` value in **/etc/sysctl.conf** and your change to this value was reset after a system reboot. 
+
+#### 4.11 `{"data":null,"retcode":100,"retmsg":"<NotFound '404: Not Found'>"}`

 Your IP address or port number may be incorrect. If you are using the default configurations, enter http://<IP_OF_YOUR_MACHINE> (**NOT 9380, AND NO PORT NUMBER REQUIRED!**) in your browser. This should work.

-#### 4.10 `Ollama - Mistral instance running at 127.0.0.1:11434 but cannot add Ollama as model in RagFlow`
+#### 4.12 `Ollama - Mistral instance running at 127.0.0.1:11434 but cannot add Ollama as model in RagFlow`

 A correct Ollama IP address and port is crucial to adding models to Ollama:

 - If you are on demo.ragflow.io, ensure that the server hosting Ollama has a publicly accessible IP address.Note that 127.0.0.1 is not a publicly accessible IP address.
 - If you deploy RAGFlow locally, ensure that Ollama and RAGFlow are in the same LAN and can comunicate with each other.

-#### 4.11 Do you offer examples of using deepdoc to parse PDF or other files?
+#### 4.13 Do you offer examples of using deepdoc to parse PDF or other files?

 Yes, we do. See the Python files under the **rag/app** folder. 

-#### 4.12 Why did I fail to upload a 10MB+ file to my locally deployed RAGFlow?
+#### 4.14 Why did I fail to upload a 10MB+ file to my locally deployed RAGFlow?

 You probably forgot to update the **MAX_CONTENT_LENGTH** environment variable:

@ -280,7 +296,7 @@ docker compose up ragflow -d
 ```
   *Now you should be able to upload files of sizes less than 100MB.*

-#### 4.13 `Table 'rag_flow.document' doesn't exist`
+#### 4.15 `Table 'rag_flow.document' doesn't exist`

 This exception occurs when starting up the RAGFlow server. Try the following: 

@ -303,7 +319,7 @@ This exception occurs when starting up the RAGFlow server. Try the following:
  docker compose up
  ```

-#### 4.14 `hint : 102  Fail to access model  Connection error`
+#### 4.16 `hint : 102  Fail to access model  Connection error`

 ![hint102](https://github.com/infiniflow/ragflow/assets/93570324/6633d892-b4f8-49b5-9a0a-37a0a8fba3d2)

@ -311,6 +327,13 @@ This exception occurs when starting up the RAGFlow server. Try the following:
 2. Do not forget to append **/v1/** to **http://IP:port**: 
   **http://IP:port/v1/**

+#### 4.17 `FileNotFoundError: [Errno 2] No such file or directory`
+
+1. Check if the status of your minio container is healthy:
+   ```bash
+   docker ps
+   ```
+2. Ensure that the username and password settings of MySQL and MinIO in **docker/.env** are in line with those in **docker/service_conf.yml**.

 ## Usage

@ -340,10 +363,43 @@ You can use Ollama to deploy local LLM. See [here](https://github.com/infiniflow

 ### 6. How to configure RAGFlow to respond with 100% matched results, rather than utilizing LLM?

-1. Click the **Knowledge Base** tab in the middle top of the page.
+1. Click **Knowledge Base** in the middle top of the page.
 2. Right click the desired knowledge base to display the **Configuration** dialogue. 
 3. Choose **Q&A** as the chunk method and click **Save** to confirm your change. 

-### Do I need to connect to Redis?
+### 7 Do I need to connect to Redis?

-No, connecting to Redis is not required to use RAGFlow. 
+No, connecting to Redis is not required. 
+
+### 8 `Error: Range of input length should be [1, 30000]`
+
+This error occurs because there are too many chunks matching your search criteria. Try reducing the **TopN** and increasing **Similarity threshold** to fix this issue: 
+
+1. Click **Chat** in the middle top of the page. 
+2. Right click the desired conversation > **Edit** > **Prompt Engine**
+3. Reduce the **TopN** and/or raise **Silimarity threshold**.
+4. Click **OK** to confirm your changes.
+
+![topn](https://github.com/infiniflow/ragflow/assets/93570324/7ec72ab3-0dd2-4cff-af44-e2663b67b2fc)
+
+### 9 How to update RAGFlow to the latest version?
+
+1. Pull the latest source code
+   ```bash
+   cd ragflow
+   git pull
+   ```
+2. If you used `docker compose up -d` to start up RAGFlow server:
+   ```bash
+   docker pull infiniflow/ragflow:dev
+   ```
+   ```bash
+   docker compose up ragflow -d
+   ```
+3. If you used `docker compose -f docker-compose-CN.yml up -d` to start up RAGFlow server:
+   ```bash
+   docker pull swr.cn-north-4.myhuaweicloud.com/infiniflow/ragflow:dev
+   ```
+   ```bash
+   docker compose -f docker-compose-CN.yml up -d
+   ```
--- a/rag/app/book.py
+++ b/rag/app/book.py
@ -18,14 +18,14 @@ from io import BytesIO
 from rag.nlp import bullets_category, is_english, tokenize, remove_contents_table, \
    hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, add_positions, \
    tokenize_chunks, find_codec
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from deepdoc.parser import PdfParser, DocxParser, PlainParser


 class Pdf(PdfParser):
    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(
            filename if not binary else binary,
            zoomin,
@ -63,9 +63,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
    """
    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+        "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
    }
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    pdf_parser = None
    sections, tbls = [], []
    if re.search(r"\.docx$", filename, re.IGNORECASE):
@ -91,7 +91,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
        txt = ""
        if binary:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(filename, "r") as f:
                while True:
--- a/rag/app/laws.py
+++ b/rag/app/laws.py
@ -19,7 +19,7 @@ from docx import Document
 from api.db import ParserType
 from rag.nlp import bullets_category, is_english, tokenize, remove_contents_table, hierarchical_merge, \
    make_colon_as_title, add_positions, tokenize_chunks, find_codec
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from deepdoc.parser import PdfParser, DocxParser, PlainParser
 from rag.settings import cron_logger

@ -58,7 +58,7 @@ class Pdf(PdfParser):

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(
            filename if not binary else binary,
            zoomin,
@ -89,9 +89,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
    """
    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+        "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
    }
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    pdf_parser = None
    sections = []
    if re.search(r"\.docx$", filename, re.IGNORECASE):
@ -113,7 +113,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
        txt = ""
        if binary:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(filename, "r") as f:
                while True:
--- a/rag/app/manual.py
+++ b/rag/app/manual.py
@ -2,7 +2,7 @@ import copy
 import re

 from api.db import ParserType
-from rag.nlp import huqie, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
+from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
 from deepdoc.parser import PdfParser, PlainParser
 from rag.utils import num_tokens_from_string

@ -16,7 +16,7 @@ class Pdf(PdfParser):
                 to_page=100000, zoomin=3, callback=None):
        from timeit import default_timer as timer
        start = timer()
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(
            filename if not binary else binary,
            zoomin,
@ -70,8 +70,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
    doc = {
        "docnm_kwd": filename
    }
-    doc["title_tks"] = huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", doc["docnm_kwd"]))
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_tks"] = rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", doc["docnm_kwd"]))
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    # is it English
    eng = lang.lower() == "english"  # pdf_parser.is_english

--- a/rag/app/naive.py
+++ b/rag/app/naive.py
@ -16,7 +16,7 @@ from docx import Document
 from timeit import default_timer as timer
 import re
 from deepdoc.parser.pdf_parser import PlainParser
-from rag.nlp import huqie, naive_merge, tokenize_table, tokenize_chunks, find_codec
+from rag.nlp import rag_tokenizer, naive_merge, tokenize_table, tokenize_chunks, find_codec
 from deepdoc.parser import PdfParser, ExcelParser, DocxParser
 from rag.settings import cron_logger

@ -69,7 +69,7 @@ class Pdf(PdfParser):
    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
        start = timer()
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(
            filename if not binary else binary,
            zoomin,
@ -112,9 +112,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
            "chunk_token_num": 128, "delimiter": "\n!?。；！？", "layout_recognize": True})
    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+        "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
    }
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    res = []
    pdf_parser = None
    sections = []
@ -141,7 +141,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
        txt = ""
        if binary:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(filename, "r") as f:
                while True:
--- a/rag/app/one.py
+++ b/rag/app/one.py
@ -14,14 +14,14 @@ from tika import parser
 from io import BytesIO
 import re
 from rag.app import laws
-from rag.nlp import huqie, tokenize, find_codec
+from rag.nlp import rag_tokenizer, tokenize, find_codec
 from deepdoc.parser import PdfParser, ExcelParser, PlainParser


 class Pdf(PdfParser):
    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(
            filename if not binary else binary,
            zoomin,
@ -85,7 +85,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
        txt = ""
        if binary:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(filename, "r") as f:
                while True:
@ -111,9 +111,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,

    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+        "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
    }
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    tokenize(doc, "\n".join(sections), eng)
    return [doc]

--- a/rag/app/paper.py
+++ b/rag/app/paper.py
@ -15,7 +15,7 @@ import re
 from collections import Counter

 from api.db import ParserType
-from rag.nlp import huqie, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
+from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, title_frequency, tokenize_chunks
 from deepdoc.parser import PdfParser, PlainParser
 import numpy as np
 from rag.utils import num_tokens_from_string
@ -28,7 +28,7 @@ class Pdf(PdfParser):

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(
            filename if not binary else binary,
            zoomin,
@ -153,10 +153,10 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
    else:
        raise NotImplementedError("file type not supported yet(pdf supported)")

-    doc = {"docnm_kwd": filename, "authors_tks": huqie.qie(paper["authors"]),
-           "title_tks": huqie.qie(paper["title"] if paper["title"] else filename)}
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
-    doc["authors_sm_tks"] = huqie.qieqie(doc["authors_tks"])
+    doc = {"docnm_kwd": filename, "authors_tks": rag_tokenizer.tokenize(paper["authors"]),
+           "title_tks": rag_tokenizer.tokenize(paper["title"] if paper["title"] else filename)}
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
+    doc["authors_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["authors_tks"])
    # is it English
    eng = lang.lower() == "english"  # pdf_parser.is_english
    print("It's English.....", eng)
--- a/rag/app/presentation.py
+++ b/rag/app/presentation.py
@ -17,7 +17,7 @@ from io import BytesIO
 from PIL import Image

 from rag.nlp import tokenize, is_english
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from deepdoc.parser import PdfParser, PptParser, PlainParser
 from PyPDF2 import PdfReader as pdf2_read

@ -58,7 +58,7 @@ class Pdf(PdfParser):

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
-        callback(msg="OCR is  running...")
+        callback(msg="OCR is running...")
        self.__images__(filename if not binary else binary,
                        zoomin, from_page, to_page, callback)
        callback(0.8, "Page {}~{}: OCR finished".format(
@ -96,9 +96,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
    eng = lang.lower() == "english"
    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+        "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
    }
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    res = []
    if re.search(r"\.pptx?$", filename, re.IGNORECASE):
        ppt_parser = Ppt()
--- a/rag/app/qa.py
+++ b/rag/app/qa.py
@ -16,7 +16,7 @@ from io import BytesIO
 from nltk import word_tokenize
 from openpyxl import load_workbook
 from rag.nlp import is_english, random_choices, find_codec
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from deepdoc.parser import ExcelParser


@ -73,8 +73,8 @@ def beAdoc(d, q, a, eng):
    aprefix = "Answer: " if eng else "回答："
    d["content_with_weight"] = "\t".join(
        [qprefix + rmPrefix(q), aprefix + rmPrefix(a)])
-    d["content_ltks"] = huqie.qie(q)
-    d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
+    d["content_ltks"] = rag_tokenizer.tokenize(q)
+    d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])
    return d


@ -94,7 +94,7 @@ def chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs):
    res = []
    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+        "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
    }
    if re.search(r"\.xlsx?$", filename, re.IGNORECASE):
        callback(0.1, "Start to parse.")
@ -107,7 +107,7 @@ def chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs):
        txt = ""
        if binary:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(filename, "r") as f:
                while True:
@ -116,18 +116,31 @@ def chunk(filename, binary=None, lang="Chinese", callback=None, **kwargs):
                        break
                    txt += l
        lines = txt.split("\n")
-        #is_english([rmPrefix(l) for l in lines[:100]])
+        comma, tab = 0, 0
+        for l in lines:
+            if len(l.split(",")) == 2: comma += 1
+            if len(l.split("\t")) == 2: tab += 1
+        delimiter = "\t" if tab >= comma else ","
+
        fails = []
-        for i, line in enumerate(lines):
-            arr = [l for l in line.split("\t") if len(l) > 1]
+        question, answer = "", ""
+        i = 0
+        while i < len(lines):
+            arr = lines[i].split(delimiter)
            if len(arr) != 2:
-                fails.append(str(i))
-                continue
-            res.append(beAdoc(deepcopy(doc), arr[0], arr[1], eng))
+                if question: answer += "\n" + lines[i]
+                else:
+                    fails.append(str(i+1))
+            elif len(arr) == 2:
+                if question and answer: res.append(beAdoc(deepcopy(doc), question, answer, eng))
+                question, answer = arr
+            i += 1
            if len(res) % 999 == 0:
                callback(len(res) * 0.6 / len(lines), ("Extract Q&A: {}".format(len(res)) + (
                    f"{len(fails)} failure, line: %s..." % (",".join(fails[:3])) if fails else "")))

+        if question: res.append(beAdoc(deepcopy(doc), question, answer, eng))
+
        callback(0.6, ("Extract Q&A: {}".format(len(res)) + (
            f"{len(fails)} failure, line: %s..." % (",".join(fails[:3])) if fails else "")))

--- a/rag/app/resume.py
+++ b/rag/app/resume.py
@ -18,7 +18,7 @@ import re
 import pandas as pd
 import requests
 from api.db.services.knowledgebase_service import KnowledgebaseService
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from deepdoc.parser.resume import refactor
 from deepdoc.parser.resume import step_one, step_two
 from rag.settings import cron_logger
@ -131,9 +131,9 @@ def chunk(filename, binary=None, callback=None, **kwargs):
        titles.append(str(v))
    doc = {
        "docnm_kwd": filename,
-        "title_tks": huqie.qie("-".join(titles) + "-简历")
+        "title_tks": rag_tokenizer.tokenize("-".join(titles) + "-简历")
    }
-    doc["title_sm_tks"] = huqie.qieqie(doc["title_tks"])
+    doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
    pairs = []
    for n, m in field_map.items():
        if not resume.get(n):
@ -147,8 +147,8 @@ def chunk(filename, binary=None, callback=None, **kwargs):

    doc["content_with_weight"] = "\n".join(
        ["{}: {}".format(re.sub(r"（[^（）]+）", "", k), v) for k, v in pairs])
-    doc["content_ltks"] = huqie.qie(doc["content_with_weight"])
-    doc["content_sm_ltks"] = huqie.qieqie(doc["content_ltks"])
+    doc["content_ltks"] = rag_tokenizer.tokenize(doc["content_with_weight"])
+    doc["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(doc["content_ltks"])
    for n, _ in field_map.items():
        if n not in resume:
            continue
@ -156,7 +156,7 @@ def chunk(filename, binary=None, callback=None, **kwargs):
                len(resume[n]) == 1 or n not in forbidden_select_fields4resume):
            resume[n] = resume[n][0]
        if n.find("_tks") > 0:
-            resume[n] = huqie.qieqie(resume[n])
+            resume[n] = rag_tokenizer.fine_grained_tokenize(resume[n])
        doc[n] = resume[n]

    print(doc)
--- a/rag/app/table.py
+++ b/rag/app/table.py
@ -20,7 +20,7 @@ from openpyxl import load_workbook
 from dateutil.parser import parse as datetime_parse

 from api.db.services.knowledgebase_service import KnowledgebaseService
-from rag.nlp import huqie, is_english, tokenize, find_codec
+from rag.nlp import rag_tokenizer, is_english, tokenize, find_codec
 from deepdoc.parser import ExcelParser


@ -47,6 +47,7 @@ class Excel(ExcelParser):
                cell.value for i,
                cell in enumerate(
                    rows[0]) if i not in missed]
+            if not headers:continue
            data = []
            for i, r in enumerate(rows[1:]):
                rn += 1
@ -148,7 +149,7 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
        txt = ""
        if binary:
            encoding = find_codec(binary)
-            txt = binary.decode(encoding)
+            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(filename, "r") as f:
                while True:
@ -216,7 +217,7 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
        for ii, row in df.iterrows():
            d = {
                "docnm_kwd": filename,
-                "title_tks": huqie.qie(re.sub(r"\.[a-zA-Z]+$", "", filename))
+                "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))
            }
            row_txt = []
            for j in range(len(clmns)):
@ -227,7 +228,7 @@ def chunk(filename, binary=None, from_page=0, to_page=10000000000,
                if pd.isna(row[clmns[j]]):
                    continue
                fld = clmns_map[j][0]
-                d[fld] = row[clmns[j]] if clmn_tys[j] != "text" else huqie.qie(
+                d[fld] = row[clmns[j]] if clmn_tys[j] != "text" else rag_tokenizer.tokenize(
                    row[clmns[j]])
                row_txt.append("{}:{}".format(clmns[j], row[clmns[j]]))
            if not row_txt:
--- a/rag/llm/init.py
+++ b/rag/llm/init.py
@ -22,7 +22,7 @@ EmbeddingModel = {
    "Ollama": OllamaEmbed,
    "OpenAI": OpenAIEmbed,
    "Xinference": XinferenceEmbed,
-    "Tongyi-Qianwen": HuEmbedding, #QWenEmbed,
+    "Tongyi-Qianwen": DefaultEmbedding, #QWenEmbed,
    "ZHIPU-AI": ZhipuEmbed,
    "FastEmbed": FastEmbed,
    "Youdao": YoudaoEmbed
@ -45,6 +45,7 @@ ChatModel = {
    "Tongyi-Qianwen": QWenChat,
    "Ollama": OllamaChat,
    "Xinference": XinferenceChat,
-    "Moonshot": MoonshotChat
+    "Moonshot": MoonshotChat,
+    "DeepSeek": DeepSeekChat
 }

--- a/rag/llm/chat_model.py
+++ b/rag/llm/chat_model.py
@ -24,16 +24,7 @@ from rag.utils import num_tokens_from_string


 class Base(ABC):
-    def __init__(self, key, model_name):
-        pass
-
-    def chat(self, system, history, gen_conf):
-        raise NotImplementedError("Please implement encode method!")
-
-
-class GptTurbo(Base):
-    def __init__(self, key, model_name="gpt-3.5-turbo", base_url="https://api.openai.com/v1"):
-        if not base_url: base_url="https://api.openai.com/v1"
+    def __init__(self, key, model_name, base_url):
        self.client = OpenAI(api_key=key, base_url=base_url)
        self.model_name = model_name

@ -54,28 +45,28 @@ class GptTurbo(Base):
            return "**ERROR**: " + str(e), 0


-class MoonshotChat(GptTurbo):
+class GptTurbo(Base):
+    def __init__(self, key, model_name="gpt-3.5-turbo", base_url="https://api.openai.com/v1"):
+        if not base_url: base_url="https://api.openai.com/v1"
+        super().__init__(key, model_name, base_url)
+
+
+class MoonshotChat(Base):
    def __init__(self, key, model_name="moonshot-v1-8k", base_url="https://api.moonshot.cn/v1"):
        if not base_url: base_url="https://api.moonshot.cn/v1"
-        self.client = OpenAI(
-            api_key=key, base_url=base_url)
-        self.model_name = model_name
+        super().__init__(key, model_name, base_url)

-    def chat(self, system, history, gen_conf):
-        if system:
-            history.insert(0, {"role": "system", "content": system})
-        try:
-            response = self.client.chat.completions.create(
-                model=self.model_name,
-                messages=history,
-                **gen_conf)
-            ans = response.choices[0].message.content.strip()
-            if response.choices[0].finish_reason == "length":
-                ans += "...\nFor the content length reason, it stopped, continue?" if is_english(
-                    [ans]) else "······\n由于长度的原因，回答被截断了，要继续吗？"
-            return ans, response.usage.total_tokens
-        except openai.APIError as e:
-            return "**ERROR**: " + str(e), 0
+
+class XinferenceChat(Base):
+    def __init__(self, key=None, model_name="", base_url=""):
+        key = "xxx"
+        super().__init__(key, model_name, base_url)
+
+
+class DeepSeekChat(Base):
+    def __init__(self, key, model_name="deepseek-chat", base_url="https://api.deepseek.com/v1"):
+        if not base_url: base_url="https://api.deepseek.com/v1"
+        super().__init__(key, model_name, base_url)


 class QWenChat(Base):
@ -141,12 +132,12 @@ class OllamaChat(Base):
        if system:
            history.insert(0, {"role": "system", "content": system})
        try:
-            options = {"temperature": gen_conf.get("temperature", 0.1),
-                       "num_predict": gen_conf.get("max_tokens", 128),
-                       "top_k": gen_conf.get("top_p", 0.3),
-                       "presence_penalty": gen_conf.get("presence_penalty", 0.4),
-                       "frequency_penalty": gen_conf.get("frequency_penalty", 0.7),
-                       }
+            options = {}
+            if "temperature" in gen_conf: options["temperature"] = gen_conf["temperature"]
+            if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]
+            if "top_p" in gen_conf: options["top_k"] = gen_conf["top_p"]
+            if "presence_penalty" in gen_conf: options["presence_penalty"] = gen_conf["presence_penalty"]
+            if "frequency_penalty" in gen_conf: options["frequency_penalty"] = gen_conf["frequency_penalty"]
            response = self.client.chat(
                model=self.model_name,
                messages=history,
@ -157,25 +148,3 @@ class OllamaChat(Base):
        except Exception as e:
            return "**ERROR**: " + str(e), 0

-
-class XinferenceChat(Base):
-    def __init__(self, key=None, model_name="", base_url=""):
-        self.client = OpenAI(api_key="xxx", base_url=base_url)
-        self.model_name = model_name
-
-    def chat(self, system, history, gen_conf):
-        if system:
-            history.insert(0, {"role": "system", "content": system})
-        try:
-            response = self.client.chat.completions.create(
-                model=self.model_name,
-                messages=history,
-                **gen_conf)
-            ans = response.choices[0].message.content.strip()
-            if response.choices[0].finish_reason == "length":
-                ans += "...\nFor the content length reason, it stopped, continue?" if is_english(
-                    [ans]) else "······\n由于长度的原因，回答被截断了，要继续吗？"
-            return ans, response.usage.total_tokens
-        except openai.APIError as e:
-            return "**ERROR**: " + str(e), 0
-
--- a/rag/llm/embedding_model.py
+++ b/rag/llm/embedding_model.py
@ -26,19 +26,17 @@ from FlagEmbedding import FlagModel
 import torch
 import numpy as np

-from api.utils.file_utils import get_project_base_directory
+from api.utils.file_utils import get_project_base_directory, get_home_cache_dir
 from rag.utils import num_tokens_from_string


 try:
-    flag_model = FlagModel(os.path.join(
-        get_project_base_directory(),
-        "rag/res/bge-large-zh-v1.5"),
+    flag_model = FlagModel(os.path.join(get_home_cache_dir(), "bge-large-zh-v1.5"),
        query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：",
        use_fp16=torch.cuda.is_available())
 except Exception as e:
    model_dir = snapshot_download(repo_id="BAAI/bge-large-zh-v1.5",
-                                  local_dir=os.path.join(get_project_base_directory(), "rag/res/bge-large-zh-v1.5"),
+                                  local_dir=os.path.join(get_home_cache_dir(), "bge-large-zh-v1.5"),
                                  local_dir_use_symlinks=False)
    flag_model = FlagModel(model_dir,
                           query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：",
@ -56,7 +54,7 @@ class Base(ABC):
        raise NotImplementedError("Please implement encode method!")


-class HuEmbedding(Base):
+class DefaultEmbedding(Base):
    def __init__(self, *args, **kwargs):
        """
        If you have trouble downloading HuggingFace models, -_^ this might help!!
@ -97,8 +95,7 @@ class OpenAIEmbed(Base):
    def encode(self, texts: list, batch_size=32):
        res = self.client.embeddings.create(input=texts,
                                            model=self.model_name)
-        return np.array([d.embedding for d in res.data]
-                        ), res.usage.total_tokens
+        return np.array([d.embedding for d in res.data]), res.usage.total_tokens

    def encode_queries(self, text):
        res = self.client.embeddings.create(input=[text],
@ -238,8 +235,8 @@ class YoudaoEmbed(Base):
            try:
                print("LOADING BCE...")
                YoudaoEmbed._client = qanthing(model_name_or_path=os.path.join(
-                    get_project_base_directory(),
-                    "rag/res/bce-embedding-base_v1"))
+                    get_home_cache_dir(),
+                    "bce-embedding-base_v1"))
            except Exception as e:
                YoudaoEmbed._client = qanthing(
                    model_name_or_path=model_name.replace(
--- a/rag/nlp/init.py
+++ b/rag/nlp/init.py
@ -2,7 +2,7 @@ import random
 from collections import Counter

 from rag.utils import num_tokens_from_string
-from . import huqie
+from . import rag_tokenizer
 import re
 import copy

@ -28,11 +28,17 @@ all_codecs = [
 def find_codec(blob):
    global all_codecs
    for c in all_codecs:
+        try:
+            blob[:1024].decode(c)
+            return c
+        except Exception as e:
+            pass
        try:
            blob.decode(c)
            return c
        except Exception as e:
            pass
+
    return "utf-8"


@ -109,8 +115,8 @@ def is_english(texts):
 def tokenize(d, t, eng):
    d["content_with_weight"] = t
    t = re.sub(r"</?(table|td|caption|tr|th)( [^<>]{0,12})?>", " ", t)
-    d["content_ltks"] = huqie.qie(t)
-    d["content_sm_ltks"] = huqie.qieqie(d["content_ltks"])
+    d["content_ltks"] = rag_tokenizer.tokenize(t)
+    d["content_sm_ltks"] = rag_tokenizer.fine_grained_tokenize(d["content_ltks"])


 def tokenize_chunks(chunks, doc, eng, pdf_parser):
--- a/rag/nlp/huchunk.py
+++ b/rag/nlp/huchunk.py
@ -1,475 +0,0 @@
-#  Licensed under the Apache License, Version 2.0 (the "License");
-#  you may not use this file except in compliance with the License.
-#  You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-#  Unless required by applicable law or agreed to in writing, software
-#  distributed under the License is distributed on an "AS IS" BASIS,
-#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#  See the License for the specific language governing permissions and
-#  limitations under the License.
-#
-import re
-import os
-import copy
-import base64
-import magic
-from dataclasses import dataclass
-from typing import List
-import numpy as np
-from io import BytesIO
-
-
-class HuChunker:
-
-    @dataclass
-    class Fields:
-        text_chunks: List = None
-        table_chunks: List = None
-
-    def __init__(self):
-        self.MAX_LVL = 12
-        self.proj_patt = [
-            (r"第[零一二三四五六七八九十百]+章", 1),
-            (r"第[零一二三四五六七八九十百]+[条节]", 2),
-            (r"[零一二三四五六七八九十百]+[、 　]", 3),
-            (r"[\(（][零一二三四五六七八九十百]+[）\)]", 4),
-            (r"[0-9]+(、|\.[　 ]|\.[^0-9])", 5),
-            (r"[0-9]+\.[0-9]+(、|[ 　]|[^0-9])", 6),
-            (r"[0-9]+\.[0-9]+\.[0-9]+(、|[ 　]|[^0-9])", 7),
-            (r"[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(、|[ 　]|[^0-9])", 8),
-            (r".{,48}[：:?？]@", 9),
-            (r"[0-9]+）", 10),
-            (r"[\(（][0-9]+[）\)]", 11),
-            (r"[零一二三四五六七八九十百]+是", 12),
-            (r"[⚫•➢✓ ]", 12)
-        ]
-        self.lines = []
-
-    def _garbage(self, txt):
-        patt = [
-            r"(在此保证|不得以任何形式翻版|请勿传阅|仅供内部使用|未经事先书面授权)",
-            r"(版权(归本公司)*所有|免责声明|保留一切权力|承担全部责任|特别声明|报告中涉及)",
-            r"(不承担任何责任|投资者的通知事项：|任何机构和个人|本报告仅为|不构成投资)",
-            r"(不构成对任何个人或机构投资建议|联系其所在国家|本报告由从事证券交易)",
-            r"(本研究报告由|「认可投资者」|所有研究报告均以|请发邮件至)",
-            r"(本报告仅供|市场有风险，投资需谨慎|本报告中提及的)",
-            r"(本报告反映|此信息仅供|证券分析师承诺|具备证券投资咨询业务资格)",
-            r"^(时间|签字|签章)[:：]",
-            r"(参考文献|目录索引|图表索引)",
-            r"[ ]*年[ ]+月[ ]+日",
-            r"^(中国证券业协会|[0-9]+年[0-9]+月[0-9]+日)$",
-            r"\.{10,}",
-            r"(———————END|帮我转发|欢迎收藏|快来关注我吧)"
-        ]
-        return any([re.search(p, txt) for p in patt])
-
-    def _proj_match(self, line):
-        for p, j in self.proj_patt:
-            if re.match(p, line):
-                return j
-        return
-
-    def _does_proj_match(self):
-        mat = [None for _ in range(len(self.lines))]
-        for i in range(len(self.lines)):
-            mat[i] = self._proj_match(self.lines[i])
-        return mat
-
-    def naive_text_chunk(self, text, ti="", MAX_LEN=612):
-        if text:
-            self.lines = [l.strip().replace(u'\u3000', u' ')
-                          .replace(u'\xa0', u'')
-                          for l in text.split("\n\n")]
-            self.lines = [l for l in self.lines if not self._garbage(l)]
-            self.lines = [re.sub(r"([ ]+|&nbsp;)", " ", l)
-                          for l in self.lines if l]
-        if not self.lines:
-            return []
-        arr = self.lines
-
-        res = [""]
-        i = 0
-        while i < len(arr):
-            a = arr[i]
-            if not a:
-                i += 1
-                continue
-            if len(a) > MAX_LEN:
-                a_ = a.split("\n")
-                if len(a_) >= 2:
-                    arr.pop(i)
-                    for j in range(2, len(a_) + 1):
-                        if len("\n".join(a_[:j])) >= MAX_LEN:
-                            arr.insert(i, "\n".join(a_[:j - 1]))
-                            arr.insert(i + 1, "\n".join(a_[j - 1:]))
-                            break
-                    else:
-                        assert False, f"Can't split: {a}"
-                    continue
-
-            if len(res[-1]) < MAX_LEN / 3:
-                res[-1] += "\n" + a
-            else:
-                res.append(a)
-            i += 1
-
-        if ti:
-            for i in range(len(res)):
-                if res[i].find("——来自") >= 0:
-                    continue
-                res[i] += f"\t——来自“{ti}”"
-
-        return res
-
-    def _merge(self):
-        # merge continuous same level text
-        lines = [self.lines[0]] if self.lines else []
-        for i in range(1, len(self.lines)):
-            if self.mat[i] == self.mat[i - 1] \
-               and len(lines[-1]) < 256 \
-               and len(self.lines[i]) < 256:
-                lines[-1] += "\n" + self.lines[i]
-                continue
-            lines.append(self.lines[i])
-        self.lines = lines
-        self.mat = self._does_proj_match()
-        return self.mat
-
-    def text_chunks(self, text):
-        if text:
-            self.lines = [l.strip().replace(u'\u3000', u' ')
-                          .replace(u'\xa0', u'')
-                          for l in re.split(r"[\r\n]", text)]
-            self.lines = [l for l in self.lines if not self._garbage(l)]
-            self.lines = [l for l in self.lines if l]
-        self.mat = self._does_proj_match()
-        mat = self._merge()
-
-        tree = []
-        for i in range(len(self.lines)):
-            tree.append({"proj": mat[i],
-                         "children": [],
-                         "read": False})
-        # find all children
-        for i in range(len(self.lines) - 1):
-            if tree[i]["proj"] is None:
-                continue
-            ed = i + 1
-            while ed < len(tree) and (tree[ed]["proj"] is None or
-                                      tree[ed]["proj"] > tree[i]["proj"]):
-                ed += 1
-
-            nxt = tree[i]["proj"] + 1
-            st = set([p["proj"] for p in tree[i + 1: ed] if p["proj"]])
-            while nxt not in st:
-                nxt += 1
-                if nxt > self.MAX_LVL:
-                    break
-            if nxt <= self.MAX_LVL:
-                for j in range(i + 1, ed):
-                    if tree[j]["proj"] is not None:
-                        break
-                    tree[i]["children"].append(j)
-                for j in range(i + 1, ed):
-                    if tree[j]["proj"] != nxt:
-                        continue
-                    tree[i]["children"].append(j)
-            else:
-                for j in range(i + 1, ed):
-                    tree[i]["children"].append(j)
-
-        # get DFS combinations, find all the paths to leaf
-        paths = []
-
-        def dfs(i, path):
-            nonlocal tree, paths
-            path.append(i)
-            tree[i]["read"] = True
-            if len(self.lines[i]) > 256:
-                paths.append(path)
-                return
-            if not tree[i]["children"]:
-                if len(path) > 1 or len(self.lines[i]) >= 32:
-                    paths.append(path)
-                return
-            for j in tree[i]["children"]:
-                dfs(j, copy.deepcopy(path))
-
-        for i, t in enumerate(tree):
-            if t["read"]:
-                continue
-            dfs(i, [])
-
-        # concat txt on the path for all paths
-        res = []
-        lines = np.array(self.lines)
-        for p in paths:
-            if len(p) < 2:
-                tree[p[0]]["read"] = False
-                continue
-            txt = "\n".join(lines[p[:-1]]) + "\n" + lines[p[-1]]
-            res.append(txt)
-        # concat continuous orphans
-        assert len(tree) == len(lines)
-        ii = 0
-        while ii < len(tree):
-            if tree[ii]["read"]:
-                ii += 1
-                continue
-            txt = lines[ii]
-            e = ii + 1
-            while e < len(tree) and not tree[e]["read"] and len(txt) < 256:
-                txt += "\n" + lines[e]
-                e += 1
-            res.append(txt)
-            ii = e
-
-        # if the node has not been read, find its daddy
-        def find_daddy(st):
-            nonlocal lines, tree
-            proj = tree[st]["proj"]
-            if len(self.lines[st]) > 512:
-                return [st]
-            if proj is None:
-                proj = self.MAX_LVL + 1
-            for i in range(st - 1, -1, -1):
-                if tree[i]["proj"] and tree[i]["proj"] < proj:
-                    a = [st] + find_daddy(i)
-                    return a
-            return []
-
-        return res
-
-
-class PdfChunker(HuChunker):
-
-    def __init__(self, pdf_parser):
-        self.pdf = pdf_parser
-        super().__init__()
-
-    def tableHtmls(self, pdfnm):
-        _, tbls = self.pdf(pdfnm, return_html=True)
-        res = []
-        for img, arr in tbls:
-            if arr[0].find("<table>") < 0:
-                continue
-            buffered = BytesIO()
-            if img:
-                img.save(buffered, format="JPEG")
-            img_str = base64.b64encode(
-                buffered.getvalue()).decode('utf-8') if img else ""
-            res.append({"table": arr[0], "image": img_str})
-        return res
-
-    def html(self, pdfnm):
-        txts, tbls = self.pdf(pdfnm, return_html=True)
-        res = []
-        txt_cks = self.text_chunks(txts)
-        for txt, img in [(self.pdf.remove_tag(c), self.pdf.crop(c))
-                         for c in txt_cks]:
-            buffered = BytesIO()
-            if img:
-                img.save(buffered, format="JPEG")
-            img_str = base64.b64encode(
-                buffered.getvalue()).decode('utf-8') if img else ""
-            res.append({"table": "<p>%s</p>" % txt.replace("\n", "<br/>"),
-                        "image": img_str})
-
-        for img, arr in tbls:
-            if not arr:
-                continue
-            buffered = BytesIO()
-            if img:
-                img.save(buffered, format="JPEG")
-            img_str = base64.b64encode(
-                buffered.getvalue()).decode('utf-8') if img else ""
-            res.append({"table": arr[0], "image": img_str})
-
-        return res
-
-    def __call__(self, pdfnm, return_image=True, naive_chunk=False):
-        flds = self.Fields()
-        text, tbls = self.pdf(pdfnm)
-        fnm = pdfnm
-        txt_cks = self.text_chunks(text) if not naive_chunk else \
-            self.naive_text_chunk(text, ti=fnm if isinstance(fnm, str) else "")
-        flds.text_chunks = [(self.pdf.remove_tag(c),
-                             self.pdf.crop(c) if return_image else None) for c in txt_cks]
-
-        flds.table_chunks = [(arr, img if return_image else None)
-                             for img, arr in tbls]
-        return flds
-
-
-class DocxChunker(HuChunker):
-
-    def __init__(self, doc_parser):
-        self.doc = doc_parser
-        super().__init__()
-
-    def _does_proj_match(self):
-        mat = []
-        for s in self.styles:
-            s = s.split(" ")[-1]
-            try:
-                mat.append(int(s))
-            except Exception as e:
-                mat.append(None)
-        return mat
-
-    def _merge(self):
-        i = 1
-        while i < len(self.lines):
-            if self.mat[i] == self.mat[i - 1] \
-               and len(self.lines[i - 1]) < 256 \
-               and len(self.lines[i]) < 256:
-                self.lines[i - 1] += "\n" + self.lines[i]
-                self.styles.pop(i)
-                self.lines.pop(i)
-                self.mat.pop(i)
-                continue
-            i += 1
-        self.mat = self._does_proj_match()
-        return self.mat
-
-    def __call__(self, fnm):
-        flds = self.Fields()
-        flds.title = os.path.splitext(
-            os.path.basename(fnm))[0] if isinstance(
-            fnm, type("")) else ""
-        secs, tbls = self.doc(fnm)
-        self.lines = [l for l, s in secs]
-        self.styles = [s for l, s in secs]
-
-        txt_cks = self.text_chunks("")
-        flds.text_chunks = [(t, None) for t in txt_cks if not self._garbage(t)]
-        flds.table_chunks = [(tb, None) for tb in tbls for t in tb if t]
-        return flds
-
-
-class ExcelChunker(HuChunker):
-
-    def __init__(self, excel_parser):
-        self.excel = excel_parser
-        super().__init__()
-
-    def __call__(self, fnm):
-        flds = self.Fields()
-        flds.text_chunks = [(t, None) for t in self.excel(fnm)]
-        flds.table_chunks = []
-        return flds
-
-
-class PptChunker(HuChunker):
-
-    def __init__(self):
-        super().__init__()
-
-    def __extract(self, shape):
-        if shape.shape_type == 19:
-            tb = shape.table
-            rows = []
-            for i in range(1, len(tb.rows)):
-                rows.append("; ".join([tb.cell(
-                    0, j).text + ": " + tb.cell(i, j).text for j in range(len(tb.columns)) if tb.cell(i, j)]))
-            return "\n".join(rows)
-
-        if shape.has_text_frame:
-            return shape.text_frame.text
-
-        if shape.shape_type == 6:
-            texts = []
-            for p in shape.shapes:
-                t = self.__extract(p)
-                if t:
-                    texts.append(t)
-            return "\n".join(texts)
-
-    def __call__(self, fnm):
-        from pptx import Presentation
-        ppt = Presentation(fnm) if isinstance(
-            fnm, str) else Presentation(
-            BytesIO(fnm))
-        txts = []
-        for slide in ppt.slides:
-            texts = []
-            for shape in slide.shapes:
-                txt = self.__extract(shape)
-                if txt:
-                    texts.append(txt)
-            txts.append("\n".join(texts))
-
-        import aspose.slides as slides
-        import aspose.pydrawing as drawing
-        imgs = []
-        with slides.Presentation(BytesIO(fnm)) as presentation:
-            for slide in presentation.slides:
-                buffered = BytesIO()
-                slide.get_thumbnail(
-                    0.5, 0.5).save(
-                    buffered, drawing.imaging.ImageFormat.jpeg)
-                imgs.append(buffered.getvalue())
-        assert len(imgs) == len(
-            txts), "Slides text and image do not match: {} vs. {}".format(len(imgs), len(txts))
-
-        flds = self.Fields()
-        flds.text_chunks = [(txts[i], imgs[i]) for i in range(len(txts))]
-        flds.table_chunks = []
-
-        return flds
-
-
-class TextChunker(HuChunker):
-
-    @dataclass
-    class Fields:
-        text_chunks: List = None
-        table_chunks: List = None
-
-    def __init__(self):
-        super().__init__()
-
-    @staticmethod
-    def is_binary_file(file_path):
-        mime = magic.Magic(mime=True)
-        if isinstance(file_path, str):
-            file_type = mime.from_file(file_path)
-        else:
-            file_type = mime.from_buffer(file_path)
-        if 'text' in file_type:
-            return False
-        else:
-            return True
-
-    def __call__(self, fnm):
-        flds = self.Fields()
-        if self.is_binary_file(fnm):
-            return flds
-        txt = ""
-        if isinstance(fnm, str):
-            with open(fnm, "r") as f:
-                txt = f.read()
-        else:
-            txt = fnm.decode("utf-8")
-        flds.text_chunks = [(c, None) for c in self.naive_text_chunk(txt)]
-        flds.table_chunks = []
-        return flds
-
-
-if __name__ == "__main__":
-    import sys
-    sys.path.append(os.path.dirname(__file__) + "/../")
-    if sys.argv[1].split(".")[-1].lower() == "pdf":
-        from deepdoc.parser import PdfParser
-        ckr = PdfChunker(PdfParser())
-    if sys.argv[1].split(".")[-1].lower().find("doc") >= 0:
-        from deepdoc.parser import DocxParser
-        ckr = DocxChunker(DocxParser())
-    if sys.argv[1].split(".")[-1].lower().find("xlsx") >= 0:
-        from deepdoc.parser import ExcelParser
-        ckr = ExcelChunker(ExcelParser())
-
-    # ckr.html(sys.argv[1])
-    print(ckr(sys.argv[1]))
--- a/rag/nlp/query.py
+++ b/rag/nlp/query.py
@ -7,14 +7,13 @@ import logging
 import copy
 from elasticsearch_dsl import Q

-from rag.nlp import huqie, term_weight, synonym
-
+from rag.nlp import rag_tokenizer, term_weight, synonym

 class EsQueryer:
    def __init__(self, es):
        self.tw = term_weight.Dealer()
        self.es = es
-        self.syn = synonym.Dealer(None)
+        self.syn = synonym.Dealer()
        self.flds = ["ask_tks^10", "ask_small_tks"]

    @staticmethod
@ -47,13 +46,13 @@ class EsQueryer:
        txt = re.sub(
            r"[ \r\n\t,，。？?/`!！&]+",
            " ",
-            huqie.tradi2simp(
-                huqie.strQ2B(
+            rag_tokenizer.tradi2simp(
+                rag_tokenizer.strQ2B(
                    txt.lower()))).strip()
        txt = EsQueryer.rmWWW(txt)

        if not self.isChinese(txt):
-            tks = huqie.qie(txt).split(" ")
+            tks = rag_tokenizer.tokenize(txt).split(" ")
            q = copy.deepcopy(tks)
            for i in range(1, len(tks)):
                q.append("\"%s %s\"^2" % (tks[i - 1], tks[i]))
@ -65,7 +64,7 @@ class EsQueryer:
                            boost=1)#, minimum_should_match=min_match)
                     ), tks

-        def needQieqie(tk):
+        def need_fine_grained_tokenize(tk):
            if len(tk) < 4:
                return False
            if re.match(r"[0-9a-z\.\+#_\*-]+$", tk):
@ -81,7 +80,7 @@ class EsQueryer:
            logging.info(json.dumps(twts, ensure_ascii=False))
            tms = []
            for tk, w in sorted(twts, key=lambda x: x[1] * -1):
-                sm = huqie.qieqie(tk).split(" ") if needQieqie(tk) else []
+                sm = rag_tokenizer.fine_grained_tokenize(tk).split(" ") if need_fine_grained_tokenize(tk) else []
                sm = [
                    re.sub(
                        r"[ ,\./;'\[\]\\`~!@#$%\^&\*\(\)=\+_<>\?:\"\{\}\|，。；‘’【】、！￥……（）——《》？：“”-]+",
@ -110,10 +109,10 @@ class EsQueryer:
            if len(twts) > 1:
                tms += f" (\"%s\"~4)^1.5" % (" ".join([t for t, _ in twts]))
            if re.match(r"[0-9a-z ]+$", tt):
-                tms = f"(\"{tt}\" OR \"%s\")" % huqie.qie(tt)
+                tms = f"(\"{tt}\" OR \"%s\")" % rag_tokenizer.tokenize(tt)

            syns = " OR ".join(
-                ["\"%s\"^0.7" % EsQueryer.subSpecialChar(huqie.qie(s)) for s in syns])
+                ["\"%s\"^0.7" % EsQueryer.subSpecialChar(rag_tokenizer.tokenize(s)) for s in syns])
            if syns:
                tms = f"({tms})^5 OR ({syns})^0.7"

--- a/rag/nlp/rag_tokenizer.py
+++ b/rag/nlp/rag_tokenizer.py
@ -14,7 +14,7 @@ from nltk.stem import PorterStemmer, WordNetLemmatizer
 from api.utils.file_utils import get_project_base_directory


-class Huqie:
+class RagTokenizer:
    def key_(self, line):
        return str(line.lower().encode("utf-8"))[2:-1]

@ -241,7 +241,7 @@ class Huqie:

        return self.score_(res[::-1])

-    def qie(self, line):
+    def tokenize(self, line):
        line = self._strQ2B(line).lower()
        line = self._tradi2simp(line)
        zh_num = len([1 for c in line if is_chinese(c)])
@ -298,7 +298,7 @@ class Huqie:
            print("[TKS]", self.merge_(res))
        return self.merge_(res)

-    def qieqie(self, tks):
+    def fine_grained_tokenize(self, tks):
        tks = tks.split(" ")
        zh_num = len([1 for c in tks if c and is_chinese(c[0])])
        if zh_num < len(tks) * 0.2:
@ -371,53 +371,53 @@ def naiveQie(txt):
    return tks


-hq = Huqie()
-qie = hq.qie
-qieqie = hq.qieqie
-tag = hq.tag
-freq = hq.freq
-loadUserDict = hq.loadUserDict
-addUserDict = hq.addUserDict
-tradi2simp = hq._tradi2simp
-strQ2B = hq._strQ2B
+tokenizer = RagTokenizer()
+tokenize = tokenizer.tokenize
+fine_grained_tokenize = tokenizer.fine_grained_tokenize
+tag = tokenizer.tag
+freq = tokenizer.freq
+loadUserDict = tokenizer.loadUserDict
+addUserDict = tokenizer.addUserDict
+tradi2simp = tokenizer._tradi2simp
+strQ2B = tokenizer._strQ2B

 if __name__ == '__main__':
-    huqie = Huqie(debug=True)
+    tknzr = RagTokenizer(debug=True)
    # huqie.addUserDict("/tmp/tmp.new.tks.dict")
-    tks = huqie.qie(
+    tks = tknzr.tokenize(
        "哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie(
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize(
        "公开征求意见稿提出，境外投资者可使用自有人民币或外汇投资。使用外汇投资的，可通过债券持有人在香港人民币业务清算行及香港地区经批准可进入境内银行间外汇市场进行交易的境外人民币业务参加行（以下统称香港结算行）办理外汇资金兑换。香港结算行由此所产生的头寸可到境内银行间外汇市场平盘。使用外汇投资的，在其投资的债券到期或卖出后，原则上应兑换回外汇。")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie(
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize(
        "多校划片就是一个小区对应多个小学初中，让买了学区房的家庭也不确定到底能上哪个学校。目的是通过这种方式为学区房降温，把就近入学落到实处。南京市长江大桥")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie(
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize(
        "实际上当时他们已经将业务中心偏移到安全部门和针对政府企业的部门 Scripts are compiled and cached aaaaaaaaa")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie("虽然我不怎么玩")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie("蓝月亮如何在外资夹击中生存,那是全宇宙最有意思的")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie(
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize("虽然我不怎么玩")
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize("蓝月亮如何在外资夹击中生存,那是全宇宙最有意思的")
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize(
        "涡轮增压发动机num最大功率,不像别的共享买车锁电子化的手段,我们接过来是否有意义,黄黄爱美食,不过，今天阿奇要讲到的这家农贸市场，说实话，还真蛮有特色的！不仅环境好，还打出了")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie("这周日你去吗？这周日你有空吗？")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie("Unity3D开发经验 测试开发工程师 c++双11双11 985 211 ")
-    print(huqie.qieqie(tks))
-    tks = huqie.qie(
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize("这周日你去吗？这周日你有空吗？")
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize("Unity3D开发经验 测试开发工程师 c++双11双11 985 211 ")
+    print(tknzr.fine_grained_tokenize(tks))
+    tks = tknzr.tokenize(
        "数据分析项目经理|数据分析挖掘|数据分析方向|商品数据分析|搜索数据分析 sql python hive tableau Cocos2d-")
-    print(huqie.qieqie(tks))
+    print(tknzr.fine_grained_tokenize(tks))
    if len(sys.argv) < 2:
        sys.exit()
-    huqie.DEBUG = False
-    huqie.loadUserDict(sys.argv[1])
+    tknzr.DEBUG = False
+    tknzr.loadUserDict(sys.argv[1])
    of = open(sys.argv[2], "r")
    while True:
        line = of.readline()
        if not line:
            break
-        print(huqie.qie(line))
+        print(tknzr.tokenize(line))
    of.close()
--- a/rag/nlp/search.py
+++ b/rag/nlp/search.py
@ -9,7 +9,7 @@ from dataclasses import dataclass

 from rag.settings import es_logger
 from rag.utils import rmSpace
-from rag.nlp import huqie, query
+from rag.nlp import rag_tokenizer, query
 import numpy as np


@ -128,7 +128,7 @@ class Dealer:
        kwds = set([])
        for k in keywords:
            kwds.add(k)
-            for kk in huqie.qieqie(k).split(" "):
+            for kk in rag_tokenizer.fine_grained_tokenize(k).split(" "):
                if len(kk) < 2:
                    continue
                if kk in kwds:
@ -243,7 +243,7 @@ class Dealer:
        assert len(ans_v[0]) == len(chunk_v[0]), "The dimension of query and chunk do not match: {} vs. {}".format(
            len(ans_v[0]), len(chunk_v[0]))

-        chunks_tks = [huqie.qie(self.qryr.rmWWW(ck)).split(" ")
+        chunks_tks = [rag_tokenizer.tokenize(self.qryr.rmWWW(ck)).split(" ")
                      for ck in chunks]
        cites = {}
        thr = 0.63
@ -251,7 +251,7 @@ class Dealer:
            for i, a in enumerate(pieces_):
                sim, tksim, vtsim = self.qryr.hybrid_similarity(ans_v[i],
                                                                chunk_v,
-                                                                huqie.qie(
+                                                                rag_tokenizer.tokenize(
                                                                    self.qryr.rmWWW(pieces_[i])).split(" "),
                                                                chunks_tks,
                                                                tkweight, vtweight)
@ -310,8 +310,8 @@ class Dealer:
    def hybrid_similarity(self, ans_embd, ins_embd, ans, inst):
        return self.qryr.hybrid_similarity(ans_embd,
                                           ins_embd,
-                                           huqie.qie(ans).split(" "),
-                                           huqie.qie(inst).split(" "))
+                                           rag_tokenizer.tokenize(ans).split(" "),
+                                           rag_tokenizer.tokenize(inst).split(" "))

    def retrieval(self, question, embd_mdl, tenant_id, kb_ids, page, page_size, similarity_threshold=0.2,
                  vector_similarity_weight=0.3, top=1024, doc_ids=None, aggs=True):
@ -385,7 +385,7 @@ class Dealer:
        for r in re.finditer(r" ([a-z_]+_l?tks)( like | ?= ?)'([^']+)'", sql):
            fld, v = r.group(1), r.group(3)
            match = " MATCH({}, '{}', 'operator=OR;minimum_should_match=30%') ".format(
-                fld, huqie.qieqie(huqie.qie(v)))
+                fld, rag_tokenizer.fine_grained_tokenize(rag_tokenizer.tokenize(v)))
            replaces.append(
                ("{}{}'{}'".format(
                    r.group(1),
--- a/rag/nlp/synonym.py
+++ b/rag/nlp/synonym.py
@ -17,7 +17,7 @@ class Dealer:
        try:
            self.dictionary = json.load(open(path, 'r'))
        except Exception as e:
-            logging.warn("Miss synonym.json")
+            logging.warn("Missing synonym.json")
            self.dictionary = {}

        if not redis:
--- a/rag/nlp/term_weight.py
+++ b/rag/nlp/term_weight.py
@ -4,7 +4,7 @@ import json
 import re
 import os
 import numpy as np
-from rag.nlp import huqie
+from rag.nlp import rag_tokenizer
 from api.utils.file_utils import get_project_base_directory


@ -83,7 +83,7 @@ class Dealer:
            txt = re.sub(p, r, txt)

        res = []
-        for t in huqie.qie(txt).split(" "):
+        for t in rag_tokenizer.tokenize(txt).split(" "):
            tk = t
            if (stpwd and tk in self.stop_words) or (
                    re.match(r"[0-9]$", tk) and not num):
@ -161,7 +161,7 @@ class Dealer:
            return m[self.ne[t]]

        def postag(t):
-            t = huqie.tag(t)
+            t = rag_tokenizer.tag(t)
            if t in set(["r", "c", "d"]):
                return 0.3
            if t in set(["ns", "nt"]):
@ -175,14 +175,14 @@ class Dealer:
        def freq(t):
            if re.match(r"[0-9. -]{2,}$", t):
                return 3
-            s = huqie.freq(t)
+            s = rag_tokenizer.freq(t)
            if not s and re.match(r"[a-z. -]+$", t):
                return 300
            if not s:
                s = 0

            if not s and len(t) >= 4:
-                s = [tt for tt in huqie.qieqie(t).split(" ") if len(tt) > 1]
+                s = [tt for tt in rag_tokenizer.fine_grained_tokenize(t).split(" ") if len(tt) > 1]
                if len(s) > 1:
                    s = np.min([freq(tt) for tt in s]) / 6.
                else:
@ -198,7 +198,7 @@ class Dealer:
            elif re.match(r"[a-z. -]+$", t):
                return 300
            elif len(t) >= 4:
-                s = [tt for tt in huqie.qieqie(t).split(" ") if len(tt) > 1]
+                s = [tt for tt in rag_tokenizer.fine_grained_tokenize(t).split(" ") if len(tt) > 1]
                if len(s) > 1:
                    return max(3, np.min([df(tt) for tt in s]) / 6.)

--- a/rag/settings.py
+++ b/rag/settings.py
@ -47,3 +47,9 @@ cron_logger = getLogger("cron_logger")
 cron_logger.setLevel(20)
 chunk_logger = getLogger("chunk_logger")
 database_logger = getLogger("database")
+
+SVR_QUEUE_NAME = "rag_flow_svr_queue"
+SVR_QUEUE_RETENTION = 60*60
+SVR_QUEUE_MAX_LEN = 1024
+SVR_CONSUMER_NAME = "rag_flow_svr_consumer"
+SVR_CONSUMER_GROUP_NAME = "rag_flow_svr_consumer_group"
--- a/rag/svr/cache_file_svr.py
+++ b/rag/svr/cache_file_svr.py
@ -4,13 +4,14 @@ import traceback

 from api.db.db_models import close_connection
 from api.db.services.task_service import TaskService
-from rag.utils import MINIO
+from rag.settings import cron_logger
+from rag.utils.minio_conn import MINIO
 from rag.utils.redis_conn import REDIS_CONN


 def collect():
    doc_locations = TaskService.get_ongoing_doc_name()
-    #print(tasks)
+    print(doc_locations)
    if len(doc_locations) == 0:
        time.sleep(1)
        return
@ -28,7 +29,7 @@ def main():
                    if REDIS_CONN.exist(key):continue
                    file_bin = MINIO.get(kb_id, loc)
                    REDIS_CONN.transaction(key, file_bin, 12 * 60)
-                    print("CACHE:", loc)
+                    cron_logger.info("CACHE: {}".format(loc))
                except Exception as e:
                    traceback.print_stack(e)
        except Exception as e:
--- a/rag/svr/task_broker.py
+++ b/rag/svr/task_broker.py
@ -1,193 +0,0 @@
-#
-#  Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
-#
-#  Licensed under the Apache License, Version 2.0 (the "License");
-#  you may not use this file except in compliance with the License.
-#  You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-#  Unless required by applicable law or agreed to in writing, software
-#  distributed under the License is distributed on an "AS IS" BASIS,
-#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#  See the License for the specific language governing permissions and
-#  limitations under the License.
-#
-import logging
-import os
-import time
-import random
-from datetime import datetime
-from api.db.db_models import Task
-from api.db.db_utils import bulk_insert_into_db
-from api.db.services.task_service import TaskService
-from deepdoc.parser import PdfParser
-from deepdoc.parser.excel_parser import HuExcelParser
-from rag.settings import cron_logger
-from rag.utils import MINIO
-from rag.utils import findMaxTm
-import pandas as pd
-from api.db import FileType, TaskStatus
-from api.db.services.document_service import DocumentService
-from api.settings import database_logger
-from api.utils import get_format_time, get_uuid
-from api.utils.file_utils import get_project_base_directory
-from rag.utils.redis_conn import REDIS_CONN
-from api.db.db_models import init_database_tables as init_web_db
-from api.db.init_data import init_web_data
-
-
-def collect(tm):
-    docs = DocumentService.get_newly_uploaded(tm)
-    if len(docs) == 0:
-        return pd.DataFrame()
-    docs = pd.DataFrame(docs)
-    mtm = docs["update_time"].max()
-    cron_logger.info("TOTAL:{}, To:{}".format(len(docs), mtm))
-    return docs
-
-
-def set_dispatching(docid):
-    try:
-        DocumentService.update_by_id(
-            docid, {"progress": random.random() * 1 / 100.,
-                    "progress_msg": "Task dispatched...",
-                    "process_begin_at": get_format_time()
-                    })
-    except Exception as e:
-        cron_logger.error("set_dispatching:({}), {}".format(docid, str(e)))
-
-
-def dispatch():
-    tm_fnm = os.path.join(
-        get_project_base_directory(),
-        "rag/res",
-        f"broker.tm")
-    tm = findMaxTm(tm_fnm)
-    rows = collect(tm)
-    if len(rows) == 0:
-        return
-
-    tmf = open(tm_fnm, "a+")
-    for _, r in rows.iterrows():
-        try:
-            tsks = TaskService.query(doc_id=r["id"])
-            if tsks:
-                for t in tsks:
-                    TaskService.delete_by_id(t.id)
-        except Exception as e:
-            cron_logger.exception(e)
-
-        def new_task():
-            nonlocal r
-            return {
-                "id": get_uuid(),
-                "doc_id": r["id"]
-            }
-
-        tsks = []
-        try:
-            file_bin = MINIO.get(r["kb_id"], r["location"])
-            if REDIS_CONN.is_alive():
-                try:
-                    REDIS_CONN.set("{}/{}".format(r["kb_id"], r["location"]), file_bin, 12*60)
-                except Exception as e:
-                    cron_logger.warning("Put into redis[EXCEPTION]:" + str(e))
-
-            if r["type"] == FileType.PDF.value:
-                do_layout = r["parser_config"].get("layout_recognize", True)
-                pages = PdfParser.total_page_number(r["name"], file_bin)
-                page_size = r["parser_config"].get("task_page_size", 12)
-                if r["parser_id"] == "paper":
-                    page_size = r["parser_config"].get("task_page_size", 22)
-                if r["parser_id"] == "one":
-                    page_size = 1000000000
-                if not do_layout:
-                    page_size = 1000000000
-                page_ranges = r["parser_config"].get("pages")
-                if not page_ranges:
-                    page_ranges = [(1, 100000)]
-                for s, e in page_ranges:
-                    s -= 1
-                    s = max(0, s)
-                    e = min(e - 1, pages)
-                    for p in range(s, e, page_size):
-                        task = new_task()
-                        task["from_page"] = p
-                        task["to_page"] = min(p + page_size, e)
-                        tsks.append(task)
-
-            elif r["parser_id"] == "table":
-                rn = HuExcelParser.row_number(
-                    r["name"], file_bin)
-                for i in range(0, rn, 3000):
-                    task = new_task()
-                    task["from_page"] = i
-                    task["to_page"] = min(i + 3000, rn)
-                    tsks.append(task)
-            else:
-                tsks.append(new_task())
-
-            bulk_insert_into_db(Task, tsks, True)
-            set_dispatching(r["id"])
-        except Exception as e:
-            cron_logger.exception(e)
-
-        tmf.write(str(r["update_time"]) + "\n")
-    tmf.close()
-
-
-def update_progress():
-    docs = DocumentService.get_unfinished_docs()
-    for d in docs:
-        try:
-            tsks = TaskService.query(doc_id=d["id"], order_by=Task.create_time)
-            if not tsks:
-                continue
-            msg = []
-            prg = 0
-            finished = True
-            bad = 0
-            status = TaskStatus.RUNNING.value
-            for t in tsks:
-                if 0 <= t.progress < 1:
-                    finished = False
-                prg += t.progress if t.progress >= 0 else 0
-                msg.append(t.progress_msg)
-                if t.progress == -1:
-                    bad += 1
-            prg /= len(tsks)
-            if finished and bad:
-                prg = -1
-                status = TaskStatus.FAIL.value
-            elif finished:
-                status = TaskStatus.DONE.value
-
-            msg = "\n".join(msg)
-            info = {
-                "process_duation": datetime.timestamp(
-                    datetime.now()) -
-                                   d["process_begin_at"].timestamp(),
-                "run": status}
-            if prg != 0:
-                info["progress"] = prg
-            if msg:
-                info["progress_msg"] = msg
-            DocumentService.update_by_id(d["id"], info)
-        except Exception as e:
-            cron_logger.error("fetch task exception:" + str(e))
-
-
-if __name__ == "__main__":
-    peewee_logger = logging.getLogger('peewee')
-    peewee_logger.propagate = False
-    peewee_logger.addHandler(database_logger.handlers[0])
-    peewee_logger.setLevel(database_logger.level)
-    # init db
-    init_web_db()
-    init_web_data()
-
-    while True:
-        dispatch()
-        time.sleep(1)
-        update_progress()
--- a/rag/svr/task_executor.py
+++ b/rag/svr/task_executor.py
@ -24,16 +24,18 @@ import sys
 import time
 import traceback
 from functools import partial
-from rag.utils import MINIO
+
+from api.db.services.file2document_service import File2DocumentService
+from rag.utils.minio_conn import MINIO
 from api.db.db_models import close_connection
-from rag.settings import database_logger
+from rag.settings import database_logger, SVR_QUEUE_NAME
 from rag.settings import cron_logger, DOC_MAXIMUM_SIZE
 from multiprocessing import Pool
 import numpy as np
 from elasticsearch_dsl import Q
 from multiprocessing.context import TimeoutError
 from api.db.services.task_service import TaskService
-from rag.utils import ELASTICSEARCH
+from rag.utils.es_conn import ELASTICSEARCH
 from timeit import default_timer as timer
 from rag.utils import rmSpace, findMaxTm

@ -87,36 +89,34 @@ def set_progress(task_id, from_page=0, to_page=-1,
    except Exception as e:
        cron_logger.error("set_progress:({}), {}".format(task_id, str(e)))

+    close_connection()
    if cancel:
        sys.exit()


-def collect(comm, mod, tm):
-    tasks = TaskService.get_tasks(tm, mod, comm)
-    #print(tasks)
-    if len(tasks) == 0:
-        time.sleep(1)
+def collect():
+    try:
+        payload = REDIS_CONN.queue_consumer(SVR_QUEUE_NAME, "rag_flow_svr_task_broker", "rag_flow_svr_task_consumer")
+        if not payload:
+            time.sleep(1)
+            return pd.DataFrame()
+    except Exception as e:
+        cron_logger.error("Get task event from queue exception:" + str(e))
        return pd.DataFrame()
+
+    msg = payload.get_message()
+    payload.ack()
+    if not msg: return pd.DataFrame()
+
+    if TaskService.do_cancel(msg["id"]):
+        return pd.DataFrame()
+    tasks = TaskService.get_tasks(msg["id"])
+    assert tasks, "{} empty task!".format(msg["id"])
    tasks = pd.DataFrame(tasks)
-    mtm = tasks["update_time"].max()
-    cron_logger.info("TOTAL:{}, To:{}".format(len(tasks), mtm))
    return tasks


 def get_minio_binary(bucket, name):
-    global MINIO
-    if REDIS_CONN.is_alive():
-        try:
-            for _ in range(30):
-                if REDIS_CONN.exist("{}/{}".format(bucket, name)):
-                    time.sleep(1)
-                    break
-                time.sleep(1)
-            r = REDIS_CONN.get("{}/{}".format(bucket, name))
-            if r: return r
-            cron_logger.warning("Cache missing: {}".format(name))
-        except Exception as e:
-            cron_logger.warning("Get redis[EXCEPTION]:" + str(e))
    return MINIO.get(bucket, name)


@ -132,12 +132,10 @@ def build(row):
        row["from_page"],
        row["to_page"])
    chunker = FACTORY[row["parser_id"].lower()]
-    pool = Pool(processes=1)
    try:
        st = timer()
-        thr = pool.apply_async(get_minio_binary, args=(row["kb_id"], row["location"]))
-        binary = thr.get(timeout=90)
-        pool.terminate()
+        bucket, name = File2DocumentService.get_minio_address(doc_id=row["doc_id"])
+        binary = get_minio_binary(bucket, name)
        cron_logger.info(
            "From minio({}) {}/{}".format(timer()-st, row["location"], row["name"]))
        cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
@ -156,7 +154,6 @@ def build(row):
        else:
            callback(-1, f"Internal server error: %s" %
                     str(e).replace("'", ""))
-        pool.terminate()
        traceback.print_exc()

        cron_logger.error(
@ -247,20 +244,13 @@ def embedding(docs, mdl, parser_config={}, callback=None):
    return tk_count


-def main(comm, mod):
-    tm_fnm = os.path.join(
-        get_project_base_directory(),
-        "rag/res",
-        f"{comm}-{mod}.tm")
-    tm = findMaxTm(tm_fnm)
-    rows = collect(comm, mod, tm)
+def main():
+    rows = collect()
    if len(rows) == 0:
        return

-    tmf = open(tm_fnm, "a+")
    for _, r in rows.iterrows():
        callback = partial(set_progress, r["id"], r["from_page"], r["to_page"])
-        #callback(random.random()/10., "Task has been received.")
        try:
            embd_mdl = LLMBundle(r["tenant_id"], LLMType.EMBEDDING, llm_name=r["embd_id"], lang=r["language"])
        except Exception as e:
@ -274,7 +264,6 @@ def main(comm, mod):
        if cks is None:
            continue
        if not cks:
-            tmf.write(str(r["update_time"]) + "\n")
            callback(1., "No chunk! Done!")
            continue
        # TODO: exception handler
@ -314,8 +303,6 @@ def main(comm, mod):
                "Chunk doc({}), token({}), chunks({}), elapsed:{}".format(
                    r["id"], tk_count, len(cks), timer()-st))

-        tmf.write(str(r["update_time"]) + "\n")
-    tmf.close()


 if __name__ == "__main__":
@ -324,8 +311,5 @@ if __name__ == "__main__":
    peewee_logger.addHandler(database_logger.handlers[0])
    peewee_logger.setLevel(database_logger.level)

-    #from mpi4py import MPI
-    #comm = MPI.COMM_WORLD
    while True:
-        main(int(sys.argv[2]), int(sys.argv[1]))
-        close_connection()
+        main()
--- a/rag/utils/init.py
+++ b/rag/utils/init.py
@ -15,9 +15,6 @@ def singleton(cls, *args, **kw):
    return _singleton


-from .minio_conn import MINIO
-from .es_conn import ELASTICSEARCH
-
 def rmSpace(txt):
    txt = re.sub(r"([^a-z0-9.,]) +([^ ])", r"\1\2", txt, flags=re.IGNORECASE)
    return re.sub(r"([^ ]) +([^a-z0-9.,])", r"\1\2", txt, flags=re.IGNORECASE)
--- a/rag/utils/es_conn.py
+++ b/rag/utils/es_conn.py
@ -15,7 +15,7 @@ es_logger.info("Elasticsearch version: "+str(elasticsearch.__version__))


@singleton
-class HuEs:
+class ESConnection:
    def __init__(self):
        self.info = {}
        self.conn()
@ -454,4 +454,4 @@ class HuEs:
            scroll_size = len(page['hits']['hits'])


-ELASTICSEARCH = HuEs()
+ELASTICSEARCH = ESConnection()
--- a/rag/utils/minio_conn.py
+++ b/rag/utils/minio_conn.py
@ -8,7 +8,7 @@ from rag.utils import singleton


@singleton
-class HuMinio(object):
+class RAGFlowMinio(object):
    def __init__(self):
        self.conn = None
        self.__open__()
@ -35,7 +35,7 @@ class HuMinio(object):
        self.conn = None

    def put(self, bucket, fnm, binary):
-        for _ in range(10):
+        for _ in range(3):
            try:
                if not self.conn.bucket_exists(bucket):
                    self.conn.make_bucket(bucket)
@ -86,10 +86,12 @@ class HuMinio(object):
                time.sleep(1)
        return

-MINIO = HuMinio()
+
+MINIO = RAGFlowMinio()
+

 if __name__ == "__main__":
-    conn = HuMinio()
+    conn = RAGFlowMinio()
    fnm = "/opt/home/kevinhu/docgpt/upload/13/11-408.jpg"
    from PIL import Image
    img = Image.open(fnm)
--- a/rag/utils/redis_conn.py
+++ b/rag/utils/redis_conn.py
@ -5,6 +5,27 @@ import logging
 from rag import settings
 from rag.utils import singleton

+
+class Payload:
+    def __init__(self, consumer, queue_name, group_name, msg_id, message):
+        self.__consumer = consumer
+        self.__queue_name = queue_name
+        self.__group_name = group_name
+        self.__msg_id = msg_id
+        self.__message = json.loads(message['message'])
+
+    def ack(self):
+        try:
+            self.__consumer.xack(self.__queue_name, self.__group_name, self.__msg_id)
+            return True
+        except Exception as e:
+            logging.warning("[EXCEPTION]ack" + str(self.__queue_name) + "||" + str(e))
+        return False
+
+    def get_message(self):
+        return self.__message
+
+
@singleton
 class RedisDB:
    def __init__(self):
@ -14,10 +35,11 @@ class RedisDB:

    def __open__(self):
        try:
-            self.REDIS = redis.Redis(host=self.config.get("host", "redis").split(":")[0],
+            self.REDIS = redis.StrictRedis(host=self.config["host"].split(":")[0],
                                     port=int(self.config.get("host", ":6379").split(":")[1]),
                                     db=int(self.config.get("db", 1)),
-                                     password=self.config.get("password"))
+                                     password=self.config.get("password"),
+                                     decode_responses=True)
        except Exception as e:
            logging.warning("Redis can't be connected.")
        return self.REDIS
@ -70,5 +92,48 @@ class RedisDB:
            self.__open__()
        return False

+    def queue_product(self, queue, message, exp=settings.SVR_QUEUE_RETENTION) -> bool:
+        try:
+            payload = {"message": json.dumps(message)}
+            pipeline = self.REDIS.pipeline()
+            pipeline.xadd(queue, payload)
+            pipeline.expire(queue, exp)
+            pipeline.execute()
+            return True
+        except Exception as e:
+            logging.warning("[EXCEPTION]producer" + str(queue) + "||" + str(e))
+        return False

-REDIS_CONN = RedisDB()
+    def queue_consumer(self, queue_name, group_name, consumer_name, msg_id=b">") -> Payload:
+        try:
+            group_info = self.REDIS.xinfo_groups(queue_name)
+            if not any(e["name"] == group_name for e in group_info):
+                self.REDIS.xgroup_create(
+                    queue_name,
+                    group_name,
+                    id="$",
+                    mkstream=True
+                )
+            args = {
+                "groupname": group_name,
+                "consumername": consumer_name,
+                "count": 1,
+                "block": 10000,
+                "streams": {queue_name: msg_id},
+            }
+            messages = self.REDIS.xreadgroup(**args)
+            if not messages:
+                return None
+            stream, element_list = messages[0]
+            msg_id, payload = element_list[0]
+            res = Payload(self.REDIS, queue_name, group_name, msg_id, payload)
+            return res
+        except Exception as e:
+            if 'key' in str(e):
+                pass
+            else:
+                logging.warning("[EXCEPTION]consumer" + str(queue_name) + "||" + str(e))
+        return None
+
+
+REDIS_CONN = RedisDB()
--- a/requirements.txt
+++ b/requirements.txt
@ -50,7 +50,6 @@ joblib==1.3.2
 lxml==5.1.0
 MarkupSafe==2.1.5
 minio==7.2.4
-mpi4py==3.1.5
 mpmath==1.3.0
 multidict==6.0.5
 multiprocess==0.70.16
@ -69,6 +68,7 @@ nvidia-cusparse-cu12==12.1.0.106
 nvidia-nccl-cu12==2.19.3
 nvidia-nvjitlink-cu12==12.3.101
 nvidia-nvtx-cu12==12.1.105
+ollama==0.1.9
 onnxruntime-gpu==1.17.1
 openai==1.12.0
 opencv-python==4.9.0.80
@ -91,8 +91,6 @@ pycryptodomex==3.20.0
 pydantic==2.6.2
 pydantic_core==2.16.3
 PyJWT==2.8.0
-PyMuPDF==1.23.25
-PyMuPDFb==1.23.22
 PyMySQL==1.1.0
 PyPDF2==3.0.1
 pypdfium2==4.27.0
@ -102,6 +100,7 @@ python-dotenv==1.0.1
 python-pptx==0.6.23
 pytz==2024.1
 PyYAML==6.0.1
+redis==5.0.3
 regex==2023.12.25
 requests==2.31.0
 ruamel.yaml==0.18.6
@ -134,4 +133,4 @@ xxhash==3.4.1
 yarl==1.9.4
 zhipuai==2.0.1
 BCEmbedding
-loguru==0.7.2
+loguru==0.7.2
--- a/requirements_dev.txt
+++ b/requirements_dev.txt
@ -0,0 +1,126 @@
+accelerate==0.27.2
+aiohttp==3.9.3
+aiosignal==1.3.1
+annotated-types==0.6.0
+anyio==4.3.0
+argon2-cffi==23.1.0
+argon2-cffi-bindings==21.2.0
+Aspose.Slides==24.2.0
+attrs==23.2.0
+blinker==1.7.0
+cachelib==0.12.0
+cachetools==5.3.3
+certifi==2024.2.2
+cffi==1.16.0
+charset-normalizer==3.3.2
+click==8.1.7
+coloredlogs==15.0.1
+cryptography==42.0.5
+dashscope==1.14.1
+datasets==2.17.1
+datrie==0.8.2
+demjson3==3.0.6
+dill==0.3.8
+distro==1.9.0
+elastic-transport==8.12.0
+elasticsearch==8.12.1
+elasticsearch-dsl==8.12.0
+et-xmlfile==1.1.0
+filelock==3.13.1
+fastembed==0.2.6
+FlagEmbedding==1.2.5
+Flask==3.0.2
+Flask-Cors==4.0.0
+Flask-Login==0.6.3
+Flask-Session==0.6.0
+flatbuffers==23.5.26
+frozenlist==1.4.1
+fsspec==2023.10.0
+h11==0.14.0
+hanziconv==0.3.2
+httpcore==1.0.4
+httpx==0.27.0
+huggingface-hub==0.20.3
+humanfriendly==10.0
+idna==3.6
+install==1.3.5
+itsdangerous==2.1.2
+Jinja2==3.1.3
+joblib==1.3.2
+lxml==5.1.0
+MarkupSafe==2.1.5
+minio==7.2.4
+mpi4py==3.1.5
+mpmath==1.3.0
+multidict==6.0.5
+multiprocess==0.70.16
+networkx==3.2.1
+nltk==3.8.1
+numpy==1.26.4
+openai==1.12.0
+opencv-python==4.9.0.80
+openpyxl==3.1.2
+packaging==23.2
+pandas==2.2.1
+pdfminer.six==20221105
+pdfplumber==0.10.4
+peewee==3.17.1
+pillow==10.2.0
+protobuf==4.25.3
+psutil==5.9.8
+pyarrow==15.0.0
+pyarrow-hotfix==0.6
+pyclipper==1.3.0.post5
+pycparser==2.21
+pycryptodome==3.20.0
+pycryptodome-test-vectors==1.0.14
+pycryptodomex==3.20.0
+pydantic==2.6.2
+pydantic_core==2.16.3
+PyJWT==2.8.0
+PyMuPDF==1.23.25
+PyMuPDFb==1.23.22
+PyMySQL==1.1.0
+PyPDF2==3.0.1
+pypdfium2==4.27.0
+python-dateutil==2.8.2
+python-docx==1.1.0
+python-dotenv==1.0.1
+python-pptx==0.6.23
+pytz==2024.1
+PyYAML==6.0.1
+regex==2023.12.25
+requests==2.31.0
+ruamel.yaml==0.18.6
+ruamel.yaml.clib==0.2.8
+safetensors==0.4.2
+scikit-learn==1.4.1.post1
+scipy==1.12.0
+sentence-transformers==2.4.0
+shapely==2.0.3
+six==1.16.0
+sniffio==1.3.1
+StrEnum==0.4.15
+sympy==1.12
+threadpoolctl==3.3.0
+tika==2.6.0
+tiktoken==0.6.0
+tokenizers==0.15.2
+torch==2.2.1
+tqdm==4.66.2
+transformers==4.38.1
+triton==2.2.0
+typing_extensions==4.10.0
+tzdata==2024.1
+urllib3==2.2.1
+Werkzeug==3.0.1
+xgboost==2.0.3
+XlsxWriter==3.2.0
+xpinyin==0.7.6
+xxhash==3.4.1
+yarl==1.9.4
+zhipuai==2.0.1
+BCEmbedding
+loguru==0.7.2
+ollama==0.1.8
+redis==5.0.4
--- a/web/.umirc.ts
+++ b/web/.umirc.ts
@ -27,7 +27,7 @@ export default defineConfig({
  devtool: 'source-map',
  proxy: {
    '/v1': {
-      target: 'http://192.168.200.233:9380/',
+      target: 'http://123.60.95.134:9380/',
      changeOrigin: true,
      // pathRewrite: { '^/v1': '/v1' },
    },
--- a/web/package-lock.json
+++ b/web/package-lock.json
--- a/web/package.json
+++ b/web/package.json
@ -3,7 +3,7 @@
  "author": "zhaofengchao <13723060510@163.com>",
  "scripts": {
    "build": "umi build",
-    "dev": "cross-env PORT=9000 umi dev",
+    "dev": "cross-env PORT=9200 umi dev",
    "postinstall": "umi setup",
    "lint": "umi lint --eslint-only",
    "setup": "umi setup",
@ -13,6 +13,7 @@
    "@ant-design/icons": "^5.2.6",
    "@ant-design/pro-components": "^2.6.46",
    "@ant-design/pro-layout": "^7.17.16",
+    "@js-preview/excel": "^1.7.8",
    "ahooks": "^3.7.10",
    "antd": "^5.12.7",
    "axios": "^1.6.3",
@ -25,12 +26,14 @@
    "rc-tween-one": "^3.0.6",
    "react-chat-elements": "^12.0.13",
    "react-copy-to-clipboard": "^5.1.0",
+    "react-file-viewer": "^1.2.1",
    "react-i18next": "^14.0.0",
    "react-infinite-scroll-component": "^6.1.0",
    "react-markdown": "^9.0.1",
    "react-pdf-highlighter": "^6.1.0",
    "react-string-replace": "^1.1.1",
    "react-syntax-highlighter": "^15.5.0",
+    "reactflow": "^11.11.2",
    "recharts": "^2.12.4",
    "remark-gfm": "^4.0.0",
    "umi": "^4.0.90",
--- a/web/src/assets/svg/llm/deepseek.svg
+++ b/web/src/assets/svg/llm/deepseek.svg
@ -0,0 +1,6 @@
+<svg t="1715133624982" class="icon" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="4263"
+    width="200" height="200">
+    <path
+        d="M320.512 804.864C46.08 676.864 77.824 274.432 362.496 274.432c34.816 0 86.016-7.168 114.688-14.336 59.392-16.384 99.328-10.24 69.632 10.24-9.216 7.168-15.36 19.456-13.312 28.672 5.12 20.48 158.72 161.792 177.152 161.792 27.648 0 27.648-32.768 1.024-57.344-43.008-38.912-55.296-90.112-35.84-141.312l9.216-26.624 54.272 52.224c35.84 34.816 58.368 49.152 68.608 44.032 9.216-4.096 30.72-9.216 49.152-12.288 18.432-2.048 38.912-10.24 45.056-18.432 19.456-23.552 43.008-17.408 35.84 9.216-3.072 12.288-6.144 27.648-6.144 34.816 0 23.552-62.464 83.968-92.16 90.112-23.552 5.12-30.72 12.288-30.72 30.72 0 46.08-38.912 148.48-75.776 198.656l-37.888 51.2 36.864 15.36c56.32 23.552 40.96 41.984-37.888 43.008-43.008 1.024-75.776 7.168-92.16 18.432-68.608 45.056-198.656 50.176-281.6 12.288z m251.904-86.016c-24.576-27.648-66.56-79.872-93.184-117.76-69.632-98.304-158.72-150.528-256-150.528-37.888 0-38.912 1.024-38.912 34.816 0 94.208 99.328 240.64 175.104 257.024 38.912 9.216 59.392-7.168 39.936-29.696-7.168-9.216-10.24-23.552-6.144-31.744 5.12-14.336 9.216-14.336 38.912 1.024 18.432 9.216 50.176 29.696 69.632 45.056 35.84 27.648 58.368 37.888 96.256 39.936 14.336 1.024 9.216-10.24-25.6-48.128z m88.064-145.408c8.192-13.312-31.744-78.848-56.32-92.16-10.24-6.144-26.624-10.24-34.816-10.24-23.552 0-20.48 27.648 4.096 33.792 13.312 3.072 20.48 14.336 20.48 29.696 0 13.312 5.12 29.696 12.288 36.864 15.36 15.36 46.08 16.384 54.272 2.048z"
+        fill="#4D6BFE" p-id="4264"></path>
+</svg>
--- a/web/src/components/chunk-method-modal/hooks.ts
+++ b/web/src/components/chunk-method-modal/hooks.ts
@ -74,9 +74,9 @@ export const useFetchParserListOnMount = (
    setSelectedTag(parserId);
  }, [parserId, documentId]);

-  const handleChange = (tag: string, checked: boolean) => {
-    const nextSelectedTag = checked ? tag : selectedTag;
-    setSelectedTag(nextSelectedTag);
+  const handleChange = (tag: string) => {
+    // const nextSelectedTag = checked ? tag : selectedTag;
+    setSelectedTag(tag);
  };

  return { parserList: nextParserList, handleChange, selectedTag };
--- a/web/src/components/chunk-method-modal/index.less
+++ b/web/src/components/chunk-method-modal/index.less
@ -8,3 +8,7 @@
  cursor: help;
  writing-mode: horizontal-tb;
 }
+
+.chunkMethod {
+  margin-bottom: 0;
+}
--- a/web/src/components/chunk-method-modal/index.tsx
+++ b/web/src/components/chunk-method-modal/index.tsx
@ -13,9 +13,9 @@ import {
  Form,
  InputNumber,
  Modal,
+  Select,
  Space,
  Switch,
-  Tag,
  Tooltip,
 } from 'antd';
 import omit from 'lodash/omit';
@ -25,8 +25,6 @@ import { useFetchParserListOnMount } from './hooks';
 import { useTranslate } from '@/hooks/commonHooks';
 import styles from './index.less';

-const { CheckableTag } = Tag;
-
 interface IProps extends Omit<IModalManagerChildrenProps, 'showModal'> {
  loading: boolean;
  onOk: (
@ -50,6 +48,7 @@ const ChunkMethodModal: React.FC<IProps> = ({
  visible,
  documentExtension,
  parserConfig,
+  loading,
 }) => {
  const { parserList, handleChange, selectedTag } = useFetchParserListOnMount(
    documentId,
@ -111,23 +110,17 @@ const ChunkMethodModal: React.FC<IProps> = ({
      onOk={handleOk}
      onCancel={hideModal}
      afterClose={afterClose}
+      confirmLoading={loading}
    >
      <Space size={[0, 8]} wrap>
-        <div className={styles.tags}>
-          {parserList.map((x) => {
-            return (
-              <CheckableTag
-                key={x.value}
-                checked={selectedTag === x.value}
-                onChange={(checked) => {
-                  handleChange(x.value, checked);
-                }}
-              >
-                {x.label}
-              </CheckableTag>
-            );
-          })}
-        </div>
+        <Form.Item label={t('chunkMethod')} className={styles.chunkMethod}>
+          <Select
+            style={{ width: 120 }}
+            onChange={handleChange}
+            value={selectedTag}
+            options={parserList}
+          />
+        </Form.Item>
      </Space>
      {hideDivider || <Divider></Divider>}
      <Form name="dynamic_form_nest_item" autoComplete="off" form={form}>
--- a/web/src/components/file-upload-modal/index.less
+++ b/web/src/components/file-upload-modal/index.less
@ -0,0 +1,8 @@
+.uploader {
+  :global {
+    .ant-upload-list {
+      max-height: 40vh;
+      overflow-y: auto;
+    }
+  }
+}
--- a/web/src/pages/file-manager/file-upload-modal/index.tsx
+++ b/web/src/pages/file-manager/file-upload-modal/index.tsx
@ -1,3 +1,4 @@
+import { useTranslate } from '@/hooks/commonHooks';
 import { IModalProps } from '@/interfaces/common';
 import { InboxOutlined } from '@ant-design/icons';
 import {
@ -12,6 +13,8 @@ import {
 } from 'antd';
 import { Dispatch, SetStateAction, useState } from 'react';

+import styles from './index.less';
+
 const { Dragger } = Upload;

 const FileUpload = ({
@ -23,6 +26,7 @@ const FileUpload = ({
  fileList: UploadFile[];
  setFileList: Dispatch<SetStateAction<UploadFile[]>>;
 }) => {
+  const { t } = useTranslate('fileManager');
  const props: UploadProps = {
    multiple: true,
    onRemove: (file) => {
@ -43,17 +47,12 @@ const FileUpload = ({
  };

  return (
-    <Dragger {...props}>
+    <Dragger {...props} className={styles.uploader}>
      <p className="ant-upload-drag-icon">
        <InboxOutlined />
      </p>
-      <p className="ant-upload-text">
-        Click or drag file to this area to upload
-      </p>
-      <p className="ant-upload-hint">
-        Support for a single or bulk upload. Strictly prohibited from uploading
-        company data or other banned files.
-      </p>
+      <p className="ant-upload-text">{t('uploadTitle')}</p>
+      <p className="ant-upload-hint">{t('uploadDescription')}</p>
    </Dragger>
  );
 };
@ -64,18 +63,29 @@ const FileUploadModal = ({
  loading,
  onOk: onFileUploadOk,
 }: IModalProps<UploadFile[]>) => {
+  const { t } = useTranslate('fileManager');
  const [value, setValue] = useState<string | number>('local');
  const [fileList, setFileList] = useState<UploadFile[]>([]);
  const [directoryFileList, setDirectoryFileList] = useState<UploadFile[]>([]);

-  const onOk = () => {
-    return onFileUploadOk?.([...fileList, ...directoryFileList]);
+  const clearFileList = () => {
+    setFileList([]);
+    setDirectoryFileList([]);
+  };
+
+  const onOk = async () => {
+    const ret = await onFileUploadOk?.([...fileList, ...directoryFileList]);
+    return ret;
+  };
+
+  const afterClose = () => {
+    clearFileList();
  };

  const items: TabsProps['items'] = [
    {
      key: '1',
-      label: 'File',
+      label: t('file'),
      children: (
        <FileUpload
          directory={false}
@ -86,7 +96,7 @@ const FileUploadModal = ({
    },
    {
      key: '2',
-      label: 'Directory',
+      label: t('directory'),
      children: (
        <FileUpload
          directory
@ -100,17 +110,18 @@ const FileUploadModal = ({
  return (
    <>
      <Modal
-        title="File upload"
+        title={t('uploadFile')}
        open={visible}
        onOk={onOk}
        onCancel={hideModal}
        confirmLoading={loading}
+        afterClose={afterClose}
      >
        <Flex gap={'large'} vertical>
          <Segmented
            options={[
-              { label: 'Local uploads', value: 'local' },
-              { label: 'S3 uploads', value: 's3' },
+              { label: t('local'), value: 'local' },
+              { label: t('s3'), value: 's3' },
            ]}
            block
            value={value}
@ -119,7 +130,7 @@ const FileUploadModal = ({
          {value === 'local' ? (
            <Tabs defaultActiveKey="1" items={items} />
          ) : (
-            'coming soon'
+            t('comingSoon', { keyPrefix: 'common' })
          )}
        </Flex>
      </Modal>
--- a/web/src/constants/common.ts
+++ b/web/src/constants/common.ts
@ -46,3 +46,25 @@ export const LanguageTranslationMap = {
  Chinese: 'zh',
  'Traditional Chinese': 'zh-TRADITIONAL',
 };
+
+export const FileMimeTypeMap = {
+  bmp: 'image/bmp',
+  csv: 'text/csv',
+  odt: 'application/vnd.oasis.opendocument.text',
+  doc: 'application/msword',
+  docx: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+  gif: 'image/gif',
+  htm: 'text/htm',
+  html: 'text/html',
+  jpg: 'image/jpg',
+  jpeg: 'image/jpeg',
+  pdf: 'application/pdf',
+  png: 'image/png',
+  ppt: 'application/vnd.ms-powerpoint',
+  pptx: 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
+  tiff: 'image/tiff',
+  txt: 'text/plain',
+  xls: 'application/vnd.ms-excel',
+  xlsx: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
+  mp4: 'video/mp4',
+};
--- a/web/src/hooks/documentHooks.ts
+++ b/web/src/hooks/documentHooks.ts
@ -7,6 +7,7 @@ import { useCallback, useMemo, useState } from 'react';
 import { IHighlight } from 'react-pdf-highlighter';
 import { useDispatch, useSelector } from 'umi';
 import { useGetKnowledgeSearchParams } from './routeHook';
+import { useOneNamespaceEffectsLoading } from './storeHooks';

 export const useGetDocumentUrl = (documentId: string) => {
  const url = useMemo(() => {
@ -160,12 +161,12 @@ export const useRemoveDocument = () => {
  const { knowledgeId } = useGetKnowledgeSearchParams();

  const removeDocument = useCallback(
-    (documentId: string) => {
+    (documentIds: string[]) => {
      try {
        return dispatch<any>({
          type: 'kFModel/document_rm',
          payload: {
-            doc_id: documentId,
+            doc_id: documentIds,
            kb_id: knowledgeId,
          },
        });
@ -184,12 +185,12 @@ export const useUploadDocument = () => {
  const { knowledgeId } = useGetKnowledgeSearchParams();

  const uploadDocument = useCallback(
-    (file: UploadFile) => {
+    (fileList: UploadFile[]) => {
      try {
        return dispatch<any>({
          type: 'kFModel/upload_document',
          payload: {
-            file,
+            fileList,
            kb_id: knowledgeId,
          },
        });
@ -222,3 +223,8 @@ export const useRunDocument = () => {

  return runDocumentByIds;
 };
+
+export const useSelectRunDocumentLoading = () => {
+  const loading = useOneNamespaceEffectsLoading('kFModel', ['document_run']);
+  return loading;
+};
--- a/web/src/hooks/knowledgeHook.ts
+++ b/web/src/hooks/knowledgeHook.ts
@ -125,13 +125,19 @@ export const useFetchKnowledgeBaseConfiguration = () => {
  }, [fetchKnowledgeBaseConfiguration]);
 };

+export const useSelectKnowledgeList = () => {
+  const knowledgeModel = useSelector((state) => state.knowledgeModel);
+  const { data = [] } = knowledgeModel;
+  return data;
+};
+
 export const useFetchKnowledgeList = (
  shouldFilterListWithoutDocument: boolean = false,
 ) => {
  const dispatch = useDispatch();
  const loading = useOneNamespaceEffectsLoading('knowledgeModel', ['getList']);

-  const knowledgeModel = useSelector((state: any) => state.knowledgeModel);
+  const knowledgeModel = useSelector((state) => state.knowledgeModel);
  const { data = [] } = knowledgeModel;
  const list: IKnowledge[] = useMemo(() => {
    return shouldFilterListWithoutDocument
--- a/web/src/hooks/llmHooks.ts
+++ b/web/src/hooks/llmHooks.ts
@ -5,6 +5,7 @@ import {
  IThirdOAIModelCollection,
 } from '@/interfaces/database/llm';
 import { IAddLlmRequestBody } from '@/interfaces/request/llm';
+import { sortLLmFactoryListBySpecifiedOrder } from '@/utils/commonUtil';
 import { useCallback, useEffect, useMemo } from 'react';
 import { useDispatch, useSelector } from 'umi';

@ -110,13 +111,12 @@ export const useFetchLlmFactoryListOnMount = () => {
  const factoryList = useSelectLlmFactoryList();
  const myLlmList = useSelectMyLlmList();

-  const list = useMemo(
-    () =>
-      factoryList.filter((x) =>
-        Object.keys(myLlmList).every((y) => y !== x.name),
-      ),
-    [factoryList, myLlmList],
-  );
+  const list = useMemo(() => {
+    const currentList = factoryList.filter((x) =>
+      Object.keys(myLlmList).every((y) => y !== x.name),
+    );
+    return sortLLmFactoryListBySpecifiedOrder(currentList);
+  }, [factoryList, myLlmList]);

  const fetchLlmFactoryList = useCallback(() => {
    dispatch({
--- a/web/src/interfaces/common.ts
+++ b/web/src/interfaces/common.ts
@ -14,5 +14,5 @@ export interface IModalProps<T> {
  hideModal(): void;
  visible: boolean;
  loading?: boolean;
-  onOk?(payload?: T): Promise<void> | void;
+  onOk?(payload?: T): Promise<any> | void;
 }
--- a/web/src/interfaces/database/chat.ts
+++ b/web/src/interfaces/database/chat.ts
@ -21,11 +21,11 @@ export interface LlmSetting {
 }

 export interface Variable {
-  frequency_penalty: number;
-  max_tokens: number;
-  presence_penalty: number;
-  temperature: number;
-  top_p: number;
+  frequency_penalty?: number;
+  max_tokens?: number;
+  presence_penalty?: number;
+  temperature?: number;
+  top_p?: number;
 }

 export interface IDialog {
@ -38,7 +38,7 @@ export interface IDialog {
  kb_names: string[];
  language: string;
  llm_id: string;
-  llm_setting: LlmSetting;
+  llm_setting: Variable;
  llm_setting_type: string;
  name: string;
  prompt_config: PromptConfig;
--- a/web/src/layouts/components/header/index.tsx
+++ b/web/src/layouts/components/header/index.tsx
@ -1,5 +1,5 @@
 import { ReactComponent as StarIon } from '@/assets/svg/chat-star.svg';
-// import { ReactComponent as FileIcon } from '@/assets/svg/file-management.svg';
+import { ReactComponent as FileIcon } from '@/assets/svg/file-management.svg';
 import { ReactComponent as KnowledgeBaseIcon } from '@/assets/svg/knowledge-base.svg';
 import { ReactComponent as Logo } from '@/assets/svg/logo.svg';
 import { useTranslate } from '@/hooks/commonHooks';
@ -25,7 +25,7 @@ const RagHeader = () => {
    () => [
      { path: '/knowledge', name: t('knowledgeBase'), icon: KnowledgeBaseIcon },
      { path: '/chat', name: t('chat'), icon: StarIon },
-      // { path: '/file', name: 'File Management', icon: FileIcon },
+      { path: '/file', name: t('fileManager'), icon: FileIcon },
    ],
    [t],
  );
--- a/web/src/layouts/components/right-toolbar/index.less
+++ b/web/src/layouts/components/right-toolbar/index.less
@ -15,3 +15,7 @@
  vertical-align: middle;
  cursor: pointer;
 }
+
+.language {
+  cursor: pointer;
+}
--- a/web/src/layouts/components/right-toolbar/index.tsx
+++ b/web/src/layouts/components/right-toolbar/index.tsx
@ -1,6 +1,5 @@
-import { ReactComponent as TranslationIcon } from '@/assets/svg/translation.svg';
 import { useTranslate } from '@/hooks/commonHooks';
-import { GithubOutlined } from '@ant-design/icons';
+import { DownOutlined, GithubOutlined } from '@ant-design/icons';
 import { Dropdown, MenuProps, Space } from 'antd';
 import camelCase from 'lodash/camelCase';
 import React from 'react';
@ -8,6 +7,7 @@ import User from '../user';

 import { LanguageList } from '@/constants/common';
 import { useChangeLanguage } from '@/hooks/logicHooks';
+import { useSelector } from 'umi';
 import styled from './index.less';

 const Circle = ({ children, ...restProps }: React.PropsWithChildren) => {
@ -25,6 +25,7 @@ const handleGithubCLick = () => {
 const RightToolBar = () => {
  const { t } = useTranslate('common');
  const changeLanguage = useChangeLanguage();
+  const { language = '' } = useSelector((state) => state.settingModel.userInfo);

  const handleItemClick: MenuProps['onClick'] = ({ key }) => {
    changeLanguage(key);
@ -40,14 +41,15 @@ const RightToolBar = () => {
  return (
    <div className={styled.toolbarWrapper}>
      <Space wrap size={16}>
+        <Dropdown menu={{ items, onClick: handleItemClick }} placement="bottom">
+          <Space className={styled.language}>
+            <b>{t(camelCase(language))}</b>
+            <DownOutlined />
+          </Space>
+        </Dropdown>
        <Circle>
          <GithubOutlined onClick={handleGithubCLick} />
        </Circle>
-        <Dropdown menu={{ items, onClick: handleItemClick }} placement="bottom">
-          <Circle>
-            <TranslationIcon />
-          </Circle>
-        </Dropdown>
        {/* <Circle>
          <MonIcon />
        </Circle> */}
--- a/web/src/less/mixins.less
+++ b/web/src/less/mixins.less
@ -42,3 +42,17 @@
    }
  }
 }
+
+.textEllipsis() {
+  overflow: hidden;
+  text-overflow: ellipsis;
+  white-space: nowrap;
+}
+
+.multipleLineEllipsis(@line) {
+  display: -webkit-box;
+  -webkit-box-orient: vertical;
+  -webkit-line-clamp: @line;
+  overflow: hidden;
+  text-overflow: ellipsis;
+}
--- a/web/src/locales/en.ts
+++ b/web/src/locales/en.ts
@ -22,6 +22,9 @@ export default {
      languagePlaceholder: 'select your language',
      copy: 'Copy',
      copied: 'Copied',
+      comingSoon: 'Coming Soon',
+      download: 'Download',
+      close: 'Close',
    },
    login: {
      login: 'Sign in',
@ -52,6 +55,7 @@ export default {
      home: 'Home',
      setting: '用户设置',
      logout: '登出',
+      fileManager: 'File Management',
    },
    knowledgeList: {
      welcome: 'Welcome back',
@ -60,6 +64,7 @@ export default {
      name: 'Name',
      namePlaceholder: 'Please input name!',
      doc: 'Docs',
+      searchKnowledgePlaceholder: 'Search',
    },
    knowledgeDetails: {
      dataset: 'Dataset',
@ -274,6 +279,8 @@ export default {
      keyword: 'Keyword',
      function: 'Function',
      chunkMessage: 'Please input value!',
+      full: 'Full text',
+      ellipse: 'Ellipse',
    },
    chat: {
      createAssistant: 'Create an Assistant',
@ -459,6 +466,7 @@ export default {
      renamed: 'Renamed',
      operated: 'Operated',
      updated: 'Updated',
+      uploaded: 'Uploaded',
      200: 'The server successfully returns the requested data.',
      201: 'Create or modify data successfully.',
      202: 'A request has been queued in the background (asynchronous task).',
@ -480,6 +488,24 @@ export default {
      networkAnomaly: 'network anomaly',
      hint: 'hint',
    },
+    fileManager: {
+      name: 'Name',
+      uploadDate: 'Upload Date',
+      knowledgeBase: 'Knowledge Base',
+      size: 'Size',
+      action: 'Action',
+      addToKnowledge: 'Add to Knowledge Base',
+      pleaseSelect: 'Please select',
+      newFolder: 'New Folder',
+      file: 'File',
+      uploadFile: 'Upload File',
+      directory: 'Directory',
+      uploadTitle: 'Click or drag file to this area to upload',
+      uploadDescription:
+        'Support for a single or bulk upload. Strictly prohibited from uploading company data or other banned files.',
+      local: 'Local uploads',
+      s3: 'S3 uploads',
+    },
    footer: {
      profile: 'All rights reserved @ React',
    },
--- a/web/src/locales/zh-traditional.ts
+++ b/web/src/locales/zh-traditional.ts
@ -22,6 +22,9 @@ export default {
      languagePlaceholder: '請選擇語言',
      copy: '複製',
      copied: '複製成功',
+      comingSoon: '即將推出',
+      download: '下載',
+      close: '关闭',
    },
    login: {
      login: '登入',
@ -52,6 +55,7 @@ export default {
      home: '首頁',
      setting: '用戶設置',
      logout: '登出',
+      fileManager: '文件管理',
    },
    knowledgeList: {
      welcome: '歡迎回來',
@ -60,6 +64,7 @@ export default {
      name: '名稱',
      namePlaceholder: '請輸入名稱',
      doc: '文件',
+      searchKnowledgePlaceholder: '搜索',
    },
    knowledgeDetails: {
      dataset: '數據集',
@ -218,7 +223,7 @@ export default {
        您只需與<i>'ragflow'</i>交談即可列出所有符合資格的候選人。
        </p>
          `,
-      table: `支持<p><b>excel</b>和<b>csv/txt</b>格式文件。</p><p>以下是一些提示： <ul> <li>对于Csv或Txt文件，列之间的分隔符为 <em><b>tab</b></em>。</li> <li>第一行必须是列标题。</li> <li>列标题必须是有意义的术语，以便我们的法学硕士能够理解。列举一些同义词时最好使用斜杠<i>'/'</i>来分隔，甚至更好使用方括号枚举值，例如 <i>“性別/性別（男性，女性）”</i>.<p>以下是标题的一些示例：<ol> <li>供应商/供货商<b>'tab'</b>顏色（黃色、紅色、棕色）<b>'tab'</b>性別（男、女）<b>'tab'</B>尺码（m、l、xl、xxl）</li> <li>姓名/名字<b>'tab'</b>電話/手機/微信<b>'tab'</b>最高学历（高中，职高，硕士，本科，博士，初中，中技，中专，专科，专升本，mpa，mba，emba）</li> </ol> </p> </li> <li>表中的每一行都将被视为一个块。</li> </ul>`,
+      table: `支持<p><b>excel</b>和<b>csv/txt</b>格式文件。</p><p>以下是一些提示： <ul> <li>对于Csv或Txt文件，列之间的分隔符为 <em><b>tab</b></em>。</li> <li>第一行必须是列标题。</li> <li>列标题必须是有意义的术语，以便我们的大語言模型能够理解。列举一些同义词时最好使用斜杠<i>'/'</i>来分隔，甚至更好使用方括号枚举值，例如 <i>“性別/性別（男性，女性）”</i>.<p>以下是标题的一些示例：<ol> <li>供应商/供货商<b>'tab'</b>顏色（黃色、紅色、棕色）<b>'tab'</b>性別（男、女）<b>'tab'</B>尺码（m、l、xl、xxl）</li> <li>姓名/名字<b>'tab'</b>電話/手機/微信<b>'tab'</b>最高学历（高中，职高，硕士，本科，博士，初中，中技，中专，专科，专升本，mpa，mba，emba）</li> </ol> </p> </li> <li>表中的每一行都将被视为一个块。</li> </ul>`,
      picture: `
       <p>支持圖像文件。視頻即將推出。</p><p>
        如果圖片中有文字，則應用 OCR 提取文字作為其文字描述。
@ -247,6 +252,8 @@ export default {
      keyword: '關鍵詞',
      function: '函數',
      chunkMessage: '請輸入值！',
+      full: '全文',
+      ellipse: '省略',
    },
    chat: {
      createAssistant: '新建助理',
@ -424,6 +431,7 @@ export default {
      renamed: '重命名成功',
      operated: '操作成功',
      updated: '更新成功',
+      uploaded: '上傳成功',
      200: '服務器成功返回請求的數據。',
      201: '新建或修改數據成功。',
      202: '一個請求已經進入後台排隊（異步任務）。',
@ -444,6 +452,23 @@ export default {
      networkAnomaly: '網絡異常',
      hint: '提示',
    },
+    fileManager: {
+      name: '名稱',
+      uploadDate: '上傳日期',
+      knowledgeBase: '知識庫',
+      size: '大小',
+      action: '操作',
+      addToKnowledge: '添加到知識庫',
+      pleaseSelect: '請選擇',
+      newFolder: '新建文件夾',
+      uploadFile: '上傳文件',
+      uploadTitle: '點擊或拖拽文件至此區域即可上傳',
+      uploadDescription: '支持單次或批量上傳。嚴禁上傳公司數據或其他違禁文件。',
+      file: '文件',
+      directory: '文件夾',
+      local: '本地上傳',
+      s3: 'S3 上傳',
+    },
    footer: {
      profile: '“保留所有權利 @ react”',
    },
--- a/web/src/locales/zh.ts
+++ b/web/src/locales/zh.ts
@ -22,6 +22,9 @@ export default {
      languagePlaceholder: '请选择语言',
      copy: '复制',
      copied: '复制成功',
+      comingSoon: '即将推出',
+      download: '下载',
+      close: '关闭',
    },
    login: {
      login: '登录',
@ -52,6 +55,7 @@ export default {
      home: '首页',
      setting: '用户设置',
      logout: '登出',
+      fileManager: '文件管理',
    },
    knowledgeList: {
      welcome: '欢迎回来',
@ -60,6 +64,7 @@ export default {
      name: '名称',
      namePlaceholder: '请输入名称',
      doc: '文档',
+      searchKnowledgePlaceholder: '搜索',
    },
    knowledgeDetails: {
      dataset: '数据集',
@ -225,7 +230,7 @@ export default {
      <ul>
    <li>对于 csv 或 txt 文件，列之间的分隔符为 <em><b>TAB</b></em>。</li>
    <li>第一行必须是列标题。</li>
-    <li>列标题必须是有意义的术语，以便我们的法学硕士能够理解。
+    <li>列标题必须是有意义的术语，以便我们的大语言模型能够理解。
    列举一些同义词时最好使用斜杠<i>'/'</i>来分隔，甚至更好
    使用方括号枚举值，例如 <i>'gender/sex(male,female)'</i>.<p>
    以下是标题的一些示例：<ol>
@ -264,6 +269,8 @@ export default {
      keyword: '关键词',
      function: '函数',
      chunkMessage: '请输入值！',
+      full: '全文',
+      ellipse: '省略',
    },
    chat: {
      createAssistant: '新建助理',
@ -298,7 +305,7 @@ export default {
      systemTip:
        '当LLM回答问题时，你需要LLM遵循的说明，比如角色设计、答案长度和答案语言等。',
      topN: 'Top N',
-      topNTip: `并非所有相似度得分高于“相似度阈值”的块都会被提供给法学硕士。 LLM 只能看到这些“Top N”块。`,
+      topNTip: `并非所有相似度得分高于“相似度阈值”的块都会被提供给大语言模型。 LLM 只能看到这些“Top N”块。`,
      variable: '变量',
      variableTip: `如果您使用对话 API，变量可能会帮助您使用不同的策略与客户聊天。
      这些变量用于填写提示中的“系统”部分，以便给LLM一个提示。
@ -315,7 +322,7 @@ export default {
      improvise: '即兴创作',
      precise: '精确',
      balance: '平衡',
-      freedomTip: `“精确”意味着法学硕士会保守并谨慎地回答你的问题。 “即兴发挥”意味着你希望法学硕士能够自由地畅所欲言。 “平衡”是谨慎与自由之间的平衡。`,
+      freedomTip: `“精确”意味着大语言模型会保守并谨慎地回答你的问题。 “即兴发挥”意味着你希望大语言模型能够自由地畅所欲言。 “平衡”是谨慎与自由之间的平衡。`,
      temperature: '温度',
      temperatureMessage: '温度是必填项',
      temperatureTip:
@ -441,6 +448,7 @@ export default {
      renamed: '重命名成功',
      operated: '操作成功',
      updated: '更新成功',
+      uploaded: '上传成功',
      200: '服务器成功返回请求的数据。',
      201: '新建或修改数据成功。',
      202: '一个请求已经进入后台排队（异步任务）。',
@ -461,6 +469,24 @@ export default {
      networkAnomaly: '网络异常',
      hint: '提示',
    },
+    fileManager: {
+      name: '名称',
+      uploadDate: '上传日期',
+      knowledgeBase: '知识库',
+      size: '大小',
+      action: '操作',
+      addToKnowledge: '添加到知识库',
+      pleaseSelect: '请选择',
+      newFolder: '新建文件夹',
+      uploadFile: '上传文件',
+      uploadTitle: '点击或拖拽文件至此区域即可上传',
+      uploadDescription:
+        '支持单次或批量上传。 严禁上传公司数据或其他违禁文件。',
+      file: '文件',
+      directory: '文件夹',
+      local: '本地上传',
+      s3: 'S3 上传',
+    },
    footer: {
      profile: 'All rights reserved @ React',
    },
--- a/web/src/pages/add-knowledge/components/knowledge-chunk/components/chunk-card/index.less
+++ b/web/src/pages/add-knowledge/components/knowledge-chunk/components/chunk-card/index.less
@ -14,6 +14,10 @@
  .chunkText;
 }

+.contentEllipsis {
+  .multipleLineEllipsis(3);
+}
+
 .chunkCard {
  width: 100%;
 }
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jin Hai	48607c3cfb	Update README (#670 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-05-08 12:01:26 +08:00
KevinHuSh	d15ba37313	update docker file to support low version npm package (#669 ) ### Type of change - [x] Refactoring	2024-05-08 10:40:38 +08:00
balibabu	a553dc8dbd	feat: support DeepSeek (#667 ) ### What problem does this PR solve? #666 feat: support DeepSeek feat: preview word and excel ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-05-08 10:30:18 +08:00
KevinHuSh	eb27a4309e	add support for deepseek (#668 ) ### What problem does this PR solve? #666 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-05-08 10:30:02 +08:00
KevinHuSh	48e1534bf4	Update conversation_api.md	2024-05-08 09:05:35 +08:00
KevinHuSh	e9d19c4684	Update conversation_api.md	2024-05-08 09:04:23 +08:00
KevinHuSh	8d6d7f6887	fix task losting isssue (#665 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 20:46:45 +08:00
KevinHuSh	a6e4b74d94	remove unused dependency (#664 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 19:46:17 +08:00
KevinHuSh	a5aed2412f	fix bugs (#662 ) ### What problem does this PR solve? Fix import error for task_service.py ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 16:41:56 +08:00
KevinHuSh	2810c60757	refine doc for v0.5.0 (#660 ) ### What problem does this PR solve? ### Type of change - [x] Documentation Update	2024-05-07 13:19:33 +08:00
KevinHuSh	62afcf5ac8	fix bug (#659 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 13:16:12 +08:00
KevinHuSh	a74c755d83	Update .env	2024-05-07 12:56:14 +08:00
KevinHuSh	7013d7f620	refine text decode (#657 ) ### What problem does this PR solve? #651 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 12:25:47 +08:00
Fakai Zhao	de839fc3f0	optimize srv broker and executor logic (#630 ) ### What problem does this PR solve? Optimize task broker and executor for reduce memory usage and deployment complexity. ### Type of change - [x] Performance Improvement - [x] Refactoring ### Change Log - Enhance redis utils for message queue(use stream) - Modify task broker logic via message queue (1.get parse event from message queue 2.use ThreadPoolExecutor async executor ) - Modify the table column name of document and task (process_duation -> process_duration maybe just a spelling mistake) - Reformat some code style(just what i see) - Add requirement_dev.txt for developer - Add redis container on docker compose --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-05-07 11:43:33 +08:00
KevinHuSh	c6b6c748ae	fix file encoding detection bug (#653 ) ### What problem does this PR solve? #651 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 10:01:24 +08:00
Moonlit	ca5acc151a	Refactor: Use TaskStatus enum for task status handling (#646 ) ### What problem does this PR solve? This commit changes the status 'not started' from being hard-coded to being maintained by the TaskStatus enum. This enhancement ensures consistency across the codebase and improves maintainability. ### Type of change - [x] Refactoring	2024-05-06 18:39:17 +08:00
balibabu	385dbe5ab5	fix: add spin to parsing status icon of dataset table (#649 ) ### What problem does this PR solve? fix: add spin to parsing status icon of dataset table #648 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-06 18:37:31 +08:00
writinwaters	3050a8cb07	Update README badge (#639 ) ### What problem does this PR solve? Entry to RAGFlow's online demo was not easy to find. Also note that text "RAGFlow" in the badge is already a given. Hence the change. ### Type of change - [x] Documentation Update	2024-05-04 15:31:11 +08:00
writinwaters	9c77d367d0	Updated faq.md (#636 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update	2024-05-03 12:11:15 +08:00
KevinHuSh	5f03a4de11	remove redis (#629 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-30 19:00:41 +08:00
Moonlit	290e5d958d	docs: Add instructions for launching service from source (#619 ) This commit includes detailed steps for setting up and launching the service directly from the source code. It covers cloning the repository, setting up a virtual environment, configuring environment variables, and starting the service using Docker. This update ensures that developers have clear guidance on how to get the service running in a development environment. ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update	2024-04-30 18:45:53 +08:00
balibabu	9703633a57	fix: filter knowledge list by keywords and clear the selected file list after the file is uploaded successfully and add ellipsis pattern to chunk list (#628 ) ### What problem does this PR solve? #627 fix: filter knowledge list by keywords fix: clear the selected file list after the file is uploaded successfully feat: add ellipsis pattern to chunk list ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-30 18:43:26 +08:00
KevinHuSh	7d3b68bb1e	refine code (#626 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-30 17:53:28 +08:00
Moonlit	c89f3c3cdb	Fix missing 'ollama' package in requirements.txt (#621 ) ### What problem does this PR solve? This commit resolves an issue where the 'ollama' package was inadvertently omitted from the requirements.txt file. The package has now been added to ensure all dependencies are correctly installed for the project. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-30 16:29:46 +08:00
Moonlit	5d7f573379	Fix: missing 'redis' package in requirements.txt (#622 ) ### What problem does this PR solve? This commit resolves an issue where the 'redis' package was inadvertently omitted from the requirements.txt file. The package has now been added to ensure all dependencies are correctly installed for the project. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-30 16:29:27 +08:00
KevinHuSh	cab274f560	remove PyMuPDF (#618 ) ### What problem does this PR solve? #613 ### Type of change - [x] Other (please describe):	2024-04-30 12:38:09 +08:00
balibabu	7059ec2298	fix: fixed the issue that ModelSetting could not be saved #614 (#617 ) ### What problem does this PR solve? fix: fixed the issue that ModelSetting could not be saved #614 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-30 11:27:10 +08:00
KevinHuSh	674b3aeafd	fix disable and enable llm setting in dialog (#616 ) ### What problem does this PR solve? #614 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-30 11:04:14 +08:00
balibabu	4c1476032d	fix: omit long file names (#608 ) ### What problem does this PR solve? #607 fix: omit long file names fix: change the parsing method from tag to select fix: replace icon for new chat fix: change the OK button text of the Chat Bot API modal to close ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-29 18:22:17 +08:00
KevinHuSh	2af74cc494	refine docker layers (#606 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-04-29 17:57:40 +08:00
balibabu	38f0cc016f	fix: #567 use modal to upload files in the knowledge base (#601 ) ### What problem does this PR solve? fix: #567 use modal to upload files in the knowledge base ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-29 15:45:19 +08:00
KevinHuSh	6874c6f3a7	refine document upload (#602 ) ### What problem does this PR solve? #567 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-29 15:45:08 +08:00
KevinHuSh	8acc01a227	refine redis connection (#599 ) ### What problem does this PR solve? #591 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-29 08:52:38 +08:00
KevinHuSh	8c07992b6c	refine code (#595 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-28 19:13:33 +08:00
balibabu	aee8b48d2f	feat: add FlowCanvas (#593 ) ### What problem does this PR solve? feat: handle operator drag feat: add FlowCanvas #592 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-04-28 19:03:54 +08:00
writinwaters	daf215d266	Updated FAQ: Range of input length (#594 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update	2024-04-28 19:03:43 +08:00
KevinHuSh	cdcc779705	refine document by using latest as version number (#588 ) ### What problem does this PR solve? ### Type of change - [x] Documentation Update	2024-04-28 16:16:08 +08:00
KevinHuSh	d589b0f568	fix exception in pdf parser (#584 ) ### What problem does this PR solve? #451 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-28 14:23:53 +08:00
KevinHuSh	9d60a84958	refactor code (#583 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-28 13:19:54 +08:00
KevinHuSh	aadb9cbec8	remove default redis configuration (#582 ) ### What problem does this PR solve? #580 ### Type of change - [x] Refactoring	2024-04-28 12:14:56 +08:00
KevinHuSh	038822f3bd	make cites in conversation API configurable (#576 ) ### What problem does this PR solve? #566 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-28 11:56:17 +08:00
balibabu	ae501c58fa	fix: display the current language directly at the top and do not disp… (#579 ) …lay reference symbols for documents in external chat boxes #566 #577 ### What problem does this PR solve? fix: display the current language directly at the top and do not display reference symbols for documents in external chat boxes #566 #577 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-28 11:50:03 +08:00
KevinHuSh	944776f207	fix bug about fetching file from minio (#574 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-28 09:57:40 +08:00
Jin Hai	f1c98aad6b	Update version info (#564 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-04-26 20:07:26 +08:00
KevinHuSh	ab06f502d7	fix bug of file management (#565 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-26 19:59:21 +08:00
balibabu	6329339a32	feat: add Tooltip to action icon of FileManager (#561 ) ### What problem does this PR solve? #345 feat: add Tooltip to action icon of FileManager ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-04-26 18:55:37 +08:00
KevinHuSh	84b39c60f6	fix rename bug (#562 ) ### What problem does this PR solve? fix rename file bugs ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-26 18:55:21 +08:00
balibabu	eb62c669ae	feat: translate FileManager #345 (#558 ) ### What problem does this PR solve? #345 feat: translate FileManager feat: batch delete files from the file table in the knowledge base ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-04-26 17:22:23 +08:00
KevinHuSh	f69ff39fa0	add file management feature (#560 ) ### What problem does this PR solve? ### Type of change - [x] Documentation Update	2024-04-26 17:21:53 +08:00