From d17970ebd06bceed3dad69fbd70da6798d8e7c03 Mon Sep 17 00:00:00 2001 From: writinwaters <93570324+writinwaters@users.noreply.github.com> Date: Wed, 26 Mar 2025 09:03:18 +0800 Subject: [PATCH] 0321 chunkmethods (#6520) ### What problem does this PR solve? #6061 ### Type of change - [x] Documentation Update --- docker/.env | 8 ++-- .../agent/agent_component_reference/begin.mdx | 11 +++++- .../agent_component_reference/generate.mdx | 11 +++--- .../dataset/construct_knowledge_graph.md | 4 ++ docs/references/http_api_reference.md | 38 +++++++++---------- docs/references/python_api_reference.md | 33 ++++++++-------- web/src/locales/de.ts | 3 +- web/src/locales/en.ts | 6 +-- web/src/locales/id.ts | 2 +- web/src/locales/ja.ts | 2 +- web/src/locales/pt-br.ts | 5 ++- web/src/locales/vi.ts | 4 +- web/src/locales/zh-traditional.ts | 3 +- web/src/locales/zh.ts | 4 +- 14 files changed, 73 insertions(+), 61 deletions(-) diff --git a/docker/.env b/docker/.env index 40a8a3c5e..2637e5664 100644 --- a/docker/.env +++ b/docker/.env @@ -125,10 +125,12 @@ TIMEZONE='Asia/Shanghai' # Uncomment the following line if your operating system is MacOS: # MACOS=1 -# The maximum file size for each uploaded file, in bytes. -# To change the 1GB file size limit, uncomment the following line and make your changes accordingly. +# The maximum file size limit (in bytes) for each upload to your knowledge base or File Management. +# To change the 1GB file size limit, uncomment the line below and update as needed. # MAX_CONTENT_LENGTH=1073741824 -# After the change, ensure you update `client_max_body_size` in nginx/nginx.conf correspondingly. +# After updating, ensure `client_max_body_size` in nginx/nginx.conf is updated accordingly. +# Note that neither `MAX_CONTENT_LENGTH` nor `client_max_body_size` sets the maximum size for files uploaded to an agent. +# See https://ragflow.io/docs/dev/begin_component for details. # The log level for the RAGFlow's owned packages and imported packages. # Available level: diff --git a/docs/guides/agent/agent_component_reference/begin.mdx b/docs/guides/agent/agent_component_reference/begin.mdx index fc1b0e370..3348fed88 100644 --- a/docs/guides/agent/agent_component_reference/begin.mdx +++ b/docs/guides/agent/agent_component_reference/begin.mdx @@ -50,6 +50,10 @@ If your agent's **Begin** component takes a variable, you *cannot* embed it into - **boolean**: Requires the user to toggle between on and off. - **Optional**: A toggle indicating whether the variable is optional. +:::danger IMPORTAN +If you set the key type as **file**, ensure the token count of the uploaded file does not exceed your model provider's maximum token limit; otherwise, the plain text in your file will be truncated and incomplete. +::: + ## Examples As mentioned earlier, the **Begin** component is indispensable for an agent. Still, you can take a look at our three-step interpreter agent template, where the **Begin** component takes two global variables: @@ -64,7 +68,7 @@ As mentioned earlier, the **Begin** component is indispensable for an agent. Sti ### Is the uploaded file in a knowledge base? -No. Files uploaded to an agent as input are not stored in a knowledge base and will not be chunked using RAGFlow's built-in chunk methods. However, RAGFlow's built-in OSR, DLR, and TSR models will still be applied to process the document. +No. Files uploaded to an agent as input are not stored in a knowledge base and hence will not be processed using RAGFlow's built-in OCR, DLR or TSR models, or chunked using RAGFlow's built-in chunk methods. ### How to upload a webpage or file from a URL? @@ -74,5 +78,8 @@ If you set the type of a variable as **file**, your users will be able to upload ### File size limit for an uploaded file -The maximum file size for each uploaded file is determined by the variable `MAX_CONTENT_LENGTH` in `/docker/.env`. It defaults to 128 MB. If you change the default file size, ensure you also update the value of `client_max_body_size` in `/docker/nginx/nginx.conf` accordingly. +There is no *specific* file size limit for a file uploaded to an agent. However, note that model providers typically have a default or explicit maximum token setting, which can range from 8196 to 128k: The plain text part of the uploaded file will be passed in as the key value, but if the file's token count exceeds this limit, the string will be truncated and incomplete. +:::tip NOTE +The variables `MAX_CONTENT_LENGTH` in `/docker/.env` and `client_max_body_size` in `/docker/nginx/nginx.conf` set the file size limit for each upload to a knowledge base or **File Management**. These settings DO NOT apply in this scenario. +::: \ No newline at end of file diff --git a/docs/guides/agent/agent_component_reference/generate.mdx b/docs/guides/agent/agent_component_reference/generate.mdx index 3cccfc5fb..c869b34d3 100644 --- a/docs/guides/agent/agent_component_reference/generate.mdx +++ b/docs/guides/agent/agent_component_reference/generate.mdx @@ -56,7 +56,11 @@ Click the dropdown menu of **Model** to show the model configuration window. Typically, you use the system prompt to describe the task for the LLM, specify how it should respond, and outline other miscellaneous requirements. We do not plan to elaborate on this topic, as it can be as extensive as prompt engineering. However, please be aware that the system prompt is often used in conjunction with keys (variables), which serve as various data inputs for the LLM. -Keys in a system prompt should be enclosed in curly braces. Below is a prompt excerpt of a **Generate** component from the **Interpreter** template (component ID: **Reflect**): +:::danger IMPORTANT +A **Generate** component relies on keys (variables) to specify its data inputs. Its immediate upstream component is *not* necessarily its data input, and the arrows in the workflow indicate *only* the processing sequence. Keys in a **Generate** component are used in conjunction with the system prompt to specify data inputs for the LLM. Use a forward slash `/` or the **(x)** button to show the keys to use. +::: + +Below is a prompt excerpt of a **Generate** component from the **Interpreter** template (component ID: **Reflect**): ```text Your task is to read a source text and a translation to {target_lang}, and give constructive suggestions to improve the translation. The source text and initial translation, delimited by XML tags and , are as follows: @@ -76,11 +80,6 @@ When writing suggestions, pay attention to whether there are ways to improve the Where `{source_text}` and `{target_lang}` are global variables defined by the **Begin** component, while `{translation_1}` is the output of another **Generate** component with the component ID **Translate directly**. - -:::danger IMPORTANT -A **Generate** component relies on keys (variables) to specify its data inputs. Its immediate upstream component is *not* necessarily its data input, and the arrows in the workflow indicate *only* the processing sequence. Keys in a **Generate** component are used in conjunction with the system prompt to specify data inputs for the LLM. Use a forward slash `/` to show the keys to use. -::: - ### Cite This toggle sets whether to cite the original text as reference. diff --git a/docs/guides/dataset/construct_knowledge_graph.md b/docs/guides/dataset/construct_knowledge_graph.md index afd2daa9f..cda957ce4 100644 --- a/docs/guides/dataset/construct_knowledge_graph.md +++ b/docs/guides/dataset/construct_knowledge_graph.md @@ -68,6 +68,10 @@ In a knowledge graph, a community is a cluster of entities linked by relationshi _A **Knowledge graph** entry appears under **Configuration** once a knowledge graph is created._ 3. Click **Knowledge graph** to view the details of the generated graph. +4. To use the created knowledge graph, do either of the following: + + - In your **Chat Configuration** dialogue, click the **Assistant Setting** tab to add the corresponding knowledge base(s) and click the **Prompt Engine** tab to switch on the **Use knowledge graph** toggle. + - If you are using an agent, click the **Retrieval** agent component to specify the knowledge base(s) and switch on the **Use knowledge graph** toggle. ## Frequently asked questions diff --git a/docs/references/http_api_reference.md b/docs/references/http_api_reference.md index 4a940c0b0..675ab833d 100644 --- a/docs/references/http_api_reference.md +++ b/docs/references/http_api_reference.md @@ -9,6 +9,22 @@ A complete reference for RAGFlow's RESTful API. Before proceeding, please ensure --- +## ERROR CODES + +--- + +| Code | Message | Description | +|------|-----------------------|----------------------------| +| 400 | Bad Request | Invalid request parameters | +| 401 | Unauthorized | Unauthorized access | +| 403 | Forbidden | Access denied | +| 404 | Not Found | Resource not found | +| 500 | Internal Server Error | Server internal error | +| 1001 | Invalid Chunk ID | Invalid Chunk ID | +| 1002 | Chunk Update Failed | Chunk update failed | + +--- + ## OpenAI-Compatible API --- @@ -531,24 +547,6 @@ Failure: --- -## Error Codes - ---- - -| Code | Message | Description | -| ---- | --------------------- | -------------------------- | -| 400 | Bad Request | Invalid request parameters | -| 401 | Unauthorized | Unauthorized access | -| 403 | Forbidden | Access denied | -| 404 | Not Found | Resource not found | -| 500 | Internal Server Error | Server internal error | -| 1001 | Invalid Chunk ID | Invalid Chunk ID | -| 1002 | Chunk Update Failed | Chunk update failed | - ---- - ---- - ## FILE MANAGEMENT WITHIN DATASET --- @@ -1771,7 +1769,7 @@ Lists chat assistants. #### Request - Method: GET -- URL: `/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id}` +- URL: `/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={chat_name}&id={chat_id}` - Headers: - `'Authorization: Bearer '` @@ -1779,7 +1777,7 @@ Lists chat assistants. ```bash curl --request GET \ - --url http://{address}/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id} \ + --url http://{address}/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={chat_name}&id={chat_id} \ --header 'Authorization: Bearer ' ``` diff --git a/docs/references/python_api_reference.md b/docs/references/python_api_reference.md index 10bd62488..d8b077294 100644 --- a/docs/references/python_api_reference.md +++ b/docs/references/python_api_reference.md @@ -18,6 +18,22 @@ pip install ragflow-sdk --- +## ERROR CODES + +--- + +| Code | Message | Description | +|------|----------------------|-----------------------------| +| 400 | Bad Request | Invalid request parameters | +| 401 | Unauthorized | Unauthorized access | +| 403 | Forbidden | Access denied | +| 404 | Not Found | Resource not found | +| 500 | Internal Server Error| Server internal error | +| 1001 | Invalid Chunk ID | Invalid Chunk ID | +| 1002 | Chunk Update Failed | Chunk update failed | + +--- + ## OpenAI-Compatible API --- @@ -317,23 +333,6 @@ dataset = rag_object.list_datasets(name="kb_name") dataset.update({"embedding_model":"BAAI/bge-zh-v1.5", "chunk_method":"manual"}) ``` ---- - -## Error Codes - ---- - -| Code | Message | Description | -|------|---------|-------------| -| 400 | Bad Request | Invalid request parameters | -| 401 | Unauthorized | Unauthorized access | -| 403 | Forbidden | Access denied | -| 404 | Not Found | Resource not found | -| 500 | Internal Server Error | Server internal error | -| 1001 | Invalid Chunk ID | Invalid Chunk ID | -| 1002 | Chunk Update Failed | Chunk update failed | - - --- ## FILE MANAGEMENT WITHIN DATASET diff --git a/web/src/locales/de.ts b/web/src/locales/de.ts index 2d5f8f6c7..1cc80b301 100644 --- a/web/src/locales/de.ts +++ b/web/src/locales/de.ts @@ -334,7 +334,7 @@ export default { useRaptorTip: 'Rekursive Abstrakte Verarbeitung für Baumorganisierten Abruf, weitere Informationen unter https://huggingface.co/papers/2401.18059.', prompt: 'Prompt', - promptTip: 'LLM-Prompt für die Zusammenfassung.', + promptTip: 'Verwenden Sie den Systemprompt, um die Aufgabe für das LLM zu beschreiben, festzulegen, wie es antworten soll, und andere verschiedene Anforderungen zu skizzieren. Der Systemprompt wird oft in Verbindung mit Schlüsseln (Variablen) verwendet, die als verschiedene Dateninputs für das LLM dienen. Verwenden Sie einen Schrägstrich `/` oder die (x)-Schaltfläche, um die zu verwendenden Schlüssel anzuzeigen.', promptMessage: 'Prompt ist erforderlich', promptText: `Bitte fassen Sie die folgenden Absätze zusammen. Seien Sie vorsichtig mit den Zahlen, erfinden Sie keine Dinge. Absätze wie folgt: {cluster_content} @@ -372,6 +372,7 @@ export default {
  • Sie müssen Tag-Sets in bestimmten Formaten hochladen, bevor Sie die Auto-Tag-Funktion ausführen.
  • Die Auto-Schlüsselwort-Funktion ist vom LLM abhängig und verbraucht eine erhebliche Anzahl an Tokens.
  • +

    Siehe https://ragflow.io/docs/dev/use_tag_sets für Details.

    `, topnTags: 'Top-N Tags', tags: 'Tags', diff --git a/web/src/locales/en.ts b/web/src/locales/en.ts index 19c22fcb3..68ca346f3 100644 --- a/web/src/locales/en.ts +++ b/web/src/locales/en.ts @@ -326,7 +326,7 @@ export default { useRaptorTip: 'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information.', prompt: 'Prompt', - promptTip: 'LLM prompt used for summarization.', + promptTip: 'Use the system prompt to describe the task for the LLM, specify how it should respond, and outline other miscellaneous requirements. The system prompt is often used in conjunction with keys (variables), which serve as various data inputs for the LLM. Use a forward slash `/` or the (x) button to show the keys to use.', promptMessage: 'Prompt is required', promptText: `Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following: {cluster_content} @@ -353,9 +353,9 @@ The above is the content you need to summarize.`, tagTable: 'Table', tagSet: 'Tag sets', tagSetTip: ` -

    Select one or multiple tag knowledge bases to auto-tag chunks in your knowledge base.

    +

    Select one or multiple tag knowledge bases to auto-tag chunks in your knowledge base. See https://ragflow.io/docs/dev/use_tag_sets for details.

    The user query will also be auto-tagged.

    -This auto-tag feature enhances retrieval by adding another layer of domain-specific knowledge to the existing dataset. +This auto-tagging feature enhances retrieval by adding another layer of domain-specific knowledge to the existing dataset.

    Difference between auto-tag and auto-keyword:

    `, + +

    Consulte https://ragflow.io/docs/dev/use_tag_sets para obter detalhes.

    `, topnTags: 'Top-N Etiquetas', tags: 'Etiquetas', addTag: 'Adicionar etiqueta', diff --git a/web/src/locales/vi.ts b/web/src/locales/vi.ts index 6937a13ae..ca39fd526 100644 --- a/web/src/locales/vi.ts +++ b/web/src/locales/vi.ts @@ -296,7 +296,7 @@ export default { useRaptorTip: 'Recursive Abstractive Processing for Tree-Organized Retrieval, xem https://huggingface.co/papers/2401.18059 để biết thêm thông tin', prompt: 'Nhắc nhở', - promptTip: 'Nhắc nhở LLM được sử dụng để tóm tắt.', + promptTip: 'Sử dụng lời nhắc hệ thống để mô tả nhiệm vụ cho LLM, chỉ định cách nó nên phản hồi và phác thảo các yêu cầu khác nhau. Lời nhắc hệ thống thường được sử dụng kết hợp với các khóa (biến), đóng vai trò là các đầu vào dữ liệu khác nhau cho LLM. Sử dụng dấu gạch chéo `/` hoặc nút (x) để hiển thị các khóa cần sử dụng.', promptMessage: 'Nhắc nhở là bắt buộc', promptText: `Vui lòng tóm tắt các đoạn văn sau. Cẩn thận với các số, đừng bịa ra. Các đoạn văn như sau: {cluster_content} @@ -329,7 +329,7 @@ export default { searchTags: 'Thẻ tìm kiếm', tagTable: 'Bảng', tagSet: 'Thư viện', - tagSetTip: `

    Việc chọn các cơ sở kiến thức 'Tag' giúp gắn thẻ cho từng đoạn.

    Truy vấn đến các đoạn đó cũng sẽ kèm theo thẻ.

    Quy trình này sẽ cải thiện độ chính xác của việc truy xuất bằng cách thêm nhiều thông tin hơn vào bộ dữ liệu, đặc biệt là khi có một tập hợp lớn các đoạn.

    Sự khác biệt giữa thẻ và từ khóa:

    `, + tagSetTip: `

    Việc chọn các cơ sở kiến thức 'Tag' giúp gắn thẻ cho từng đoạn.

    Truy vấn đến các đoạn đó cũng sẽ kèm theo thẻ.

    Quy trình này sẽ cải thiện độ chính xác của việc truy xuất bằng cách thêm nhiều thông tin hơn vào bộ dữ liệu, đặc biệt là khi có một tập hợp lớn các đoạn.

    Sự khác biệt giữa thẻ và từ khóa:

    Xem https://ragflow.io/docs/dev/use_tag_sets để biết thêm chi tiết.

    `, topnTags: 'Thẻ Top-N', tags: 'Thẻ', addTag: 'Thêm thẻ', diff --git a/web/src/locales/zh-traditional.ts b/web/src/locales/zh-traditional.ts index cd9b1b247..aae5c6549 100644 --- a/web/src/locales/zh-traditional.ts +++ b/web/src/locales/zh-traditional.ts @@ -327,7 +327,7 @@ export default { maxClusterMessage: '最大聚類數是必填項', randomSeed: '隨機種子', randomSeedMessage: '隨機種子是必填項', - promptTip: 'LLM提示用於總結。', + promptTip: '系統提示為大型模型提供任務描述、規定回覆方式,以及設定其他各種要求。系統提示通常與 key(變數)合用,透過變數設定大型模型的輸入資料。你可以透過斜線或 (x) 按鈕顯示可用的 key。', maxTokenTip: '用於匯總的最大token數。', thresholdTip: '閾值越大,聚類越少。', maxClusterTip: '最大聚類數。', @@ -352,6 +352,7 @@ export default {
  • 在給你的知識庫文本塊批量打標籤之前,你需要先生成標籤集作為樣本。
  • 自動關鍵詞功能中的關鍵詞由 LLM 生成,此過程相對耗時,並且會產生一定的 Token 消耗。
  • +

    詳情請參閱 https://ragflow.io/docs/dev/use_tag_sets。

    `, tags: '標籤', addTag: '增加標籤', diff --git a/web/src/locales/zh.ts b/web/src/locales/zh.ts index 8faf84daf..ad66445ca 100644 --- a/web/src/locales/zh.ts +++ b/web/src/locales/zh.ts @@ -344,7 +344,7 @@ export default { maxClusterMessage: '最大聚类数是必填项', randomSeed: '随机种子', randomSeedMessage: '随机种子是必填项', - promptTip: 'LLM提示用于总结。', + promptTip: '系统提示为大模型提供任务描述、规定回复方式,以及设置其他各种要求。系统提示通常与 key (变量)合用,通过变量设置大模型的输入数据。你可以通过斜杠或者 (x) 按钮显示可用的 key。', maxTokenTip: '用于汇总的最大token数。', thresholdTip: '阈值越大,聚类越少。', maxClusterTip: '最大聚类数。', @@ -360,7 +360,7 @@ export default { tagSet: '标签集', topnTags: 'Top-N 标签', tagSetTip: ` -

    请选择一个或多个标签集或标签知识库,用于对知识库中的每个文本块进行标记。

    +

    请选择一个或多个标签集或标签知识库,用于对知识库中的每个文本块进行标记。

    对这些文本块的查询也将自动关联相应标签。

    此功能基于文本相似度,能够为数据集的文本块批量添加更多领域知识,从而显著提高检索准确性。该功能还能提升大量文本块的操作效率。

    为了更好地理解标签集的作用,以下是标签集和关键词之间的主要区别: