From 4058715df78b2decbbbf70ad755b41761eb7a2d0 Mon Sep 17 00:00:00 2001 From: writinwaters <93570324+writinwaters@users.noreply.github.com> Date: Thu, 25 Sep 2025 09:45:27 +0800 Subject: [PATCH] Docs: Knowledge base renamed to dataset. (#10269) ### What problem does this PR solve? ### Type of change - [x] Documentation Update --- docs/develop/mcp/_category_.json | 2 +- docs/develop/mcp/launch_mcp_server.md | 4 +- docs/faq.mdx | 2 +- .../agent/agent_component_reference/begin.mdx | 6 +-- .../agent_component_reference/retrieval.mdx | 24 ++++----- docs/guides/agent/agent_introduction.md | 2 +- docs/guides/ai_search.md | 2 +- docs/guides/chat/set_chat_variables.md | 10 ++-- docs/guides/chat/start_chat.md | 14 ++--- docs/guides/dataset/_category_.json | 2 +- .../dataset/autokeyword_autoquestion.mdx | 8 +-- .../dataset/best_practices/_category_.json | 2 +- .../accelerate_doc_indexing.mdx | 6 +-- .../dataset/configure_knowledge_base.md | 54 +++++++++---------- .../dataset/construct_knowledge_graph.md | 20 +++---- docs/guides/dataset/enable_excel2html.md | 8 +-- docs/guides/dataset/enable_raptor.md | 2 +- docs/guides/dataset/run_retrieval_test.md | 8 +-- docs/guides/dataset/select_pdf_parser.md | 2 +- docs/guides/dataset/set_metadata.md | 2 +- docs/guides/dataset/set_page_rank.md | 8 +-- docs/guides/dataset/use_tag_sets.md | 16 +++--- docs/guides/manage_files.md | 14 ++--- docs/guides/models/deploy_local_llm.mdx | 2 +- docs/guides/team/join_or_leave_team.md | 6 +-- docs/guides/team/manage_team_members.md | 4 +- docs/guides/team/share_knowledge_bases.md | 10 ++-- docs/guides/upgrade_ragflow.mdx | 4 +- docs/quickstart.mdx | 30 +++++------ docs/release_notes.md | 30 +++++------ 30 files changed, 152 insertions(+), 152 deletions(-) diff --git a/docs/develop/mcp/_category_.json b/docs/develop/mcp/_category_.json index 35324fdcf..d2f129c23 100644 --- a/docs/develop/mcp/_category_.json +++ b/docs/develop/mcp/_category_.json @@ -3,6 +3,6 @@ "position": 40, "link": { "type": "generated-index", - "description": "Guides and references on accessing RAGFlow's knowledge bases via MCP." + "description": "Guides and references on accessing RAGFlow's datasets via MCP." } } diff --git a/docs/develop/mcp/launch_mcp_server.md b/docs/develop/mcp/launch_mcp_server.md index 718aaaf70..ceabc8bd0 100644 --- a/docs/develop/mcp/launch_mcp_server.md +++ b/docs/develop/mcp/launch_mcp_server.md @@ -14,9 +14,9 @@ A RAGFlow Model Context Protocol (MCP) server is designed as an independent comp An MCP server can start up in either self-host mode (default) or host mode: - **Self-host mode**: - When launching an MCP server in self-host mode, you must provide an API key to authenticate the MCP server with the RAGFlow server. In this mode, the MCP server can access *only* the datasets (knowledge bases) of a specified tenant on the RAGFlow server. + When launching an MCP server in self-host mode, you must provide an API key to authenticate the MCP server with the RAGFlow server. In this mode, the MCP server can access *only* the datasets of a specified tenant on the RAGFlow server. - **Host mode**: - In host mode, each MCP client can access their own knowledge bases on the RAGFlow server. However, each client request must include a valid API key to authenticate the client with the RAGFlow server. + In host mode, each MCP client can access their own datasets on the RAGFlow server. However, each client request must include a valid API key to authenticate the client with the RAGFlow server. Once a connection is established, an MCP server communicates with its client in MCP HTTP+SSE (Server-Sent Events) mode, unidirectionally pushing responses from the RAGFlow server to its client in real time. diff --git a/docs/faq.mdx b/docs/faq.mdx index ae2c8ee4a..a9bdf0cb1 100644 --- a/docs/faq.mdx +++ b/docs/faq.mdx @@ -498,7 +498,7 @@ To switch your document engine from Elasticsearch to [Infinity](https://github.c ### Where are my uploaded files stored in RAGFlow's image? -All uploaded files are stored in Minio, RAGFlow's object storage solution. For instance, if you upload your file directly to a knowledge base, it is located at `/filename`. +All uploaded files are stored in Minio, RAGFlow's object storage solution. For instance, if you upload your file directly to a dataset, it is located at `/filename`. --- diff --git a/docs/guides/agent/agent_component_reference/begin.mdx b/docs/guides/agent/agent_component_reference/begin.mdx index 74efb14be..597d93905 100644 --- a/docs/guides/agent/agent_component_reference/begin.mdx +++ b/docs/guides/agent/agent_component_reference/begin.mdx @@ -67,14 +67,14 @@ You can tune document parsing and embedding efficiency by setting the environmen ## Frequently asked questions -### Is the uploaded file in a knowledge base? +### Is the uploaded file in a dataset? -No. Files uploaded to an agent as input are not stored in a knowledge base and hence will not be processed using RAGFlow's built-in OCR, DLR or TSR models, or chunked using RAGFlow's built-in chunking methods. +No. Files uploaded to an agent as input are not stored in a dataset and hence will not be processed using RAGFlow's built-in OCR, DLR or TSR models, or chunked using RAGFlow's built-in chunking methods. ### File size limit for an uploaded file There is no _specific_ file size limit for a file uploaded to an agent. However, note that model providers typically have a default or explicit maximum token setting, which can range from 8196 to 128k: The plain text part of the uploaded file will be passed in as the key value, but if the file's token count exceeds this limit, the string will be truncated and incomplete. :::tip NOTE -The variables `MAX_CONTENT_LENGTH` in `/docker/.env` and `client_max_body_size` in `/docker/nginx/nginx.conf` set the file size limit for each upload to a knowledge base or **File Management**. These settings DO NOT apply in this scenario. +The variables `MAX_CONTENT_LENGTH` in `/docker/.env` and `client_max_body_size` in `/docker/nginx/nginx.conf` set the file size limit for each upload to a dataset or **File Management**. These settings DO NOT apply in this scenario. ::: diff --git a/docs/guides/agent/agent_component_reference/retrieval.mdx b/docs/guides/agent/agent_component_reference/retrieval.mdx index 0b69c641f..5807eab5c 100644 --- a/docs/guides/agent/agent_component_reference/retrieval.mdx +++ b/docs/guides/agent/agent_component_reference/retrieval.mdx @@ -9,7 +9,7 @@ A component that retrieves information from specified datasets. ## Scenarios -A **Retrieval** component is essential in most RAG scenarios, where information is extracted from designated knowledge bases before being sent to the LLM for content generation. A **Retrieval** component can operate either as a standalone workflow module or as a tool for an **Agent** component. In the latter role, the **Agent** component has autonomous control over when to invoke it for query and retrieval. +A **Retrieval** component is essential in most RAG scenarios, where information is extracted from designated datasets before being sent to the LLM for content generation. A **Retrieval** component can operate either as a standalone workflow module or as a tool for an **Agent** component. In the latter role, the **Agent** component has autonomous control over when to invoke it for query and retrieval. The following screenshot shows a reference design using the **Retrieval** component, where the component serves as a tool for an **Agent** component. You can find it from the **Report Agent Using Knowledge Base** Agent template. @@ -17,7 +17,7 @@ The following screenshot shows a reference design using the **Retrieval** compon ## Prerequisites -Ensure you [have properly configured your target knowledge base(s)](../../dataset/configure_knowledge_base.md). +Ensure you [have properly configured your target dataset(s)](../../dataset/configure_knowledge_base.md). ## Quickstart @@ -36,9 +36,9 @@ The **Retrieval** component depends on query variables to specify its queries. By default, you can use `sys.query`, which is the user query and the default output of the **Begin** component. All global variables defined before the **Retrieval** component can also be used as query statements. Use the `(x)` button or type `/` to show all the available query variables. -### 3. Select knowledge base(s) to query +### 3. Select dataset(s) to query -You can specify one or multiple knowledge bases to retrieve data from. If selecting mutiple, ensure they use the same embedding model. +You can specify one or multiple datasets to retrieve data from. If selecting mutiple, ensure they use the same embedding model. ### 4. Expand **Advanced Settings** to configure the retrieval method @@ -52,7 +52,7 @@ Using a rerank model will *significantly* increase the system's response time. I ### 5. Enable cross-language search -If your user query is different from the languages of the knowledge bases, you can select the target languages in the **Cross-language search** dropdown menu. The model will then translates queries to ensure accurate matching of semantic meaning across languages. +If your user query is different from the languages of the datasets, you can select the target languages in the **Cross-language search** dropdown menu. The model will then translates queries to ensure accurate matching of semantic meaning across languages. ### 6. Test retrieval results @@ -76,10 +76,10 @@ The **Retrieval** component relies on query variables to specify its queries. Al ### Knowledge bases -Select the knowledge base(s) to retrieve data from. +Select the dataset(s) to retrieve data from. -- If no knowledge base is selected, meaning conversations with the agent will not be based on any knowledge base, ensure that the **Empty response** field is left blank to avoid an error. -- If you select multiple knowledge bases, you must ensure that the knowledge bases (datasets) you select use the same embedding model; otherwise, an error message would occur. +- If no dataset is selected, meaning conversations with the agent will not be based on any dataset, ensure that the **Empty response** field is left blank to avoid an error. +- If you select multiple datasets, you must ensure that the datasets you select use the same embedding model; otherwise, an error message would occur. ### Similarity threshold @@ -110,11 +110,11 @@ Using a rerank model will *significantly* increase the system's response time. ### Empty response -- Set this as a response if no results are retrieved from the knowledge base(s) for your query, or +- Set this as a response if no results are retrieved from the dataset(s) for your query, or - Leave this field blank to allow the chat model to improvise when nothing is found. :::caution WARNING -If you do not specify a knowledge base, you must leave this field blank; otherwise, an error would occur. +If you do not specify a dataset, you must leave this field blank; otherwise, an error would occur. ::: ### Cross-language search @@ -124,10 +124,10 @@ Select one or more languages for cross‑language search. If no language is sele ### Use knowledge graph :::caution IMPORTANT -Before enabling this feature, ensure you have properly [constructed a knowledge graph from each target knowledge base](../../dataset/construct_knowledge_graph.md). +Before enabling this feature, ensure you have properly [constructed a knowledge graph from each target dataset](../../dataset/construct_knowledge_graph.md). ::: -Whether to use knowledge graph(s) in the specified knowledge base(s) during retrieval for multi-hop question answering. When enabled, this would involve iterative searches across entity, relationship, and community report chunks, greatly increasing retrieval time. +Whether to use knowledge graph(s) in the specified dataset(s) during retrieval for multi-hop question answering. When enabled, this would involve iterative searches across entity, relationship, and community report chunks, greatly increasing retrieval time. ### Output diff --git a/docs/guides/agent/agent_introduction.md b/docs/guides/agent/agent_introduction.md index c93bf4c25..fa21a7810 100644 --- a/docs/guides/agent/agent_introduction.md +++ b/docs/guides/agent/agent_introduction.md @@ -27,7 +27,7 @@ Agents and RAG are complementary techniques, each enhancing the other’s capabi Before proceeding, ensure that: 1. You have properly set the LLM to use. See the guides on [Configure your API key](../models/llm_api_key_setup.md) or [Deploy a local LLM](../models/deploy_local_llm.mdx) for more information. -2. You have a knowledge base configured and the corresponding files properly parsed. See the guide on [Configure a knowledge base](../dataset/configure_knowledge_base.md) for more information. +2. You have a dataset configured and the corresponding files properly parsed. See the guide on [Configure a dataset](../dataset/configure_knowledge_base.md) for more information. ::: diff --git a/docs/guides/ai_search.md b/docs/guides/ai_search.md index e5f48793c..6bd533600 100644 --- a/docs/guides/ai_search.md +++ b/docs/guides/ai_search.md @@ -22,7 +22,7 @@ When debugging your chat assistant, you can use AI search as a reference to veri ## Prerequisites - Ensure that you have configured the system's default models on the **Model providers** page. -- Ensure that the intended knowledge bases are properly configured and the intended documents have finished file parsing. +- Ensure that the intended datasets are properly configured and the intended documents have finished file parsing. ## Frequently asked questions diff --git a/docs/guides/chat/set_chat_variables.md b/docs/guides/chat/set_chat_variables.md index a5676c4f9..89e786262 100644 --- a/docs/guides/chat/set_chat_variables.md +++ b/docs/guides/chat/set_chat_variables.md @@ -25,13 +25,13 @@ In the **Variable** section, you add, remove, or update variables. ### `{knowledge}` - a reserved variable -`{knowledge}` is the system's reserved variable, representing the chunks retrieved from the knowledge base(s) specified by **Knowledge bases** under the **Assistant settings** tab. If your chat assistant is associated with certain knowledge bases, you can keep it as is. +`{knowledge}` is the system's reserved variable, representing the chunks retrieved from the dataset(s) specified by **Knowledge bases** under the **Assistant settings** tab. If your chat assistant is associated with certain datasets, you can keep it as is. :::info NOTE It currently makes no difference whether `{knowledge}` is set as optional or mandatory, but please note this design will be updated in due course. ::: -From v0.17.0 onward, you can start an AI chat without specifying knowledge bases. In this case, we recommend removing the `{knowledge}` variable to prevent unnecessary reference and keeping the **Empty response** field empty to avoid errors. +From v0.17.0 onward, you can start an AI chat without specifying datasets. In this case, we recommend removing the `{knowledge}` variable to prevent unnecessary reference and keeping the **Empty response** field empty to avoid errors. ### Custom variables @@ -45,15 +45,15 @@ Besides `{knowledge}`, you can also define your own variables to pair with the s After you add or remove variables in the **Variable** section, ensure your changes are reflected in the system prompt to avoid inconsistencies or errors. Here's an example: ``` -You are an intelligent assistant. Please answer the question by summarizing chunks from the specified knowledge base(s)... +You are an intelligent assistant. Please answer the question by summarizing chunks from the specified dataset(s)... Your answers should follow a professional and {style} style. ... -Here is the knowledge base: +Here is the dataset: {knowledge} -The above is the knowledge base. +The above is the dataset. ``` :::tip NOTE diff --git a/docs/guides/chat/start_chat.md b/docs/guides/chat/start_chat.md index abe7f8a8f..1ba8c2755 100644 --- a/docs/guides/chat/start_chat.md +++ b/docs/guides/chat/start_chat.md @@ -9,7 +9,7 @@ Initiate an AI-powered chat with a configured chat assistant. --- -Knowledge base, hallucination-free chat, and file management are the three pillars of RAGFlow. Chats in RAGFlow are based on a particular knowledge base or multiple knowledge bases. Once you have created your knowledge base, finished file parsing, and [run a retrieval test](../dataset/run_retrieval_test.md), you can go ahead and start an AI conversation. +Knowledge base, hallucination-free chat, and file management are the three pillars of RAGFlow. Chats in RAGFlow are based on a particular dataset or multiple datasets. Once you have created your dataset, finished file parsing, and [run a retrieval test](../dataset/run_retrieval_test.md), you can go ahead and start an AI conversation. ## Start an AI chat @@ -21,12 +21,12 @@ You start an AI conversation by creating an assistant. 2. Update **Assistant settings**: - - **Assistant name** is the name of your chat assistant. Each assistant corresponds to a dialogue with a unique combination of knowledge bases, prompts, hybrid search configurations, and large model settings. + - **Assistant name** is the name of your chat assistant. Each assistant corresponds to a dialogue with a unique combination of datasets, prompts, hybrid search configurations, and large model settings. - **Empty response**: - - If you wish to *confine* RAGFlow's answers to your knowledge bases, leave a response here. Then, when it doesn't retrieve an answer, it *uniformly* responds with what you set here. - - If you wish RAGFlow to *improvise* when it doesn't retrieve an answer from your knowledge bases, leave it blank, which may give rise to hallucinations. + - If you wish to *confine* RAGFlow's answers to your datasets, leave a response here. Then, when it doesn't retrieve an answer, it *uniformly* responds with what you set here. + - If you wish RAGFlow to *improvise* when it doesn't retrieve an answer from your datasets, leave it blank, which may give rise to hallucinations. - **Show quote**: This is a key feature of RAGFlow and enabled by default. RAGFlow does not work like a black box. Instead, it clearly shows the sources of information that its responses are based on. - - Select the corresponding knowledge bases. You can select one or multiple knowledge bases, but ensure that they use the same embedding model, otherwise an error would occur. + - Select the corresponding datasets. You can select one or multiple datasets, but ensure that they use the same embedding model, otherwise an error would occur. 3. Update **Prompt engine**: @@ -37,14 +37,14 @@ You start an AI conversation by creating an assistant. - If **Rerank model** is selected, the hybrid score system uses keyword similarity and reranker score, and the default weight assigned to the reranker score is 1-0.7=0.3. - **Top N** determines the *maximum* number of chunks to feed to the LLM. In other words, even if more chunks are retrieved, only the top N chunks are provided as input. - **Multi-turn optimization** enhances user queries using existing context in a multi-round conversation. It is enabled by default. When enabled, it will consume additional LLM tokens and significantly increase the time to generate answers. - - **Use knowledge graph** indicates whether to use knowledge graph(s) in the specified knowledge base(s) during retrieval for multi-hop question answering. When enabled, this would involve iterative searches across entity, relationship, and community report chunks, greatly increasing retrieval time. + - **Use knowledge graph** indicates whether to use knowledge graph(s) in the specified dataset(s) during retrieval for multi-hop question answering. When enabled, this would involve iterative searches across entity, relationship, and community report chunks, greatly increasing retrieval time. - **Reasoning** indicates whether to generate answers through reasoning processes like Deepseek-R1/OpenAI o1. Once enabled, the chat model autonomously integrates Deep Research during question answering when encountering an unknown topic. This involves the chat model dynamically searching external knowledge and generating final answers through reasoning. - **Rerank model** sets the reranker model to use. It is left empty by default. - If **Rerank model** is left empty, the hybrid score system uses keyword similarity and vector similarity, and the default weight assigned to the vector similarity component is 1-0.7=0.3. - If **Rerank model** is selected, the hybrid score system uses keyword similarity and reranker score, and the default weight assigned to the reranker score is 1-0.7=0.3. - [Cross-language search](../../references/glossary.mdx#cross-language-search): Optional Select one or more target languages from the dropdown menu. The system’s default chat model will then translate your query into the selected target language(s). This translation ensures accurate semantic matching across languages, allowing you to retrieve relevant results regardless of language differences. - - When selecting target languages, please ensure that these languages are present in the knowledge base to guarantee an effective search. + - When selecting target languages, please ensure that these languages are present in the dataset to guarantee an effective search. - If no target language is selected, the system will search only in the language of your query, which may cause relevant information in other languages to be missed. - **Variable** refers to the variables (keys) to be used in the system prompt. `{knowledge}` is a reserved variable. Click **Add** to add more variables for the system prompt. - If you are uncertain about the logic behind **Variable**, leave it *as-is*. diff --git a/docs/guides/dataset/_category_.json b/docs/guides/dataset/_category_.json index f0d79edfd..4c454f51f 100644 --- a/docs/guides/dataset/_category_.json +++ b/docs/guides/dataset/_category_.json @@ -3,6 +3,6 @@ "position": 0, "link": { "type": "generated-index", - "description": "Guides on configuring a knowledge base." + "description": "Guides on configuring a dataset." } } diff --git a/docs/guides/dataset/autokeyword_autoquestion.mdx b/docs/guides/dataset/autokeyword_autoquestion.mdx index c7a1293af..f61e50317 100644 --- a/docs/guides/dataset/autokeyword_autoquestion.mdx +++ b/docs/guides/dataset/autokeyword_autoquestion.mdx @@ -6,7 +6,7 @@ slug: /autokeyword_autoquestion # Auto-keyword Auto-question import APITable from '@site/src/components/APITable'; -Use a chat model to generate keywords or questions from each chunk in the knowledge base. +Use a chat model to generate keywords or questions from each chunk in the dataset. --- @@ -18,7 +18,7 @@ Enabling this feature increases document indexing time and uses extra tokens, as ## What is Auto-keyword? -Auto-keyword refers to the auto-keyword generation feature of RAGFlow. It uses a chat model to generate a set of keywords or synonyms from each chunk to correct errors and enhance retrieval accuracy. This feature is implemented as a slider under **Page rank** on the **Configuration** page of your knowledge base. +Auto-keyword refers to the auto-keyword generation feature of RAGFlow. It uses a chat model to generate a set of keywords or synonyms from each chunk to correct errors and enhance retrieval accuracy. This feature is implemented as a slider under **Page rank** on the **Configuration** page of your dataset. **Values**: @@ -33,7 +33,7 @@ Auto-keyword refers to the auto-keyword generation feature of RAGFlow. It uses a ## What is Auto-question? -Auto-question is a feature of RAGFlow that automatically generates questions from chunks of data using a chat model. These questions (e.g. who, what, and why) also help correct errors and improve the matching of user queries. The feature usually works with FAQ retrieval scenarios involving product manuals or policy documents. And you can find this feature as a slider under **Page rank** on the **Configuration** page of your knowledge base. +Auto-question is a feature of RAGFlow that automatically generates questions from chunks of data using a chat model. These questions (e.g. who, what, and why) also help correct errors and improve the matching of user queries. The feature usually works with FAQ retrieval scenarios involving product manuals or policy documents. And you can find this feature as a slider under **Page rank** on the **Configuration** page of your dataset. **Values**: @@ -48,7 +48,7 @@ Auto-question is a feature of RAGFlow that automatically generates questions fro ## Tips from the community -The Auto-keyword or Auto-question values relate closely to the chunking size in your knowledge base. However, if you are new to this feature and unsure which value(s) to start with, the following are some value settings we gathered from our community. While they may not be accurate, they provide a starting point at the very least. +The Auto-keyword or Auto-question values relate closely to the chunking size in your dataset. However, if you are new to this feature and unsure which value(s) to start with, the following are some value settings we gathered from our community. While they may not be accurate, they provide a starting point at the very least. ```mdx-code-block diff --git a/docs/guides/dataset/best_practices/_category_.json b/docs/guides/dataset/best_practices/_category_.json index 52098b7d8..f55fe009b 100644 --- a/docs/guides/dataset/best_practices/_category_.json +++ b/docs/guides/dataset/best_practices/_category_.json @@ -3,6 +3,6 @@ "position": 11, "link": { "type": "generated-index", - "description": "Best practices on configuring a knowledge base." + "description": "Best practices on configuring a dataset." } } diff --git a/docs/guides/dataset/best_practices/accelerate_doc_indexing.mdx b/docs/guides/dataset/best_practices/accelerate_doc_indexing.mdx index bc0dde11b..d70579769 100644 --- a/docs/guides/dataset/best_practices/accelerate_doc_indexing.mdx +++ b/docs/guides/dataset/best_practices/accelerate_doc_indexing.mdx @@ -13,7 +13,7 @@ A checklist to speed up document parsing and indexing. Please note that some of your settings may consume a significant amount of time. If you often find that document parsing is time-consuming, here is a checklist to consider: - Use GPU to reduce embedding time. -- On the configuration page of your knowledge base, switch off **Use RAPTOR to enhance retrieval**. +- On the configuration page of your dataset, switch off **Use RAPTOR to enhance retrieval**. - Extracting knowledge graph (GraphRAG) is time-consuming. -- Disable **Auto-keyword** and **Auto-question** on the configuration page of your knowledge base, as both depend on the LLM. -- **v0.17.0+:** If all PDFs in your knowledge base are plain text and do not require GPU-intensive processes like OCR (Optical Character Recognition), TSR (Table Structure Recognition), or DLA (Document Layout Analysis), you can choose **Naive** over **DeepDoc** or other time-consuming large model options in the **Document parser** dropdown. This will substantially reduce document parsing time. +- Disable **Auto-keyword** and **Auto-question** on the configuration page of your dataset, as both depend on the LLM. +- **v0.17.0+:** If all PDFs in your dataset are plain text and do not require GPU-intensive processes like OCR (Optical Character Recognition), TSR (Table Structure Recognition), or DLA (Document Layout Analysis), you can choose **Naive** over **DeepDoc** or other time-consuming large model options in the **Document parser** dropdown. This will substantially reduce document parsing time. diff --git a/docs/guides/dataset/configure_knowledge_base.md b/docs/guides/dataset/configure_knowledge_base.md index 432ecce1f..487d8b9cd 100644 --- a/docs/guides/dataset/configure_knowledge_base.md +++ b/docs/guides/dataset/configure_knowledge_base.md @@ -3,28 +3,28 @@ sidebar_position: -1 slug: /configure_knowledge_base --- -# Configure knowledge base +# Configure dataset -Knowledge base, hallucination-free chat, and file management are the three pillars of RAGFlow. RAGFlow's AI chats are based on knowledge bases. Each of RAGFlow's knowledge bases serves as a knowledge source, *parsing* files uploaded from your local machine and file references generated in **File Management** into the real 'knowledge' for future AI chats. This guide demonstrates some basic usages of the knowledge base feature, covering the following topics: +Most of RAGFlow's chat assistants and Agents are based on datasets. Each of RAGFlow's datasets serves as a knowledge source, *parsing* files uploaded from your local machine and file references generated in **File Management** into the real 'knowledge' for future AI chats. This guide demonstrates some basic usages of the dataset feature, covering the following topics: -- Create a knowledge base -- Configure a knowledge base -- Search for a knowledge base -- Delete a knowledge base +- Create a dataset +- Configure a dataset +- Search for a dataset +- Delete a dataset -## Create knowledge base +## Create dataset -With multiple knowledge bases, you can build more flexible, diversified question answering. To create your first knowledge base: +With multiple datasets, you can build more flexible, diversified question answering. To create your first dataset: -![create knowledge base](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/create_knowledge_base.jpg) +![create dataset](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/create_knowledge_base.jpg) -_Each time a knowledge base is created, a folder with the same name is generated in the **root/.knowledgebase** directory._ +_Each time a dataset is created, a folder with the same name is generated in the **root/.knowledgebase** directory._ -## Configure knowledge base +## Configure dataset -The following screenshot shows the configuration page of a knowledge base. A proper configuration of your knowledge base is crucial for future AI chats. For example, choosing the wrong embedding model or chunking method would cause unexpected semantic loss or mismatched answers in chats. +The following screenshot shows the configuration page of a dataset. A proper configuration of your dataset is crucial for future AI chats. For example, choosing the wrong embedding model or chunking method would cause unexpected semantic loss or mismatched answers in chats. -![knowledge base configuration](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/configure_knowledge_base.jpg) +![dataset configuration](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/configure_knowledge_base.jpg) This section covers the following topics: @@ -52,7 +52,7 @@ RAGFlow offers multiple chunking template to facilitate chunking files of differ | Presentation | | PDF, PPTX | | Picture | | JPEG, JPG, PNG, TIF, GIF | | One | Each document is chunked in its entirety (as one). | DOCX, XLSX, XLS (Excel 97-2003), PDF, TXT | -| Tag | The knowledge base functions as a tag set for the others. | XLSX, CSV/TXT | +| Tag | The dataset functions as a tag set for the others. | XLSX, CSV/TXT | You can also change a file's chunking method on the **Datasets** page. @@ -60,7 +60,7 @@ You can also change a file's chunking method on the **Datasets** page. ### Select embedding model -An embedding model converts chunks into embeddings. It cannot be changed once the knowledge base has chunks. To switch to a different embedding model, you must delete all existing chunks in the knowledge base. The obvious reason is that we *must* ensure that files in a specific knowledge base are converted to embeddings using the *same* embedding model (ensure that they are compared in the same embedding space). +An embedding model converts chunks into embeddings. It cannot be changed once the dataset has chunks. To switch to a different embedding model, you must delete all existing chunks in the dataset. The obvious reason is that we *must* ensure that files in a specific dataset are converted to embeddings using the *same* embedding model (ensure that they are compared in the same embedding space). The following embedding models can be deployed locally: @@ -73,19 +73,19 @@ These two embedding models are optimized specifically for English and Chinese, s ### Upload file -- RAGFlow's **File Management** allows you to link a file to multiple knowledge bases, in which case each target knowledge base holds a reference to the file. -- In **Knowledge Base**, you are also given the option of uploading a single file or a folder of files (bulk upload) from your local machine to a knowledge base, in which case the knowledge base holds file copies. +- RAGFlow's **File Management** allows you to link a file to multiple datasets, in which case each target dataset holds a reference to the file. +- In **Knowledge Base**, you are also given the option of uploading a single file or a folder of files (bulk upload) from your local machine to a dataset, in which case the dataset holds file copies. -While uploading files directly to a knowledge base seems more convenient, we *highly* recommend uploading files to **File Management** and then linking them to the target knowledge bases. This way, you can avoid permanently deleting files uploaded to the knowledge base. +While uploading files directly to a dataset seems more convenient, we *highly* recommend uploading files to **File Management** and then linking them to the target datasets. This way, you can avoid permanently deleting files uploaded to the dataset. ### Parse file -File parsing is a crucial topic in knowledge base configuration. The meaning of file parsing in RAGFlow is twofold: chunking files based on file layout and building embedding and full-text (keyword) indexes on these chunks. After having selected the chunking method and embedding model, you can start parsing a file: +File parsing is a crucial topic in dataset configuration. The meaning of file parsing in RAGFlow is twofold: chunking files based on file layout and building embedding and full-text (keyword) indexes on these chunks. After having selected the chunking method and embedding model, you can start parsing a file: ![parse file](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/parse_file.jpg) - As shown above, RAGFlow allows you to use a different chunking method for a particular file, offering flexibility beyond the default method. -- As shown above, RAGFlow allows you to enable or disable individual files, offering finer control over knowledge base-based AI chats. +- As shown above, RAGFlow allows you to enable or disable individual files, offering finer control over dataset-based AI chats. ### Intervene with file parsing results @@ -122,17 +122,17 @@ RAGFlow uses multiple recall of both full-text search and vector search in its c See [Run retrieval test](./run_retrieval_test.md) for details. -## Search for knowledge base +## Search for dataset -As of RAGFlow v0.20.5, the search feature is still in a rudimentary form, supporting only knowledge base search by name. +As of RAGFlow v0.20.5, the search feature is still in a rudimentary form, supporting only dataset search by name. -![search knowledge base](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/search_datasets.jpg) +![search dataset](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/search_datasets.jpg) -## Delete knowledge base +## Delete dataset -You are allowed to delete a knowledge base. Hover your mouse over the three dot of the intended knowledge base card and the **Delete** option appears. Once you delete a knowledge base, the associated folder under **root/.knowledge** directory is AUTOMATICALLY REMOVED. The consequence is: +You are allowed to delete a dataset. Hover your mouse over the three dot of the intended dataset card and the **Delete** option appears. Once you delete a dataset, the associated folder under **root/.knowledge** directory is AUTOMATICALLY REMOVED. The consequence is: -- The files uploaded directly to the knowledge base are gone; +- The files uploaded directly to the dataset are gone; - The file references, which you created from within **File Management**, are gone, but the associated files still exist in **File Management**. -![delete knowledge base](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/delete_datasets.jpg) +![delete dataset](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/delete_datasets.jpg) diff --git a/docs/guides/dataset/construct_knowledge_graph.md b/docs/guides/dataset/construct_knowledge_graph.md index edc63d98b..4484526f3 100644 --- a/docs/guides/dataset/construct_knowledge_graph.md +++ b/docs/guides/dataset/construct_knowledge_graph.md @@ -5,7 +5,7 @@ slug: /construct_knowledge_graph # Construct knowledge graph -Generate a knowledge graph for your knowledge base. +Generate a knowledge graph for your dataset. --- @@ -13,7 +13,7 @@ To enhance multi-hop question-answering, RAGFlow adds a knowledge graph construc ![Image](https://github.com/user-attachments/assets/1ec21d8e-f255-4d65-9918-69b72dfa142b) -From v0.16.0 onward, RAGFlow supports constructing a knowledge graph on a knowledge base, allowing you to construct a *unified* graph across multiple files within your knowledge base. When a newly uploaded file starts parsing, the generated graph will automatically update. +From v0.16.0 onward, RAGFlow supports constructing a knowledge graph on a dataset, allowing you to construct a *unified* graph across multiple files within your dataset. When a newly uploaded file starts parsing, the generated graph will automatically update. :::danger WARNING Constructing a knowledge graph requires significant memory, computational resources, and tokens. @@ -37,7 +37,7 @@ The system's default chat model is used to generate knowledge graph. Before proc ### Entity types (*Required*) -The types of the entities to extract from your knowledge base. The default types are: **organization**, **person**, **event**, and **category**. Add or remove types to suit your specific knowledge base. +The types of the entities to extract from your dataset. The default types are: **organization**, **person**, **event**, and **category**. Add or remove types to suit your specific dataset. ### Method @@ -62,12 +62,12 @@ In a knowledge graph, a community is a cluster of entities linked by relationshi ## Procedure -1. On the **Configuration** page of your knowledge base, switch on **Extract knowledge graph** or adjust its settings as needed, and click **Save** to confirm your changes. +1. On the **Configuration** page of your dataset, switch on **Extract knowledge graph** or adjust its settings as needed, and click **Save** to confirm your changes. - - *The default knowledge graph configurations for your knowledge base are now set and files uploaded from this point onward will automatically use these settings during parsing.* + - *The default knowledge graph configurations for your dataset are now set and files uploaded from this point onward will automatically use these settings during parsing.* - *Files parsed before this update will retain their original knowledge graph settings.* -2. The knowledge graph of your knowledge base does *not* automatically update *until* a newly uploaded file is parsed. +2. The knowledge graph of your dataset does *not* automatically update *until* a newly uploaded file is parsed. _A **Knowledge graph** entry appears under **Configuration** once a knowledge graph is created._ @@ -75,13 +75,13 @@ In a knowledge graph, a community is a cluster of entities linked by relationshi 4. To use the created knowledge graph, do either of the following: - In the **Chat setting** panel of your chat app, switch on the **Use knowledge graph** toggle. - - If you are using an agent, click the **Retrieval** agent component to specify the knowledge base(s) and switch on the **Use knowledge graph** toggle. + - If you are using an agent, click the **Retrieval** agent component to specify the dataset(s) and switch on the **Use knowledge graph** toggle. ## Frequently asked questions -### Can I have different knowledge graph settings for different files in my knowledge base? +### Can I have different knowledge graph settings for different files in my dataset? -Yes, you can. Just one graph is generated per knowledge base. The smaller graphs of your files will be *combined* into one big, unified graph at the end of the graph extraction process. +Yes, you can. Just one graph is generated per dataset. The smaller graphs of your files will be *combined* into one big, unified graph at the end of the graph extraction process. ### Does the knowledge graph automatically update when I remove a related file? @@ -89,7 +89,7 @@ Nope. The knowledge graph does *not* automatically update *until* a newly upload ### How to remove a generated knowledge graph? -To remove the generated knowledge graph, delete all related files in your knowledge base. Although the **Knowledge graph** entry will still be visible, the graph has actually been deleted. +To remove the generated knowledge graph, delete all related files in your dataset. Although the **Knowledge graph** entry will still be visible, the graph has actually been deleted. ### Where is the created knowledge graph stored? diff --git a/docs/guides/dataset/enable_excel2html.md b/docs/guides/dataset/enable_excel2html.md index 531a673cc..d8090420d 100644 --- a/docs/guides/dataset/enable_excel2html.md +++ b/docs/guides/dataset/enable_excel2html.md @@ -12,7 +12,7 @@ Convert complex Excel spreadsheets into HTML tables. When using the **General** chunking method, you can enable the **Excel to HTML** toggle to convert spreadsheet files into HTML tables. If it is disabled, spreadsheet tables will be represented as key-value pairs. For complex tables that cannot be simply represented this way, you must enable this feature. :::caution WARNING -The feature is disabled by default. If your knowledge base contains spreadsheets with complex tables and you do not enable this feature, RAGFlow will not throw an error but your tables are likely to be garbled. +The feature is disabled by default. If your dataset contains spreadsheets with complex tables and you do not enable this feature, RAGFlow will not throw an error but your tables are likely to be garbled. ::: ## Scenarios @@ -27,12 +27,12 @@ Works with complex tables that cannot be represented as key-value pairs. Example ## Procedure -1. On your knowledge base's **Configuration** page, select **General** as the chunking method. +1. On your dataset's **Configuration** page, select **General** as the chunking method. _The **Excel to HTML** toggle appears._ -2. Enable **Excel to HTML** if your knowledge base contains complex spreadsheet tables that cannot be represented as key-value pairs. -3. Leave **Excel to HTML** disabled if your knowledge base has no spreadsheet tables or if its spreadsheet tables can be represented as key-value pairs. +2. Enable **Excel to HTML** if your dataset contains complex spreadsheet tables that cannot be represented as key-value pairs. +3. Leave **Excel to HTML** disabled if your dataset has no spreadsheet tables or if its spreadsheet tables can be represented as key-value pairs. 4. If question-answering regarding complex tables is unsatisfactory, check if **Excel to HTML** is enabled. ## Frequently asked questions diff --git a/docs/guides/dataset/enable_raptor.md b/docs/guides/dataset/enable_raptor.md index ae55faaa7..096bc19ef 100644 --- a/docs/guides/dataset/enable_raptor.md +++ b/docs/guides/dataset/enable_raptor.md @@ -43,7 +43,7 @@ The system's default chat model is used to summarize clustered content. Before p ## Configurations -The RAPTOR feature is disabled by default. To enable it, manually switch on the **Use RAPTOR to enhance retrieval** toggle on your knowledge base's **Configuration** page. +The RAPTOR feature is disabled by default. To enable it, manually switch on the **Use RAPTOR to enhance retrieval** toggle on your dataset's **Configuration** page. ### Prompt diff --git a/docs/guides/dataset/run_retrieval_test.md b/docs/guides/dataset/run_retrieval_test.md index a9ca9f192..08ef999cd 100644 --- a/docs/guides/dataset/run_retrieval_test.md +++ b/docs/guides/dataset/run_retrieval_test.md @@ -5,11 +5,11 @@ slug: /run_retrieval_test # Run retrieval test -Conduct a retrieval test on your knowledge base to check whether the intended chunks can be retrieved. +Conduct a retrieval test on your dataset to check whether the intended chunks can be retrieved. --- -After your files are uploaded and parsed, it is recommended that you run a retrieval test before proceeding with the chat assistant configuration. Running a retrieval test is *not* an unnecessary or superfluous step at all! Just like fine-tuning a precision instrument, RAGFlow requires careful tuning to deliver optimal question answering performance. Your knowledge base settings, chat assistant configurations, and the specified large and small models can all significantly impact the final results. Running a retrieval test verifies whether the intended chunks can be recovered, allowing you to quickly identify areas for improvement or pinpoint any issue that needs addressing. For instance, when debugging your question answering system, if you know that the correct chunks can be retrieved, you can focus your efforts elsewhere. For example, in issue [#5627](https://github.com/infiniflow/ragflow/issues/5627), the problem was found to be due to the LLM's limitations. +After your files are uploaded and parsed, it is recommended that you run a retrieval test before proceeding with the chat assistant configuration. Running a retrieval test is *not* an unnecessary or superfluous step at all! Just like fine-tuning a precision instrument, RAGFlow requires careful tuning to deliver optimal question answering performance. Your dataset settings, chat assistant configurations, and the specified large and small models can all significantly impact the final results. Running a retrieval test verifies whether the intended chunks can be recovered, allowing you to quickly identify areas for improvement or pinpoint any issue that needs addressing. For instance, when debugging your question answering system, if you know that the correct chunks can be retrieved, you can focus your efforts elsewhere. For example, in issue [#5627](https://github.com/infiniflow/ragflow/issues/5627), the problem was found to be due to the LLM's limitations. During a retrieval test, chunks created from your specified chunking method are retrieved using a hybrid search. This search combines weighted keyword similarity with either weighted vector cosine similarity or a weighted reranking score, depending on your settings: @@ -65,7 +65,7 @@ Using a knowledge graph in a retrieval test will significantly increase the time To perform a [cross-language search](../../references/glossary.mdx#cross-language-search), select one or more target languages from the dropdown menu. The system’s default chat model will then translate your query entered in the Test text field into the selected target language(s). This translation ensures accurate semantic matching across languages, allowing you to retrieve relevant results regardless of language differences. :::tip NOTE -- When selecting target languages, please ensure that these languages are present in the knowledge base to guarantee an effective search. +- When selecting target languages, please ensure that these languages are present in the dataset to guarantee an effective search. - If no target language is selected, the system will search only in the language of your query, which may cause relevant information in other languages to be missed. ::: @@ -75,7 +75,7 @@ This field is where you put in your testing query. ## Procedure -1. Navigate to the **Retrieval testing** page of your knowledge base, enter your query in **Test text**, and click **Testing** to run the test. +1. Navigate to the **Retrieval testing** page of your dataset, enter your query in **Test text**, and click **Testing** to run the test. 2. If the results are unsatisfactory, tune the options listed in the Configuration section and rerun the test. *The following is a screenshot of a retrieval test conducted without using knowledge graph. It demonstrates a hybrid search combining weighted keyword similarity and weighted vector cosine similarity. The overall hybrid similarity score is 28.56, calculated as 25.17 (term similarity score) x 0.7 + 36.49 (vector similarity score) x 0.3:* diff --git a/docs/guides/dataset/select_pdf_parser.md b/docs/guides/dataset/select_pdf_parser.md index 1bdda5d1d..eabf0b264 100644 --- a/docs/guides/dataset/select_pdf_parser.md +++ b/docs/guides/dataset/select_pdf_parser.md @@ -27,7 +27,7 @@ RAGFlow isn't one-size-fits-all. It is built for flexibility and supports deeper ## Procedure -1. On your knowledge base's **Configuration** page, select a chunking method, say **General**. +1. On your dataset's **Configuration** page, select a chunking method, say **General**. _The **PDF parser** dropdown menu appears._ diff --git a/docs/guides/dataset/set_metadata.md b/docs/guides/dataset/set_metadata.md index e0281da81..904efaa9c 100644 --- a/docs/guides/dataset/set_metadata.md +++ b/docs/guides/dataset/set_metadata.md @@ -9,7 +9,7 @@ Add metadata to an uploaded file --- -On the **Dataset** page of your knowledge base, you can add metadata to any uploaded file. This approach enables you to 'tag' additional information like URL, author, date, and more to an existing file. In an AI-powered chat, such information will be sent to the LLM with the retrieved chunks for content generation. +On the **Dataset** page of your dataset, you can add metadata to any uploaded file. This approach enables you to 'tag' additional information like URL, author, date, and more to an existing file. In an AI-powered chat, such information will be sent to the LLM with the retrieved chunks for content generation. For example, if you have a dataset of HTML files and want the LLM to cite the source URL when responding to your query, add a `"url"` parameter to each file's metadata. diff --git a/docs/guides/dataset/set_page_rank.md b/docs/guides/dataset/set_page_rank.md index c0af82308..4b24d9b34 100644 --- a/docs/guides/dataset/set_page_rank.md +++ b/docs/guides/dataset/set_page_rank.md @@ -11,15 +11,15 @@ Create a step-retrieval strategy using page rank. ## Scenario -In an AI-powered chat, you can configure a chat assistant or an agent to respond using knowledge retrieved from multiple specified knowledge bases (datasets), provided that they employ the same embedding model. In situations where you prefer information from certain knowledge base(s) to take precedence or to be retrieved first, you can use RAGFlow's page rank feature to increase the ranking of chunks from these knowledge bases. For example, if you have configured a chat assistant to draw from two knowledge bases, knowledge base A for 2024 news and knowledge base B for 2023 news, but wish to prioritize news from year 2024, this feature is particularly useful. +In an AI-powered chat, you can configure a chat assistant or an agent to respond using knowledge retrieved from multiple specified datasets (datasets), provided that they employ the same embedding model. In situations where you prefer information from certain dataset(s) to take precedence or to be retrieved first, you can use RAGFlow's page rank feature to increase the ranking of chunks from these datasets. For example, if you have configured a chat assistant to draw from two datasets, dataset A for 2024 news and dataset B for 2023 news, but wish to prioritize news from year 2024, this feature is particularly useful. :::info NOTE -It is important to note that this 'page rank' feature operates at the level of the entire knowledge base rather than on individual files or documents. +It is important to note that this 'page rank' feature operates at the level of the entire dataset rather than on individual files or documents. ::: ## Configuration -On the **Configuration** page of your knowledge base, drag the slider under **Page rank** to set the page rank value for your knowledge base. You are also allowed to input the intended page rank value in the field next to the slider. +On the **Configuration** page of your dataset, drag the slider under **Page rank** to set the page rank value for your dataset. You are also allowed to input the intended page rank value in the field next to the slider. :::info NOTE The page rank value must be an integer. Range: [0,100] @@ -36,4 +36,4 @@ If you set the page rank value to a non-integer, say 1.7, it will be rounded dow If you configure a chat assistant's **similarity threshold** to 0.2, only chunks with a hybrid score greater than 0.2 x 100 = 20 will be retrieved and sent to the chat model for content generation. This initial filtering step is crucial for narrowing down relevant information. -If you have assigned a page rank of 1 to knowledge base A (2024 news) and 0 to knowledge base B (2023 news), the final hybrid scores of the retrieved chunks will be adjusted accordingly. A chunk retrieved from knowledge base A with an initial score of 50 will receive a boost of 1 x 100 = 100 points, resulting in a final score of 50 + 1 x 100 = 150. In this way, chunks retrieved from knowledge base A will always precede chunks from knowledge base B. \ No newline at end of file +If you have assigned a page rank of 1 to dataset A (2024 news) and 0 to dataset B (2023 news), the final hybrid scores of the retrieved chunks will be adjusted accordingly. A chunk retrieved from dataset A with an initial score of 50 will receive a boost of 1 x 100 = 100 points, resulting in a final score of 50 + 1 x 100 = 150. In this way, chunks retrieved from dataset A will always precede chunks from dataset B. \ No newline at end of file diff --git a/docs/guides/dataset/use_tag_sets.md b/docs/guides/dataset/use_tag_sets.md index 012c63e14..81dc65838 100644 --- a/docs/guides/dataset/use_tag_sets.md +++ b/docs/guides/dataset/use_tag_sets.md @@ -9,9 +9,9 @@ Use a tag set to auto-tag chunks in your datasets. --- -Retrieval accuracy is the touchstone for a production-ready RAG framework. In addition to retrieval-enhancing approaches like auto-keyword, auto-question, and knowledge graph, RAGFlow introduces an auto-tagging feature to address semantic gaps. The auto-tagging feature automatically maps tags in the user-defined tag sets to relevant chunks within your knowledge base based on similarity with each chunk. This automation mechanism allows you to apply an additional "layer" of domain-specific knowledge to existing datasets, which is particularly useful when dealing with a large number of chunks. +Retrieval accuracy is the touchstone for a production-ready RAG framework. In addition to retrieval-enhancing approaches like auto-keyword, auto-question, and knowledge graph, RAGFlow introduces an auto-tagging feature to address semantic gaps. The auto-tagging feature automatically maps tags in the user-defined tag sets to relevant chunks within your dataset based on similarity with each chunk. This automation mechanism allows you to apply an additional "layer" of domain-specific knowledge to existing datasets, which is particularly useful when dealing with a large number of chunks. -To use this feature, ensure you have at least one properly configured tag set, specify the tag set(s) on the **Configuration** page of your knowledge base (dataset), and then re-parse your documents to initiate the auto-tagging process. During this process, each chunk in your dataset is compared with every entry in the specified tag set(s), and tags are automatically applied based on similarity. +To use this feature, ensure you have at least one properly configured tag set, specify the tag set(s) on the **Configuration** page of your dataset, and then re-parse your documents to initiate the auto-tagging process. During this process, each chunk in your dataset is compared with every entry in the specified tag set(s), and tags are automatically applied based on similarity. ## Scenarios @@ -19,7 +19,7 @@ Auto-tagging applies in situations where chunks are so similar to each other tha ## 1. Create tag set -You can consider a tag set as a closed set, and the tags to attach to the chunks in your dataset (knowledge base) are *exclusively* from the specified tag set. You use a tag set to "inform" RAGFlow which chunks to tag and which tags to apply. +You can consider a tag set as a closed set, and the tags to attach to the chunks in your dataset are *exclusively* from the specified tag set. You use a tag set to "inform" RAGFlow which chunks to tag and which tags to apply. ### Prepare a tag table file @@ -41,8 +41,8 @@ As a rule of thumb, consider including the following entries in your tag table: A tag set is *not* involved in document indexing or retrieval. Do not specify a tag set when configuring your chat assistant or agent. ::: -1. Click **+ Create knowledge base** to create a knowledge base. -2. Navigate to the **Configuration** page of the created knowledge base and choose **Tag** as the default chunking method. +1. Click **+ Create dataset** to create a dataset. +2. Navigate to the **Configuration** page of the created dataset and choose **Tag** as the default chunking method. 3. Navigate to the **Dataset** page and upload and parse your table file in XLSX, CSV, or TXT formats. _A tag cloud appears under the **Tag view** section, indicating the tag set is created:_ ![Image](https://github.com/user-attachments/assets/abefbcbf-c130-4abe-95e1-267b0d2a0505) @@ -53,7 +53,7 @@ A tag set is *not* involved in document indexing or retrieval. Do not specify a Once a tag set is created, you can apply it to your dataset: -1. Navigate to the **Configuration** page of your knowledge base (dataset). +1. Navigate to the **Configuration** page of your dataset. 2. Select the tag set from the **Tag sets** dropdown and click **Save** to confirm. :::tip NOTE @@ -94,9 +94,9 @@ If you add new table files to your tag set, it is at your own discretion whether Yes, you can. Usually one tag set suffices. When using multiple tag sets, ensure they are independent of each other; otherwise, consider merging your tag sets. -### Difference between a tag set and a standard knowledge base? +### Difference between a tag set and a standard dataset? -A standard knowledge base is a dataset. It will be searched by RAGFlow's document engine and the retrieved chunks will be fed to the LLM. In contrast, a tag set is used solely to attach tags to chunks within your dataset. It does not directly participate in the retrieval process, and you should not choose a tag set when selecting datasets for your chat assistant or agent. +A standard dataset is a dataset. It will be searched by RAGFlow's document engine and the retrieved chunks will be fed to the LLM. In contrast, a tag set is used solely to attach tags to chunks within your dataset. It does not directly participate in the retrieval process, and you should not choose a tag set when selecting datasets for your chat assistant or agent. ### Difference between auto-tag and auto-keyword? diff --git a/docs/guides/manage_files.md b/docs/guides/manage_files.md index 7f633dd0a..f3e3b31e6 100644 --- a/docs/guides/manage_files.md +++ b/docs/guides/manage_files.md @@ -5,10 +5,10 @@ slug: /manage_files # Files -Knowledge base, hallucination-free chat, and file management are the three pillars of RAGFlow. RAGFlow's file management allows you to upload files individually or in bulk. You can then link an uploaded file to multiple target knowledge bases. This guide showcases some basic usages of the file management feature. +RAGFlow's file management allows you to upload files individually or in bulk. You can then link an uploaded file to multiple target datasets. This guide showcases some basic usages of the file management feature. :::info IMPORTANT -Compared to uploading files directly to various knowledge bases, uploading them to RAGFlow's file management and then linking them to different knowledge bases is *not* an unnecessary step, particularly when you want to delete some parsed files or an entire knowledge base but retain the original files. +Compared to uploading files directly to various datasets, uploading them to RAGFlow's file management and then linking them to different datasets is *not* an unnecessary step, particularly when you want to delete some parsed files or an entire dataset but retain the original files. ::: ## Create folder @@ -18,7 +18,7 @@ RAGFlow's file management allows you to establish your file system with nested f ![create new folder](https://github.com/infiniflow/ragflow/assets/93570324/3a37a5f4-43a6-426d-a62a-e5cd2ff7a533) :::caution NOTE -Each knowledge base in RAGFlow has a corresponding folder under the **root/.knowledgebase** directory. You are not allowed to create a subfolder within it. +Each dataset in RAGFlow has a corresponding folder under the **root/.knowledgebase** directory. You are not allowed to create a subfolder within it. ::: ## Upload file @@ -39,13 +39,13 @@ RAGFlow's file management supports previewing files in the following formats: ![preview](https://github.com/infiniflow/ragflow/assets/93570324/2e931362-8bbf-482c-ac86-b68b09d331bc) -## Link file to knowledge bases +## Link file to datasets -RAGFlow's file management allows you to *link* an uploaded file to multiple knowledge bases, creating a file reference in each target knowledge base. Therefore, deleting a file in your file management will AUTOMATICALLY REMOVE all related file references across the knowledge bases. +RAGFlow's file management allows you to *link* an uploaded file to multiple datasets, creating a file reference in each target dataset. Therefore, deleting a file in your file management will AUTOMATICALLY REMOVE all related file references across the datasets. ![link knowledgebase](https://github.com/infiniflow/ragflow/assets/93570324/6c6b8db4-3269-4e35-9434-6089887e3e3f) -You can link your file to one knowledge base or multiple knowledge bases at one time: +You can link your file to one dataset or multiple datasets at one time: ![link multiple kb](https://github.com/infiniflow/ragflow/assets/93570324/6c508803-fb1f-435d-b688-683066fd7fff) @@ -79,7 +79,7 @@ To bulk delete files or folders: ![bulk delete](https://github.com/infiniflow/ragflow/assets/93570324/519b99ab-ec7f-4c8a-8cea-e0b6dcb3cb46) > - You are not allowed to delete the **root/.knowledgebase** folder. -> - Deleting files that have been linked to knowledge bases will **AUTOMATICALLY REMOVE** all associated file references across the knowledge bases. +> - Deleting files that have been linked to datasets will **AUTOMATICALLY REMOVE** all associated file references across the datasets. ## Download uploaded file diff --git a/docs/guides/models/deploy_local_llm.mdx b/docs/guides/models/deploy_local_llm.mdx index 918e9503c..6553e7c53 100644 --- a/docs/guides/models/deploy_local_llm.mdx +++ b/docs/guides/models/deploy_local_llm.mdx @@ -164,7 +164,7 @@ Click on your logo **>** **Model providers** **>** **System Model Settings** to Update your chat model accordingly in **Chat Configuration**: -> If your local model is an embedding model, update it on the configuration page of your knowledge base. +> If your local model is an embedding model, update it on the configuration page of your dataset. ## Deploy a local model using IPEX-LLM diff --git a/docs/guides/team/join_or_leave_team.md b/docs/guides/team/join_or_leave_team.md index 12257306d..93255ef3c 100644 --- a/docs/guides/team/join_or_leave_team.md +++ b/docs/guides/team/join_or_leave_team.md @@ -11,7 +11,7 @@ Accept an invite to join a team, decline an invite, or leave a team. Once you join a team, you can do the following: -- Upload documents to the team owner's shared datasets (knowledge bases). +- Upload documents to the team owner's shared datasets. - Parse documents in the team owner's shared datasets. - Use the team owner's shared Agents. @@ -22,7 +22,7 @@ You cannot invite users to a team unless you are its owner. ## Prerequisites 1. Ensure that your Email address that received the team invitation is associated with a RAGFlow user account. -2. The team owner should share his knowledge bases by setting their **Permission** to **Team**. +2. The team owner should share his datasets by setting their **Permission** to **Team**. ## Accept or decline team invite @@ -32,6 +32,6 @@ You cannot invite users to a team unless you are its owner. _On the **Team** page, you can view the information about members of your team and the teams you have joined._ -_After accepting the team invite, you should be able to view and update the team owner's knowledge bases whose **Permissions** is set to **Team**._ +_After accepting the team invite, you should be able to view and update the team owner's datasets whose **Permissions** is set to **Team**._ ## Leave a joined team \ No newline at end of file diff --git a/docs/guides/team/manage_team_members.md b/docs/guides/team/manage_team_members.md index bf8a2eacf..edd8289cd 100644 --- a/docs/guides/team/manage_team_members.md +++ b/docs/guides/team/manage_team_members.md @@ -11,7 +11,7 @@ Invite or remove team members. By default, each RAGFlow user is assigned a single team named after their name. RAGFlow allows you to invite RAGFlow users to your team. Your team members can help you: -- Upload documents to your shared datasets (knowledge bases). +- Upload documents to your shared datasets. - Parse documents in your shared datasets. - Use your shared Agents. @@ -23,7 +23,7 @@ By default, each RAGFlow user is assigned a single team named after their name. ## Prerequisites 1. Ensure that the invited team member is a RAGFlow user and that the Email address used is associated with a RAGFlow user account. -2. To allow your team members to view and update your knowledge base, ensure that you set **Permissions** on its **Configuration** page from **Only me** to **Team**. +2. To allow your team members to view and update your dataset, ensure that you set **Permissions** on its **Configuration** page from **Only me** to **Team**. ## Invite team members diff --git a/docs/guides/team/share_knowledge_bases.md b/docs/guides/team/share_knowledge_bases.md index ed106b63f..4eeccd264 100644 --- a/docs/guides/team/share_knowledge_bases.md +++ b/docs/guides/team/share_knowledge_bases.md @@ -3,16 +3,16 @@ sidebar_position: 4 slug: /share_datasets --- -# Share knowledge base +# Share dataset -Share a knowledge base with team members. +Share a dataset with team members. --- -When ready, you may share your knowledge bases with your team members so that they can upload and parse files in them. Please note that your knowledge bases are not shared automatically; you must manually enable sharing by selecting the appropriate **Permissions** radio button: +When ready, you may share your datasets with your team members so that they can upload and parse files in them. Please note that your datasets are not shared automatically; you must manually enable sharing by selecting the appropriate **Permissions** radio button: -1. Navigate to the knowledge base's **Configuration** page. +1. Navigate to the dataset's **Configuration** page. 2. Change **Permissions** from **Only me** to **Team**. 3. Click **Save** to apply your changes. -*Once completed, your team members will see your shared knowledge bases.* \ No newline at end of file +*Once completed, your team members will see your shared datasets.* \ No newline at end of file diff --git a/docs/guides/upgrade_ragflow.mdx b/docs/guides/upgrade_ragflow.mdx index 57a9bd7d8..41d9dfd22 100644 --- a/docs/guides/upgrade_ragflow.mdx +++ b/docs/guides/upgrade_ragflow.mdx @@ -105,9 +105,9 @@ RAGFLOW_IMAGE=infiniflow/ragflow:v0.20.5 ## Frequently asked questions -### Do I need to back up my knowledge bases before upgrading RAGFlow? +### Do I need to back up my datasets before upgrading RAGFlow? -No, you do not need to. Upgrading RAGFlow in itself will *not* remove your uploaded data or knowledge base settings. However, be aware that `docker compose -f docker/docker-compose.yml down -v` will remove Docker container volumes, resulting in data loss. +No, you do not need to. Upgrading RAGFlow in itself will *not* remove your uploaded data or dataset settings. However, be aware that `docker compose -f docker/docker-compose.yml down -v` will remove Docker container volumes, resulting in data loss. ### Upgrade RAGFlow in an offline environment (without Internet access) diff --git a/docs/quickstart.mdx b/docs/quickstart.mdx index 8be0a8b35..959fb33e5 100644 --- a/docs/quickstart.mdx +++ b/docs/quickstart.mdx @@ -13,7 +13,7 @@ RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on d This quick start guide describes a general process from: - Starting up a local RAGFlow server, -- Creating a knowledge base, +- Creating a dataset, - Intervening with file parsing, to - Establishing an AI chat based on your datasets. @@ -280,29 +280,29 @@ To add and configure an LLM: > Some models, such as the image-to-text model **qwen-vl-max**, are subsidiary to a specific LLM. And you may need to update your API key to access these models. -## Create your first knowledge base +## Create your first dataset -You are allowed to upload files to a knowledge base in RAGFlow and parse them into datasets. A knowledge base is virtually a collection of datasets. Question answering in RAGFlow can be based on a particular knowledge base or multiple knowledge bases. File formats that RAGFlow supports include documents (PDF, DOC, DOCX, TXT, MD, MDX), tables (CSV, XLSX, XLS), pictures (JPEG, JPG, PNG, TIF, GIF), and slides (PPT, PPTX). +You are allowed to upload files to a dataset in RAGFlow and parse them into datasets. A dataset is virtually a collection of datasets. Question answering in RAGFlow can be based on a particular dataset or multiple datasets. File formats that RAGFlow supports include documents (PDF, DOC, DOCX, TXT, MD, MDX), tables (CSV, XLSX, XLS), pictures (JPEG, JPG, PNG, TIF, GIF), and slides (PPT, PPTX). -To create your first knowledge base: +To create your first dataset: 1. Click the **Dataset** tab in the top middle of the page **>** **Create dataset**. -2. Input the name of your knowledge base and click **OK** to confirm your changes. +2. Input the name of your dataset and click **OK** to confirm your changes. - _You are taken to the **Configuration** page of your knowledge base._ + _You are taken to the **Configuration** page of your dataset._ - ![knowledge base configuration](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/configure_knowledge_base.jpg) + ![dataset configuration](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/configure_knowledge_base.jpg) -3. RAGFlow offers multiple chunk templates that cater to different document layouts and file formats. Select the embedding model and chunking method (template) for your knowledge base. +3. RAGFlow offers multiple chunk templates that cater to different document layouts and file formats. Select the embedding model and chunking method (template) for your dataset. :::danger IMPORTANT -Once you have selected an embedding model and used it to parse a file, you are no longer allowed to change it. The obvious reason is that we must ensure that all files in a specific knowledge base are parsed using the *same* embedding model (ensure that they are being compared in the same embedding space). +Once you have selected an embedding model and used it to parse a file, you are no longer allowed to change it. The obvious reason is that we must ensure that all files in a specific dataset are parsed using the *same* embedding model (ensure that they are being compared in the same embedding space). ::: - _You are taken to the **Dataset** page of your knowledge base._ + _You are taken to the **Dataset** page of your dataset._ -4. Click **+ Add file** **>** **Local files** to start uploading a particular file to the knowledge base. +4. Click **+ Add file** **>** **Local files** to start uploading a particular file to the dataset. 5. In the uploaded file entry, click the play button to start file parsing: @@ -341,17 +341,17 @@ You can add keywords or questions to a file chunk to improve its ranking for que ## Set up an AI chat -Conversations in RAGFlow are based on a particular knowledge base or multiple knowledge bases. Once you have created your knowledge base and finished file parsing, you can go ahead and start an AI conversation. +Conversations in RAGFlow are based on a particular dataset or multiple datasets. Once you have created your dataset and finished file parsing, you can go ahead and start an AI conversation. 1. Click the **Chat** tab in the middle top of the mage **>** **Create an assistant** to show the **Chat Configuration** dialogue *of your next dialogue*. > RAGFlow offer the flexibility of choosing a different chat model for each dialogue, while allowing you to set the default models in **System Model Settings**. 2. Update **Assistant settings**: - - Name your assistant and specify your knowledge bases. + - Name your assistant and specify your datasets. - **Empty response**: - - If you wish to *confine* RAGFlow's answers to your knowledge bases, leave a response here. Then when it doesn't retrieve an answer, it *uniformly* responds with what you set here. - - If you wish RAGFlow to *improvise* when it doesn't retrieve an answer from your knowledge bases, leave it blank, which may give rise to hallucinations. + - If you wish to *confine* RAGFlow's answers to your datasets, leave a response here. Then when it doesn't retrieve an answer, it *uniformly* responds with what you set here. + - If you wish RAGFlow to *improvise* when it doesn't retrieve an answer from your datasets, leave it blank, which may give rise to hallucinations. 3. Update **Prompt engine** or leave it as is for the beginning. diff --git a/docs/release_notes.md b/docs/release_notes.md index 6e983755b..b19f9f720 100644 --- a/docs/release_notes.md +++ b/docs/release_notes.md @@ -79,7 +79,7 @@ ZHIPU GLM-4.5 ### New Agent templates -Ecommerce Customer Service Workflow: A template designed to handle enquiries about product features and multi-product comparisons using the internal knowledge base, as well as to manage installation appointment bookings. +Ecommerce Customer Service Workflow: A template designed to handle enquiries about product features and multi-product comparisons using the internal dataset, as well as to manage installation appointment bookings. ### Fixed issues @@ -131,7 +131,7 @@ Released on August 8, 2025. ### New Features -- The **Retrieval** component now supports the dynamic specification of knowledge base names using variables. +- The **Retrieval** component now supports the dynamic specification of dataset names using variables. - The user interface now includes a French language option. ### Added Models @@ -142,7 +142,7 @@ Released on August 8, 2025. ### New agent templates (both workflow and agentic) - SQL Assistant Workflow: Empowers non-technical teams (e.g., operations, product) to independently query business data. -- Choose Your Knowledge Base Workflow: Lets users select a knowledge base to query during conversations. [#9325](https://github.com/infiniflow/ragflow/pull/9325) +- Choose Your Knowledge Base Workflow: Lets users select a dataset to query during conversations. [#9325](https://github.com/infiniflow/ragflow/pull/9325) - Choose Your Knowledge Base Agent: Delivers higher-quality responses with extended reasoning time, suited for complex queries. [#9325](https://github.com/infiniflow/ragflow/pull/9325) ### Fixed Issues @@ -175,14 +175,14 @@ From v0.20.0 onwards, Agents are no longer compatible with earlier versions, and ### New agent templates introduced - Multi-Agent based Deep Research: Collaborative Agent teamwork led by a Lead Agent with multiple Subagents, distinct from traditional workflow orchestration. -- An intelligent Q&A chatbot leveraging internal knowledge bases, designed for customer service and training scenarios. +- An intelligent Q&A chatbot leveraging internal datasets, designed for customer service and training scenarios. - A resume analysis template used by the RAGFlow team to screen, analyze, and record candidate information. - A blog generation workflow that transforms raw ideas into SEO-friendly blog content. - An intelligent customer service workflow. - A user feedback analysis template that directs user feedback to appropriate teams through semantic analysis. - Trip Planner: Uses web search and map MCP servers to assist with travel planning. - Image Lingo: Translates content from uploaded photos. -- An information search assistant that retrieves answers from both internal knowledge bases and the web. +- An information search assistant that retrieves answers from both internal datasets and the web. ## v0.19.1 @@ -195,7 +195,7 @@ Released on June 23, 2025. - A context error occurring when using Sandbox in standalone mode. [#8340](https://github.com/infiniflow/ragflow/pull/8340) - An excessive CPU usage issue caused by Ollama. [#8216](https://github.com/infiniflow/ragflow/pull/8216) - A bug in the Code Component. [#7949](https://github.com/infiniflow/ragflow/pull/7949) -- Added support for models installed via Ollama or VLLM when creating a knowledge base through the API. [#8069](https://github.com/infiniflow/ragflow/pull/8069) +- Added support for models installed via Ollama or VLLM when creating a dataset through the API. [#8069](https://github.com/infiniflow/ragflow/pull/8069) - Enabled role-based authentication for S3 bucket access. [#8149](https://github.com/infiniflow/ragflow/pull/8149) ### Added models @@ -209,7 +209,7 @@ Released on May 26, 2025. ### New features -- [Cross-language search](./references/glossary.mdx#cross-language-search) is supported in the Knowledge and Chat modules, enhancing search accuracy and user experience in multilingual environments, such as in Chinese-English knowledge bases. +- [Cross-language search](./references/glossary.mdx#cross-language-search) is supported in the Knowledge and Chat modules, enhancing search accuracy and user experience in multilingual environments, such as in Chinese-English datasets. - Agent component: A new Code component supports Python and JavaScript scripts, enabling developers to handle more complex tasks like dynamic data processing. - Enhanced image display: Images in Chat and Search now render directly within responses, rather than as external references. Knowledge retrieval testing can retrieve images directly, instead of texts extracted from images. - Claude 4 and ChatGPT o3: Developers can now use the newly released, most advanced Claude model and OpenAI’s latest ChatGPT o3 inference model. @@ -238,7 +238,7 @@ From this release onwards, built-in rerank models have been removed because they ### New features -- MCP server: enables access to RAGFlow's knowledge bases via MCP. +- MCP server: enables access to RAGFlow's datasets via MCP. - DeepDoc supports adopting VLM model as a processing pipeline during document layout recognition, enabling in-depth analysis of images in PDF and DOCX files. - OpenAI-compatible APIs: Agents can be called via OpenAI-compatible APIs. - User registration control: administrators can enable or disable user registration through an environment variable. @@ -330,7 +330,7 @@ Released on March 3, 2025. - AI chat: Implements Deep Research for agentic reasoning. To activate this, enable the **Reasoning** toggle under the **Prompt engine** tab of your chat assistant dialogue. - AI chat: Leverages Tavily-based web search to enhance contexts in agentic reasoning. To activate this, enter the correct Tavily API key under the **Assistant settings** tab of your chat assistant dialogue. -- AI chat: Supports starting a chat without specifying knowledge bases. +- AI chat: Supports starting a chat without specifying datasets. - AI chat: HTML files can also be previewed and referenced, in addition to PDF files. - Dataset: Adds a **PDF parser**, aka **Document parser**, dropdown menu to dataset configurations. This includes a DeepDoc model option, which is time-consuming, a much faster **naive** option (plain text), which skips DLA (Document Layout Analysis), OCR (Optical Character Recognition), and TSR (Table Structure Recognition) tasks, and several currently *experimental* large model options. See [here](./guides/dataset/select_pdf_parser.md). - Agent component: **(x)** or a forward slash `/` can be used to insert available keys (variables) in the system prompt field of the **Generate** or **Template** component. @@ -369,16 +369,16 @@ Released on February 6, 2025. ### New features - Supports DeepSeek R1 and DeepSeek V3. -- GraphRAG refactor: Knowledge graph is dynamically built on an entire knowledge base (dataset) rather than on an individual file, and automatically updated when a newly uploaded file starts parsing. See [here](https://ragflow.io/docs/dev/construct_knowledge_graph). +- GraphRAG refactor: Knowledge graph is dynamically built on an entire dataset rather than on an individual file, and automatically updated when a newly uploaded file starts parsing. See [here](https://ragflow.io/docs/dev/construct_knowledge_graph). - Adds an **Iteration** agent component and a **Research report generator** agent template. See [here](./guides/agent/agent_component_reference/iteration.mdx). - New UI language: Portuguese. -- Allows setting metadata for a specific file in a knowledge base to enhance AI-powered chats. See [here](./guides/dataset/set_metadata.md). +- Allows setting metadata for a specific file in a dataset to enhance AI-powered chats. See [here](./guides/dataset/set_metadata.md). - Upgrades RAGFlow's document engine [Infinity](https://github.com/infiniflow/infinity) to v0.6.0.dev3. - Supports GPU acceleration for DeepDoc (see [docker-compose-gpu.yml](https://github.com/infiniflow/ragflow/blob/main/docker/docker-compose-gpu.yml)). -- Supports creating and referencing a **Tag** knowledge base as a key milestone towards bridging the semantic gap between query and response. +- Supports creating and referencing a **Tag** dataset as a key milestone towards bridging the semantic gap between query and response. :::danger IMPORTANT -The **Tag knowledge base** feature is *unavailable* on the [Infinity](https://github.com/infiniflow/infinity) document engine. +The **Tag dataset** feature is *unavailable* on the [Infinity](https://github.com/infiniflow/infinity) document engine. ::: ### Documentation @@ -415,7 +415,7 @@ Released on December 25, 2024. This release fixes the following issues: - The `SCORE not found` and `position_int` errors returned by [Infinity](https://github.com/infiniflow/infinity). -- Once an embedding model in a specific knowledge base is changed, embedding models in other knowledge bases can no longer be changed. +- Once an embedding model in a specific dataset is changed, embedding models in other datasets can no longer be changed. - Slow response in question-answering and AI search due to repetitive loading of the embedding model. - Fails to parse documents with RAPTOR. - Using the **Table** parsing method results in information loss. @@ -442,7 +442,7 @@ Released on December 18, 2024. ### New features - Introduces additional Agent-specific APIs. -- Supports using page rank score to improve retrieval performance when searching across multiple knowledge bases. +- Supports using page rank score to improve retrieval performance when searching across multiple datasets. - Offers an iframe in Chat and Agent to facilitate the integration of RAGFlow into your webpage. - Adds a Helm chart for deploying RAGFlow on Kubernetes. - Supports importing or exporting an agent in JSON format.