diff --git a/docs/faq.mdx b/docs/faq.mdx index 63c7edc9c..6c1334d13 100644 --- a/docs/faq.mdx +++ b/docs/faq.mdx @@ -493,18 +493,35 @@ See [here](./guides/agent/best_practices/accelerate_agent_question_answering.md) ### How to use MinerU to parse PDF documents? -MinerU PDF document parsing is available starting from v0.22.0. RAGFlow works only as a remote client to MinerU (>= 2.6.3) and does not install or execute MinerU locally. To use this feature: +From v0.22.0 onwards, RAGFlow includes MinerU (≥ 2.6.3) as an optional PDF parser of multiple backends. Please note that RAGFlow acts only as a *remote client* for MinerU, calling the MinerU API to parse PDFs and reading the returned files. To use this feature: -1. Prepare a reachable MinerU API service (for example, the FastAPI server provided by MinerU). -2. Configure RAGFlow with remote MinerU settings (environment variables or UI model provider): - - `MINERU_APISERVER`: MinerU API endpoint, for example `http://mineru-host:8886`. - - `MINERU_BACKEND`: MinerU backend, defaults to `pipeline` (supports `vlm-http-client`, `vlm-transformers`, `vlm-vllm-engine`, `vlm-mlx-engine`, `vlm-vllm-async-engine`, `vlm-lmdeploy-engine`). - - `MINERU_SERVER_URL`: (optional) For `vlm-http-client`, the downstream vLLM HTTP server, for example `http://vllm-host:30000`. - - `MINERU_OUTPUT_DIR`: (optional) Local directory to store MinerU API outputs (zip/JSON) before ingestion. - - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temp dir is used (`1` deletes temp outputs; set `0` to keep). -3. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown (which supports PDF parsing), and select **MinerU** in **PDF parser**. -4. If you use a custom ingestion pipeline instead, provide the same MinerU settings and select **MinerU** in the **Parsing method** section of the **Parser** component. +1. Prepare a reachable MinerU API service (FastAPI server). +2. In the **.env** file or from the **Model providers** page in the UI, configure RAGFlow as a remote client to MinerU: + - `MINERU_APISERVER`: The MinerU API endpoint (e.g., `http://mineru-host:8886`). + - `MINERU_BACKEND`: The MinerU backend: + - `"pipeline"` (default) + - `"vlm-http-client"` + - `"vlm-transformers"` + - `"vlm-vllm-engine"` + - `"vlm-mlx-engine"` + - `"vlm-vllm-async-engine"` + - `"vlm-lmdeploy-engine"`. + - `MINERU_SERVER_URL`: (optional) The downstream vLLM HTTP server (e.g., `http://vllm-host:30000`). Applicable when `MINERU_BACKEND` is set to `"vlm-http-client"`. + - `MINERU_OUTPUT_DIR`: (optional) The local directory for holding the outputs of the MinerU API service (zip/JSON) before ingestion. + - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temporary directory is used: + - `1`: Delete. + - `0`: Retain. +3. In the web UI, navigate to your dataset's **Configuration** page and find the **Ingestion pipeline** section: + - If you decide to use a chunking method from the **Built-in** dropdown, ensure it supports PDF parsing, then select **MinerU** from the **PDF parser** dropdown. + - If you use a custom ingestion pipeline instead, select **MinerU** in the **PDF parser** section of the **Parser** component. +:::note +All MinerU environment variables are optional. When set, these values are used to auto-provision a MinerU OCR model for the tenant on first use. To avoid auto-provisioning, skip the environment variable settings and only configure MinerU from the **Model providers** page in the UI. +::: + +:::caution WARNING +Third-party visual models are marked **Experimental**, because we have not fully tested these models for the aforementioned data extraction tasks. +::: --- ### How to configure MinerU-specific settings? diff --git a/docs/guides/agent/agent_component_reference/code.mdx b/docs/guides/agent/agent_component_reference/code.mdx index 047b6bdd6..ea4831581 100644 --- a/docs/guides/agent/agent_component_reference/code.mdx +++ b/docs/guides/agent/agent_component_reference/code.mdx @@ -24,7 +24,7 @@ We use gVisor to isolate code execution from the host system. Please follow [the RAGFlow Sandbox is a secure, pluggable code execution backend. It serves as the code executor for the **Code** component. Please follow the [instructions here](https://github.com/infiniflow/ragflow/tree/main/sandbox) to install RAGFlow Sandbox. :::note Docker client version -The executor manager image now bundles Docker CLI `29.1.0` (API 1.44+). Older images shipped Docker 24.x and will fail against newer Docker daemons with `client version 1.43 is too old`. Pull the latest `infiniflow/sandbox-executor-manager:latest` or rebuild `./sandbox/executor_manager` if you encounter this error. +The executor manager image now bundles Docker CLI `29.1.0` (API 1.44+). Older images shipped Docker 24.x and will fail against newer Docker daemons with `client version 1.43 is too old`. Pull the latest `infiniflow/sandbox-executor-manager:latest` or rebuild it in `./sandbox/executor_manager` if you encounter this error. ::: :::tip NOTE @@ -134,7 +134,7 @@ Your executor manager image includes Docker CLI 24.x (API 1.43), but the host Do **Solution** -Pull the latest executor manager image or rebuild it locally to upgrade the built-in Docker client: +Pull the latest executor manager image or rebuild it in `./sandbox/executor_manager` to upgrade the built-in Docker client: ```bash docker pull infiniflow/sandbox-executor-manager:latest diff --git a/docs/guides/agent/agent_component_reference/parser.md b/docs/guides/agent/agent_component_reference/parser.md index ec545ba64..0eb0f6bff 100644 --- a/docs/guides/agent/agent_component_reference/parser.md +++ b/docs/guides/agent/agent_component_reference/parser.md @@ -40,21 +40,31 @@ The output of a PDF parser is `json`. In the PDF parser, you select the parsing - A third-party visual model from a specific model provider. :::danger IMPORTANT -MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow acts only as a **remote client** for MinerU, calling the MinerU API to parse documents, reading the returned output files, and ingesting the parsed content. To use this feature: +Starting from v0.22.0, RAGFlow includes MinerU (≥ 2.6.3) as an optional PDF parser of multiple backends. Please note that RAGFlow acts only as a *remote client* for MinerU, calling the MinerU API to parse documents and reading the returned files. To use this feature: ::: 1. Prepare a reachable MinerU API service (FastAPI server). -2. Configure RAGFlow with the remote MinerU settings (env or UI model provider): - - `MINERU_APISERVER`: MinerU API endpoint, for example `http://mineru-host:8886`. - - `MINERU_BACKEND`: MinerU backend, defaults to `pipeline` (supports `vlm-http-client`, `vlm-transformers`, `vlm-vllm-engine`, `vlm-mlx-engine`, `vlm-vllm-async-engine`, `vlm-lmdeploy-engine`). - - `MINERU_SERVER_URL`: (optional) For `vlm-http-client`, the downstream vLLM HTTP server, for example `http://vllm-host:30000`. - - `MINERU_OUTPUT_DIR`: (optional) Local directory to store MinerU API outputs (zip/JSON) before ingestion. - - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temp dir is used (`1` deletes temp outputs; set `0` to keep). -3. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and select **MinerU** in **PDF parser**. -4. If you use a custom ingestion pipeline instead, provide the same MinerU settings and select **MinerU** in the **Parsing method** section of the **Parser** component. +2. In the **.env** file or from the **Model providers** page in the UI, configure RAGFlow as a remote client to MinerU: + - `MINERU_APISERVER`: The MinerU API endpoint (e.g., `http://mineru-host:8886`). + - `MINERU_BACKEND`: The MinerU backend: + - `"pipeline"` (default) + - `"vlm-http-client"` + - `"vlm-transformers"` + - `"vlm-vllm-engine"` + - `"vlm-mlx-engine"` + - `"vlm-vllm-async-engine"` + - `"vlm-lmdeploy-engine"`. + - `MINERU_SERVER_URL`: (optional) The downstream vLLM HTTP server (e.g., `http://vllm-host:30000`). Applicable when `MINERU_BACKEND` is set to `"vlm-http-client"`. + - `MINERU_OUTPUT_DIR`: (optional) The local directory for holding the outputs of the MinerU API service (zip/JSON) before ingestion. + - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temporary directory is used: + - `1`: Delete. + - `0`: Retain. +3. In the web UI, navigate to your dataset's **Configuration** page and find the **Ingestion pipeline** section: + - If you decide to use a chunking method from the **Built-in** dropdown, ensure it supports PDF parsing, then select **MinerU** from the **PDF parser** dropdown. + - If you use a custom ingestion pipeline instead, select **MinerU** in the **PDF parser** section of the **Parser** component. :::note -All MinerU environment variables are optional. If set, RAGFlow will auto-provision a MinerU OCR model for the tenant on first use with these values. To avoid auto-provisioning, configure MinerU solely through the UI and leave the env vars unset. +All MinerU environment variables are optional. When set, these values are used to auto-provision a MinerU OCR model for the tenant on first use. To avoid auto-provisioning, skip the environment variable settings and only configure MinerU from the **Model providers** page in the UI. ::: :::caution WARNING diff --git a/docs/guides/agent/sandbox_quickstart.md b/docs/guides/agent/sandbox_quickstart.md index d3a0457be..5baa935a8 100644 --- a/docs/guides/agent/sandbox_quickstart.md +++ b/docs/guides/agent/sandbox_quickstart.md @@ -29,7 +29,7 @@ The architecture consists of isolated Docker base images for each supported lang - (Optional) GNU Make for simplified command-line management. :::tip NOTE -The error message `client version 1.43 is too old. Minimum supported API version is 1.44` indicates that your executor manager image's built-in Docker CLI version is lower than `29.1.0` required by the Docker daemon in use. To solve this issue, pull the latest `infiniflow/sandbox-executor-manager:latest` from Docker Hub (or rebuild `./sandbox/executor_manager`). +The error message `client version 1.43 is too old. Minimum supported API version is 1.44` indicates that your executor manager image's built-in Docker CLI version is lower than `29.1.0` required by the Docker daemon in use. To solve this issue, pull the latest `infiniflow/sandbox-executor-manager:latest` from Docker Hub or rebuild it in `./sandbox/executor_manager`. ::: ## Build Docker base images diff --git a/docs/guides/dataset/add_data_source/add_google_drive.md b/docs/guides/dataset/add_data_source/add_google_drive.md index b4fdf14f4..a1f2d895f 100644 --- a/docs/guides/dataset/add_data_source/add_google_drive.md +++ b/docs/guides/dataset/add_data_source/add_google_drive.md @@ -45,7 +45,7 @@ Google Cloud external project. http://localhost:9380/v1/connector/google-drive/oauth/web/callback ``` -### If using Docker deployment: +- If using Docker deployment: **Authorized JavaScript origin:** ``` @@ -53,15 +53,16 @@ http://localhost:80 ``` ![placeholder-image](https://github.com/infiniflow/ragflow-docs/blob/040e4acd4c1eac6dc73dc44e934a6518de78d097/images/google_drive/image8.png?raw=true) -### If running from source: + +- If running from source: **Authorized JavaScript origin:** ``` http://localhost:9222 ``` ![placeholder-image](https://github.com/infiniflow/ragflow-docs/blob/040e4acd4c1eac6dc73dc44e934a6518de78d097/images/google_drive/image9.png?raw=true) -5. After saving, click **Download JSON**. This file will later be - uploaded into RAGFlow. + +5. After saving, click **Download JSON**. This file will later be uploaded into RAGFlow. ![placeholder-image](https://github.com/infiniflow/ragflow-docs/blob/040e4acd4c1eac6dc73dc44e934a6518de78d097/images/google_drive/image10.png?raw=true) diff --git a/docs/guides/dataset/select_pdf_parser.md b/docs/guides/dataset/select_pdf_parser.md index 2df825274..3b9bb6132 100644 --- a/docs/guides/dataset/select_pdf_parser.md +++ b/docs/guides/dataset/select_pdf_parser.md @@ -40,21 +40,31 @@ RAGFlow isn't one-size-fits-all. It is built for flexibility and supports deeper - A third-party visual model from a specific model provider. :::danger IMPORTANT -MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow acts only as a **remote client** for MinerU, calling the MinerU API to parse documents, reading the returned output files, and ingesting the parsed content. To use this feature: - -1. Prepare a reachable MinerU API service (FastAPI server). -2. Configure RAGFlow with the remote MinerU settings (env or UI model provider): - - `MINERU_APISERVER`: MinerU API endpoint, for example `http://mineru-host:8886`. - - `MINERU_BACKEND`: MinerU backend, defaults to `pipeline` (supports `vlm-http-client`, `vlm-transformers`, `vlm-vllm-engine`, `vlm-mlx-engine`, `vlm-vllm-async-engine`). - - `MINERU_SERVER_URL`: (optional) For `vlm-http-client`, the downstream vLLM HTTP server, for example `http://vllm-host:30000`. - - `MINERU_OUTPUT_DIR`: (optional) Local directory to store MinerU API outputs (zip/JSON) before ingestion. - - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temp dir is used (`1` deletes temp outputs; set `0` to keep). -3. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and select **MinerU** in **PDF parser**. -4. If you use a custom ingestion pipeline instead, provide the same MinerU settings and select **MinerU** in the **Parsing method** section of the **Parser** component. +Starting from v0.22.0, RAGFlow includes MinerU (≥ 2.6.3) as an optional PDF parser of multiple backends. Please note that RAGFlow acts only as a *remote client* for MinerU, calling the MinerU API to parse documents and reading the returned files. To use this feature: ::: +1. Prepare a reachable MinerU API service (FastAPI server). +2. In the **.env** file or from the **Model providers** page in the UI, configure RAGFlow as a remote client to MinerU: + - `MINERU_APISERVER`: The MinerU API endpoint (e.g., `http://mineru-host:8886`). + - `MINERU_BACKEND`: The MinerU backend: + - `"pipeline"` (default) + - `"vlm-http-client"` + - `"vlm-transformers"` + - `"vlm-vllm-engine"` + - `"vlm-mlx-engine"` + - `"vlm-vllm-async-engine"` + - `"vlm-lmdeploy-engine"`. + - `MINERU_SERVER_URL`: (optional) The downstream vLLM HTTP server (e.g., `http://vllm-host:30000`). Applicable when `MINERU_BACKEND` is set to `"vlm-http-client"`. + - `MINERU_OUTPUT_DIR`: (optional) The local directory for holding the outputs of the MinerU API service (zip/JSON) before ingestion. + - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temporary directory is used: + - `1`: Delete. + - `0`: Retain. +3. In the web UI, navigate to your dataset's **Configuration** page and find the **Ingestion pipeline** section: + - If you decide to use a chunking method from the **Built-in** dropdown, ensure it supports PDF parsing, then select **MinerU** from the **PDF parser** dropdown. + - If you use a custom ingestion pipeline instead, select **MinerU** in the **PDF parser** section of the **Parser** component. + :::note -All MinerU environment variables are optional. When they are set, RAGFlow will auto-create a MinerU OCR model for a tenant on first use using these values. If you do not want this auto-provisioning, configure MinerU only through the UI and leave the env vars unset. +All MinerU environment variables are optional. When set, these values are used to auto-provision a MinerU OCR model for the tenant on first use. To avoid auto-provisioning, skip the environment variable settings and only configure MinerU from the **Model providers** page in the UI. ::: :::caution WARNING diff --git a/sandbox/README.md b/sandbox/README.md index 62bc343fa..a86361872 100644 --- a/sandbox/README.md +++ b/sandbox/README.md @@ -36,7 +36,7 @@ A secure, pluggable code execution backend for RAGFlow and beyond. > ⚠️ **New Docker CLI requirement** > -> If you see `client version 1.43 is too old. Minimum supported API version is 1.44`, pull the latest `infiniflow/sandbox-executor-manager:latest` (rebuilt with Docker CLI `29.1.0`) or rebuild `./sandbox/executor_manager` locally. Older images shipped Docker 24.x, which cannot talk to newer Docker daemons. +> If you see `client version 1.43 is too old. Minimum supported API version is 1.44`, pull the latest `infiniflow/sandbox-executor-manager:latest` (rebuilt with Docker CLI `29.1.0`) or rebuild it in `./sandbox/executor_manager`. Older images shipped Docker 24.x, which cannot talk to newer Docker daemons. ### 🐳 Build Docker Base Images @@ -304,7 +304,7 @@ Follow this checklist to troubleshoot: **Fix:** - Pull the refreshed image that bundles Docker CLI `29.1.0`, or rebuild locally: + Pull the refreshed image that bundles Docker CLI `29.1.0`, or rebuild it in `./sandbox/executor_manager`: ```bash docker pull infiniflow/sandbox-executor-manager:latest