mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 12:32:30 +08:00
Docs: refine MinerU part in FAQ (#11111)
### What problem does this PR solve? Refine MinerU part in FAQ. ### Type of change - [x] Documentation Update
This commit is contained in:
94
docs/faq.mdx
94
docs/faq.mdx
@ -512,43 +512,85 @@ See [here](./guides/chat/best_practices/accelerate_question_answering.mdx).
|
||||
|
||||
See [here](./guides/agent/best_practices/accelerate_agent_question_answering.md).
|
||||
|
||||
---
|
||||
|
||||
### How to use MinerU to parse PDF documents?
|
||||
|
||||
MinerU PDF document parsing is available starting from v0.21.1. To use this feature, follow these steps:
|
||||
MinerU PDF document parsing is available starting from v0.21.1. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow itself only acts as a client: it calls MinerU to parse documents, reads the output files, and ingests the parsed content into RAGFlow. To use this feature, follow these steps:
|
||||
|
||||
1. Before deploying ragflow-server, update your **docker/.env** file:
|
||||
- Enable `HF_ENDPOINT=https://hf-mirror.com`
|
||||
- Add a MinerU entry: `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru`
|
||||
1. **Prepare MinerU**
|
||||
|
||||
2. Start the ragflow-server and run the following commands inside the container:
|
||||
- **If you run RAGFlow from source**, install MinerU into an isolated virtual environment (recommended path: `$HOME/uv_tools`):
|
||||
|
||||
```bash
|
||||
mkdir uv_tools
|
||||
cd uv_tools
|
||||
uv venv .venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
|
||||
```
|
||||
```bash
|
||||
mkdir -p "$HOME/uv_tools"
|
||||
cd "$HOME/uv_tools"
|
||||
uv venv .venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
|
||||
# or
|
||||
# uv pip install -U "mineru[all]" -i https://mirrors.aliyun.com/pypi/simple
|
||||
```
|
||||
|
||||
3. Restart the ragflow-server.
|
||||
4. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and slect **MinerU** in **PDF parser**.
|
||||
5. If you use a custom ingestion pipeline instead, you must also complete the first three steps before selecting **MinerU** in the **Parsing method** section of the **Parser** component.
|
||||
- **If you run RAGFlow with Docker**, you usually only need to turn on MinerU support in `docker/.env`:
|
||||
|
||||
```bash
|
||||
# docker/.env
|
||||
...
|
||||
USE_MINERU=true
|
||||
...
|
||||
```
|
||||
|
||||
Enabling `USE_MINERU=true` will internally perform the same setup as the manual configuration (including setting the MinerU executable path and related environment variables). You only need the manual installation above if you are running from source or want full control over the MinerU installation.
|
||||
|
||||
2. **Start RAGFlow with MinerU enabled**
|
||||
|
||||
- **Source deployment** – in the RAGFlow repo, export the key MinerU-related variables and start the backend service:
|
||||
|
||||
```bash
|
||||
# in RAGFlow repo
|
||||
export MINERU_EXECUTABLE="$HOME/uv_tools/.venv/bin/mineru"
|
||||
export MINERU_DELETE_OUTPUT=0 # keep output directory
|
||||
export MINERU_BACKEND=pipeline # or another backend you prefer
|
||||
|
||||
source .venv/bin/activate
|
||||
export PYTHONPATH=$(pwd)
|
||||
bash docker/launch_backend_service.sh
|
||||
```
|
||||
|
||||
- **Docker deployment** – after setting `USE_MINERU=true`, restart the containers so that the new settings take effect:
|
||||
|
||||
```bash
|
||||
# in RAGFlow repo
|
||||
docker compose -f docker/docker-compose.yml restart
|
||||
```
|
||||
|
||||
3. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown (which supports PDF parsing), and select **MinerU** in **PDF parser**.
|
||||
4. If you use a custom ingestion pipeline instead, you must also complete the first two steps before selecting **MinerU** in the **Parsing method** section of the **Parser** component.
|
||||
|
||||
---
|
||||
|
||||
### How to configure MinerU-specific settings?
|
||||
|
||||
1. Set `MINERU_EXECUTABLE` (default: `mineru`) to the path to the MinerU executable.
|
||||
2. Set `MINERU_DELETE_OUTPUT` to `0` to keep MinerU's output. (Default: `1`, which deletes temporary output)
|
||||
3. Set `MINERU_OUTPUT_DIR` to specify the output directory for MinerU.
|
||||
The table below summarizes the most commonly used MinerU-related environment variables:
|
||||
|
||||
| Environment variable | Description | Default | Example |
|
||||
| ---------------------- | ---------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------- |
|
||||
| `MINERU_EXECUTABLE` | Path to the local MinerU executable | `mineru` | `MINERU_EXECUTABLE=/home/ragflow/uv_tools/.venv/bin/mineru` |
|
||||
| `MINERU_DELETE_OUTPUT` | Whether to delete MinerU output directory | `1` (do **not** keep the output directory) | `MINERU_DELETE_OUTPUT=0` |
|
||||
| `MINERU_OUTPUT_DIR` | Directory for MinerU output files | System-defined temporary directory | `MINERU_OUTPUT_DIR=/home/ragflow/mineru/output` |
|
||||
| `MINERU_BACKEND` | MinerU parsing backend | `pipeline` | `MINERU_BACKEND=pipeline\|vlm-transformers\|vlm-vllm-engine\|vlm-http-client` |
|
||||
| `MINERU_SERVER_URL` | URL of remote vLLM server (only for `vlm-http-client` backend) | _unset_ | `MINERU_SERVER_URL=http://your-vllm-server-ip:30000` |
|
||||
| `MINERU_APISERVER` | URL of remote MinerU service used as the parser (instead of local MinerU) | _unset_ | `MINERU_APISERVER=http://your-mineru-server:port` |
|
||||
|
||||
1. Set `MINERU_EXECUTABLE` to the path to the MinerU executable if the default `mineru` is not on `PATH`.
|
||||
2. Set `MINERU_DELETE_OUTPUT` to `0` to keep MinerU's output. (Default: `1`, which deletes temporary output.)
|
||||
3. Set `MINERU_OUTPUT_DIR` to specify the output directory for MinerU (otherwise a system temp directory is used).
|
||||
4. Set `MINERU_BACKEND` to specify a parsing backend:
|
||||
- `"pipeline"` (default): The traditional multimodel pipeline.
|
||||
- `"vlm-transformers"`: A vision-language model using HuggingFace Transformers.
|
||||
- `"vlm-vllm-engine"`: A vision-language model using local vLLM engine (requires a local GPU).
|
||||
- `"vlm-http-client"`: A vision-language model via HTTP client to remote vLLM server (RAGFlow only requires CPU).
|
||||
- `"vlm-vllm-engine"`: A vision-language model using a local vLLM engine (requires a local GPU).
|
||||
- `"vlm-http-client"`: A vision-language model via HTTP client to a remote vLLM server (RAGFlow only requires CPU).
|
||||
5. If using the `"vlm-http-client"` backend, you must also set `MINERU_SERVER_URL` to the URL of your vLLM server.
|
||||
6. If you want RAGFlow to call a **remote MinerU service** (instead of a MinerU process running locally with RAGFlow), set `MINERU_APISERVER` to the URL of the remote MinerU server.
|
||||
|
||||
:::tip NOTE
|
||||
For information about other environment variables natively supported by MinerU, see [here](https://opendatalab.github.io/MinerU/usage/cli_tools/#environment-variables-description).
|
||||
@ -561,16 +603,18 @@ For information about other environment variables natively supported by MinerU,
|
||||
RAGFlow supports MinerU's `vlm-http-client` backend, enabling you to delegate document parsing tasks to a remote vLLM server. With this configuration, RAGFlow will connect to your remote vLLM server as a client and use its powerful GPU resources for document parsing. This significantly improves performance for parsing complex documents while reducing the resources required on your RAGFlow server. To configure MinerU with a vLLM server:
|
||||
|
||||
1. Set up a vLLM server running MinerU:
|
||||
|
||||
```bash
|
||||
mineru-vllm-server --port 30000
|
||||
```
|
||||
|
||||
2. Configure the following environment variables in your **docker/.env** file:
|
||||
- `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru` (or the path to your MinerU executable)
|
||||
2. Configure the following environment variables in your **docker/.env** file (or your shell if running from source):
|
||||
|
||||
- `MINERU_EXECUTABLE=/home/ragflow/uv_tools/.venv/bin/mineru` (or the path to your MinerU executable)
|
||||
- `MINERU_BACKEND="vlm-http-client"`
|
||||
- `MINERU_SERVER_URL="http://your-vllm-server-ip:30000"`
|
||||
|
||||
3. Complete the rest standard MinerU setup steps as described [here](#how-to-configure-mineru-specific-settings).
|
||||
3. Complete the rest of the standard MinerU setup steps as described [here](#how-to-configure-mineru-specific-settings).
|
||||
|
||||
:::tip NOTE
|
||||
When using the `vlm-http-client` backend, the RAGFlow server requires no GPU, only network connectivity. This enables cost-effective distributed deployment with multiple RAGFlow instances sharing one remote vLLM server.
|
||||
|
||||
Reference in New Issue
Block a user