diff --git a/docs/faq.mdx b/docs/faq.mdx index 48071ba29..cee11799f 100644 --- a/docs/faq.mdx +++ b/docs/faq.mdx @@ -510,3 +510,27 @@ See [here](./guides/agent/best_practices/accelerate_agent_question_answering.md) --- +### How to use MinerU to parse PDF documents? + +MinerU PDF document parsing is available starting from v0.21.1. To use this feature, follow these steps: + +1. Before deploying ragflow-server, update your **docker/.env** file: + - Enable `HF_ENDPOINT=https://hf-mirror.com` + - Add a MinerU entry: `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru` + +2. Start the ragflow-server and run the following commands inside the container: + +```bash +mkdir uv_tools +cd uv_tools +uv venv .venv +source .venv/bin/activate +uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple +``` + +3. Restart the ragflow-server. +4. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and slect **MinerU** in **PDF parser**. +5. If you use a custom ingestion pipeline instead, you must also complete the first three steps before selecting **MinerU** in the **Parsing method** section of the **Parser** component. + + + diff --git a/docs/guides/dataset/select_pdf_parser.md b/docs/guides/dataset/select_pdf_parser.md index b9af551c2..1781d7691 100644 --- a/docs/guides/dataset/select_pdf_parser.md +++ b/docs/guides/dataset/select_pdf_parser.md @@ -35,8 +35,31 @@ RAGFlow isn't one-size-fits-all. It is built for flexibility and supports deeper - DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on PDFs, which can be time-consuming. - Naive: Skip OCR, TSR, and DLR tasks if *all* your PDFs are plain text. + - MinerU: An experimental feature. - A third-party visual model provided by a specific model provider. +:::danger IMPORTANG +MinerU PDF document parsing is available starting from v0.21.1. To use this feature, follow these steps: + +1. Before deploying ragflow-server, update your **docker/.env** file: + - Enable `HF_ENDPOINT=https://hf-mirror.com` + - Add a MinerU entry: `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru` + +2. Start the ragflow-server and run the following commands inside the container: + +```bash +mkdir uv_tools +cd uv_tools +uv venv .venv +source .venv/bin/activate +uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple +``` + +3. Restart the ragflow-server. +4. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and slect **MinerU** in **PDF parser**. +5. If you use a custom ingestion pipeline instead, you must also complete the first three steps before selecting **MinerU** in the **Parsing method** section of the **Parser** component. +::: + :::caution WARNING Third-party visual models are marked **Experimental**, because we have not fully tested these models for the aforementioned data extraction tasks. ::: diff --git a/docs/release_notes.md b/docs/release_notes.md index 17688d4eb..38cd81aa2 100644 --- a/docs/release_notes.md +++ b/docs/release_notes.md @@ -28,12 +28,12 @@ Released on October 23, 2025. ### New features -- Experimental: Adds support for PDF document parsing using MinerU. +- Experimental: Adds support for PDF document parsing using MinerU. See [here](./faq.mdx#how-to-use-mineru-to-parse-pdf-documents). ### Improvements - Enhances UI/UX for the dataset and personal center pages. -- Upgrades RAGFlow's document engine, Infinity, to v0.6.1. +- Upgrades RAGFlow's document engine, [Infinity](https://github.com/infiniflow/infinity), to v0.6.1. ### Fixed issues