mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-19 20:16:49 +08:00

Files

Jin Hai ef44979b5c Fix table format warning in Markdown file (#12002 )

### What problem does this PR solve?

As title

### Type of change

- [x] Documentation Update
- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>

2025-12-17 19:27:47 +08:00

5.8 KiB

Raw Blame History

sidebar_position, slug

sidebar_position	slug
30	/parser_component

Parser component

A component that sets the parsing rules for your dataset.

A Parser component is autopopulated on the ingestion pipeline canvas and required in all ingestion pipeline workflows. Just like the Extract stage in the traditional ETL process, a Parser component in an ingestion pipeline defines how various file types are parsed into structured data. Click the component to display its configuration panel. In this configuration panel, you set the parsing rules for various file types.

Configurations

Within the configuration panel, you can add multiple parsers and set the corresponding parsing rules or remove unwanted parsers. Please ensure your set of parsers covers all required file types; otherwise, an error would occur when you select this ingestion pipeline on your dataset's Files page.

The Parser component supports parsing the following file types:

File type	File format
PDF	PDF
Spreadsheet	XLSX, XLS, CSV
Image	PNG, JPG, JPEG, GIF, TIF
Email	EML
Text & Markup	TXT, MD, MDX, HTML, JSON
Word	DOCX
PowerPoint	PPTX, PPT
Audio	MP3, WAV
Video	MP4, AVI, MKV

PDF parser

The output of a PDF parser is json. In the PDF parser, you select the parsing method that works best with your PDFs.

DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on complex PDFs, but can be time-consuming.
Naive: Skip OCR, TSR, and DLR tasks if all your PDFs are plain text.
MinerU: (Experimental) An open-source tool that converts PDF into machine-readable formats.
Docling: (Experimental) An open-source document processing tool for gen AI.
A third-party visual model from a specific model provider.

:::danger IMPORTANT MinerU PDF document parsing is available starting from v0.22.0. RAGFlow supports MinerU (>= 2.6.3) as an optional PDF parser with multiple backends. RAGFlow acts only as a remote client for MinerU, calling the MinerU API to parse documents, reading the returned output files, and ingesting the parsed content. To use this feature: :::

Prepare a reachable MinerU API service (FastAPI server).
Configure RAGFlow with the remote MinerU settings (env or UI model provider):
- MINERU_APISERVER: MinerU API endpoint, for example http://mineru-host:8886.
- MINERU_BACKEND: MinerU backend, defaults to pipeline (supports vlm-http-client, vlm-transformers, vlm-vllm-engine, vlm-mlx-engine, vlm-vllm-async-engine, vlm-lmdeploy-engine).
- MINERU_SERVER_URL: (optional) For vlm-http-client, the downstream vLLM HTTP server, for example http://vllm-host:30000.
- MINERU_OUTPUT_DIR: (optional) Local directory to store MinerU API outputs (zip/JSON) before ingestion.
- MINERU_DELETE_OUTPUT: Whether to delete temporary output when a temp dir is used (1 deletes temp outputs; set 0 to keep).
In the web UI, navigate to the Configuration page of your dataset. Click Built-in in the Ingestion pipeline section, select a chunking method from the Built-in dropdown, which supports PDF parsing, and select MinerU in PDF parser.
If you use a custom ingestion pipeline instead, provide the same MinerU settings and select MinerU in the Parsing method section of the Parser component.

:::note All MinerU environment variables are optional. If set, RAGFlow will auto-provision a MinerU OCR model for the tenant on first use with these values. To avoid auto-provisioning, configure MinerU solely through the UI and leave the env vars unset. :::

:::caution WARNING Third-party visual models are marked Experimental, because we have not fully tested these models for the aforementioned data extraction tasks. :::

Spreadsheet parser

A spreadsheet parser outputs html, preserving the original layout and table structure. You may remove this parser if your dataset contains no spreadsheets.

Image parser

An Image parser uses a native OCR model for text extraction by default. You may select an alternative VLM model, provided that you have properly configured it on the Model provider page.

Email parser

With the Email parser, you select the fields to parse from Emails, such as subject and body. The parser will then extract text from these specified fields.

Text&Markup parser

A Text&Markup parser automatically removes all formatting tags (e.g., those from HTML and Markdown files) to output clean, plain text only.

Word parser

A Word parser outputs json, preserving the original document structure information, including titles, paragraphs, tables, headers, and footers.

PowerPoint (PPT) parser

A PowerPoint parser extracts content from PowerPoint files into json, processing each slide individually and distinguishing between its title, body text, and notes.

Audio parser

An Audio parser transcribes audio files to text. To use this parser, you must first configure an ASR model on the Model provider page.

Video parser

A Video parser transcribes video files to text. To use this parser, you must first configure a VLM model on the Model provider page.

Output

The global variable names for the output of the Parser component, which can be referenced by subsequent components in the ingestion pipeline.

Variable name	Type
`markdown`	`string`
`text`	`string`
`html`	`string`
`json`	`Array<Object>`

5.8 KiB Raw Blame History