Compare commits

...

3 Commits

Author SHA1 Message Date
672958a192 Fix: model not authorized (#12001)
### What problem does this PR solve?

Fix model not authorized. #11973.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-17 19:48:24 +08:00
3820de916c Fix: duplicated PDF parser (#12000)
### What problem does this PR solve?

Fix duplicated PDF parser.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-17 19:48:10 +08:00
ef44979b5c Fix table format warning in Markdown file (#12002)
### What problem does this PR solve?

As title

### Type of change

- [x] Documentation Update
- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-17 19:27:47 +08:00
29 changed files with 201 additions and 133 deletions

View File

@ -206,10 +206,10 @@ releases! 🌟
> Note: Prior to `v0.22.0`, we provided both images with embedding models and slim images without embedding models. Details as follows:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> Starting with `v0.22.0`, we ship only the slim edition and no longer append the **-slim** suffix to the image tag.

View File

@ -206,10 +206,10 @@ Coba demo kami di [https://demo.ragflow.io](https://demo.ragflow.io).
> Catatan: Sebelum `v0.22.0`, kami menyediakan image dengan model embedding dan image slim tanpa model embedding. Detailnya sebagai berikut:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> Mulai dari `v0.22.0`, kami hanya menyediakan edisi slim dan tidak lagi menambahkan akhiran **-slim** pada tag image.

View File

@ -186,10 +186,10 @@
> 注意:`v0.22.0` より前のバージョンでは、embedding モデルを含むイメージと、embedding モデルを含まない slim イメージの両方を提供していました。詳細は以下の通りです:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> `v0.22.0` 以降、当プロジェクトでは slim エディションのみを提供し、イメージタグに **-slim** サフィックスを付けなくなりました。

View File

@ -188,10 +188,10 @@
> 참고: `v0.22.0` 이전 버전에서는 embedding 모델이 포함된 이미지와 embedding 모델이 포함되지 않은 slim 이미지를 모두 제공했습니다. 자세한 내용은 다음과 같습니다:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> `v0.22.0`부터는 slim 에디션만 배포하며 이미지 태그에 **-slim** 접미사를 더 이상 붙이지 않습니다.

View File

@ -206,10 +206,10 @@ Experimente nossa demo em [https://demo.ragflow.io](https://demo.ragflow.io).
> Nota: Antes da `v0.22.0`, fornecíamos imagens com modelos de embedding e imagens slim sem modelos de embedding. Detalhes a seguir:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> A partir da `v0.22.0`, distribuímos apenas a edição slim e não adicionamos mais o sufixo **-slim** às tags das imagens.

View File

@ -205,10 +205,10 @@
> 注意:在 `v0.22.0` 之前的版本,我們會同時提供包含 embedding 模型的映像和不含 embedding 模型的 slim 映像。具體如下:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> 從 `v0.22.0` 開始,我們只發佈 slim 版本,並且不再在映像標籤後附加 **-slim** 後綴。

View File

@ -206,10 +206,10 @@
> 注意:在 `v0.22.0` 之前的版本,我们会同时提供包含 embedding 模型的镜像和不含 embedding 模型的 slim 镜像。具体如下:
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
| ----------------- | --------------- | --------------------- | ------------------------ |
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|-------------------|-----------------|-----------------------|----------------|
| v0.21.1 | &approx;9 | ✔️ | Stable release |
| v0.21.1-slim | &approx;2 | ❌ | Stable release |
> 从 `v0.22.0` 开始,我们只发布 slim 版本,并且不再在镜像标签后附加 **-slim** 后缀。

View File

@ -6,8 +6,8 @@ Use this section to tell people about which versions of your project are
currently being supported with security updates.
| Version | Supported |
| ------- | ------------------ |
| <=0.7.0 | :white_check_mark: |
|---------|--------------------|
| <=0.7.0 | :white_check_mark: |
## Reporting a Vulnerability

View File

@ -252,7 +252,6 @@ async def delete_chats(tenant_id):
continue
temp_dict = {"status": StatusEnum.INVALID.value}
success_count += DialogService.update_by_id(id, temp_dict)
print(success_count, "$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$", flush=True)
if errors:
if success_count > 0:

View File

@ -0,0 +1,30 @@
#
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from typing import Any
def normalize_layout_recognizer(layout_recognizer_raw: Any) -> tuple[Any, str | None]:
parser_model_name: str | None = None
layout_recognizer = layout_recognizer_raw
if isinstance(layout_recognizer_raw, str):
lowered = layout_recognizer_raw.lower()
if lowered.endswith("@mineru"):
parser_model_name = layout_recognizer_raw.rsplit("@", 1)[0]
layout_recognizer = "MinerU"
return layout_recognizer, parser_model_name

View File

@ -262,10 +262,8 @@ class MinerUParser(RAGFlowPdfParser):
elif self.mineru_server_url:
data["server_url"] = self.mineru_server_url
print("--------------------------------", flush=True)
print(f"{data=}", flush=True)
print(f"{options=}", flush=True)
print("--------------------------------", flush=True)
self.logger.info(f"[MinerU] request {data=}")
self.logger.info(f"[MinerU] request {options=}")
headers = {"Accept": "application/json"}
try:

View File

@ -13,9 +13,9 @@ The RAGFlow Admin UI is a web-based interface that provides comprehensive system
To access the RAGFlow admin UI, append `/admin` to the web UI's address, e.g. `http://[RAGFLOW_WEB_UI_ADDR]/admin`, replace `[RAGFLOW_WEB_UI_ADDR]` with real RAGFlow web UI address.
### Default Credentials
| Username | Password |
|----------|----------|
| `admin@ragflow.io` | `admin` |
| Username | Password |
|--------------------|----------|
| `admin@ragflow.io` | `admin` |
## Admin UI Overview

View File

@ -157,12 +157,12 @@ Optional. Text to display as a diagonal watermark across each page. Useful for m
The **Docs Generator** component provides the following output variables:
| Variable name | Type | Description |
| ------------- | --------- | --------------------------------------------------------------------------- |
| `file_path` | `string` | The server path where the generated document is saved. |
| `pdf_base64` | `string` | The document content encoded in base64 format. |
| `download` | `string` | JSON containing download information for the chat interface. |
| `success` | `boolean` | Indicates whether the document was generated successfully. |
| Variable name | Type | Description |
|---------------|-----------|--------------------------------------------------------------|
| `file_path` | `string` | The server path where the generated document is saved. |
| `pdf_base64` | `string` | The document content encoded in base64 format. |
| `download` | `string` | JSON containing download information for the chat interface. |
| `success` | `boolean` | Indicates whether the document was generated successfully. |
### Displaying the download button
@ -189,15 +189,15 @@ The **Docs Generator** includes intelligent font handling for international cont
### Supported scripts
| Script | Unicode Range | Font Used |
| ------ | ------------- | --------- |
| Chinese (CJK) | U+4E00U+9FFF | STSong-Light |
| Japanese (Hiragana/Katakana) | U+3040U+30FF | HeiseiMin-W3 |
| Korean (Hangul) | U+AC00U+D7AF | HYSMyeongJo-Medium |
| Arabic | U+0600U+06FF | CID font fallback |
| Hebrew | U+0590U+05FF | CID font fallback |
| Devanagari (Hindi) | U+0900U+097F | CID font fallback |
| Thai | U+0E00U+0E7F | CID font fallback |
| Script | Unicode Range | Font Used |
|------------------------------|---------------|--------------------|
| Chinese (CJK) | U+4E00U+9FFF | STSong-Light |
| Japanese (Hiragana/Katakana) | U+3040U+30FF | HeiseiMin-W3 |
| Korean (Hangul) | U+AC00U+D7AF | HYSMyeongJo-Medium |
| Arabic | U+0600U+06FF | CID font fallback |
| Hebrew | U+0590U+05FF | CID font fallback |
| Devanagari (Hindi) | U+0900U+097F | CID font fallback |
| Thai | U+0E00U+0E7F | CID font fallback |
### Font installation

View File

@ -18,7 +18,7 @@ Within the configuration panel, you can add multiple parsers and set the corresp
The **Parser** component supports parsing the following file types:
| File type | File format |
| ------------- | ------------------------ |
|---------------|--------------------------|
| PDF | PDF |
| Spreadsheet | XLSX, XLS, CSV |
| Image | PNG, JPG, JPEG, GIF, TIF |
@ -97,9 +97,9 @@ A Video parser transcribes video files to text. To use this parser, you must fir
The global variable names for the output of the **Parser** component, which can be referenced by subsequent components in the ingestion pipeline.
| Variable name | Type |
| ------------- | ------------------------ |
| `markdown` | `string` |
| `text` | `string` |
| `html` | `string` |
| `json` | `Array<Object>` |
| Variable name | Type |
|---------------|-----------------|
| `markdown` | `string` |
| `text` | `string` |
| `html` | `string` |
| `json` | `Array<Object>` |

View File

@ -45,7 +45,7 @@ Click the light bulb icon above the *current* dialogue and scroll down the popup
| Item name | Description |
| ----------------- |-----------------------------------------------------------------------------------------------|
|-------------------|-----------------------------------------------------------------------------------------------|
| Total | Total time spent on this conversation round, including chunk retrieval and answer generation. |
| Check LLM | Time to validate the specified LLM. |
| Create retriever | Time to create a chunk retriever. |

View File

@ -39,20 +39,20 @@ This section covers the following topics:
RAGFlow offers multiple built-in chunking template to facilitate chunking files of different layouts and ensure semantic integrity. From the **Built-in** chunking method dropdown under **Parse type**, you can choose the default template that suits the layouts and formats of your files. The following table shows the descriptions and the compatible file formats of each supported chunk template:
| **Template** | Description | File format |
|--------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
| Q&A | Retrieves relevant information and generates answers to respond to questions. | XLSX, XLS (Excel 97-2003), CSV/TXT |
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
| Manual | | PDF |
| Table | The table mode uses TSI technology for efficient data parsing. | XLSX, XLS (Excel 97-2003), CSV/TXT |
| Paper | | PDF |
| Book | | DOCX, PDF, TXT |
| Laws | | DOCX, PDF, TXT |
| Presentation | | PDF, PPTX |
| Picture | | JPEG, JPG, PNG, TIF, GIF |
| One | Each document is chunked in its entirety (as one). | DOCX, XLSX, XLS (Excel 97-2003), PDF, TXT |
| Tag | The dataset functions as a tag set for the others. | XLSX, CSV/TXT |
| **Template** | Description | File format |
|--------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
| Q&A | Retrieves relevant information and generates answers to respond to questions. | XLSX, XLS (Excel 97-2003), CSV/TXT |
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
| Manual | | PDF |
| Table | The table mode uses TSI technology for efficient data parsing. | XLSX, XLS (Excel 97-2003), CSV/TXT |
| Paper | | PDF |
| Book | | DOCX, PDF, TXT |
| Laws | | DOCX, PDF, TXT |
| Presentation | | PDF, PPTX |
| Picture | | JPEG, JPG, PNG, TIF, GIF |
| One | Each document is chunked in its entirety (as one). | DOCX, XLSX, XLS (Excel 97-2003), PDF, TXT |
| Tag | The dataset functions as a tag set for the others. | XLSX, CSV/TXT |
You can also change a file's chunking method on the **Files** page.

View File

@ -14,7 +14,7 @@ A complete reference for RAGFlow's RESTful API. Before proceeding, please ensure
---
| Code | Message | Description |
| ---- | --------------------- | -------------------------- |
|------|-----------------------|----------------------------|
| 400 | Bad Request | Invalid request parameters |
| 401 | Unauthorized | Unauthorized access |
| 403 | Forbidden | Access denied |

View File

@ -22,15 +22,15 @@ pip install ragflow-sdk
---
| Code | Message | Description |
|------|----------------------|-----------------------------|
| 400 | Bad Request | Invalid request parameters |
| 401 | Unauthorized | Unauthorized access |
| 403 | Forbidden | Access denied |
| 404 | Not Found | Resource not found |
| 500 | Internal Server Error| Server internal error |
| 1001 | Invalid Chunk ID | Invalid Chunk ID |
| 1002 | Chunk Update Failed | Chunk update failed |
| Code | Message | Description |
|------|-----------------------|----------------------------|
| 400 | Bad Request | Invalid request parameters |
| 401 | Unauthorized | Unauthorized access |
| 403 | Forbidden | Access denied |
| 404 | Not Found | Resource not found |
| 500 | Internal Server Error | Server internal error |
| 1001 | Invalid Chunk ID | Invalid Chunk ID |
| 1002 | Chunk Update Failed | Chunk update failed |
---

View File

@ -81,13 +81,13 @@ pip install ragflow-firecrawl-integration
## Configuration Options
| Option | Description | Default | Required |
|--------|-------------|---------|----------|
| `api_key` | Your Firecrawl API key | - | Yes |
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
| `max_retries` | Maximum retry attempts | 3 | No |
| `timeout` | Request timeout (seconds) | 30 | No |
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
| Option | Description | Default | Required |
|--------------------|----------------------------------|-----------------------------|----------|
| `api_key` | Your Firecrawl API key | - | Yes |
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
| `max_retries` | Maximum retry attempts | 3 | No |
| `timeout` | Request timeout (seconds) | 30 | No |
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
## Environment Variables

View File

@ -99,13 +99,13 @@ intergrations/firecrawl/
## 🔧 **Configuration Options**
| Option | Description | Default | Required |
|--------|-------------|---------|----------|
| `api_key` | Your Firecrawl API key | - | Yes |
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
| `max_retries` | Maximum retry attempts | 3 | No |
| `timeout` | Request timeout (seconds) | 30 | No |
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
| Option | Description | Default | Required |
|--------------------|----------------------------------|-----------------------------|----------|
| `api_key` | Your Firecrawl API key | - | Yes |
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
| `max_retries` | Maximum retry attempts | 3 | No |
| `timeout` | Request timeout (seconds) | 30 | No |
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
## 📊 **API Reference**

View File

@ -21,6 +21,7 @@ from io import BytesIO
from deepdoc.parser.utils import get_text
from rag.app import naive
from rag.app.naive import by_plaintext, PARSERS
from common.parser_config_utils import normalize_layout_recognizer
from rag.nlp import bullets_category, is_english,remove_contents_table, \
hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, \
tokenize_chunks, attach_media_context
@ -96,7 +97,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback(0.8, "Finish parsing.")
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if isinstance(layout_recognizer, bool):
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
@ -114,6 +117,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback = callback,
pdf_cls = Pdf,
layout_recognizer = layout_recognizer,
mineru_llm_name=parser_model_name,
**kwargs
)

View File

@ -26,6 +26,7 @@ from rag.nlp import bullets_category, remove_contents_table, \
from rag.nlp import rag_tokenizer, Node
from deepdoc.parser import PdfParser, DocxParser, HtmlParser
from rag.app.naive import by_plaintext, PARSERS
from common.parser_config_utils import normalize_layout_recognizer
@ -155,7 +156,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
return tokenize_chunks(chunks, doc, eng, None)
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if isinstance(layout_recognizer, bool):
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
@ -173,6 +176,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback = callback,
pdf_cls = Pdf,
layout_recognizer = layout_recognizer,
mineru_llm_name=parser_model_name,
**kwargs
)

View File

@ -27,6 +27,7 @@ from deepdoc.parser.figure_parser import vision_figure_parser_pdf_wrapper,vision
from docx import Document
from PIL import Image
from rag.app.naive import by_plaintext, PARSERS
from common.parser_config_utils import normalize_layout_recognizer
class Pdf(PdfParser):
def __init__(self):
@ -196,7 +197,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
# is it English
eng = lang.lower() == "english" # pdf_parser.is_english
if re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if isinstance(layout_recognizer, bool):
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
@ -205,6 +208,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
pdf_parser = PARSERS.get(name, by_plaintext)
callback(0.1, "Start to parse.")
kwargs.pop("parse_method", None)
kwargs.pop("mineru_llm_name", None)
sections, tbls, pdf_parser = pdf_parser(
filename = filename,
binary = binary,
@ -214,6 +219,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback = callback,
pdf_cls = Pdf,
layout_recognizer = layout_recognizer,
mineru_llm_name=parser_model_name,
parse_method = "manual",
**kwargs
)
@ -232,7 +238,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
poss = pdf_parser.extract_positions(poss)
if poss:
first = poss[0] # tuple: ([pn], x1, x2, y1, y2)
pn = first[0]
pn = first[0]
if isinstance(pn, list) and pn:
pn = pn[0] # [pn] -> pn
poss[0] = (pn, *first[1:])

View File

@ -36,10 +36,11 @@ from deepdoc.parser.figure_parser import VisionFigureParser,vision_figure_parser
from deepdoc.parser.pdf_parser import PlainParser, VisionParser
from deepdoc.parser.docling_parser import DoclingParser
from deepdoc.parser.tcadp_parser import TCADPParser
from common.parser_config_utils import normalize_layout_recognizer
from rag.nlp import concat_img, find_codec, naive_merge, naive_merge_with_images, naive_merge_docx, rag_tokenizer, tokenize_chunks, tokenize_chunks_with_images, tokenize_table, attach_media_context
def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
callback = callback
binary = binary
pdf_parser = pdf_cls() if pdf_cls else Pdf()
@ -56,11 +57,19 @@ def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese
return sections, tables, pdf_parser
def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
parse_method = kwargs.get("parse_method", "raw")
mineru_llm_name = kwargs.get("mineru_llm_name")
tenant_id = kwargs.get("tenant_id")
def by_mineru(
filename,
binary=None,
from_page=0,
to_page=100000,
lang="Chinese",
callback=None,
pdf_cls=None,
parse_method: str = "raw",
mineru_llm_name: str | None = None,
tenant_id: str | None = None,
**kwargs,
):
pdf_parser = None
if tenant_id:
if not mineru_llm_name:
@ -86,7 +95,7 @@ def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese"
callback=callback,
parse_method=parse_method,
lang=lang,
**kwargs
**kwargs,
)
return sections, tables, pdf_parser
except Exception as e:
@ -97,9 +106,7 @@ def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese"
return None, None, None
def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
pdf_parser = DoclingParser()
parse_method = kwargs.get("parse_method", "raw")
@ -118,7 +125,7 @@ def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese
return sections, tables, pdf_parser
def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
tcadp_parser = TCADPParser()
if not tcadp_parser.check_installation():
@ -136,10 +143,19 @@ def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese",
def by_plaintext(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs):
if kwargs.get("layout_recognizer", "") == "Plain Text":
layout_recognizer = (kwargs.get("layout_recognizer") or "").strip()
if (not layout_recognizer) or (layout_recognizer == "Plain Text"):
pdf_parser = PlainParser()
else:
vision_model = LLMBundle(kwargs["tenant_id"], LLMType.IMAGE2TEXT, llm_name=kwargs.get("layout_recognizer", ""), lang=kwargs.get("lang", "Chinese"))
tenant_id = kwargs.get("tenant_id")
if not tenant_id:
raise ValueError("tenant_id is required when using vision layout recognizer")
vision_model = LLMBundle(
tenant_id,
LLMType.IMAGE2TEXT,
llm_name=layout_recognizer,
lang=kwargs.get("lang", "Chinese"),
)
pdf_parser = VisionParser(vision_model=vision_model, **kwargs)
sections, tables = pdf_parser(
@ -716,14 +732,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca
return res
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer_raw = parser_config.get("layout_recognize", "DeepDOC")
parser_model_name = None
layout_recognizer = layout_recognizer_raw
if isinstance(layout_recognizer_raw, str):
lowered = layout_recognizer_raw.lower()
if lowered.endswith("@mineru"):
parser_model_name = layout_recognizer_raw.split("@", 1)[0]
layout_recognizer = "MinerU"
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if parser_config.get("analyze_hyperlink", False) and is_root:
urls = extract_links_from_pdf(binary)

View File

@ -24,6 +24,7 @@ from rag.nlp import rag_tokenizer, tokenize
from deepdoc.parser import PdfParser, ExcelParser, HtmlParser
from deepdoc.parser.figure_parser import vision_figure_parser_docx_wrapper
from rag.app.naive import by_plaintext, PARSERS
from common.parser_config_utils import normalize_layout_recognizer
class Pdf(PdfParser):
def __call__(self, filename, binary=None, from_page=0,
@ -82,7 +83,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback(0.8, "Finish parsing.")
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if isinstance(layout_recognizer, bool):
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
@ -100,6 +103,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback = callback,
pdf_cls = Pdf,
layout_recognizer = layout_recognizer,
mineru_llm_name=parser_model_name,
**kwargs
)

View File

@ -24,6 +24,7 @@ from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bull
from deepdoc.parser import PdfParser
import numpy as np
from rag.app.naive import by_plaintext, PARSERS
from common.parser_config_utils import normalize_layout_recognizer
class Pdf(PdfParser):
@ -149,7 +150,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
"parser_config", {
"chunk_token_num": 512, "delimiter": "\n!?。;!?", "layout_recognize": "DeepDOC"})
if re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if isinstance(layout_recognizer, bool):
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
@ -163,6 +166,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
paper = pdf_parser(filename if not binary else binary,
from_page=from_page, to_page=to_page, callback=callback)
else:
kwargs.pop("parse_method", None)
kwargs.pop("mineru_llm_name", None)
sections, tables, pdf_parser = pdf_parser(
filename=filename,
binary=binary,
@ -171,6 +176,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
lang=lang,
callback=callback,
pdf_cls=Pdf,
layout_recognizer=layout_recognizer,
mineru_llm_name=parser_model_name,
parse_method="paper",
**kwargs
)

View File

@ -24,6 +24,7 @@ from PyPDF2 import PdfReader as pdf2_read
from deepdoc.parser import PdfParser, PptParser, PlainParser
from rag.app.naive import by_plaintext, PARSERS
from common.parser_config_utils import normalize_layout_recognizer
from rag.nlp import rag_tokenizer
from rag.nlp import tokenize, is_english
@ -195,7 +196,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
res.append(d)
return res
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
layout_recognizer, parser_model_name = normalize_layout_recognizer(
parser_config.get("layout_recognize", "DeepDOC")
)
if isinstance(layout_recognizer, bool):
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
@ -213,6 +216,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
callback=callback,
pdf_cls=Pdf,
layout_recognizer=layout_recognizer,
mineru_llm_name=parser_model_name,
**kwargs
)

View File

@ -121,7 +121,7 @@ make logs # With Make
### 🧰 Makefile Toolbox
| Command | Description |
| ----------------- | ------------------------------------------------ |
|-------------------|--------------------------------------------------|
| `make` | Setup, build, launch and test all at once |
| `make setup` | Initialize environment and install uv |
| `make ensure_env` | Auto-create `.env` if missing |
@ -183,7 +183,7 @@ This security model strikes a balance between **robust isolation** and **develop
Currently, the following languages are officially supported:
| Language | Priority |
| -------- | -------- |
|----------|----------|
| Python | High |
| Node.js | Medium |

View File

@ -42,6 +42,7 @@ import { ExcelToHtmlFormField } from '../excel-to-html-form-field';
import { FormContainer } from '../form-container';
import { LayoutRecognizeFormField } from '../layout-recognize-form-field';
import { MaxTokenNumberFormField } from '../max-token-number-from-field';
import { MinerUOptionsFormField } from '../mineru-options-form-field';
import { ButtonLoading } from '../ui/button';
import { Input } from '../ui/input';
import { DynamicPageRange } from './dynamic-page-range';
@ -335,7 +336,10 @@ export function ChunkMethodDialog({
className="space-y-3"
>
{showOne && (
<LayoutRecognizeFormField showMineruOptions={false} />
<>
<LayoutRecognizeFormField showMineruOptions={false} />
{isMineruSelected && <MinerUOptionsFormField />}
</>
)}
{showMaxTokenNumber && (
<>
@ -359,9 +363,6 @@ export function ChunkMethodDialog({
}
className="space-y-3"
>
{isMineruSelected && (
<LayoutRecognizeFormField showMineruOptions />
)}
{selectedTag === DocumentParserType.Naive && (
<EnableTocToggle />
)}