mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-01-04 03:25:30 +08:00
Compare commits
3 Commits
d38f8a1562
...
672958a192
| Author | SHA1 | Date | |
|---|---|---|---|
| 672958a192 | |||
| 3820de916c | |||
| ef44979b5c |
@ -206,10 +206,10 @@ releases! 🌟
|
||||
|
||||
> Note: Prior to `v0.22.0`, we provided both images with embedding models and slim images without embedding models. Details as follows:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> Starting with `v0.22.0`, we ship only the slim edition and no longer append the **-slim** suffix to the image tag.
|
||||
|
||||
|
||||
@ -206,10 +206,10 @@ Coba demo kami di [https://demo.ragflow.io](https://demo.ragflow.io).
|
||||
|
||||
> Catatan: Sebelum `v0.22.0`, kami menyediakan image dengan model embedding dan image slim tanpa model embedding. Detailnya sebagai berikut:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> Mulai dari `v0.22.0`, kami hanya menyediakan edisi slim dan tidak lagi menambahkan akhiran **-slim** pada tag image.
|
||||
|
||||
|
||||
@ -186,10 +186,10 @@
|
||||
|
||||
> 注意:`v0.22.0` より前のバージョンでは、embedding モデルを含むイメージと、embedding モデルを含まない slim イメージの両方を提供していました。詳細は以下の通りです:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> `v0.22.0` 以降、当プロジェクトでは slim エディションのみを提供し、イメージタグに **-slim** サフィックスを付けなくなりました。
|
||||
|
||||
|
||||
@ -188,10 +188,10 @@
|
||||
|
||||
> 참고: `v0.22.0` 이전 버전에서는 embedding 모델이 포함된 이미지와 embedding 모델이 포함되지 않은 slim 이미지를 모두 제공했습니다. 자세한 내용은 다음과 같습니다:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> `v0.22.0`부터는 slim 에디션만 배포하며 이미지 태그에 **-slim** 접미사를 더 이상 붙이지 않습니다.
|
||||
|
||||
|
||||
@ -206,10 +206,10 @@ Experimente nossa demo em [https://demo.ragflow.io](https://demo.ragflow.io).
|
||||
|
||||
> Nota: Antes da `v0.22.0`, fornecíamos imagens com modelos de embedding e imagens slim sem modelos de embedding. Detalhes a seguir:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> A partir da `v0.22.0`, distribuímos apenas a edição slim e não adicionamos mais o sufixo **-slim** às tags das imagens.
|
||||
|
||||
|
||||
@ -205,10 +205,10 @@
|
||||
|
||||
> 注意:在 `v0.22.0` 之前的版本,我們會同時提供包含 embedding 模型的映像和不含 embedding 模型的 slim 映像。具體如下:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> 從 `v0.22.0` 開始,我們只發佈 slim 版本,並且不再在映像標籤後附加 **-slim** 後綴。
|
||||
|
||||
|
||||
@ -206,10 +206,10 @@
|
||||
|
||||
> 注意:在 `v0.22.0` 之前的版本,我们会同时提供包含 embedding 模型的镜像和不含 embedding 模型的 slim 镜像。具体如下:
|
||||
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||
|-------------------|-----------------|-----------------------|----------------|
|
||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||
|
||||
> 从 `v0.22.0` 开始,我们只发布 slim 版本,并且不再在镜像标签后附加 **-slim** 后缀。
|
||||
|
||||
|
||||
@ -6,8 +6,8 @@ Use this section to tell people about which versions of your project are
|
||||
currently being supported with security updates.
|
||||
|
||||
| Version | Supported |
|
||||
| ------- | ------------------ |
|
||||
| <=0.7.0 | :white_check_mark: |
|
||||
|---------|--------------------|
|
||||
| <=0.7.0 | :white_check_mark: |
|
||||
|
||||
## Reporting a Vulnerability
|
||||
|
||||
|
||||
@ -252,7 +252,6 @@ async def delete_chats(tenant_id):
|
||||
continue
|
||||
temp_dict = {"status": StatusEnum.INVALID.value}
|
||||
success_count += DialogService.update_by_id(id, temp_dict)
|
||||
print(success_count, "$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$", flush=True)
|
||||
|
||||
if errors:
|
||||
if success_count > 0:
|
||||
|
||||
30
common/parser_config_utils.py
Normal file
30
common/parser_config_utils.py
Normal file
@ -0,0 +1,30 @@
|
||||
#
|
||||
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
from typing import Any
|
||||
|
||||
|
||||
def normalize_layout_recognizer(layout_recognizer_raw: Any) -> tuple[Any, str | None]:
|
||||
parser_model_name: str | None = None
|
||||
layout_recognizer = layout_recognizer_raw
|
||||
|
||||
if isinstance(layout_recognizer_raw, str):
|
||||
lowered = layout_recognizer_raw.lower()
|
||||
if lowered.endswith("@mineru"):
|
||||
parser_model_name = layout_recognizer_raw.rsplit("@", 1)[0]
|
||||
layout_recognizer = "MinerU"
|
||||
|
||||
return layout_recognizer, parser_model_name
|
||||
@ -262,10 +262,8 @@ class MinerUParser(RAGFlowPdfParser):
|
||||
elif self.mineru_server_url:
|
||||
data["server_url"] = self.mineru_server_url
|
||||
|
||||
print("--------------------------------", flush=True)
|
||||
print(f"{data=}", flush=True)
|
||||
print(f"{options=}", flush=True)
|
||||
print("--------------------------------", flush=True)
|
||||
self.logger.info(f"[MinerU] request {data=}")
|
||||
self.logger.info(f"[MinerU] request {options=}")
|
||||
|
||||
headers = {"Accept": "application/json"}
|
||||
try:
|
||||
|
||||
@ -13,9 +13,9 @@ The RAGFlow Admin UI is a web-based interface that provides comprehensive system
|
||||
To access the RAGFlow admin UI, append `/admin` to the web UI's address, e.g. `http://[RAGFLOW_WEB_UI_ADDR]/admin`, replace `[RAGFLOW_WEB_UI_ADDR]` with real RAGFlow web UI address.
|
||||
|
||||
### Default Credentials
|
||||
| Username | Password |
|
||||
|----------|----------|
|
||||
| `admin@ragflow.io` | `admin` |
|
||||
| Username | Password |
|
||||
|--------------------|----------|
|
||||
| `admin@ragflow.io` | `admin` |
|
||||
|
||||
## Admin UI Overview
|
||||
|
||||
|
||||
@ -157,12 +157,12 @@ Optional. Text to display as a diagonal watermark across each page. Useful for m
|
||||
|
||||
The **Docs Generator** component provides the following output variables:
|
||||
|
||||
| Variable name | Type | Description |
|
||||
| ------------- | --------- | --------------------------------------------------------------------------- |
|
||||
| `file_path` | `string` | The server path where the generated document is saved. |
|
||||
| `pdf_base64` | `string` | The document content encoded in base64 format. |
|
||||
| `download` | `string` | JSON containing download information for the chat interface. |
|
||||
| `success` | `boolean` | Indicates whether the document was generated successfully. |
|
||||
| Variable name | Type | Description |
|
||||
|---------------|-----------|--------------------------------------------------------------|
|
||||
| `file_path` | `string` | The server path where the generated document is saved. |
|
||||
| `pdf_base64` | `string` | The document content encoded in base64 format. |
|
||||
| `download` | `string` | JSON containing download information for the chat interface. |
|
||||
| `success` | `boolean` | Indicates whether the document was generated successfully. |
|
||||
|
||||
### Displaying the download button
|
||||
|
||||
@ -189,15 +189,15 @@ The **Docs Generator** includes intelligent font handling for international cont
|
||||
|
||||
### Supported scripts
|
||||
|
||||
| Script | Unicode Range | Font Used |
|
||||
| ------ | ------------- | --------- |
|
||||
| Chinese (CJK) | U+4E00–U+9FFF | STSong-Light |
|
||||
| Japanese (Hiragana/Katakana) | U+3040–U+30FF | HeiseiMin-W3 |
|
||||
| Korean (Hangul) | U+AC00–U+D7AF | HYSMyeongJo-Medium |
|
||||
| Arabic | U+0600–U+06FF | CID font fallback |
|
||||
| Hebrew | U+0590–U+05FF | CID font fallback |
|
||||
| Devanagari (Hindi) | U+0900–U+097F | CID font fallback |
|
||||
| Thai | U+0E00–U+0E7F | CID font fallback |
|
||||
| Script | Unicode Range | Font Used |
|
||||
|------------------------------|---------------|--------------------|
|
||||
| Chinese (CJK) | U+4E00–U+9FFF | STSong-Light |
|
||||
| Japanese (Hiragana/Katakana) | U+3040–U+30FF | HeiseiMin-W3 |
|
||||
| Korean (Hangul) | U+AC00–U+D7AF | HYSMyeongJo-Medium |
|
||||
| Arabic | U+0600–U+06FF | CID font fallback |
|
||||
| Hebrew | U+0590–U+05FF | CID font fallback |
|
||||
| Devanagari (Hindi) | U+0900–U+097F | CID font fallback |
|
||||
| Thai | U+0E00–U+0E7F | CID font fallback |
|
||||
|
||||
### Font installation
|
||||
|
||||
|
||||
@ -18,7 +18,7 @@ Within the configuration panel, you can add multiple parsers and set the corresp
|
||||
The **Parser** component supports parsing the following file types:
|
||||
|
||||
| File type | File format |
|
||||
| ------------- | ------------------------ |
|
||||
|---------------|--------------------------|
|
||||
| PDF | PDF |
|
||||
| Spreadsheet | XLSX, XLS, CSV |
|
||||
| Image | PNG, JPG, JPEG, GIF, TIF |
|
||||
@ -97,9 +97,9 @@ A Video parser transcribes video files to text. To use this parser, you must fir
|
||||
|
||||
The global variable names for the output of the **Parser** component, which can be referenced by subsequent components in the ingestion pipeline.
|
||||
|
||||
| Variable name | Type |
|
||||
| ------------- | ------------------------ |
|
||||
| `markdown` | `string` |
|
||||
| `text` | `string` |
|
||||
| `html` | `string` |
|
||||
| `json` | `Array<Object>` |
|
||||
| Variable name | Type |
|
||||
|---------------|-----------------|
|
||||
| `markdown` | `string` |
|
||||
| `text` | `string` |
|
||||
| `html` | `string` |
|
||||
| `json` | `Array<Object>` |
|
||||
|
||||
@ -45,7 +45,7 @@ Click the light bulb icon above the *current* dialogue and scroll down the popup
|
||||
|
||||
|
||||
| Item name | Description |
|
||||
| ----------------- |-----------------------------------------------------------------------------------------------|
|
||||
|-------------------|-----------------------------------------------------------------------------------------------|
|
||||
| Total | Total time spent on this conversation round, including chunk retrieval and answer generation. |
|
||||
| Check LLM | Time to validate the specified LLM. |
|
||||
| Create retriever | Time to create a chunk retriever. |
|
||||
|
||||
@ -39,20 +39,20 @@ This section covers the following topics:
|
||||
|
||||
RAGFlow offers multiple built-in chunking template to facilitate chunking files of different layouts and ensure semantic integrity. From the **Built-in** chunking method dropdown under **Parse type**, you can choose the default template that suits the layouts and formats of your files. The following table shows the descriptions and the compatible file formats of each supported chunk template:
|
||||
|
||||
| **Template** | Description | File format |
|
||||
|--------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
|
||||
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
|
||||
| Q&A | Retrieves relevant information and generates answers to respond to questions. | XLSX, XLS (Excel 97-2003), CSV/TXT |
|
||||
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
|
||||
| Manual | | PDF |
|
||||
| Table | The table mode uses TSI technology for efficient data parsing. | XLSX, XLS (Excel 97-2003), CSV/TXT |
|
||||
| Paper | | PDF |
|
||||
| Book | | DOCX, PDF, TXT |
|
||||
| Laws | | DOCX, PDF, TXT |
|
||||
| Presentation | | PDF, PPTX |
|
||||
| Picture | | JPEG, JPG, PNG, TIF, GIF |
|
||||
| One | Each document is chunked in its entirety (as one). | DOCX, XLSX, XLS (Excel 97-2003), PDF, TXT |
|
||||
| Tag | The dataset functions as a tag set for the others. | XLSX, CSV/TXT |
|
||||
| **Template** | Description | File format |
|
||||
|--------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
|
||||
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
|
||||
| Q&A | Retrieves relevant information and generates answers to respond to questions. | XLSX, XLS (Excel 97-2003), CSV/TXT |
|
||||
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
|
||||
| Manual | | PDF |
|
||||
| Table | The table mode uses TSI technology for efficient data parsing. | XLSX, XLS (Excel 97-2003), CSV/TXT |
|
||||
| Paper | | PDF |
|
||||
| Book | | DOCX, PDF, TXT |
|
||||
| Laws | | DOCX, PDF, TXT |
|
||||
| Presentation | | PDF, PPTX |
|
||||
| Picture | | JPEG, JPG, PNG, TIF, GIF |
|
||||
| One | Each document is chunked in its entirety (as one). | DOCX, XLSX, XLS (Excel 97-2003), PDF, TXT |
|
||||
| Tag | The dataset functions as a tag set for the others. | XLSX, CSV/TXT |
|
||||
|
||||
You can also change a file's chunking method on the **Files** page.
|
||||
|
||||
|
||||
@ -14,7 +14,7 @@ A complete reference for RAGFlow's RESTful API. Before proceeding, please ensure
|
||||
---
|
||||
|
||||
| Code | Message | Description |
|
||||
| ---- | --------------------- | -------------------------- |
|
||||
|------|-----------------------|----------------------------|
|
||||
| 400 | Bad Request | Invalid request parameters |
|
||||
| 401 | Unauthorized | Unauthorized access |
|
||||
| 403 | Forbidden | Access denied |
|
||||
|
||||
@ -22,15 +22,15 @@ pip install ragflow-sdk
|
||||
|
||||
---
|
||||
|
||||
| Code | Message | Description |
|
||||
|------|----------------------|-----------------------------|
|
||||
| 400 | Bad Request | Invalid request parameters |
|
||||
| 401 | Unauthorized | Unauthorized access |
|
||||
| 403 | Forbidden | Access denied |
|
||||
| 404 | Not Found | Resource not found |
|
||||
| 500 | Internal Server Error| Server internal error |
|
||||
| 1001 | Invalid Chunk ID | Invalid Chunk ID |
|
||||
| 1002 | Chunk Update Failed | Chunk update failed |
|
||||
| Code | Message | Description |
|
||||
|------|-----------------------|----------------------------|
|
||||
| 400 | Bad Request | Invalid request parameters |
|
||||
| 401 | Unauthorized | Unauthorized access |
|
||||
| 403 | Forbidden | Access denied |
|
||||
| 404 | Not Found | Resource not found |
|
||||
| 500 | Internal Server Error | Server internal error |
|
||||
| 1001 | Invalid Chunk ID | Invalid Chunk ID |
|
||||
| 1002 | Chunk Update Failed | Chunk update failed |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@ -81,13 +81,13 @@ pip install ragflow-firecrawl-integration
|
||||
|
||||
## Configuration Options
|
||||
|
||||
| Option | Description | Default | Required |
|
||||
|--------|-------------|---------|----------|
|
||||
| `api_key` | Your Firecrawl API key | - | Yes |
|
||||
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
||||
| `max_retries` | Maximum retry attempts | 3 | No |
|
||||
| `timeout` | Request timeout (seconds) | 30 | No |
|
||||
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
|
||||
| Option | Description | Default | Required |
|
||||
|--------------------|----------------------------------|-----------------------------|----------|
|
||||
| `api_key` | Your Firecrawl API key | - | Yes |
|
||||
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
||||
| `max_retries` | Maximum retry attempts | 3 | No |
|
||||
| `timeout` | Request timeout (seconds) | 30 | No |
|
||||
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
|
||||
|
||||
## Environment Variables
|
||||
|
||||
|
||||
@ -99,13 +99,13 @@ intergrations/firecrawl/
|
||||
|
||||
## 🔧 **Configuration Options**
|
||||
|
||||
| Option | Description | Default | Required |
|
||||
|--------|-------------|---------|----------|
|
||||
| `api_key` | Your Firecrawl API key | - | Yes |
|
||||
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
||||
| `max_retries` | Maximum retry attempts | 3 | No |
|
||||
| `timeout` | Request timeout (seconds) | 30 | No |
|
||||
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
|
||||
| Option | Description | Default | Required |
|
||||
|--------------------|----------------------------------|-----------------------------|----------|
|
||||
| `api_key` | Your Firecrawl API key | - | Yes |
|
||||
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
||||
| `max_retries` | Maximum retry attempts | 3 | No |
|
||||
| `timeout` | Request timeout (seconds) | 30 | No |
|
||||
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
|
||||
|
||||
## 📊 **API Reference**
|
||||
|
||||
|
||||
@ -21,6 +21,7 @@ from io import BytesIO
|
||||
from deepdoc.parser.utils import get_text
|
||||
from rag.app import naive
|
||||
from rag.app.naive import by_plaintext, PARSERS
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
from rag.nlp import bullets_category, is_english,remove_contents_table, \
|
||||
hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, \
|
||||
tokenize_chunks, attach_media_context
|
||||
@ -96,7 +97,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback(0.8, "Finish parsing.")
|
||||
|
||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if isinstance(layout_recognizer, bool):
|
||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||
@ -114,6 +117,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback = callback,
|
||||
pdf_cls = Pdf,
|
||||
layout_recognizer = layout_recognizer,
|
||||
mineru_llm_name=parser_model_name,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
|
||||
@ -26,6 +26,7 @@ from rag.nlp import bullets_category, remove_contents_table, \
|
||||
from rag.nlp import rag_tokenizer, Node
|
||||
from deepdoc.parser import PdfParser, DocxParser, HtmlParser
|
||||
from rag.app.naive import by_plaintext, PARSERS
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
|
||||
|
||||
|
||||
@ -155,7 +156,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
return tokenize_chunks(chunks, doc, eng, None)
|
||||
|
||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if isinstance(layout_recognizer, bool):
|
||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||
@ -173,6 +176,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback = callback,
|
||||
pdf_cls = Pdf,
|
||||
layout_recognizer = layout_recognizer,
|
||||
mineru_llm_name=parser_model_name,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
|
||||
@ -27,6 +27,7 @@ from deepdoc.parser.figure_parser import vision_figure_parser_pdf_wrapper,vision
|
||||
from docx import Document
|
||||
from PIL import Image
|
||||
from rag.app.naive import by_plaintext, PARSERS
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
|
||||
class Pdf(PdfParser):
|
||||
def __init__(self):
|
||||
@ -196,7 +197,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
# is it English
|
||||
eng = lang.lower() == "english" # pdf_parser.is_english
|
||||
if re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if isinstance(layout_recognizer, bool):
|
||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||
@ -205,6 +208,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
pdf_parser = PARSERS.get(name, by_plaintext)
|
||||
callback(0.1, "Start to parse.")
|
||||
|
||||
kwargs.pop("parse_method", None)
|
||||
kwargs.pop("mineru_llm_name", None)
|
||||
sections, tbls, pdf_parser = pdf_parser(
|
||||
filename = filename,
|
||||
binary = binary,
|
||||
@ -214,6 +219,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback = callback,
|
||||
pdf_cls = Pdf,
|
||||
layout_recognizer = layout_recognizer,
|
||||
mineru_llm_name=parser_model_name,
|
||||
parse_method = "manual",
|
||||
**kwargs
|
||||
)
|
||||
@ -232,7 +238,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
poss = pdf_parser.extract_positions(poss)
|
||||
if poss:
|
||||
first = poss[0] # tuple: ([pn], x1, x2, y1, y2)
|
||||
pn = first[0]
|
||||
pn = first[0]
|
||||
if isinstance(pn, list) and pn:
|
||||
pn = pn[0] # [pn] -> pn
|
||||
poss[0] = (pn, *first[1:])
|
||||
|
||||
@ -36,10 +36,11 @@ from deepdoc.parser.figure_parser import VisionFigureParser,vision_figure_parser
|
||||
from deepdoc.parser.pdf_parser import PlainParser, VisionParser
|
||||
from deepdoc.parser.docling_parser import DoclingParser
|
||||
from deepdoc.parser.tcadp_parser import TCADPParser
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
from rag.nlp import concat_img, find_codec, naive_merge, naive_merge_with_images, naive_merge_docx, rag_tokenizer, tokenize_chunks, tokenize_chunks_with_images, tokenize_table, attach_media_context
|
||||
|
||||
|
||||
def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
|
||||
def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
|
||||
callback = callback
|
||||
binary = binary
|
||||
pdf_parser = pdf_cls() if pdf_cls else Pdf()
|
||||
@ -56,11 +57,19 @@ def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese
|
||||
return sections, tables, pdf_parser
|
||||
|
||||
|
||||
def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
|
||||
parse_method = kwargs.get("parse_method", "raw")
|
||||
mineru_llm_name = kwargs.get("mineru_llm_name")
|
||||
tenant_id = kwargs.get("tenant_id")
|
||||
|
||||
def by_mineru(
|
||||
filename,
|
||||
binary=None,
|
||||
from_page=0,
|
||||
to_page=100000,
|
||||
lang="Chinese",
|
||||
callback=None,
|
||||
pdf_cls=None,
|
||||
parse_method: str = "raw",
|
||||
mineru_llm_name: str | None = None,
|
||||
tenant_id: str | None = None,
|
||||
**kwargs,
|
||||
):
|
||||
pdf_parser = None
|
||||
if tenant_id:
|
||||
if not mineru_llm_name:
|
||||
@ -86,7 +95,7 @@ def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese"
|
||||
callback=callback,
|
||||
parse_method=parse_method,
|
||||
lang=lang,
|
||||
**kwargs
|
||||
**kwargs,
|
||||
)
|
||||
return sections, tables, pdf_parser
|
||||
except Exception as e:
|
||||
@ -97,9 +106,7 @@ def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese"
|
||||
return None, None, None
|
||||
|
||||
|
||||
|
||||
|
||||
def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
|
||||
def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
|
||||
pdf_parser = DoclingParser()
|
||||
parse_method = kwargs.get("parse_method", "raw")
|
||||
|
||||
@ -118,7 +125,7 @@ def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese
|
||||
return sections, tables, pdf_parser
|
||||
|
||||
|
||||
def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
|
||||
def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
|
||||
tcadp_parser = TCADPParser()
|
||||
|
||||
if not tcadp_parser.check_installation():
|
||||
@ -136,10 +143,19 @@ def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese",
|
||||
|
||||
|
||||
def by_plaintext(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs):
|
||||
if kwargs.get("layout_recognizer", "") == "Plain Text":
|
||||
layout_recognizer = (kwargs.get("layout_recognizer") or "").strip()
|
||||
if (not layout_recognizer) or (layout_recognizer == "Plain Text"):
|
||||
pdf_parser = PlainParser()
|
||||
else:
|
||||
vision_model = LLMBundle(kwargs["tenant_id"], LLMType.IMAGE2TEXT, llm_name=kwargs.get("layout_recognizer", ""), lang=kwargs.get("lang", "Chinese"))
|
||||
tenant_id = kwargs.get("tenant_id")
|
||||
if not tenant_id:
|
||||
raise ValueError("tenant_id is required when using vision layout recognizer")
|
||||
vision_model = LLMBundle(
|
||||
tenant_id,
|
||||
LLMType.IMAGE2TEXT,
|
||||
llm_name=layout_recognizer,
|
||||
lang=kwargs.get("lang", "Chinese"),
|
||||
)
|
||||
pdf_parser = VisionParser(vision_model=vision_model, **kwargs)
|
||||
|
||||
sections, tables = pdf_parser(
|
||||
@ -716,14 +732,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca
|
||||
return res
|
||||
|
||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer_raw = parser_config.get("layout_recognize", "DeepDOC")
|
||||
parser_model_name = None
|
||||
layout_recognizer = layout_recognizer_raw
|
||||
if isinstance(layout_recognizer_raw, str):
|
||||
lowered = layout_recognizer_raw.lower()
|
||||
if lowered.endswith("@mineru"):
|
||||
parser_model_name = layout_recognizer_raw.split("@", 1)[0]
|
||||
layout_recognizer = "MinerU"
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if parser_config.get("analyze_hyperlink", False) and is_root:
|
||||
urls = extract_links_from_pdf(binary)
|
||||
|
||||
@ -24,6 +24,7 @@ from rag.nlp import rag_tokenizer, tokenize
|
||||
from deepdoc.parser import PdfParser, ExcelParser, HtmlParser
|
||||
from deepdoc.parser.figure_parser import vision_figure_parser_docx_wrapper
|
||||
from rag.app.naive import by_plaintext, PARSERS
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
|
||||
class Pdf(PdfParser):
|
||||
def __call__(self, filename, binary=None, from_page=0,
|
||||
@ -82,7 +83,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback(0.8, "Finish parsing.")
|
||||
|
||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if isinstance(layout_recognizer, bool):
|
||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||
@ -100,6 +103,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback = callback,
|
||||
pdf_cls = Pdf,
|
||||
layout_recognizer = layout_recognizer,
|
||||
mineru_llm_name=parser_model_name,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
|
||||
@ -24,6 +24,7 @@ from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bull
|
||||
from deepdoc.parser import PdfParser
|
||||
import numpy as np
|
||||
from rag.app.naive import by_plaintext, PARSERS
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
|
||||
|
||||
class Pdf(PdfParser):
|
||||
@ -149,7 +150,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
"parser_config", {
|
||||
"chunk_token_num": 512, "delimiter": "\n!?。;!?", "layout_recognize": "DeepDOC"})
|
||||
if re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if isinstance(layout_recognizer, bool):
|
||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||
@ -163,6 +166,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
paper = pdf_parser(filename if not binary else binary,
|
||||
from_page=from_page, to_page=to_page, callback=callback)
|
||||
else:
|
||||
kwargs.pop("parse_method", None)
|
||||
kwargs.pop("mineru_llm_name", None)
|
||||
sections, tables, pdf_parser = pdf_parser(
|
||||
filename=filename,
|
||||
binary=binary,
|
||||
@ -171,6 +176,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
lang=lang,
|
||||
callback=callback,
|
||||
pdf_cls=Pdf,
|
||||
layout_recognizer=layout_recognizer,
|
||||
mineru_llm_name=parser_model_name,
|
||||
parse_method="paper",
|
||||
**kwargs
|
||||
)
|
||||
|
||||
@ -24,6 +24,7 @@ from PyPDF2 import PdfReader as pdf2_read
|
||||
|
||||
from deepdoc.parser import PdfParser, PptParser, PlainParser
|
||||
from rag.app.naive import by_plaintext, PARSERS
|
||||
from common.parser_config_utils import normalize_layout_recognizer
|
||||
from rag.nlp import rag_tokenizer
|
||||
from rag.nlp import tokenize, is_english
|
||||
|
||||
@ -195,7 +196,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
res.append(d)
|
||||
return res
|
||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
||||
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||
parser_config.get("layout_recognize", "DeepDOC")
|
||||
)
|
||||
|
||||
if isinstance(layout_recognizer, bool):
|
||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||
@ -213,6 +216,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
||||
callback=callback,
|
||||
pdf_cls=Pdf,
|
||||
layout_recognizer=layout_recognizer,
|
||||
mineru_llm_name=parser_model_name,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
|
||||
@ -121,7 +121,7 @@ make logs # With Make
|
||||
### 🧰 Makefile Toolbox
|
||||
|
||||
| Command | Description |
|
||||
| ----------------- | ------------------------------------------------ |
|
||||
|-------------------|--------------------------------------------------|
|
||||
| `make` | Setup, build, launch and test all at once |
|
||||
| `make setup` | Initialize environment and install uv |
|
||||
| `make ensure_env` | Auto-create `.env` if missing |
|
||||
@ -183,7 +183,7 @@ This security model strikes a balance between **robust isolation** and **develop
|
||||
Currently, the following languages are officially supported:
|
||||
|
||||
| Language | Priority |
|
||||
| -------- | -------- |
|
||||
|----------|----------|
|
||||
| Python | High |
|
||||
| Node.js | Medium |
|
||||
|
||||
|
||||
@ -42,6 +42,7 @@ import { ExcelToHtmlFormField } from '../excel-to-html-form-field';
|
||||
import { FormContainer } from '../form-container';
|
||||
import { LayoutRecognizeFormField } from '../layout-recognize-form-field';
|
||||
import { MaxTokenNumberFormField } from '../max-token-number-from-field';
|
||||
import { MinerUOptionsFormField } from '../mineru-options-form-field';
|
||||
import { ButtonLoading } from '../ui/button';
|
||||
import { Input } from '../ui/input';
|
||||
import { DynamicPageRange } from './dynamic-page-range';
|
||||
@ -335,7 +336,10 @@ export function ChunkMethodDialog({
|
||||
className="space-y-3"
|
||||
>
|
||||
{showOne && (
|
||||
<LayoutRecognizeFormField showMineruOptions={false} />
|
||||
<>
|
||||
<LayoutRecognizeFormField showMineruOptions={false} />
|
||||
{isMineruSelected && <MinerUOptionsFormField />}
|
||||
</>
|
||||
)}
|
||||
{showMaxTokenNumber && (
|
||||
<>
|
||||
@ -359,9 +363,6 @@ export function ChunkMethodDialog({
|
||||
}
|
||||
className="space-y-3"
|
||||
>
|
||||
{isMineruSelected && (
|
||||
<LayoutRecognizeFormField showMineruOptions />
|
||||
)}
|
||||
{selectedTag === DocumentParserType.Naive && (
|
||||
<EnableTocToggle />
|
||||
)}
|
||||
|
||||
Reference in New Issue
Block a user