mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-02-06 02:25:05 +08:00
Compare commits
3 Commits
d38f8a1562
...
672958a192
| Author | SHA1 | Date | |
|---|---|---|---|
| 672958a192 | |||
| 3820de916c | |||
| ef44979b5c |
@ -207,7 +207,7 @@ releases! 🌟
|
|||||||
> Note: Prior to `v0.22.0`, we provided both images with embedding models and slim images without embedding models. Details as follows:
|
> Note: Prior to `v0.22.0`, we provided both images with embedding models and slim images without embedding models. Details as follows:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -207,7 +207,7 @@ Coba demo kami di [https://demo.ragflow.io](https://demo.ragflow.io).
|
|||||||
> Catatan: Sebelum `v0.22.0`, kami menyediakan image dengan model embedding dan image slim tanpa model embedding. Detailnya sebagai berikut:
|
> Catatan: Sebelum `v0.22.0`, kami menyediakan image dengan model embedding dan image slim tanpa model embedding. Detailnya sebagai berikut:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -187,7 +187,7 @@
|
|||||||
> 注意:`v0.22.0` より前のバージョンでは、embedding モデルを含むイメージと、embedding モデルを含まない slim イメージの両方を提供していました。詳細は以下の通りです:
|
> 注意:`v0.22.0` より前のバージョンでは、embedding モデルを含むイメージと、embedding モデルを含まない slim イメージの両方を提供していました。詳細は以下の通りです:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -189,7 +189,7 @@
|
|||||||
> 참고: `v0.22.0` 이전 버전에서는 embedding 모델이 포함된 이미지와 embedding 모델이 포함되지 않은 slim 이미지를 모두 제공했습니다. 자세한 내용은 다음과 같습니다:
|
> 참고: `v0.22.0` 이전 버전에서는 embedding 모델이 포함된 이미지와 embedding 모델이 포함되지 않은 slim 이미지를 모두 제공했습니다. 자세한 내용은 다음과 같습니다:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -207,7 +207,7 @@ Experimente nossa demo em [https://demo.ragflow.io](https://demo.ragflow.io).
|
|||||||
> Nota: Antes da `v0.22.0`, fornecíamos imagens com modelos de embedding e imagens slim sem modelos de embedding. Detalhes a seguir:
|
> Nota: Antes da `v0.22.0`, fornecíamos imagens com modelos de embedding e imagens slim sem modelos de embedding. Detalhes a seguir:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -206,7 +206,7 @@
|
|||||||
> 注意:在 `v0.22.0` 之前的版本,我們會同時提供包含 embedding 模型的映像和不含 embedding 模型的 slim 映像。具體如下:
|
> 注意:在 `v0.22.0` 之前的版本,我們會同時提供包含 embedding 模型的映像和不含 embedding 模型的 slim 映像。具體如下:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -207,7 +207,7 @@
|
|||||||
> 注意:在 `v0.22.0` 之前的版本,我们会同时提供包含 embedding 模型的镜像和不含 embedding 模型的 slim 镜像。具体如下:
|
> 注意:在 `v0.22.0` 之前的版本,我们会同时提供包含 embedding 模型的镜像和不含 embedding 模型的 slim 镜像。具体如下:
|
||||||
|
|
||||||
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
| RAGFlow image tag | Image size (GB) | Has embedding models? | Stable? |
|
||||||
| ----------------- | --------------- | --------------------- | ------------------------ |
|
|-------------------|-----------------|-----------------------|----------------|
|
||||||
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
| v0.21.1 | ≈9 | ✔️ | Stable release |
|
||||||
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
| v0.21.1-slim | ≈2 | ❌ | Stable release |
|
||||||
|
|
||||||
|
|||||||
@ -6,7 +6,7 @@ Use this section to tell people about which versions of your project are
|
|||||||
currently being supported with security updates.
|
currently being supported with security updates.
|
||||||
|
|
||||||
| Version | Supported |
|
| Version | Supported |
|
||||||
| ------- | ------------------ |
|
|---------|--------------------|
|
||||||
| <=0.7.0 | :white_check_mark: |
|
| <=0.7.0 | :white_check_mark: |
|
||||||
|
|
||||||
## Reporting a Vulnerability
|
## Reporting a Vulnerability
|
||||||
|
|||||||
@ -252,7 +252,6 @@ async def delete_chats(tenant_id):
|
|||||||
continue
|
continue
|
||||||
temp_dict = {"status": StatusEnum.INVALID.value}
|
temp_dict = {"status": StatusEnum.INVALID.value}
|
||||||
success_count += DialogService.update_by_id(id, temp_dict)
|
success_count += DialogService.update_by_id(id, temp_dict)
|
||||||
print(success_count, "$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$", flush=True)
|
|
||||||
|
|
||||||
if errors:
|
if errors:
|
||||||
if success_count > 0:
|
if success_count > 0:
|
||||||
|
|||||||
30
common/parser_config_utils.py
Normal file
30
common/parser_config_utils.py
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
#
|
||||||
|
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_layout_recognizer(layout_recognizer_raw: Any) -> tuple[Any, str | None]:
|
||||||
|
parser_model_name: str | None = None
|
||||||
|
layout_recognizer = layout_recognizer_raw
|
||||||
|
|
||||||
|
if isinstance(layout_recognizer_raw, str):
|
||||||
|
lowered = layout_recognizer_raw.lower()
|
||||||
|
if lowered.endswith("@mineru"):
|
||||||
|
parser_model_name = layout_recognizer_raw.rsplit("@", 1)[0]
|
||||||
|
layout_recognizer = "MinerU"
|
||||||
|
|
||||||
|
return layout_recognizer, parser_model_name
|
||||||
@ -262,10 +262,8 @@ class MinerUParser(RAGFlowPdfParser):
|
|||||||
elif self.mineru_server_url:
|
elif self.mineru_server_url:
|
||||||
data["server_url"] = self.mineru_server_url
|
data["server_url"] = self.mineru_server_url
|
||||||
|
|
||||||
print("--------------------------------", flush=True)
|
self.logger.info(f"[MinerU] request {data=}")
|
||||||
print(f"{data=}", flush=True)
|
self.logger.info(f"[MinerU] request {options=}")
|
||||||
print(f"{options=}", flush=True)
|
|
||||||
print("--------------------------------", flush=True)
|
|
||||||
|
|
||||||
headers = {"Accept": "application/json"}
|
headers = {"Accept": "application/json"}
|
||||||
try:
|
try:
|
||||||
|
|||||||
@ -14,7 +14,7 @@ To access the RAGFlow admin UI, append `/admin` to the web UI's address, e.g. `h
|
|||||||
|
|
||||||
### Default Credentials
|
### Default Credentials
|
||||||
| Username | Password |
|
| Username | Password |
|
||||||
|----------|----------|
|
|--------------------|----------|
|
||||||
| `admin@ragflow.io` | `admin` |
|
| `admin@ragflow.io` | `admin` |
|
||||||
|
|
||||||
## Admin UI Overview
|
## Admin UI Overview
|
||||||
|
|||||||
@ -158,7 +158,7 @@ Optional. Text to display as a diagonal watermark across each page. Useful for m
|
|||||||
The **Docs Generator** component provides the following output variables:
|
The **Docs Generator** component provides the following output variables:
|
||||||
|
|
||||||
| Variable name | Type | Description |
|
| Variable name | Type | Description |
|
||||||
| ------------- | --------- | --------------------------------------------------------------------------- |
|
|---------------|-----------|--------------------------------------------------------------|
|
||||||
| `file_path` | `string` | The server path where the generated document is saved. |
|
| `file_path` | `string` | The server path where the generated document is saved. |
|
||||||
| `pdf_base64` | `string` | The document content encoded in base64 format. |
|
| `pdf_base64` | `string` | The document content encoded in base64 format. |
|
||||||
| `download` | `string` | JSON containing download information for the chat interface. |
|
| `download` | `string` | JSON containing download information for the chat interface. |
|
||||||
@ -190,7 +190,7 @@ The **Docs Generator** includes intelligent font handling for international cont
|
|||||||
### Supported scripts
|
### Supported scripts
|
||||||
|
|
||||||
| Script | Unicode Range | Font Used |
|
| Script | Unicode Range | Font Used |
|
||||||
| ------ | ------------- | --------- |
|
|------------------------------|---------------|--------------------|
|
||||||
| Chinese (CJK) | U+4E00–U+9FFF | STSong-Light |
|
| Chinese (CJK) | U+4E00–U+9FFF | STSong-Light |
|
||||||
| Japanese (Hiragana/Katakana) | U+3040–U+30FF | HeiseiMin-W3 |
|
| Japanese (Hiragana/Katakana) | U+3040–U+30FF | HeiseiMin-W3 |
|
||||||
| Korean (Hangul) | U+AC00–U+D7AF | HYSMyeongJo-Medium |
|
| Korean (Hangul) | U+AC00–U+D7AF | HYSMyeongJo-Medium |
|
||||||
|
|||||||
@ -18,7 +18,7 @@ Within the configuration panel, you can add multiple parsers and set the corresp
|
|||||||
The **Parser** component supports parsing the following file types:
|
The **Parser** component supports parsing the following file types:
|
||||||
|
|
||||||
| File type | File format |
|
| File type | File format |
|
||||||
| ------------- | ------------------------ |
|
|---------------|--------------------------|
|
||||||
| PDF | PDF |
|
| PDF | PDF |
|
||||||
| Spreadsheet | XLSX, XLS, CSV |
|
| Spreadsheet | XLSX, XLS, CSV |
|
||||||
| Image | PNG, JPG, JPEG, GIF, TIF |
|
| Image | PNG, JPG, JPEG, GIF, TIF |
|
||||||
@ -98,7 +98,7 @@ A Video parser transcribes video files to text. To use this parser, you must fir
|
|||||||
The global variable names for the output of the **Parser** component, which can be referenced by subsequent components in the ingestion pipeline.
|
The global variable names for the output of the **Parser** component, which can be referenced by subsequent components in the ingestion pipeline.
|
||||||
|
|
||||||
| Variable name | Type |
|
| Variable name | Type |
|
||||||
| ------------- | ------------------------ |
|
|---------------|-----------------|
|
||||||
| `markdown` | `string` |
|
| `markdown` | `string` |
|
||||||
| `text` | `string` |
|
| `text` | `string` |
|
||||||
| `html` | `string` |
|
| `html` | `string` |
|
||||||
|
|||||||
@ -45,7 +45,7 @@ Click the light bulb icon above the *current* dialogue and scroll down the popup
|
|||||||
|
|
||||||
|
|
||||||
| Item name | Description |
|
| Item name | Description |
|
||||||
| ----------------- |-----------------------------------------------------------------------------------------------|
|
|-------------------|-----------------------------------------------------------------------------------------------|
|
||||||
| Total | Total time spent on this conversation round, including chunk retrieval and answer generation. |
|
| Total | Total time spent on this conversation round, including chunk retrieval and answer generation. |
|
||||||
| Check LLM | Time to validate the specified LLM. |
|
| Check LLM | Time to validate the specified LLM. |
|
||||||
| Create retriever | Time to create a chunk retriever. |
|
| Create retriever | Time to create a chunk retriever. |
|
||||||
|
|||||||
@ -40,7 +40,7 @@ This section covers the following topics:
|
|||||||
RAGFlow offers multiple built-in chunking template to facilitate chunking files of different layouts and ensure semantic integrity. From the **Built-in** chunking method dropdown under **Parse type**, you can choose the default template that suits the layouts and formats of your files. The following table shows the descriptions and the compatible file formats of each supported chunk template:
|
RAGFlow offers multiple built-in chunking template to facilitate chunking files of different layouts and ensure semantic integrity. From the **Built-in** chunking method dropdown under **Parse type**, you can choose the default template that suits the layouts and formats of your files. The following table shows the descriptions and the compatible file formats of each supported chunk template:
|
||||||
|
|
||||||
| **Template** | Description | File format |
|
| **Template** | Description | File format |
|
||||||
|--------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
|
|--------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
|
||||||
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
|
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
|
||||||
| Q&A | Retrieves relevant information and generates answers to respond to questions. | XLSX, XLS (Excel 97-2003), CSV/TXT |
|
| Q&A | Retrieves relevant information and generates answers to respond to questions. | XLSX, XLS (Excel 97-2003), CSV/TXT |
|
||||||
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
|
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
|
||||||
|
|||||||
@ -14,7 +14,7 @@ A complete reference for RAGFlow's RESTful API. Before proceeding, please ensure
|
|||||||
---
|
---
|
||||||
|
|
||||||
| Code | Message | Description |
|
| Code | Message | Description |
|
||||||
| ---- | --------------------- | -------------------------- |
|
|------|-----------------------|----------------------------|
|
||||||
| 400 | Bad Request | Invalid request parameters |
|
| 400 | Bad Request | Invalid request parameters |
|
||||||
| 401 | Unauthorized | Unauthorized access |
|
| 401 | Unauthorized | Unauthorized access |
|
||||||
| 403 | Forbidden | Access denied |
|
| 403 | Forbidden | Access denied |
|
||||||
|
|||||||
@ -23,7 +23,7 @@ pip install ragflow-sdk
|
|||||||
---
|
---
|
||||||
|
|
||||||
| Code | Message | Description |
|
| Code | Message | Description |
|
||||||
|------|----------------------|-----------------------------|
|
|------|-----------------------|----------------------------|
|
||||||
| 400 | Bad Request | Invalid request parameters |
|
| 400 | Bad Request | Invalid request parameters |
|
||||||
| 401 | Unauthorized | Unauthorized access |
|
| 401 | Unauthorized | Unauthorized access |
|
||||||
| 403 | Forbidden | Access denied |
|
| 403 | Forbidden | Access denied |
|
||||||
|
|||||||
@ -82,7 +82,7 @@ pip install ragflow-firecrawl-integration
|
|||||||
## Configuration Options
|
## Configuration Options
|
||||||
|
|
||||||
| Option | Description | Default | Required |
|
| Option | Description | Default | Required |
|
||||||
|--------|-------------|---------|----------|
|
|--------------------|----------------------------------|-----------------------------|----------|
|
||||||
| `api_key` | Your Firecrawl API key | - | Yes |
|
| `api_key` | Your Firecrawl API key | - | Yes |
|
||||||
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
||||||
| `max_retries` | Maximum retry attempts | 3 | No |
|
| `max_retries` | Maximum retry attempts | 3 | No |
|
||||||
|
|||||||
@ -100,7 +100,7 @@ intergrations/firecrawl/
|
|||||||
## 🔧 **Configuration Options**
|
## 🔧 **Configuration Options**
|
||||||
|
|
||||||
| Option | Description | Default | Required |
|
| Option | Description | Default | Required |
|
||||||
|--------|-------------|---------|----------|
|
|--------------------|----------------------------------|-----------------------------|----------|
|
||||||
| `api_key` | Your Firecrawl API key | - | Yes |
|
| `api_key` | Your Firecrawl API key | - | Yes |
|
||||||
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
||||||
| `max_retries` | Maximum retry attempts | 3 | No |
|
| `max_retries` | Maximum retry attempts | 3 | No |
|
||||||
|
|||||||
@ -21,6 +21,7 @@ from io import BytesIO
|
|||||||
from deepdoc.parser.utils import get_text
|
from deepdoc.parser.utils import get_text
|
||||||
from rag.app import naive
|
from rag.app import naive
|
||||||
from rag.app.naive import by_plaintext, PARSERS
|
from rag.app.naive import by_plaintext, PARSERS
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
from rag.nlp import bullets_category, is_english,remove_contents_table, \
|
from rag.nlp import bullets_category, is_english,remove_contents_table, \
|
||||||
hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, \
|
hierarchical_merge, make_colon_as_title, naive_merge, random_choices, tokenize_table, \
|
||||||
tokenize_chunks, attach_media_context
|
tokenize_chunks, attach_media_context
|
||||||
@ -96,7 +97,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback(0.8, "Finish parsing.")
|
callback(0.8, "Finish parsing.")
|
||||||
|
|
||||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(layout_recognizer, bool):
|
if isinstance(layout_recognizer, bool):
|
||||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||||
@ -114,6 +117,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback = callback,
|
callback = callback,
|
||||||
pdf_cls = Pdf,
|
pdf_cls = Pdf,
|
||||||
layout_recognizer = layout_recognizer,
|
layout_recognizer = layout_recognizer,
|
||||||
|
mineru_llm_name=parser_model_name,
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -26,6 +26,7 @@ from rag.nlp import bullets_category, remove_contents_table, \
|
|||||||
from rag.nlp import rag_tokenizer, Node
|
from rag.nlp import rag_tokenizer, Node
|
||||||
from deepdoc.parser import PdfParser, DocxParser, HtmlParser
|
from deepdoc.parser import PdfParser, DocxParser, HtmlParser
|
||||||
from rag.app.naive import by_plaintext, PARSERS
|
from rag.app.naive import by_plaintext, PARSERS
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -155,7 +156,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
return tokenize_chunks(chunks, doc, eng, None)
|
return tokenize_chunks(chunks, doc, eng, None)
|
||||||
|
|
||||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(layout_recognizer, bool):
|
if isinstance(layout_recognizer, bool):
|
||||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||||
@ -173,6 +176,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback = callback,
|
callback = callback,
|
||||||
pdf_cls = Pdf,
|
pdf_cls = Pdf,
|
||||||
layout_recognizer = layout_recognizer,
|
layout_recognizer = layout_recognizer,
|
||||||
|
mineru_llm_name=parser_model_name,
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -27,6 +27,7 @@ from deepdoc.parser.figure_parser import vision_figure_parser_pdf_wrapper,vision
|
|||||||
from docx import Document
|
from docx import Document
|
||||||
from PIL import Image
|
from PIL import Image
|
||||||
from rag.app.naive import by_plaintext, PARSERS
|
from rag.app.naive import by_plaintext, PARSERS
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
|
|
||||||
class Pdf(PdfParser):
|
class Pdf(PdfParser):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
@ -196,7 +197,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
# is it English
|
# is it English
|
||||||
eng = lang.lower() == "english" # pdf_parser.is_english
|
eng = lang.lower() == "english" # pdf_parser.is_english
|
||||||
if re.search(r"\.pdf$", filename, re.IGNORECASE):
|
if re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(layout_recognizer, bool):
|
if isinstance(layout_recognizer, bool):
|
||||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||||
@ -205,6 +208,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
pdf_parser = PARSERS.get(name, by_plaintext)
|
pdf_parser = PARSERS.get(name, by_plaintext)
|
||||||
callback(0.1, "Start to parse.")
|
callback(0.1, "Start to parse.")
|
||||||
|
|
||||||
|
kwargs.pop("parse_method", None)
|
||||||
|
kwargs.pop("mineru_llm_name", None)
|
||||||
sections, tbls, pdf_parser = pdf_parser(
|
sections, tbls, pdf_parser = pdf_parser(
|
||||||
filename = filename,
|
filename = filename,
|
||||||
binary = binary,
|
binary = binary,
|
||||||
@ -214,6 +219,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback = callback,
|
callback = callback,
|
||||||
pdf_cls = Pdf,
|
pdf_cls = Pdf,
|
||||||
layout_recognizer = layout_recognizer,
|
layout_recognizer = layout_recognizer,
|
||||||
|
mineru_llm_name=parser_model_name,
|
||||||
parse_method = "manual",
|
parse_method = "manual",
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|||||||
@ -36,6 +36,7 @@ from deepdoc.parser.figure_parser import VisionFigureParser,vision_figure_parser
|
|||||||
from deepdoc.parser.pdf_parser import PlainParser, VisionParser
|
from deepdoc.parser.pdf_parser import PlainParser, VisionParser
|
||||||
from deepdoc.parser.docling_parser import DoclingParser
|
from deepdoc.parser.docling_parser import DoclingParser
|
||||||
from deepdoc.parser.tcadp_parser import TCADPParser
|
from deepdoc.parser.tcadp_parser import TCADPParser
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
from rag.nlp import concat_img, find_codec, naive_merge, naive_merge_with_images, naive_merge_docx, rag_tokenizer, tokenize_chunks, tokenize_chunks_with_images, tokenize_table, attach_media_context
|
from rag.nlp import concat_img, find_codec, naive_merge, naive_merge_with_images, naive_merge_docx, rag_tokenizer, tokenize_chunks, tokenize_chunks_with_images, tokenize_table, attach_media_context
|
||||||
|
|
||||||
|
|
||||||
@ -56,11 +57,19 @@ def by_deepdoc(filename, binary=None, from_page=0, to_page=100000, lang="Chinese
|
|||||||
return sections, tables, pdf_parser
|
return sections, tables, pdf_parser
|
||||||
|
|
||||||
|
|
||||||
def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None ,**kwargs):
|
def by_mineru(
|
||||||
parse_method = kwargs.get("parse_method", "raw")
|
filename,
|
||||||
mineru_llm_name = kwargs.get("mineru_llm_name")
|
binary=None,
|
||||||
tenant_id = kwargs.get("tenant_id")
|
from_page=0,
|
||||||
|
to_page=100000,
|
||||||
|
lang="Chinese",
|
||||||
|
callback=None,
|
||||||
|
pdf_cls=None,
|
||||||
|
parse_method: str = "raw",
|
||||||
|
mineru_llm_name: str | None = None,
|
||||||
|
tenant_id: str | None = None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
pdf_parser = None
|
pdf_parser = None
|
||||||
if tenant_id:
|
if tenant_id:
|
||||||
if not mineru_llm_name:
|
if not mineru_llm_name:
|
||||||
@ -86,7 +95,7 @@ def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese"
|
|||||||
callback=callback,
|
callback=callback,
|
||||||
parse_method=parse_method,
|
parse_method=parse_method,
|
||||||
lang=lang,
|
lang=lang,
|
||||||
**kwargs
|
**kwargs,
|
||||||
)
|
)
|
||||||
return sections, tables, pdf_parser
|
return sections, tables, pdf_parser
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@ -97,8 +106,6 @@ def by_mineru(filename, binary=None, from_page=0, to_page=100000, lang="Chinese"
|
|||||||
return None, None, None
|
return None, None, None
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
|
def by_docling(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, pdf_cls = None, **kwargs):
|
||||||
pdf_parser = DoclingParser()
|
pdf_parser = DoclingParser()
|
||||||
parse_method = kwargs.get("parse_method", "raw")
|
parse_method = kwargs.get("parse_method", "raw")
|
||||||
@ -136,10 +143,19 @@ def by_tcadp(filename, binary=None, from_page=0, to_page=100000, lang="Chinese",
|
|||||||
|
|
||||||
|
|
||||||
def by_plaintext(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs):
|
def by_plaintext(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs):
|
||||||
if kwargs.get("layout_recognizer", "") == "Plain Text":
|
layout_recognizer = (kwargs.get("layout_recognizer") or "").strip()
|
||||||
|
if (not layout_recognizer) or (layout_recognizer == "Plain Text"):
|
||||||
pdf_parser = PlainParser()
|
pdf_parser = PlainParser()
|
||||||
else:
|
else:
|
||||||
vision_model = LLMBundle(kwargs["tenant_id"], LLMType.IMAGE2TEXT, llm_name=kwargs.get("layout_recognizer", ""), lang=kwargs.get("lang", "Chinese"))
|
tenant_id = kwargs.get("tenant_id")
|
||||||
|
if not tenant_id:
|
||||||
|
raise ValueError("tenant_id is required when using vision layout recognizer")
|
||||||
|
vision_model = LLMBundle(
|
||||||
|
tenant_id,
|
||||||
|
LLMType.IMAGE2TEXT,
|
||||||
|
llm_name=layout_recognizer,
|
||||||
|
lang=kwargs.get("lang", "Chinese"),
|
||||||
|
)
|
||||||
pdf_parser = VisionParser(vision_model=vision_model, **kwargs)
|
pdf_parser = VisionParser(vision_model=vision_model, **kwargs)
|
||||||
|
|
||||||
sections, tables = pdf_parser(
|
sections, tables = pdf_parser(
|
||||||
@ -716,14 +732,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca
|
|||||||
return res
|
return res
|
||||||
|
|
||||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer_raw = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
parser_model_name = None
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
layout_recognizer = layout_recognizer_raw
|
)
|
||||||
if isinstance(layout_recognizer_raw, str):
|
|
||||||
lowered = layout_recognizer_raw.lower()
|
|
||||||
if lowered.endswith("@mineru"):
|
|
||||||
parser_model_name = layout_recognizer_raw.split("@", 1)[0]
|
|
||||||
layout_recognizer = "MinerU"
|
|
||||||
|
|
||||||
if parser_config.get("analyze_hyperlink", False) and is_root:
|
if parser_config.get("analyze_hyperlink", False) and is_root:
|
||||||
urls = extract_links_from_pdf(binary)
|
urls = extract_links_from_pdf(binary)
|
||||||
|
|||||||
@ -24,6 +24,7 @@ from rag.nlp import rag_tokenizer, tokenize
|
|||||||
from deepdoc.parser import PdfParser, ExcelParser, HtmlParser
|
from deepdoc.parser import PdfParser, ExcelParser, HtmlParser
|
||||||
from deepdoc.parser.figure_parser import vision_figure_parser_docx_wrapper
|
from deepdoc.parser.figure_parser import vision_figure_parser_docx_wrapper
|
||||||
from rag.app.naive import by_plaintext, PARSERS
|
from rag.app.naive import by_plaintext, PARSERS
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
|
|
||||||
class Pdf(PdfParser):
|
class Pdf(PdfParser):
|
||||||
def __call__(self, filename, binary=None, from_page=0,
|
def __call__(self, filename, binary=None, from_page=0,
|
||||||
@ -82,7 +83,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback(0.8, "Finish parsing.")
|
callback(0.8, "Finish parsing.")
|
||||||
|
|
||||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(layout_recognizer, bool):
|
if isinstance(layout_recognizer, bool):
|
||||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||||
@ -100,6 +103,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback = callback,
|
callback = callback,
|
||||||
pdf_cls = Pdf,
|
pdf_cls = Pdf,
|
||||||
layout_recognizer = layout_recognizer,
|
layout_recognizer = layout_recognizer,
|
||||||
|
mineru_llm_name=parser_model_name,
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -24,6 +24,7 @@ from rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bull
|
|||||||
from deepdoc.parser import PdfParser
|
from deepdoc.parser import PdfParser
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from rag.app.naive import by_plaintext, PARSERS
|
from rag.app.naive import by_plaintext, PARSERS
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
|
|
||||||
|
|
||||||
class Pdf(PdfParser):
|
class Pdf(PdfParser):
|
||||||
@ -149,7 +150,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
"parser_config", {
|
"parser_config", {
|
||||||
"chunk_token_num": 512, "delimiter": "\n!?。;!?", "layout_recognize": "DeepDOC"})
|
"chunk_token_num": 512, "delimiter": "\n!?。;!?", "layout_recognize": "DeepDOC"})
|
||||||
if re.search(r"\.pdf$", filename, re.IGNORECASE):
|
if re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(layout_recognizer, bool):
|
if isinstance(layout_recognizer, bool):
|
||||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||||
@ -163,6 +166,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
paper = pdf_parser(filename if not binary else binary,
|
paper = pdf_parser(filename if not binary else binary,
|
||||||
from_page=from_page, to_page=to_page, callback=callback)
|
from_page=from_page, to_page=to_page, callback=callback)
|
||||||
else:
|
else:
|
||||||
|
kwargs.pop("parse_method", None)
|
||||||
|
kwargs.pop("mineru_llm_name", None)
|
||||||
sections, tables, pdf_parser = pdf_parser(
|
sections, tables, pdf_parser = pdf_parser(
|
||||||
filename=filename,
|
filename=filename,
|
||||||
binary=binary,
|
binary=binary,
|
||||||
@ -171,6 +176,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
lang=lang,
|
lang=lang,
|
||||||
callback=callback,
|
callback=callback,
|
||||||
pdf_cls=Pdf,
|
pdf_cls=Pdf,
|
||||||
|
layout_recognizer=layout_recognizer,
|
||||||
|
mineru_llm_name=parser_model_name,
|
||||||
parse_method="paper",
|
parse_method="paper",
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|||||||
@ -24,6 +24,7 @@ from PyPDF2 import PdfReader as pdf2_read
|
|||||||
|
|
||||||
from deepdoc.parser import PdfParser, PptParser, PlainParser
|
from deepdoc.parser import PdfParser, PptParser, PlainParser
|
||||||
from rag.app.naive import by_plaintext, PARSERS
|
from rag.app.naive import by_plaintext, PARSERS
|
||||||
|
from common.parser_config_utils import normalize_layout_recognizer
|
||||||
from rag.nlp import rag_tokenizer
|
from rag.nlp import rag_tokenizer
|
||||||
from rag.nlp import tokenize, is_english
|
from rag.nlp import tokenize, is_english
|
||||||
|
|
||||||
@ -195,7 +196,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
res.append(d)
|
res.append(d)
|
||||||
return res
|
return res
|
||||||
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
||||||
layout_recognizer = parser_config.get("layout_recognize", "DeepDOC")
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(
|
||||||
|
parser_config.get("layout_recognize", "DeepDOC")
|
||||||
|
)
|
||||||
|
|
||||||
if isinstance(layout_recognizer, bool):
|
if isinstance(layout_recognizer, bool):
|
||||||
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
||||||
@ -213,6 +216,7 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
|
|||||||
callback=callback,
|
callback=callback,
|
||||||
pdf_cls=Pdf,
|
pdf_cls=Pdf,
|
||||||
layout_recognizer=layout_recognizer,
|
layout_recognizer=layout_recognizer,
|
||||||
|
mineru_llm_name=parser_model_name,
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -121,7 +121,7 @@ make logs # With Make
|
|||||||
### 🧰 Makefile Toolbox
|
### 🧰 Makefile Toolbox
|
||||||
|
|
||||||
| Command | Description |
|
| Command | Description |
|
||||||
| ----------------- | ------------------------------------------------ |
|
|-------------------|--------------------------------------------------|
|
||||||
| `make` | Setup, build, launch and test all at once |
|
| `make` | Setup, build, launch and test all at once |
|
||||||
| `make setup` | Initialize environment and install uv |
|
| `make setup` | Initialize environment and install uv |
|
||||||
| `make ensure_env` | Auto-create `.env` if missing |
|
| `make ensure_env` | Auto-create `.env` if missing |
|
||||||
@ -183,7 +183,7 @@ This security model strikes a balance between **robust isolation** and **develop
|
|||||||
Currently, the following languages are officially supported:
|
Currently, the following languages are officially supported:
|
||||||
|
|
||||||
| Language | Priority |
|
| Language | Priority |
|
||||||
| -------- | -------- |
|
|----------|----------|
|
||||||
| Python | High |
|
| Python | High |
|
||||||
| Node.js | Medium |
|
| Node.js | Medium |
|
||||||
|
|
||||||
|
|||||||
@ -42,6 +42,7 @@ import { ExcelToHtmlFormField } from '../excel-to-html-form-field';
|
|||||||
import { FormContainer } from '../form-container';
|
import { FormContainer } from '../form-container';
|
||||||
import { LayoutRecognizeFormField } from '../layout-recognize-form-field';
|
import { LayoutRecognizeFormField } from '../layout-recognize-form-field';
|
||||||
import { MaxTokenNumberFormField } from '../max-token-number-from-field';
|
import { MaxTokenNumberFormField } from '../max-token-number-from-field';
|
||||||
|
import { MinerUOptionsFormField } from '../mineru-options-form-field';
|
||||||
import { ButtonLoading } from '../ui/button';
|
import { ButtonLoading } from '../ui/button';
|
||||||
import { Input } from '../ui/input';
|
import { Input } from '../ui/input';
|
||||||
import { DynamicPageRange } from './dynamic-page-range';
|
import { DynamicPageRange } from './dynamic-page-range';
|
||||||
@ -335,7 +336,10 @@ export function ChunkMethodDialog({
|
|||||||
className="space-y-3"
|
className="space-y-3"
|
||||||
>
|
>
|
||||||
{showOne && (
|
{showOne && (
|
||||||
|
<>
|
||||||
<LayoutRecognizeFormField showMineruOptions={false} />
|
<LayoutRecognizeFormField showMineruOptions={false} />
|
||||||
|
{isMineruSelected && <MinerUOptionsFormField />}
|
||||||
|
</>
|
||||||
)}
|
)}
|
||||||
{showMaxTokenNumber && (
|
{showMaxTokenNumber && (
|
||||||
<>
|
<>
|
||||||
@ -359,9 +363,6 @@ export function ChunkMethodDialog({
|
|||||||
}
|
}
|
||||||
className="space-y-3"
|
className="space-y-3"
|
||||||
>
|
>
|
||||||
{isMineruSelected && (
|
|
||||||
<LayoutRecognizeFormField showMineruOptions />
|
|
||||||
)}
|
|
||||||
{selectedTag === DocumentParserType.Naive && (
|
{selectedTag === DocumentParserType.Naive && (
|
||||||
<EnableTocToggle />
|
<EnableTocToggle />
|
||||||
)}
|
)}
|
||||||
|
|||||||
Reference in New Issue
Block a user