Commit Graph

266 Commits

Author SHA1 Message Date
b40d639fdb Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541)
### What problem does this PR solve?

1) Create  dataset using table parser for infinity
2) Answer questions in chat using SQL

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-19 19:35:14 +08:00
678a4f959c Fix: skip internal bookmark references in DOCX parsing (#12604) (#12611)
### What problem does this PR solve?

Fixes #12604 - DOCX files containing hyperlinks to internal bookmarks
(e.g., `#_文档目录`) cause a `KeyError` during parsing:

```
KeyError: "There is no item named 'word/#_文档目录' in the archive"
```

This happens because python-docx incorrectly tries to read internal
bookmark references as files from the ZIP archive. Internal bookmarks
are relationship targets starting with `#` and are not actual files.

This PR extends the existing `load_from_xml_v2` workaround (which
already handles `NULL` targets) to also skip relationship targets
starting with `#`.

Related upstream issue:
https://github.com/python-openxml/python-docx/issues/902

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=94194147
2026-01-14 19:08:46 +08:00
b226e06e2d refactor: remove debug print statements (#12534)
### What problem does this PR solve?

refactor: remove debug print statements

### Type of change

- [x] Refactoring
2026-01-09 19:23:50 +08:00
2e09db02f3 feat: add paddleocr parser (#12513)
### What problem does this PR solve?

Add PaddleOCR as a new PDF parser.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-09 17:48:45 +08:00
011bbe9556 Feat: support context window for docx (#12455)
### What problem does this PR solve?

Feat: support context window for docx

#12303

Done:
- [x] naive.py
- [x] one.py

TODO:
- [ ] book.py
- [ ] manual.py

Fix: incorrect image position
Fix: incorrect chunk type tag

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-01-07 15:08:17 +08:00
4cd4526492 Feat: PDF vision figure parser supports reading context (#12416)
### What problem does this PR solve?

PDF vision figure parser supports reading context.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-05 09:55:43 +08:00
52f91c2388 Refine: image/table context. (#12336)
### What problem does this PR solve?

#12303

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-30 20:24:27 +08:00
f0392e7501 Fix IDE warnings (#12315)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 15:04:09 +08:00
df3cbb9b9e Refactor code (#12305)
### What problem does this PR solve?

as title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 11:09:18 +08:00
37e4485415 feat: add MDX file support (#12261)
Feat: add MDX file support  #12057 
### What problem does this PR solve?

<img width="1055" height="270" alt="image"
src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b"
/>

The page states that MDX files are supported, but uploading fails with
the error: "x.mdx: This type of file has not been supported yet!"
<img width="381" height="110" alt="image"
src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-29 12:54:31 +08:00
01f0ced1e6 Fix IDE warnings (#12281)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-29 12:01:18 +08:00
bd76b8ff1a Fix: Tika server upgrades. (#12073)
### What problem does this PR solve?

#12037

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-23 09:35:52 +08:00
b49eb6826b Feat: enhance Excel image extraction with vision-based descriptions (#12054)
### What problem does this PR solve?
issue:
[#11618](https://github.com/infiniflow/ragflow/issues/11618)
change:
enhance Excel image extraction with vision-based descriptions

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-22 10:17:44 +08:00
4dd8cdc38b task executor issues (#12006)
### What problem does this PR solve?

**Fixes #8706** - `InfinityException: TOO_MANY_CONNECTIONS` when running
multiple task executor workers

### Problem Description

When running RAGFlow with 8-16 task executor workers, most workers fail
to start properly. Checking logs revealed that workers were
stuck/hanging during Infinity connection initialization - only 1-2
workers would successfully register in Redis while the rest remained
blocked.

### Root Cause

The Infinity SDK `ConnectionPool` pre-allocates all connections in
`__init__`. With the default `max_size=32` and multiple workers (e.g.,
16), this creates 16×32=512 connections immediately on startup,
exceeding Infinity's default 128 connection limit. Workers hang while
waiting for connections that can never be established.

### Changes

1. **Prevent Infinity connection storm** (`rag/utils/infinity_conn.py`,
`rag/svr/task_executor.py`)
- Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since
operations are synchronous)
- Added staggered startup delay (2s per worker) to spread connection
initialization

2. **Handle None children_delimiter** (`rag/app/naive.py`)
   - Use `or ""` to handle explicitly set None values from parser config

3. **MinerU parser robustness** (`deepdoc/parser/mineru_parser.py`)
   - Use `.get()` for optional output fields that may be missing
- Fix DISCARDED block handling: change `pass` to `continue` to skip
discarded blocks entirely

### Why `max_size=4` is sufficient

| Workers | Pool Size | Total Connections | Infinity Limit |
|---------|-----------|-------------------|----------------|
| 16      | 32        | 512               | 128          |
| 16      | 4         | 64                | 128          |
| 32      | 4         | 128               | 128          |

- All RAGFlow operations are synchronous: `get_conn()` → operation →
`release_conn()`
- No parallel `docStoreConn` operations in the codebase
- Maximum 1-2 concurrent connections needed per worker; 4 provides
safety margin

### MinerU DISCARDED block bug

When MinerU returns blocks with `type: "discarded"` (headers, footers,
watermarks, page numbers, artifacts), the previous code used `pass`
which left the `section` variable undefined, causing:

- **UnboundLocalError** if DISCARDED is the first block
- **Duplicate content** if DISCARDED follows another block (stale value
from previous iteration)

**Root cause confirmed via MinerU source code:**

From
[`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14):
```python
class BlockType:
    DISCARDED = 'discarded'
    # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE
```

Per [MinerU
documentation](https://opendatalab.github.io/MinerU/reference/output_files/),
discarded blocks contain content that should be filtered out for clean
text extraction.

**Fix:** Changed `pass` to `continue` to skip discarded blocks entirely.

### Testing

- Verified all 16 workers now register successfully in Redis
- All workers heartbeating correctly
- Document parsing works as expected
- MinerU parsing with DISCARDED blocks no longer crashes

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: user210 <user210@rt>
2025-12-18 10:03:30 +08:00
672958a192 Fix: model not authorized (#12001)
### What problem does this PR solve?

Fix model not authorized. #11973.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-17 19:48:24 +08:00
8e4d011b15 Fix: parent-children chunking method. (#11997)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-12-17 16:50:36 +08:00
0e8b9588ba Fix error and format issue (#11975)
### What problem does this PR solve?

1. Fix error of book chunking.
2. Fix format issues.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-16 19:29:37 +08:00
49c74d08e8 Feature/mineru improvements (#11938)
我已在下面的评论中用中文重复说明。

### What problem does this PR solve?

## Summary
This PR enhances the MinerU document parser with additional
configuration options, giving users more control over PDF parsing
behavior and improving support for multilingual documents.

## Changes

### Backend (`deepdoc/parser/mineru_parser.py`)
- Added configurable parsing options:
- **Parse Method**: `auto`, `txt`, or `ocr` — allows users to choose the
extraction strategy
- **Formula Recognition**: Toggle for enabling/disabling formula
extraction (useful to disable for Cyrillic documents where it may cause
issues)
- **Table Recognition**: Toggle for enabling/disabling table extraction
- Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate
RAGFlow language settings to MinerU-compatible language codes for better
OCR accuracy
- Improved parser configuration handling to pass these options through
the processing pipeline

### Frontend (`web/`)
- Created new `MinerUOptionsFormField` component that conditionally
renders when MinerU is selected as the layout recognition engine
- Added UI controls for:
  - Parse method selection (dropdown)
  - Formula recognition toggle (switch)
  - Table recognition toggle (switch)
- Added i18n translations for English and Chinese
- Integrated the options into both the dataset creation dialog and
dataset settings page

### Integration
- Updated `rag/app/naive.py` to forward MinerU options to the parser
- Updated task service to handle the new configuration parameters

## Why
MinerU is a powerful document parser, but the default settings don't
work well for all document types. This PR allows users to:
1. Choose the best parsing method for their documents
2. Disable formula recognition for Cyrillic/non-Latin scripts where it
causes issues
3. Control table extraction based on document needs
4. Benefit from automatic language detection for better OCR results

## Testing
- [x] Tested MinerU parsing with different parse methods
- [x] Verified UI renders correctly when MinerU is selected/deselected
- [x] Confirmed settings persist correctly in dataset configuration

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---------

Co-authored-by: user210 <user210@rt>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-16 13:15:25 +08:00
7d23c3aed0 Fix: presentation parsing & Embedding encode exception handling (#11933)
### What problem does this PR solve?

Fix: presentation parsing #11920
Fix: Embeddin encode exception handling
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-13 11:37:42 +08:00
e9710b7aa9 Refa: treat MinerU as an OCR model 2 (#11905)
### What problem does this PR solve?

Treat MinerU as an OCR model 2. #11903

### Type of change

- [x] Refactoring
2025-12-11 17:33:12 +08:00
ab4b62031f Fix:csv parse in Table (#11870)
### What problem does this PR solve?

change:
csv parse in Table

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-10 16:44:06 +08:00
ca2d6f3301 Fix: duplicate output by async_chat_streamly (#11842)
### What problem does this PR solve?

Fix: duplicate output by async_chat_streamly
Refact: revert manual modification

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-09 19:21:52 +08:00
a94b3b9df2 Refa: treat MinerU as an OCR model (#11849)
### What problem does this PR solve?

 Treat MinerU as an OCR model.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-12-09 18:54:14 +08:00
c51e6b2a58 Refa: migrate CV model chat to Async (#11828)
### What problem does this PR solve?

Migrate CV model chat to Async. #11750

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2025-12-09 13:08:37 +08:00
09a3854ed8 Fix: chunk method error. (#11807)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-08 14:28:23 +08:00
43f51baa96 Fix errors (#11804)
### What problem does this PR solve?

1. typos
2. grammar errors.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-08 12:21:18 +08:00
b66881a371 Refactor:book parser use with to handle bytesIO (#11800)
### What problem does this PR solve?

book parser use with to handle bytesIO

### Type of change

- [x] Refactoring
2025-12-08 10:18:46 +08:00
7719fd6350 Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702)
### What problem does this PR solve?

This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:

1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
   This PR adds:

   * root-directory detection
   * fallback lookup using sanitized names
   * a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.

2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
   `ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:

   * parse position tags when present
   * default to empty positions when missing
     preserving backward compatibility and preventing crashes.

### Type of change

* [x] Bug Fix (non-breaking change which fixes an issue)


[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)

we need your help [yongtenglei](https://github.com/yongtenglei)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-05 19:25:45 +08:00
257af75ece Fix: relative page_number in boxes (#11712)
page_number in boxes is relative page number,must + from_page

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-04 11:23:34 +08:00
4ba17361e9 feat: improve presentation PdfParser (#11639)
The old presentation PdfParser lost table format after parse

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:35:14 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
9d8b96c1d0 Feat: add context for figure and table (#11547)
### What problem does this PR solve?

Add context for figure table.



![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e)


`==================()` for demonstrating purpose. 
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 10:21:44 +08:00
74e0b58d89 Fix: excel default optimization. (#11519)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:20 +08:00
7c20c964b4 Fix: incorrect image merging for naive markdown parser (#11520)
### What problem does this PR solve?

Fix incorrect image merging for naive markdown parser. #9349 


[ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:06 +08:00
41665b0865 Refactor: Email parser use with to handle buffer (#11496)
### What problem does this PR solve?
 Email parser use with to handle buffer

### Type of change

- [x] Refactoring
2025-11-25 10:03:37 +08:00
971c1bcba7 Fix: missing parameters in by_plaintext method for PDF naive mode (#11408)
### What problem does this PR solve?

FIx: missing parameters in by_plaintext method for PDF naive mode

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: lih <dev_lih@139.com>
2025-11-21 09:33:36 +08:00
d3d2ccc76c Feat: add more chunking method (#11413)
### What problem does this PR solve?

Feat: add more chunking method #11311

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 19:07:17 +08:00
420c97199a Feat: Add TCADP parser for PPTX and spreadsheet document types. (#11041)
### What problem does this PR solve?

- Added TCADP Parser configuration fields to PDF, PPT, and spreadsheet
parsing forms
- Implemented support for setting table result type (Markdown/HTML) and
Markdown image response type (URL/Text)
- Updated TCADP Parser to handle return format settings from
configuration or parameters
- Enhanced frontend to dynamically show TCADP options based on selected
parsing method
- Modified backend to pass format parameters when calling TCADP API
- Optimized form default value logic for TCADP configuration items
- Updated multilingual resource files for new configuration options

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:08:42 +08:00
fea157ba08 Fix: manual parser with mineru (#11336)
### What problem does this PR solve?

Fix: manual parser with mineru #11320
Fix: missing parameter in mineru #11334
Fix: add outlines parameter for pdf parsers

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-18 15:22:52 +08:00
2b9145948f Fix:not enough values to unpack (expected 3, got 2) in general chunk (#11139)
### What problem does this PR solve?
issue:
#11136
change:
not enough values to unpack (expected 3, got 2) in general chunk

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-10 15:08:24 +08:00
d207291217 Fix: add download stats to kb logs. (#11112)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-10 13:28:07 +08:00
f6aeebc608 Fix: cannot write mode RGBA as JPEG (#11102)
### What problem does this PR solve?
Fix #11091 
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-07 15:45:10 +08:00
b137de1def Fix: Plain parser is skipped (#11094)
### What problem does this PR solve?

plain parser skipeed

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-07 13:39:29 +08:00
2cb1046cbf fix: The doc file cannot be parsed(#11092) (#11093)
### What problem does this PR solve?

The doc file cannot be parsed(#11092)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: virgilwong <hyhvirgil@gmail.com>
2025-11-07 11:46:10 +08:00
4b8ce08050 Fix: fix pdf_parser ignored in rag/app/naive.py (#11065)
### What problem does this PR solve?

Fix: fix pdf_parser ignored in rag/app/naive.py #11000

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-06 15:20:35 +08:00
f98b24c9bf Move api.settings to common.settings (#11036)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-06 09:36:38 +08:00
121c51661d Fix: Markdown table extractor (#11018)
### What problem does this PR solve?

Now markdown table extractor supports <table ...>. #10966 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-05 16:10:21 +08:00
1a9215bc6f Move some vars to globals (#11017)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 14:14:38 +08:00
cf9611c96f Feat: Support more chunking methods (#11000)
### What problem does this PR solve?

Feat: Support more chunking methods #10772 

This PR enables multiple chunking methods — including books, laws,
naive, one, and presentation — to be used with all existing PDF parsers
(DeepDOC, MinerU, Docling, TCADP, Plain Text, and Vision modes).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-05 13:00:42 +08:00
bab3fce136 Move some constants to common (#11004)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 08:01:39 +08:00