Commit Graph

161 Commits

Author SHA1 Message Date
321a280031 Feat: add image preview to retrieval test. (#7610)
### What problem does this PR solve?

#7608

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-05-13 14:30:36 +08:00
baa108f5cc Fix: markdown table conversion error (#7570)
### What problem does this PR solve?

Since `import markdown.markdown` has been changed to `import markdown`
in `rag/app/naive.py`, previous code for converting markdown tables
would call a markdown module instead of a callable function. This cause
error.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2025-05-12 17:16:55 +08:00
5352bdf4da Error storing tag in Redis (#7541)
### What problem does this PR solve?

The parameter positions were incorrect and have been corrected to use
keyword argument passing

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-09 10:17:09 +08:00
1a5608d0f8 Fix: Add title_tks for Pictures (#7365)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/7362

append title_tks
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-28 13:35:34 +08:00
1662c7eda3 Feat: Markdown add image (#7124)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/6984

1. Markdown parser supports get pictures
2. For Native, when handling Markdown, it will handle images
3. improve merge and 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-25 18:35:28 +08:00
1b4016317e fix bug chunking:expected string or bytes-like object (#7116)
… bytes-like object

### What problem does this PR solve?
fix bug #6990 internal server error ehile chunking:expected string or
bytes-like object
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: unknown <taoshi.ln@chinatelecom.cn>
2025-04-18 14:42:36 +08:00
ed5f81b02e Fix: abnormal cell mergeing. (#6991)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-14 11:00:11 +08:00
5aae73c230 Make error messages during PPT processing clearer. (#6980)
### What problem does this PR solve?

Sometimes a slide may trigger a Proxy error (ArgumentException:
Parameter is not valid) due to issues in the original file, and this
error message can be confusing for users.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [x] Other (please describe):
2025-04-14 10:10:20 +08:00
14a3efd756 Fix: docx image exceptions. (#6839)
### What problem does this PR solve?

Close #6784

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-07 12:33:34 +08:00
ee5aa51d43 Fix: point in tag issue. (#6436)
### What problem does this PR solve?

#6414

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-24 10:45:29 +08:00
0e0ebaac5f Feat: Adds hierarchical title path tracking for tables in DOCX documents to improve context association (#6374)
### What problem does this PR solve?

Adds hierarchical title path tracking for tables in DOCX documents to
improve context association. Previously, extracted tables lacked
positional context within document structure.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-21 18:42:36 +08:00
95497b4aab Fix: adapt to old configurations. (#6321)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-20 14:50:59 +08:00
9611185eb4 Feat: add VLM-boosted DocX parser (#6307)
### What problem does this PR solve?

Add VLM-boosted DocX parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 11:24:44 +08:00
e4380843c4 Feat: add fallback for PDF figure parser (#6305)
### What problem does this PR solve?

Add fallback for PDF figure parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 10:48:38 +08:00
1d6760dd84 Feat: add VLM-boosted PDF parser (#6278)
### What problem does this PR solve?

Add VLM-boosted PDF parser if VLM is set.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 09:39:32 +08:00
5cf610af40 Feat: add vision LLM PDF parser (#6173)
### What problem does this PR solve?

Add vision LLM PDF parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-03-18 14:52:20 +08:00
1333d3c02a Fix: float transfer exception. (#6197)
### What problem does this PR solve?

#6177

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-18 11:13:44 +08:00
3a99c2b5f4 Refa: PARALLEL_DEVICES is a static parameter. (#6168)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-03-17 16:49:54 +08:00
bfa8d342b3 Fix: retrieval debug mode issue. (#6150)
### What problem does this PR solve?

#6139

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-17 13:07:13 +08:00
3e19044dee Feat: add OCR's muti-gpus and parallel processing support (#5972)
### What problem does this PR solve?

Add OCR's muti-gpus and parallel processing support

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

@yuzhichang I've tried to resolve the comments in #5697. OCR jobs can
now be done on both CPU and GPU. ( By the way, I've encountered a
“Generate embedding error” issue #5954 that might be due to my outdated
GPUs? idk. ) Please review it and give me suggestions.

GPU:

![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e)

![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d)

CPU:

![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)
2025-03-17 11:58:40 +08:00
4ff609b6a8 Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027)
### What problem does this PR solve?

Optimize OCR garbage identification to reduce unnecessary filtering.
#5713

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-13 18:48:32 +08:00
7cd37c37cd Feat: add CSV file parsing support (#5989)
### What problem does this PR solve?

Add CSV file parsing support #4552, #5849, #5870

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-12 19:20:50 +08:00
b0c21b00d9 Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633)
Optimize error handling and support parsing of XLS(EXCEL97—2003) files.
2025-03-05 11:55:27 +08:00
b418ce5643 Fix table parser issue. (#5482)
### What problem does this PR solve?

#1475
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-28 16:09:12 +08:00
4f40f685d9 Code refactor (#5371)
### What problem does this PR solve?

#5173

### Type of change

- [x] Refactoring
2025-02-26 15:40:52 +08:00
c28bc41a96 Fix docx table issue. (#5117)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-19 12:40:06 +08:00
c24137bd11 Fix too long integer for Table. (#4651)
### What problem does this PR solve?

#4594

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-26 12:54:58 +08:00
9d717f0b6e Fix csv reader exception. (#4628)
### What problem does this PR solve?

#4552
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-24 14:47:19 +08:00
13f04b7cca Fix pdf applying Q&A issue. (#4599)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-23 12:30:46 +08:00
dd0ebbea35 Light GraphRAG (#4585)
### What problem does this PR solve?

#4543

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-22 19:43:14 +08:00
3894de895b Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
f556f0239c Fix dify retrieval issue. (#4473)
### What problem does this PR solve?

#4464
#4469 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-14 13:16:05 +08:00
e098fcf6ad Fix csv for TAG. (#4454)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-13 12:03:18 +08:00
c5da3cdd97 Tagging (#4426)
### What problem does this PR solve?

#4367

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-09 17:07:21 +08:00
50f209204e Synchronize with enterprise version (#4325)
### Type of change

- [x] Refactoring
2025-01-02 13:44:44 +08:00
8fb18f37f6 Code refactor. (#4291)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-12-30 18:38:51 +08:00
dd13a5d05c Fix some bugs in text2sql.(#4279)(#4281) (#4280)
Fix some bugs in text2sql.(#4279)(#4281)

### What problem does this PR solve?
- The incorrect results in parsing CSV files of the QA knowledge base in
the text2sql scenario. Process CSV files using the csv library. Decouple
CSV parsing from TXT parsing
- Most llm return results in markdown format ```sql query ```, Fix
execution error caused by LLM output SQLmarkdown format.### Type of
change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-30 10:32:19 +08:00
101b8ff813 fix chunk method "Table" losing content when the Excel file has multi… (#4123)
…ple sheets

### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-19 17:30:26 +08:00
1d65299791 Fix rerank_model bug in chat and markdown bug (#4061)
### What problem does this PR solve?

Fix rerank_model bug in chat and markdown bug
#4000
#3992
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
2024-12-17 16:03:37 +08:00
03f00c9e6f Rename page_num_list, top_list, position_list (#3940)
### What problem does this PR solve?

Rename page_num_list, top_list, position_list to page_num_int, top_int,
position_int

### Type of change

- [x] Refactoring
2024-12-10 16:32:58 +08:00
927873bfa6 Fix syn error. (#3953)
### What problem does this PR solve?

Close #3696
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-10 10:54:54 +08:00
0d68a6cd1b Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
821fdf02b4 Fix parsing JSON file error (#3829)
### What problem does this PR solve?

Close issue: #3828

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-12-03 19:02:03 +08:00
08c1a5e1e8 Refactor parse progress (#3781)
### What problem does this PR solve?

Refactor parse file progress

### Type of change

- [x] Refactoring

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-12-01 22:28:00 +08:00
e079656473 Update progress info and start welcome info (#3768)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

---------

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-11-30 18:48:06 +08:00
e678819f70 Fix RGBA error (#3707)
### What problem does this PR solve?

**Passing cv_mdl.describe() is not an RGB converted image**

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:09:02 +08:00
bc701d7b4c Edit chunk shall update instead of insert it (#3709)
### What problem does this PR solve?

Edit chunk shall update instead of insert it. Close #3679 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:00:38 +08:00
609236f5c1 Let 'One' applicable for tables in docx (#3619)
### What problem does this PR solve?

#3598

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement
2024-11-25 09:57:54 +08:00
482c1b59c8 Check tika.parser return result (#3564)
### What problem does this PR solve?

Check tika.parser return result. Close #3229

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2024-11-22 11:05:06 +08:00
c4f2464935 fix: laws.py added missing import logging (#3501)
### What problem does this PR solve?

_Choosing Laws Chunk Method results in an error when parsing a document.
The error is caused by a missing import in the `laws.py` file._

```
Traceback (most recent call last):
  File "/ragflow/rag/svr/task_executor.py", line 445, in handle_task
    do_handle_task(task)
  File "/ragflow/rag/svr/task_executor.py", line 384, in do_handle_task
    cks = build(r)
          ^^^^^^^^
  File "/ragflow/rag/svr/task_executor.py", line 196, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ragflow/rag/app/laws.py", line 161, in chunk
    for txt, poss in pdf_parser(filename if not binary else binary,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ragflow/rag/app/laws.py", line 124, in __call__
    logging.debug("layouts:".format(
    ^^^^^^^
NameError: name 'logging' is not defined. Did you forget to import 'logging'

```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: Michal Masrna <m.marna1@gmail.com>
2024-11-20 20:52:05 +08:00