Commit Graph

218 Commits

Author SHA1 Message Date
360f5c1179 Move token related functions to common (#10942)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 08:50:05 +08:00
44f2d6f5da Move 'get_project_base_directory' to common directory (#10940)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-02 21:05:28 +08:00
27f0d82102 Fix: opensearch retrieval error (#10891)
### What problem does this PR solve?

Fix: opensearch retrieval error #10828

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-30 17:30:54 +08:00
766d900a41 Refactor: rename rmSpace to remove_redundant_spaces (#10796)
### What problem does this PR solve?

- rename rmSpace to remove_redundant_spaces
- move clean_markdown_block to common module
- add unit tests for remove_redundant_spaces and clean_markdown_block

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-10-28 09:46:32 +08:00
e59458c36b Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819)
### What problem does this PR solve?

Fix: parsing excel with chartsheet #10815

Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-28 09:40:37 +08:00
501b7d4d01 Fix: prio synonym match than wordnet for english (#10762)
### What problem does this PR solve?

Fix: prio synonym match than wordnet for english

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-27 09:32:55 +08:00
1d57801c0c Fix:ERROR 20 Method rag.nlp.search.Dealer.search() parameter highlight="None" violates type hint bool | list, as <class "builtins.NoneType"> "None" not list or bool. (#10743)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/10733

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-27 09:29:39 +08:00
ea73f13ebf Fix: infinity rerank error. (#10760)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-23 17:38:54 +08:00
863c3e3d9c Fix: tree merge (#10691)
### What problem does this PR solve?

Fix: Fix tree merge, solved #10636

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-21 13:02:01 +08:00
43ea312144 Fix: search highlight. (#10616)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-16 18:45:43 +08:00
e48bec1cbf Don't rerank for infinity (#10579)
### What problem does this PR solve?

Don't need rerank for infinity since Infinity normalizes each way score
before fusion.

### Type of change

- [x] Refactoring
2025-10-15 20:15:49 +08:00
7d2f65671f Feat: debugging toc part. (#10486)
### What problem does this PR solve?

#10436

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-11 18:45:21 +08:00
0d8791936e Feat: TOC retrieval (#10456)
### What problem does this PR solve?

#10436

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-10 17:07:55 +08:00
cbf04ee470 Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423)
### What problem does this PR solve?

#9869

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: jinhai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: chanx <1243304602@qq.com>
Co-authored-by: balibabu <cike8899@users.noreply.github.com>
Co-authored-by: Lynn <lynn_inf@hotmail.com>
Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com>
Co-authored-by: huangzl <huangzl@shinemo.com>
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
Co-authored-by: Wilmer <33392318@qq.com>
Co-authored-by: Adrian Weidig <adrianweidig@gmx.net>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yongteng Lei <yongtengrey@outlook.com>
Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com>
Co-authored-by: BadwomanCraZY <511528396@qq.com>
Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com>
Co-authored-by: Russell Valentine <russ@coldstonelabs.org>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Billy Bao <newyorkupperbay@gmail.com>
Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com>
Co-authored-by: TensorNull <tensor.null@gmail.com>
Co-authored-by: TeslaZY <TeslaZY@outlook.com>
Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com>
Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com>
Co-authored-by: He Wang <wanghechn@qq.com>
Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com>
Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box>
Co-authored-by: Stephen Hu <stephenhu@seismic.com>
Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com>
Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com>
Co-authored-by: mxc <mxc@example.com>
Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com>
Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com>
Co-authored-by: mcoder6425 <mcoder64@gmail.com>
Co-authored-by: lemsn <lemsn@msn.com>
Co-authored-by: lemsn <lemsn@126.com>
Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com>
Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com>
Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>
2025-10-09 12:36:19 +08:00
9e323a9351 Feat(nlp): add "怎么办" pattern to question word removal (#10284)
### What problem does this PR solve?

Added "怎么办" to the regex pattern in rmWWW method to improve query
cleaning by removing this common question phrase along with other
question words.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-25 16:47:56 +08:00
ca9f30e1a1 Add tree_merge for law parsers, significantly outperforming hierarchical_merge (#10202)
### What problem does this PR solve?
Add tree_merge for law parsers, significantly outperforming
hierarchical_merge, solved: #8637
1. Add tree_merge for law parsers, include build_tree and get_tree by
dfs.
2. add Copyright statement for helath_utils
### Type of change

- [x] Documentation Update
- [x] Performance Improvement
2025-09-22 16:33:21 +08:00
0d9c1f1c3c Feat: dataflow supports Spreadsheet and Word processor document (#9996)
### What problem does this PR solve?

Dataflow supports Spreadsheet and Word processor document

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-10 13:02:53 +08:00
e9ee9269f5 Feat: user defined prompt. (#9972)
### What problem does this PR solve?


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-08 14:05:01 +08:00
4e16936fa4 Refactor: Use re compile for weight method (#9929)
### What problem does this PR solve?

Use re compile for the weight method

### Type of change

- [x] Refactoring
- [x] Performance Improvement
2025-09-05 12:29:44 +08:00
c27172b3bc Feat: init dataflow. (#9791)
### What problem does this PR solve?

#9790

Close #9782

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-28 18:40:32 +08:00
5abd0bbac1 Fix typo (#9766)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-08-27 18:56:40 +08:00
b5b8032a56 Feat: Support metadata auto filer for Search. (#9524)
### What problem does this PR solve?

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-19 10:27:24 +08:00
153e430b00 Feat: add meta data filter. (#9405)
### What problem does this PR solve?

#8531 
#7417 
#6761 
#6573
#6477

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-12 14:12:56 +08:00
0a0bfc02a0 Refactor:naive_merge_with_images close useless images (#9296)
### What problem does this PR solve?

naive_merge_with_images close useless images

### Type of change

- [x] Refactoring
2025-08-07 11:07:29 +08:00
7efeaf6548 Fix:remove a img close which can not operate (#9267)
### What problem does this PR solve?


https://github.com/infiniflow/ragflow/issues/9149#issuecomment-3157129587

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-06 10:59:49 +08:00
667c5812d0 Fix:Repeated images when parsing markdown files with images (#9196)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/9149

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-04 13:35:58 +08:00
d9fe279dde Feat: Redesign and refactor agent module (#9113)
### What problem does this PR solve?

#9082 #6365

<u> **WARNING: it's not compatible with the older version of `Agent`
module, which means that `Agent` from older versions can not work
anymore.**</u>

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-07-30 19:41:09 +08:00
342a04ec8a Added infinity rank_feature support (#9044)
### What problem does this PR solve?

Added infinity rank_feature support

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-29 09:14:23 +08:00
dbc2a8689a Fix: no chunks parsed out for Law (#8842)
### What problem does this PR solve?

Fixes no chunks parsed out for Law. #5113 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-15 13:01:56 +08:00
f569401398 Fix: better_handle_different_types (#8775)
### What problem does this PR solve?


https://github.com/infiniflow/ragflow/issues/8719#issuecomment-3055883271

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-11 18:21:39 +08:00
00c954755e Fix:use the same logic to handle pos in tokenize_chunks_with_images (#8732)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/8719

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-09 09:31:40 +08:00
8af0d04ad0 Refactor:Improve the logic in search.py (#8716)
### What problem does this PR solve?

1. Remove the useless pop logic due to already been checked at the if
logic
2. merge log logic

### Type of change

- [x] Refactoring
2025-07-08 12:32:01 +08:00
b705ff08fe Refa: improve GraphRAG similarity sensitivity to numeric differences (#8479)
### What problem does this PR solve?

Improve GraphRAG similarity sensitivity to numeric differences. #8444.

### Type of change

- [x] Refactoring
2025-06-25 16:20:59 +08:00
d4e6e2bd21 Fix: doc_aggs issue. (#8418)
### What problem does this PR solve?

#8406

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-06-23 14:54:01 +08:00
3d0b440e9f fix(search.py):remove hard page_size (#8242)
### What problem does this PR solve?

Fix the restriction of forcing similarity_threshold=0 and page_size=30
when doc_ids is not empty

#8228

---------

Co-authored-by: shiqing.wusq <shiqing.wusq@dtzhejiang.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-06-13 14:56:25 +08:00
93f5df716f Fix: order chunks from docx by positions. (#7979)
### What problem does this PR solve?

#7934

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-30 17:20:53 +08:00
bd4678bca6 Fix: Unnecessary truncation in markdown parser (#7972)
### What problem does this PR solve?

Fix unnecessary truncation in markdown parser. So that markdown can work
perfectly like
[this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576)
in #7824, supporting multiple special delimiters.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-30 15:04:21 +08:00
46963ab1ca Fix: add advanced delimiter detection for naive merge (#7941)
### What problem does this PR solve?

Add advanced delimiter detection for naive merge. #7824

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-05-29 16:17:22 +08:00
0c562f0a9f Refa: change citation mark as [ID:n] (#7923)
### What problem does this PR solve?

Change citation mark as [ID:n], it's easier for LLMs to follow the
instruction :) #7904

### Type of change

- [x] Refactoring
2025-05-29 10:03:51 +08:00
Sol
0d7cfce6e1 Update rag/nlp/query.py (#7816)
### What problem does this PR solve?
Fix tokenizer resulting in low recall

![37743d3a495f734aa69f1e173fa77457](https://github.com/user-attachments/assets/1394757e-8fcb-4f87-96af-a92716144884)

![4aba633a17f34269a4e17e84fafb34c4](https://github.com/user-attachments/assets/a1828e32-3e17-4394-a633-ba3f09bd506d)

![image](https://github.com/user-attachments/assets/61308f32-2a4f-44d5-a034-d65bbec554ef)



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-05-23 17:13:37 +08:00
db4371c745 Fix: Improve First Chunk Size (#7806)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/7790

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-23 14:30:19 +08:00
d4a123d6dd Fix: resolve regex library warnings (#7782)
### What problem does this PR solve?
This small PR resolves the regex library warnings showing in Python3.11:
```python
DeprecationWarning: 'count' is passed as positional argument
```

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-22 10:06:28 +08:00
321a280031 Feat: add image preview to retrieval test. (#7610)
### What problem does this PR solve?

#7608

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-05-13 14:30:36 +08:00
573d46a4ef FIX:ZeroDivisionError when using large page_size in client.retrieve() (#7595)
### What problem does this PR solve?

Close #7592

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-13 10:46:31 +08:00
a14865e6bb Fix: empty query issue. (#7551)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-09 12:20:19 +08:00
c7310f7fb2 Refa: similarity calculations. (#7381)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-04-28 19:17:11 +08:00
1662c7eda3 Feat: Markdown add image (#7124)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/6984

1. Markdown parser supports get pictures
2. For Native, when handling Markdown, it will handle images
3. improve merge and 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-25 18:35:28 +08:00
67dee2d74e Fix: fix retrieval tesing wrong pagination (#7174)
### What problem does this PR solve?

Fix retrieval testing wrong pagination. #7171 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-22 15:16:04 +08:00
d9266ed65a Fix: incorrect total chunks count in retrieval function after similarity filtering (#6741) (#6932)
### Related Issue:
https://github.com/infiniflow/ragflow/issues/6741

### Environment:
Using nightly version
Commit version:
[[6051abb](6051abb4a3)]

### Bug Description:
The retrieval function in rag/nlp/search.py returns the original total
chunks number
even after chunks are filtered by similarity_threshold. This creates
inconsistency
between the actual returned chunks and the reported total.

### Changes Made:
Added code to count how many search results actually meet or exceed the
configured similarity threshold
Positioned the calculation after the doc_ids conditional logic to ensure
special cases are handled correctly
Updated the ranks["total"] value to store this filtered count instead of
using the raw search result count
Using NumPy leverages optimized C-level batch operations to optimize
speed
2025-04-11 12:31:36 +08:00
ead5f7aba9 Fix infinite recursion in RagTokenizer when processing repetitive characters (#6109)
### What problem does this PR solve?
fix #6085 
RagTokenizer's dfs_() function falls into infinite recursion when
processing text with repetitive Chinese characters (e.g.,
"一一一一一十一十一十一..." or "一一一一一一十十十十十十十二十二十二..."), causing memory leaks.
### Type of change
Implemented three optimizations to the dfs_() function:
1.Added memoization with _memo dictionary to cache computed results
2.Added recursion depth limiting with _depth parameter (max 10 levels)
3.Implemented special handling for repetitive character sequences
- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-01 13:59:52 +08:00