Commit Graph

61 Commits

Author SHA1 Message Date
1254ecf445 Added static check at PR CI (#3921)
### What problem does this PR solve?

Added static check at PR CI

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2024-12-08 21:23:51 +08:00
0d68a6cd1b Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
7b6a5ffaff Fix: page_chars attribute does not exist in some formats of PDF (#3796)
### What problem does this PR solve?

In #3335 someone suggested to upgrade pdfplumber==0.11.1, but that
didn't solve it.
It's actually the special formatting in some of the pdfs that triggers
the problem.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-03 11:08:06 +08:00
7058ac0041 Fix out of boundary. (#3786)
### What problem does this PR solve?

#3769
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-02 11:38:53 +08:00
bc701d7b4c Edit chunk shall update instead of insert it (#3709)
### What problem does this PR solve?

Edit chunk shall update instead of insert it. Close #3679 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:00:38 +08:00
cad341e794 Added kb_id filter to knn. Fix #3458 (#3513)
### What problem does this PR solve?

Added kb_id filter to knn. Fix #3458

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-20 20:53:30 +08:00
1e90a1bf36 Move settings initialization after module init phase (#3438)
### What problem does this PR solve?

1. Module init won't connect database any more.
2. Config in settings need to be used with settings.CONFIG_NAME

### Type of change

- [x] Refactoring

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-11-15 17:30:56 +08:00
30f6421760 Use consistent log file names, introduced initLogger (#3403)
### What problem does this PR solve?

Use consistent log file names, introduced initLogger

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-11-14 17:13:48 +08:00
a2a5631da4 Rework logging (#3358)
Unified all log files into one.

### What problem does this PR solve?

Unified all log files into one.

### Type of change

- [x] Refactoring
2024-11-12 17:35:13 +08:00
bfc07fe4f9 bigger resolution for OCR (#2919)
### What problem does this PR solve?



### Type of change

- [x] Performance Improvement
2024-10-21 16:25:42 +08:00
66172cef3e fix: torch dependency start error (#2777)
### What problem does this PR solve?

when use slim image, remove ```torch``` denpendency.

### Type of change

- [✓] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: chongchuanbing <chongchuanbing@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-10-10 10:06:03 +08:00
b68d349bd6 Fix: renrank_model and pdf_parser bugs | Update: session API (#2601)
### What problem does this PR solve?

Fix: renrank_model and pdf_parser bugs | Update: session API
#2575
#2559
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
2024-09-26 16:05:25 +08:00
7bb28ca2bd add lighten control (#2567)
### What problem does this PR solve?

#2295

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-09-24 19:22:01 +08:00
7e75b9d778 fix parsing spaces in russian language PDFs (#1987) (#2427)
### What problem does this PR solve?

[#1987](https://github.com/infiniflow/ragflow/issues/1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-14 13:14:39 +08:00
H
0cb588f7bf Fix docx parser line bug (#1715)
### What problem does this PR solve?
#1704 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-07-29 10:06:02 +08:00
ebdd71ce68 fix: When parsing the bold content in PDF, the result is duplicated. (#1729)
### What problem does this PR solve?

_fix: When parsing the bold content in PDF, the result is duplicated._

the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-29 09:43:05 +08:00
H
b24abee364 Fix pdfparser content confusion (#1700)
### What problem does this PR solve?

#1407 #1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-25 11:40:23 +08:00
100b3165d8 pypdf2 to pypdf (#1684)
### What problem does this PR solve?

pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59

### Type of change

- [x] Refactoring
2024-07-24 12:38:48 +08:00
d29fd52e14 fix bug about divided by zero (#1482)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-12 12:59:56 +08:00
7f4c63d102 fix: Delete hardcode (#1464)
### What problem does this PR solve?

After checking the language of the pdf, the line will hardcode the
language into Chinese

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 15:41:31 +08:00
H
2290c2a2f0 fix pdf_paser char content confusion (#1462)
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 14:37:55 +08:00
H
dbb8f7b77b fix pdf_parser content confusion (#1458)
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 12:36:55 +08:00
45853505bb Fix occasional errors in pdf table recognition (#1277)
### What problem does this PR solve?

Fix occasional errors in pdf table recognition

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-06-27 14:37:58 +08:00
4454ba7a1e add self-rag (#1070)
### What problem does this PR solve?

#1069 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-06-06 11:13:39 +08:00
cdea1d0a85 Update readme and add license (#1018)
### What problem does this PR solve?

- Update readme
- Add license

### Type of change

- [x] Documentation Update

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-06-01 16:24:10 +08:00
843720f958 fix bug in pdf parser (#986)
### What problem does this PR solve?

#963 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-30 11:47:36 +08:00
7eee193956 fix #917 #915 (#946)
### What problem does this PR solve?

#917 
#915

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-28 11:13:02 +08:00
3bbdf3b770 fixbug for computing 'not concating feature' (#896)
### What problem does this PR solve?

When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-23 14:29:42 +08:00
99be226c7c fix coordinate error (#686)
### What problem does this PR solve?

#683 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-08 20:00:14 +08:00
cab274f560 remove PyMuPDF (#618)
### What problem does this PR solve?
#613 

### Type of change


- [x] Other (please describe):
2024-04-30 12:38:09 +08:00
8c07992b6c refine code (#595)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 19:13:33 +08:00
d589b0f568 fix exception in pdf parser (#584)
### What problem does this PR solve?
#451 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-28 14:23:53 +08:00
9d60a84958 refactor code (#583)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 13:19:54 +08:00
66f8d35632 Refactor (#537)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-25 14:14:28 +08:00
0dfc8ddc0f enlarge docker memory usage (#501)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-23 14:41:10 +08:00
962c66714e fix divide by zero bug (#447)
### What problem does this PR solve?

#445 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-19 11:26:38 +08:00
39f1feaccb Bug fix pdf parse index out of range (#440)
### What problem does this PR solve?

fix a bug comes when parse some pdf file #436 

### Type of change

- [☑️ ] Bug Fix (non-breaking change which fixes an issue)
2024-04-19 08:44:51 +08:00
0499a3f621 rm page number exception for pdf parser (#424)
### What problem does this PR solve?

#423 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-18 12:09:56 +08:00
453c29170f make sure the models will not be load twice (#422)
### What problem does this PR solve?

#381 
### Type of change

- [x] Refactoring
2024-04-18 09:37:23 +08:00
a5384446e3 let's load model from local (#163) 2024-03-28 16:10:47 +08:00
fd7fcb5baf apply pep8 formalize (#155) 2024-03-27 11:33:46 +08:00
979b3a5b4b support snapshot download from local (#153)
* support snapshot download from local

* let snapshot download from local
2024-03-27 09:53:42 +08:00
da21320b88 fix plainPdf bugs (#152) 2024-03-26 15:11:07 +08:00
71fe314955 refine page ranges (#147) 2024-03-25 13:11:57 +08:00
f6aee7f230 add use layout or not option (#145)
* add use layout or not option

* trival
2024-03-22 19:21:09 +08:00
6c6b144de2 refine manual parser (#140) 2024-03-21 18:17:32 +08:00
6999598101 refine for English corpus (#135) 2024-03-20 16:56:16 +08:00
9a843667b3 fix github account login issue (#132) 2024-03-19 15:31:47 +08:00
9da671b951 refine manul parser (#131) 2024-03-19 12:26:04 +08:00
675a9f8d9a add dockerfile for cuda envirement. Refine table search strategy, (#123) 2024-03-14 19:45:29 +08:00