Commit Graph

26 Commits

Author SHA1 Message Date
0d9c1f1c3c Feat: dataflow supports Spreadsheet and Word processor document (#9996)
### What problem does this PR solve?

Dataflow supports Spreadsheet and Word processor document

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-10 13:02:53 +08:00
1ee9c0b8d9 fix xss in excel_parser (#9909)
### What problem does this PR solve?



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement

Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-09-05 09:58:03 +08:00
c27172b3bc Feat: init dataflow. (#9791)
### What problem does this PR solve?

#9790

Close #9782

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-28 18:40:32 +08:00
eef43fa25c Fix: unexpected truncated Excel files (#9500)
### What problem does this PR solve?

Handle unexpected truncated Excel files.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-15 17:00:34 +08:00
79e2edc835 Fix "File contains no valid workbook part" (#9360)
### What problem does this PR solve?

fix "File contains no valid workbook part"

stacktrace:
```
Traceback (most recent call last):
  File "/ragflow/deepdoc/parser/excel_parser.py", line 54, in _load_excel_to_workbook
    return RAGFlowExcelParser._dataframe_to_workbook(df)
  File "/ragflow/deepdoc/parser/excel_parser.py", line 69, in _dataframe_to_workbook
    ws.cell(row=row_num, column=col_num, value=value)
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/worksheet/worksheet.py", line 246, in cell
    cell.value = value
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 218, in value
    self._bind_value(value)
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 197, in _bind_value
    value = self.check_string(value)
  File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 165, in check_string
    raise IllegalCharacterError(f"{value} cannot be used in worksheets.")
```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2025-08-12 14:58:36 +08:00
569ab011c4 Add fallback to use 'calamine' parse engine in excel_parser.py (#9374)
### What problem does this PR solve?

add fallback to `calamine` engine when parse error raised using the
default `openpyxl` / `xlrd` engine.
e.g. the following error can be fixed:
```
Traceback (most recent call last):
  File "/ragflow/deepdoc/parser/excel_parser.py", line 53, in _load_excel_to_workbook
    df = pd.read_excel(file_like_object)
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel
    io = ExcelFile(
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__
    self._reader = self._engines[engine](
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__
    super().__init__(
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__
    self.book = self.load_workbook(self.handles.handle, engine_kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook
    return open_workbook(file_contents=data, **engine_kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook
    bk = open_workbook_xls(
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 68, in open_workbook_xls
    bk.biff2_8_load(
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 641, in biff2_8_load
    cd.locate_named_stream(UNICODE_LITERAL(qname))
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 398, in locate_named_stream
    result = self._locate_stream(
  File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 429, in _locate_stream
    raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4
```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-08-12 12:41:33 +08:00
03daf4618c Refactor parser code (#9042)
### What problem does this PR solve?

Refactor code

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-07-25 12:04:07 +08:00
0b48a2e0d1 Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613)
### What problem does this PR solve?

Fix: When Excel is a formula, the parsed result is a formula, but cannot
be correctly parsed as a value type

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: tangyu <1@1.com>
2025-03-28 09:33:49 +08:00
7cd37c37cd Feat: add CSV file parsing support (#5989)
### What problem does this PR solve?

Add CSV file parsing support #4552, #5849, #5870

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-12 19:20:50 +08:00
b0c21b00d9 Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633)
Optimize error handling and support parsing of XLS(EXCEL97—2003) files.
2025-03-05 11:55:27 +08:00
8fcca1b958 fix: big xls file error (#4859)
### What problem does this PR solve?

if *.xls file is too large, .eg >50M, I get error.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-12 12:39:25 +08:00
3894de895b Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
101b8ff813 fix chunk method "Table" losing content when the Excel file has multi… (#4123)
…ple sheets

### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-19 17:30:26 +08:00
0d68a6cd1b Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
cdea1d0a85 Update readme and add license (#1018)
### What problem does this PR solve?

- Update readme
- Add license

### Type of change

- [x] Documentation Update

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-06-01 16:24:10 +08:00
a12fcf9156 fix minio helth bug (#850)
### What problem does this PR solve?

#643 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-20 19:35:30 +08:00
GYH
c27c02ea67 Split Excel file into different chunks (#847)
### What problem does this PR solve?


Split Excel into different chunk
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-20 18:35:15 +08:00
7013d7f620 refine text decode (#657)
### What problem does this PR solve?
#651 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 12:25:47 +08:00
9d60a84958 refactor code (#583)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 13:19:54 +08:00
ed6081845a Fit a lot of encodings for text file. (#458)
### What problem does this PR solve?

#384

### Type of change

- [x] Performance Improvement
2024-04-19 18:02:53 +08:00
36f2d7b797 To avoid assertion while no rows in excel (#197)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

Issue link:#[[Link the issue
here](https://github.com/infiniflow/ragflow/issues/196)]

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Breaking Change (fix or feature that could cause existing
functionality not to work as expected)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Test cases
- [ ] Python SDK impacted, Need to update PyPI
- [ ] Other (please describe):
2024-04-02 10:51:21 +08:00
fd7fcb5baf apply pep8 formalize (#155) 2024-03-27 11:33:46 +08:00
6999598101 refine for English corpus (#135) 2024-03-20 16:56:16 +08:00
675a9f8d9a add dockerfile for cuda envirement. Refine table search strategy, (#123) 2024-03-14 19:45:29 +08:00
f1f09df901 add local llm implementation (#119) 2024-03-12 11:57:08 +08:00
cacd36c5e1 use onnx models, new deepdoc (#68) 2024-02-21 16:32:38 +08:00