mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-02-07 11:05:05 +08:00
fix: optimize Excel row counting for files with abnormal max_row (#13018)
### What problem does this PR solve? Some Excel files have abnormal `max_row` metadata (e.g., `max_row=1,048,534` with only 300 actual data rows). This causes: - `row_number()` returns incorrect count, creating 350+ tasks instead of 1 - `list(ws.rows)` iterates through millions of empty rows, causing system hang This PR uses binary search to find the actual last row with data. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@ -44,7 +44,7 @@ class Excel(ExcelParser):
|
||||
wb = Excel._load_excel_to_workbook(BytesIO(binary))
|
||||
total = 0
|
||||
for sheet_name in wb.sheetnames:
|
||||
total += len(list(wb[sheet_name].rows))
|
||||
total += Excel._get_actual_row_count(wb[sheet_name])
|
||||
res, fails, done = [], [], 0
|
||||
rn = 0
|
||||
flow_images = []
|
||||
@ -66,7 +66,7 @@ class Excel(ExcelParser):
|
||||
flow_images.append(img)
|
||||
|
||||
try:
|
||||
rows = list(ws.rows)
|
||||
rows = Excel._get_rows_limited(ws)
|
||||
except Exception as e:
|
||||
logging.warning(f"Skip sheet '{sheet_name}' due to rows access error: {e}")
|
||||
continue
|
||||
|
||||
Reference in New Issue
Block a user