fix: optimize Excel row counting for files with abnormal max_row (#13018)

### What problem does this PR solve?

Some Excel files have abnormal `max_row` metadata (e.g.,
`max_row=1,048,534` with only 300 actual data rows). This causes:
- `row_number()` returns incorrect count, creating 350+ tasks instead of
1
- `list(ws.rows)` iterates through millions of empty rows, causing
system hang

This PR uses binary search to find the actual last row with data.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
yH
2026-02-06 14:43:52 +08:00
committed by GitHub
parent 00c392e633
commit 5333e764fc
2 changed files with 62 additions and 11 deletions

View File

@ -44,7 +44,7 @@ class Excel(ExcelParser):
wb = Excel._load_excel_to_workbook(BytesIO(binary))
total = 0
for sheet_name in wb.sheetnames:
total += len(list(wb[sheet_name].rows))
total += Excel._get_actual_row_count(wb[sheet_name])
res, fails, done = [], [], 0
rn = 0
flow_images = []
@ -66,7 +66,7 @@ class Excel(ExcelParser):
flow_images.append(img)
try:
rows = list(ws.rows)
rows = Excel._get_rows_limited(ws)
except Exception as e:
logging.warning(f"Skip sheet '{sheet_name}' due to rows access error: {e}")
continue