fix: optimize Excel row counting for files with abnormal max_row (#13018)

### What problem does this PR solve? Some Excel files have abnormal `max_row` metadata (e.g., `max_row=1,048,534` with only 300 actual data rows). This causes: - `row_number()` returns incorrect count, creating 350+ tasks instead of 1 - `list(ws.rows)` iterates through millions of empty rows, causing system hang This PR uses binary search to find the actual last row with data. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-07 11:05:05 +08:00 · 2026-02-06 14:43:52 +08:00
parent 00c392e633
commit 5333e764fc
2 changed files with 62 additions and 11 deletions
--- a/rag/app/table.py
+++ b/rag/app/table.py
@ -44,7 +44,7 @@ class Excel(ExcelParser):
            wb = Excel._load_excel_to_workbook(BytesIO(binary))
        total = 0
        for sheet_name in wb.sheetnames:
-            total += len(list(wb[sheet_name].rows))
+            total += Excel._get_actual_row_count(wb[sheet_name])
        res, fails, done = [], [], 0
        rn = 0
        flow_images = []
@ -66,7 +66,7 @@ class Excel(ExcelParser):
                            flow_images.append(img)

            try:
-                rows = list(ws.rows)
+                rows = Excel._get_rows_limited(ws)
            except Exception as e:
                logging.warning(f"Skip sheet '{sheet_name}' due to rows access error: {e}")
                continue