Refa: improve image table context (#12244)

### What problem does this PR solve? Improve image table context. Current strategy in attach_media_context: - Order by position when possible: if any chunk has page/position info, sort by (page, top, left), otherwise keep original order. - Apply only to media chunks: images use image_context_size, tables use table_context_size. - Primary matching: on the same page, choose a text chunk whose vertical span overlaps the media, then pick the one with the closest vertical midpoint. - Fallback matching: if no overlap on that page, choose the nearest text chunk on the same page (page-head uses the next text; page-tail uses the previous text). - Context extraction: inside the chosen text chunk, find a mid-sentence boundary near the text midpoint, then take context_size tokens split before/after (total budget). - No multi-chunk stitching: context comes from a single text chunk to avoid mixing unrelated segments. ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-31 01:01:30 +08:00 · 2025-12-26 17:55:32 +08:00
parent 9de3ecc4a8
commit 51bc41b2e8
4 changed files with 165 additions and 43 deletions
--- a/web/src/interfaces/database/document.ts
+++ b/web/src/interfaces/database/document.ts
@ -44,6 +44,9 @@ export interface IParserConfig {
  raptor?: Raptor;
  graphrag?: GraphRag;
  image_context_window?: number;
+  image_table_context_window?: number;
+  image_context_size?: number;
+  table_context_size?: number;
  mineru_parse_method?: 'auto' | 'txt' | 'ocr';
  mineru_formula_enable?: boolean;
  mineru_table_enable?: boolean;
--- a/web/src/interfaces/request/document.ts
+++ b/web/src/interfaces/request/document.ts
@ -8,6 +8,9 @@ export interface IChangeParserConfigRequestBody {
  auto_questions?: number;
  html4excel?: boolean;
  toc_extraction?: boolean;
+  image_table_context_window?: number;
+  image_context_size?: number;
+  table_context_size?: number;
 }

 export interface IChangeParserRequestBody {