Fix: improve PDF text type detection by expanding regex content (#11432)

mirror of https://github.com/infiniflow/ragflow.git synced 2026-02-06 18:45:08 +08:00

- Add whitespace validation to the PDF English text checking regex
- Reduce false negatives in English PDF content recognition

### What problem does this PR solve?

The core idea is to **expand the regex content used for English text
detection** so it can accommodate more valid characters commonly found
in English PDFs. The modifications include:

- Adding support for **space** in the regex.
- Ensuring the update does not reduce existing detection accuracy.

### Type of change

- [✅] Bug Fix (non-breaking change which fixes an issue)

This commit is contained in:

FallingSnowFlake

2025-11-21 14:33:29 +08:00

committed by

GitHub

parent 1845daf41f

commit 1033a3ae26

1 changed files with 2 additions and 2 deletions