mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-02-03 00:55:10 +08:00
## What problem does this PR solve? This PR implements parsing support for legacy PowerPoint files (`.ppt`, 97-2003 format). Currently, parsing these files fails because `python-pptx` **natively lacks support** for the legacy OLE2 binary format. ## **Context:** I originally using `aspose-slides` for this purpose. However, since `aspose-slides` is **no longer a project dependency**, I implemented a fallback mechanism using the existing `tika-server` to ensure compatibility and stability. ## **Key Changes:** - **Fallback Logic**: Modified `rag/app/presentation.py` to catch `python-pptx` failures and automatically fall back to Tika parsing. - **No New Dependencies**: Utilizes the `tika` service that is already part of the RAGFlow stack. - **Note**: Since Tika focuses on text extraction, this implementation extracts text content but does not generate slide thumbnails . ## 🧪 Test / Verification Results ### 1. Before (The Issue) I have verified the fix using a legacy `.ppt` file (`math(1).ppt`, ~8MB). <img width="963" height="970" alt="image" src="https://github.com/user-attachments/assets/468c4ba8-f90b-4d7b-b969-9c5f5e42c474" /> ### 2. After (The Fix) With this PR, the system detects the failure in python-pptx and successfully falls back to Tika. The text is extracted correctly. <img width="1467" height="1121" alt="image" src="https://github.com/user-attachments/assets/fa0fed3b-b923-4c86-ba2c-24b3ce6ee7a6" /> **Type of change** - [x] New Feature (non-breaking change which adds functionality) Signed-off-by: evilhero <2278596667@qq.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>