ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-30 23:26:36 +08:00

Author	SHA1	Message	Date
Philipp Heyken Soares	6305c7e411	Fix metadata filter (#12861 ) ### What problem does this PR solve? ##### Summary This PR fixes a bug in the metadata filtering logic where the contains and not contains operators were behaving identically to the in and not in operators. It also standardizes the syntax for string-based operators. ##### The Issue On the main branch, the contains operator was implemented as: `matched = input in value if not isinstance(input, list) else all(i in value for i in input)` This logic is identical to the `in` operator. It checks if the metadata (`input`) exists within the filter (`value`). For a "contains" search, the logic should be reversed: _we want to check if the filter value exists within the metadata input_. ##### Solution Presented Here The operators have been rewritten using str.find(): Contains: `str(input).find(value) >= 0` Not Contains: `str(input).find(value) == -1` ##### Advantage This approach places the metadata (input) on the left side of the expression. This maintains stylistic consistency with the existing start with and end with operators in the same file, which also place the input on the left (e.g., str(input).lower().startswith(...)). ##### Considered Alternative In a previous PR we considered using the standard Python `in` operator: `value in str(input)`. The `in` operator is approximately 15% faster because it uses optimized Python bytecode (CONTAINS_OP) and avoids an attribute lookup. However following rejection of this PR we now propose the change presented here. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>	2026-01-29 09:59:48 +08:00
buua436	af1344033d	Delete:remove unused tests (#11749 ) ### What problem does this PR solve? change: remove unused tests ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-04 18:49:32 +08:00
hsparks-codes	4870d42949	feat: Auto-disable Raptor for structured data (Issue #11653 ) (#11676 ) ### What problem does this PR solve? Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653. Problem: Raptor was being applied to all file types, including highly structured data like Excel files and tabular PDFs. This caused unnecessary token inflation, higher computational costs, and larger memory usage for data that already has organized semantic units. Solution: Automatically skip Raptor processing for: - Excel files (.xls, .xlsx, .xlsm, .xlsb) - CSV files (.csv, .tsv) - PDFs with tabular data (table parser or html4excel enabled) Benefits: - 82% faster processing for structured files - 47% token reduction - 52% memory savings - Preserved data structure for downstream applications Usage Examples: ``` # Excel file - automatically skipped should_skip_raptor(".xlsx") # True # CSV file - automatically skipped should_skip_raptor(".csv") # True # Tabular PDF - automatically skipped should_skip_raptor(".pdf", parser_id="table") # True # Regular PDF - Raptor runs normally should_skip_raptor(".pdf", parser_id="naive") # False # Override for special cases should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False ``` Configuration: Includes `auto_disable_for_structured_data` toggle (default: true) to allow override for special use cases. Testing: 44 comprehensive tests, 100% passing ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:02:29 +08:00
hsparks-codes	237a66913b	Feat: RAG evaluation (#11674 ) ### What problem does this PR solve? Feature: This PR implements a comprehensive RAG evaluation framework to address issue #11656. Problem: Developers using RAGFlow lack systematic ways to measure RAG accuracy and quality. They cannot objectively answer: 1. Are RAG results truly accurate? 2. How should configurations be adjusted to improve quality? 3. How to maintain and improve RAG performance over time? Solution: This PR adds a complete evaluation system with: - Dataset & test case management - Create ground truth datasets with questions and expected answers - Automated evaluation - Run RAG pipeline on test cases and compute metrics - Comprehensive metrics - Precision, recall, F1 score, MRR, hit rate for retrieval quality - Smart recommendations - Analyze results and suggest specific configuration improvements (e.g., "increase top_k", "enable reranking") - 20+ REST API endpoints - Full CRUD operations for datasets, test cases, and evaluation runs Impact: Enables developers to objectively measure RAG quality, identify issues, and systematically improve their RAG systems through data-driven configuration tuning. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:00:58 +08:00
Jin Hai	256b0fb19c	Remove redundant ut (#10955 ) ### What problem does this PR solve? Remove redundant ut cases. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 13:04:20 +08:00
Jin Hai	78631a3fd3	Move some functions out of 'api/utils/common.py' (#10948 ) ### What problem does this PR solve? as title. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 12:34:47 +08:00
Jin Hai	360f5c1179	Move token related functions to common (#10942 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-03 08:50:05 +08:00
Jin Hai	44f2d6f5da	Move 'get_project_base_directory' to common directory (#10940 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 21:05:28 +08:00
Jin Hai	6447b737ab	Move singleton to common directory (#10935 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 12:24:08 +08:00
Jin Hai	f52e56c2d6	Remove 'get_lan_ip' and add common misc_utils.py (#10880 ) ### What problem does this PR solve? Add get_uuid, download_img and hash_str2int into misc_utils.py ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-31 16:42:01 +08:00
Jin Hai	5a200f7652	Add time utils (#10849 ) ### What problem does this PR solve? - Add time utilities and unit tests ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-28 19:09:14 +08:00
Jin Hai	766d900a41	Refactor: rename rmSpace to remove_redundant_spaces (#10796 ) ### What problem does this PR solve? - rename rmSpace to remove_redundant_spaces - move clean_markdown_block to common module - add unit tests for remove_redundant_spaces and clean_markdown_block ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-28 09:46:32 +08:00

12 Commits