### What problem does this PR solve?
This PR adds **Seafile** as a new data source connector for RAGFlow.
[Seafile](https://www.seafile.com/) is an open-source, self-hosted file
sync and share platform widely used by enterprises, universities, and
organizations that require data sovereignty and privacy. Users who store
documents in Seafile currently have no way to index and search their
content through RAGFlow.
This connector enables RAGFlow users to:
- Connect to self-hosted Seafile servers via API token
- Index documents from personal and shared libraries
- Support incremental polling for updated files
- Seamlessly integrate Seafile-stored documents into their RAG pipelines
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### Changes included
- `SeaFileConnector` implementing `LoadConnector` and `PollConnector`
interfaces
- Support for API token
- Recursive file traversal across libraries
- Time-based filtering for incremental updates
- Seafile logo (sourced from Simple Icons, CC0)
- Connector configuration and registration
### Testing
- Tested against self-hosted Seafile Community Edition
- Verified authentication (token)
- Verified document ingestion from personal and shared libraries
- Verified incremental polling with time filters
### What problem does this PR solve?
Fix:Optimize metadata and optimize the empty state style of the agent
page.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add memory status indicator and detail message dialog
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
#### Summary
This PR enhances the Semi-automatic metadata filtering mode by allowing
users to explicitly pre-define operators (e.g., contains, =, >, etc.)
for selected metadata keys. While the LLM still dynamically extracts the
filter value from the user's query, it is now strictly constrained to
use the operator specified in the UI configuration.
Using this feature is optional. By default the operator selection is set
to "automatic" resulting in the LLM choosing the operator (as
presently).
#### Rationale & Use Case
This enhancement was driven by a concrete challenge I encountered while
working with technical documentation.
In my specific use case, I was trying to filter for software versions
within a technical manual. In this dataset, a single document chunk
often applies to multiple software versions. These versions are stored
as a combined string within the metadata for each chunk.
When using the standard semi-automatic filter, the LLM would
inconsistently choose between the contains and equals operators. When it
chose equals, it would exclude every chunk that applied to more than one
version, even if the version I was searching for was clearly included in
that metadata string. This led to incomplete and frustrating retrieval
results.
By extending the semi-automatic filter to allow pre-defining the
operator for a specific key, I was able to force the use of contains for
the version field. This change immediately led to significantly improved
and more reliable results in my case.
I believe this functionality will be equally useful for others dealing
with "tagged" or multi-value metadata where the relationship between the
query and the field is known, but the specific value needs to remain
dynamic.
#### Key Changes
##### Backend & Core Logic
- `common/metadata_utils.py`: Updated apply_meta_data_filter to support
a mixed data structure for semi_auto (handling both legacy string arrays
and the new object-based format {"key": "...", "op": "..."}).
- `rag/prompts/generator.py`: Extended gen_meta_filter to accept and
pass operator constraints to the LLM.
- `rag/prompts/meta_filter.md`: Updated the system prompt to instruct
the LLM to strictly respect provided operator constraints.
##### Frontend
- `web/src/components/metadata-filter/metadata-semi-auto-fields.tsx`:
Enhanced the UI to include an operator dropdown for each selected
metadata key, utilizing existing operator constants.
- `web/src/components/metadata-filter/index.tsx`: Updated the validation
schema to accommodate the new state structure.
#### Test Plan
- Backward Compatibility: Verified that existing semi-auto filters
stored as simple strings still function correctly.
- Prompt Verification: Confirmed that constraints are correctly rendered
in the LLM system prompt when specified.
- Added unit tests as
`test/unit_test/common/test_apply_semi_auto_meta_data_filter.py`
- Manual End-to-End:
- Configured a "Semi-automatic" filter for a "Version" key with the
"contains" operator.
- Asked a version-specific query.
- Result
<img width="1173" height="704" alt="Screenshot 2026-02-02 145359"
src="https://github.com/user-attachments/assets/510a6a61-a231-4dc2-a7fe-cdfc07219132"
/>
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
---------
Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>
### What problem does this PR solve?
As title.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
Fixed 12787 with syntax error in generated MySql json path expression
https://github.com/infiniflow/ragflow/issues/12787
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
#### What was fixed:
- Changed line 237 in ob_conn.py from value_str = get_value_str(value)
if value else "" to value_str = get_value_str(value)
- This fixes the bug where falsy but valid values (0, False, "", [], {})
were being converted to empty strings, causing invalid SQL syntax
#### What was tested:
- Comprehensive unit tests covering all edge cases
- Regression tests specifically for the bug scenario
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Proofread the Sandbox Specification document and moved it to a dedicated
folder outside of the original docs.
### Type of change
- [x] Documentation Update
## Description
This PR focuses on API performance optimization and refining the model
capability detection logic in the Agent/Canvas module.
### 1. Performance Optimization (Backend)
- **Changes**: Removed `cls.model.dsl` from query fields in
`UserCanvasService.get_by_tenant_ids`.
- **Reasoning**: The `dsl` object is large and unnecessary for the Agent
list view. Excluding it reduces the payload size of the
`/v1/canvas/list` API, leading to faster serialization and reduced
network latency.
- **Consistency**: Full DSL data remains accessible via the individual
`/v1/canvas/get/<id>` endpoint used in the detail view.
### 2. Multimodal Detection Refinement (Frontend)
- **Changes**: Replaced `model_type === LlmModelType.Image2text` with
`tags?.includes('IMAGE2TEXT')`.
- **Reasoning**: In RAGFlow, `model_type` defines the primary role of a
model (e.g., `chat`). However, many advanced Chat models are also
vision-capable. Since `model_type` is a single-value field, it cannot
represent these multiple capabilities.
- **Solution**: Utilizing the `tags` field (which supports multiple
attributes) to check for `IMAGE2TEXT` ensures that models like
`gpt-5.2-pro` correctly display multimodal input options.
## Type of Change
- [x] Bug fix (logic correction for multimodal detection)
- [x] Optimization (performance improvement for list API)
## Main Changes
- `api/db/services/canvas_service.py`: Optimized DB query by excluding
heavy DSL fields.
- `web/src/pages/agent/form/agent-form/index.tsx`: Enhanced capability
detection using the tags system.
## Verification
- [x] Verified Agent list loads faster with reduced response payload.
- [x] Confirmed that `chat` models with the `IMAGE2TEXT` tag now
correctly enable the multimodal input UI.
### What problem does this PR solve?
Fix Gitee AI links and update the reranker model configuration
### Type of change
- [X] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: franco <1787003204@q.comq>
## What problem does this PR solve?
This PR implements parsing support for legacy PowerPoint files (`.ppt`,
97-2003 format).
Currently, parsing these files fails because `python-pptx` **natively
lacks support** for the legacy OLE2 binary format.
## **Context:**
I originally using `aspose-slides` for this purpose. However, since
`aspose-slides` is **no longer a project dependency**, I implemented a
fallback mechanism using the existing `tika-server` to ensure
compatibility and stability.
## **Key Changes:**
- **Fallback Logic**: Modified `rag/app/presentation.py` to catch
`python-pptx` failures and automatically fall back to Tika parsing.
- **No New Dependencies**: Utilizes the `tika` service that is already
part of the RAGFlow stack.
- **Note**: Since Tika focuses on text extraction, this implementation
extracts text content but does not generate slide thumbnails .
## 🧪 Test / Verification Results
### 1. Before (The Issue)
I have verified the fix using a legacy `.ppt` file (`math(1).ppt`,
~8MB).
<img width="963" height="970" alt="image"
src="https://github.com/user-attachments/assets/468c4ba8-f90b-4d7b-b969-9c5f5e42c474"
/>
### 2. After (The Fix)
With this PR, the system detects the failure in python-pptx and
successfully falls back to Tika. The text is extracted correctly.
<img width="1467" height="1121" alt="image"
src="https://github.com/user-attachments/assets/fa0fed3b-b923-4c86-ba2c-24b3ce6ee7a6"
/>
**Type of change**
- [x] New Feature (non-breaking change which adds functionality)
Signed-off-by: evilhero <2278596667@qq.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fixes a duplicate POSTGRES entry in the TextFieldType enum that triggers
TypeError: 'POSTGRES' already defined as 'TEXT' on
import, preventing the backend from starting and resulting in 502 errors
on the frontend.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fixed the regression issue that unable to start the server as below
details, which was related to this pr
https://github.com/infiniflow/ragflow/pull/12926 looks like.
``` Error Trace
Traceback (most recent call last):
File "/mnt/c/Workspace/ragflow/api/ragflow_server.py", line 33, in <module>
from api.apps import app
File "/mnt/c/Workspace/ragflow/api/apps/__init__.py", line 26, in <module>
Traceback (most recent call last):
File "/mnt/c/Workspace/ragflow/rag/svr/task_executor.py", line 34, in <module>
from api.db.db_models import close_connection, APIToken
File "/mnt/c/Workspace/ragflow/api/db/db_models.py", line 49, in <module>
from api.db.services.knowledgebase_service import KnowledgebaseService
File "/mnt/c/Workspace/ragflow/api/db/services/__init__.py", line 19, in <module>
class TextFieldType(Enum):
File "/mnt/c/Workspace/ragflow/api/db/db_models.py", line 53, in TextFieldType
from .user_service import UserService as UserService
File "/mnt/c/Workspace/ragflow/api/db/services/user_service.py", line 24, in <module>
POSTGRES = "TEXT"
^^^^^^^^
File "/usr/lib/python3.12/enum.py", line 443, in __setitem__
raise TypeError('%r already defined as %r' % (key, self[key]))
TypeError: 'POSTGRES' already defined as 'TEXT'
from api.db.db_models import DB, UserTenant
File "/mnt/c/Workspace/ragflow/api/db/db_models.py", line 49, in <module>
class TextFieldType(Enum):
File "/mnt/c/Workspace/ragflow/api/db/db_models.py", line 53, in TextFieldType
POSTGRES = "TEXT"
^^^^^^^^
File "/usr/lib/python3.12/enum.py", line 443, in __setitem__
raise TypeError('%r already defined as %r' % (key, self[key]))
TypeError: 'POSTGRES' already defined as 'TEXT'
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
close#12930
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Changes
only delete duplicate definition in `api/db/db_models.py`
### What problem does this PR solve?
- Remove unused imports (Mock, patch, MagicMock, json, os,
RAGFLOW_COLUMNS, VECTOR_FIELD_PATTERN) from multiple files
- Replace f-string formatting with regular strings for console output
messages in cli.py
- Clean up unnecessary imports that were no longer being used in the
codebase
### Type of change
- [x] Refactoring
## Summary
This PR adds Peewee ORM support for OceanBase as the primary database in
RAGFlow, as requested in issue #12769.
## Changes
### Core Implementation
1. **RetryingPooledOceanBaseDatabase Class**
- Inherits from `PooledMySQLDatabase` (OceanBase is MySQL-compatible)
- Implements retry mechanism for connection issues
- Handles MySQL-specific error codes (2013, 2006 for connection loss)
- Provides connection pool management
2. **PooledDatabase Enum**
- Added `OCEANBASE = RetryingPooledOceanBaseDatabase`
3. **DatabaseLock Enum**
- Added `OCEANBASE = MysqlDatabaseLock`
- OceanBase uses MySQL-style locking
4. **TextFieldType Enum**
- Added `OCEANBASE = "LONGTEXT"`
- OceanBase uses same text field type as MySQL
5. **DatabaseMigrator Enum**
- Added `OCEANBASE = MySQLMigrator`
- OceanBase uses MySQL migration tools
### Usage
```bash
# Set environment variable to use OceanBase
export DB_TYPE=oceanbase
# Configure connection (in docker/.env or environment)
OCEANBASE_HOST=localhost
OCEANBASE_PORT=2881
OCEANBASE_USER=root
OCEANBASE_PASSWORD=password
OCEANBASE_DATABASE=ragflow
```
### Technical Details
- **Location**: `api/db/db_models.py`
- **Dependencies**: No new dependencies (uses existing Peewee MySQL
support)
- **Code Size**: ~90 lines
- **Difficulty**: Simple
### Testing
- Added comprehensive unit tests in
`tests/unit/test_oceanbase_peewee.py`
- Tests cover:
- OceanBase database class existence and inheritance
- Enum values for PooledDatabase, DatabaseLock, TextFieldType
- Initialization with custom retry settings
- Environment variable configuration
### Acceptance Criteria
✅ Can switch to OceanBase database via `DB_TYPE=oceanbase` environment
variable
✅ All database operations work normally in OceanBase environment
✅ OceanBase uses MySQL compatibility mode (no additional dependencies)
### Background
This is part of the RAGFlow + OceanBase Hackathon to allow users to
choose OceanBase as RAGFlow's primary database, leveraging OceanBase's
high availability and scalability.
---
## Related Issues
- **Primary**: https://github.com/infiniflow/ragflow/issues/12769
- **Context**: https://github.com/oceanbase/seekdb/issues/123 (OceanBase
Developer Challenge)
---
Closesinfiniflow/ragflow#12769
### What problem does this PR solve?
close#12770
This PR adds OceanBase as a storage backend for the Table Parser. It
enables dynamic table schema storage via JSON and implements OceanBase
SQL execution for text-to-SQL retrieval.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### Changes
- Table Parser stores row data into `chunk_data` when doc engine is
OceanBase. (table.py)
- OceanBase table schema adds `chunk_data` JSON column and migrates if
needed.
- Implemented OceanBase `sql()` to execute text-to-SQL results.
(ob_conn.py)
- Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py)
### Test
1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`)
<img width="1290" height="783" alt="doc_engine_ob"
src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1"
/>
2. Upload an Excel file to Knowledge Base.(for test, we use as below)
<img width="786" height="930" alt="excel"
src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a"
/>
3. Choose **Table** as parsing method.
<img width="2550" height="1134" alt="parse_excel"
src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0"
/>
4.Ask a natural language query in chat.
<img width="2550" height="1134" alt="query"
src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91"
/>
### What problem does this PR solve?
Close#12768.
This PR adds OceanBase support to RAGFlow’s Text-to-SQL (ExeSQL)
component.
OceanBase is integrated via MySQL compatibility mode, and the UI
`db_type` options are updated accordingly.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### Changes
**Backend**
- Add `oceanbase` `db_type` validation and connection logic in
`exesql.py` and reuse existing MySQL compatibility mode
**Frontend**
- Add OceanBase option to the ExeSQL `db_type` selector
### How to test
1. Configure OceanBase connection in ExeSQL node
(host/port/user/password/database)
2. Input: “Show 10 rows from test table”
3. Generated SQL: `SELECT * FROM test LIMIT 10;`
4. Query executes successfully and results are returned
### Screenshots
- ExeSQL db_type includes OceanBase
<img width="649" height="1015" alt="2"
src="https://github.com/user-attachments/assets/e0a5f7b9-e282-402a-8639-64c1aef8fce6"
/>
- ExeSQL test OceanBase connection
<img width="2247" height="1140" alt="test_ob"
src="https://github.com/user-attachments/assets/f16ebd93-b48e-4d18-b53f-8496581e755d"
/>
- Query results from OceanBase shown in UI
<img width="2550" height="1351" alt="1"
src="https://github.com/user-attachments/assets/b44163dc-baab-420d-b31e-b644bdcb77a9"
/>
### What problem does this PR solve?
Changed test priorities in multiple test files, downgrading from p1 to
p2 and p2 to p3.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add code coverage reporting to CI
### Type of change
- [x] Test (please describe): coverage report
---------
Co-authored-by: Liu An <asiro@qq.com>
**What problem does this PR solve?**
When loading JSON mapping/schema files, the code used
json.load(open(path)) without closing the file. The file handle stayed
open until garbage collection, which can leak file descriptors under
load (e.g. repeated reconnects or migrations).
**Type of change**
[x] Bug Fix (non-breaking change which fixes an issue)
**Change**
Replaced json.load(open(...)) with a context manager so the file is
closed after loading:
with open(fp_mapping, "r") as f: ... = json.load(f)
**Files updated**
rag/utils/opensearch_conn.py – mapping load (1 place)
common/doc_store/es_conn_base.py – mapping load + doc_meta_mapping load
(2 places)
common/doc_store/infinity_conn_base.py – schema loads in _migrate_db,
doc metadata table creation, and SQL field mapping (4 places)
Behavior is unchanged; only resource handling is fixed.
Co-authored-by: Gittensor Miner <miner@gittensor.io>
### What problem does this PR solve?
test_update_document.py failed as metadata is not included in the
response of get_list(), fix the issue.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Closes#12762
### What problem does this PR solve?
**Line break issue in Agent prompt editor:**
- Text with blank lines in `system_prompt` or `user_prompt` would have
extra/fewer blank lines after save/reload or paste
- Root cause: Mismatch between Lexical editor's paragraph nodes (`\n\n`
separator) and line break nodes (`\n` separator)
**Auto-save issue:**
- Changes were only saved after 20-second debounce, causing data loss on
page refresh before timer completed
### Solution
1. **Line break fix**: Use `LineBreakNode` consistently for all line
breaks (typing Enter, paste, load)
2. **Auto-save**: Save immediately when prompt editor loses focus
[1.webm](https://github.com/user-attachments/assets/eb2c2428-54a3-4d4e-8037-6cc34a859b83)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This commit updates test cases for create, delete, and update dataset
endpoints to expect consistent error messages when an unsupported
content type is provided.
### Type of change
- [x] Bug Fix (test)
## Description
This PR implements comprehensive OceanBase performance monitoring and
health check functionality as requested in issue #12772. The
implementation follows the existing ES/Infinity health check patterns
and provides detailed metrics for operations teams.
## Problem
Currently, RAGFlow lacks detailed health monitoring for OceanBase when
used as the document engine. Operations teams need visibility into:
- Connection status and latency
- Storage space usage
- Query throughput (QPS)
- Slow query statistics
- Connection pool utilization
## Solution
### 1. Enhanced OBConnection Class (`rag/utils/ob_conn.py`)
Added comprehensive performance monitoring methods:
- `get_performance_metrics()` - Main method returning all performance
metrics
- `_get_storage_info()` - Retrieves database storage usage
- `_get_connection_pool_stats()` - Gets connection pool statistics
- `_get_slow_query_count()` - Counts queries exceeding threshold
- `_estimate_qps()` - Estimates queries per second
- Enhanced `health()` method with connection status
### 2. Health Check Utilities (`api/utils/health_utils.py`)
Added two new functions following ES/Infinity patterns:
- `get_oceanbase_status()` - Returns OceanBase status with health and
performance metrics
- `check_oceanbase_health()` - Comprehensive health check with detailed
metrics
### 3. API Endpoint (`api/apps/system_app.py`)
Added new endpoint:
- `GET /v1/system/oceanbase/status` - Returns OceanBase health status
and performance metrics
### 4. Comprehensive Unit Tests
(`test/unit_test/utils/test_oceanbase_health.py`)
Added 340+ lines of unit tests covering:
- Health check success/failure scenarios
- Performance metrics retrieval
- Error handling and edge cases
- Connection pool statistics
- Storage information retrieval
- QPS estimation
- Slow query detection
## Metrics Provided
- **Connection Status**: connected/disconnected
- **Latency**: Query latency in milliseconds
- **Storage**: Used and total storage space
- **QPS**: Estimated queries per second
- **Slow Queries**: Count of queries exceeding threshold
- **Connection Pool**: Active connections, max connections, pool size
## Testing
- All unit tests pass
- Error handling tested for connection failures
- Edge cases covered (missing tables, connection errors)
- Follows existing code patterns and conventions
## Code Statistics
- **Total Lines Changed**: 665+ lines
- **New Code**: ~600 lines
- **Test Coverage**: 340+ lines of comprehensive tests
- **Files Modified**: 3
- **Files Created**: 1 (test file)
## Acceptance Criteria Met
✅ `/system/oceanbase/status` API returns OceanBase health status
✅ Monitoring metrics accurately reflect OceanBase running status
✅ Clear error messages when health checks fail
✅ Response time optimized (metrics cached where possible)
✅ Follows existing ES/Infinity health check patterns
✅ Comprehensive test coverage
## Related Files
- `rag/utils/ob_conn.py` - OceanBase connection class
- `api/utils/health_utils.py` - Health check utilities
- `api/apps/system_app.py` - System API endpoints
- `test/unit_test/utils/test_oceanbase_health.py` - Unit tests
Fixes#12772
---------
Co-authored-by: Daniel <daniel@example.com>
### What problem does this PR solve?
Fixed thread pool workers and improve retrieval component
### Type of change
- [x] Refactoring
- [x] Performance Improvement
## What problem does this PR solve?
This PR addresses three specific issues to improve agent reliability and
model support:
1. **`codeExec` Output Limitation**: Previously, the `codeExec` tool was
strictly limited to returning `string` types. I updated the output
constraint to `object` to support structured data (Dicts, Lists, etc.)
required for complex downstream tasks.
2. **`codeExec` Error Handling**: Improved the execution logic so that
when runtime errors occur, the tool captures the exception and returns
the error message as the output instead of causing the process to abort
or fail silently.
3. **Spark Model Configuration**:
- Added support for the `MAX-32k` model variant.
- Fixed the `Spark-Lite` mapping from `general` to `lite` to match the
latest API specifications.
## Type of change
- [x] Bug Fix (fixes execution logic and model mapping)
- [x] New Feature / Enhancement (adds model support and improves tool
flexibility)
## Key Changes
### `agent/tools/code_exec.py`
- Changed the output type definition from `string` to `object`.
- Refactored the execution flow to gracefully catch exceptions and
return error messages as part of the tool output.
### `rag/llm/chat_model.py`
- Added `"Spark-Max-32K": "max-32k"` to the model list.
- Updated `"Spark-Lite"` value from `"general"` to `"lite"`.
## Checklist
- [x] My code follows the style guidelines of this project.
- [x] I have performed a self-review of my own code.
Signed-off-by: evilhero <2278596667@qq.com>
### What problem does this PR solve?
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Adds a CLI-based retrieval test to CI after the Elasticsearch HTTP API
tests to validate end-to-end admin/user flows and dataset retrieval via
ragflow_cli.py. This helps catch regressions in the CLI path that aren’t
covered by existing API tests.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fix wrong data rendered in task executor bar chart
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
##### Summary
This PR fixes a bug in the metadata filtering logic where the contains
and not contains operators were behaving identically to the in and not
in operators. It also standardizes the syntax for string-based
operators.
##### The Issue
On the main branch, the contains operator was implemented as:
`matched = input in value if not isinstance(input, list) else all(i in
value for i in input)`
This logic is identical to the `in` operator. It checks if the metadata
(`input`) exists within the filter (`value`). For a "contains" search,
the logic should be reversed: _we want to check if the filter value
exists within the metadata input_.
##### Solution Presented Here
The operators have been rewritten using str.find():
Contains: `str(input).find(value) >= 0`
Not Contains: `str(input).find(value) == -1`
##### Advantage
This approach places the metadata (input) on the left side of the
expression. This maintains stylistic consistency with the existing start
with and end with operators in the same file, which also place the input
on the left (e.g., str(input).lower().startswith(...)).
##### Considered Alternative
In a previous PR we considered using the standard Python `in` operator:
`value in str(input)`.
The `in` operator is approximately 15% faster because it uses optimized
Python bytecode (CONTAINS_OP) and avoids an attribute lookup. However
following rejection of this PR we now propose the change presented here.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
---------
Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.4.0 to 2.6.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/urllib3/urllib3/releases">urllib3's
releases</a>.</em></p>
<blockquote>
<h2>2.6.3</h2>
<h2>🚀 urllib3 is fundraising for HTTP/2 support</h2>
<p><a
href="https://sethmlarson.dev/urllib3-is-fundraising-for-http2-support">urllib3
is raising ~$40,000 USD</a> to release HTTP/2 support and ensure
long-term sustainable maintenance of the project after a sharp decline
in financial support. If your company or organization uses Python and
would benefit from HTTP/2 support in Requests, pip, cloud SDKs, and
thousands of other projects <a
href="https://opencollective.com/urllib3">please consider contributing
financially</a> to ensure HTTP/2 support is developed sustainably and
maintained for the long-haul.</p>
<p>Thank you for your support.</p>
<h2>Changes</h2>
<ul>
<li>Fixed a security issue where decompression-bomb safeguards of the
streaming API were bypassed when HTTP redirects were followed.
(CVE-2026-21441 reported by <a
href="https://github.com/D47A"><code>@D47A</code></a>, 8.9 High,
GHSA-38jv-5279-wg99)</li>
<li>Started treating <code>Retry-After</code> times greater than 6 hours
as 6 hours by default. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3743">urllib3/urllib3#3743</a>)</li>
<li>Fixed <code>urllib3.connection.VerifiedHTTPSConnection</code> on
Emscripten. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3752">urllib3/urllib3#3752</a>)</li>
</ul>
<h2>2.6.2</h2>
<h2>🚀 urllib3 is fundraising for HTTP/2 support</h2>
<p><a
href="https://sethmlarson.dev/urllib3-is-fundraising-for-http2-support">urllib3
is raising ~$40,000 USD</a> to release HTTP/2 support and ensure
long-term sustainable maintenance of the project after a sharp decline
in financial support. If your company or organization uses Python and
would benefit from HTTP/2 support in Requests, pip, cloud SDKs, and
thousands of other projects <a
href="https://opencollective.com/urllib3">please consider contributing
financially</a> to ensure HTTP/2 support is developed sustainably and
maintained for the long-haul.</p>
<p>Thank you for your support.</p>
<h2>Changes</h2>
<ul>
<li>Fixed <code>HTTPResponse.read_chunked()</code> to properly handle
leftover data in the decoder's buffer when reading compressed chunked
responses. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3734">urllib3/urllib3#3734</a>)</li>
</ul>
<h2>2.6.1</h2>
<h2>🚀 urllib3 is fundraising for HTTP/2 support</h2>
<p><a
href="https://sethmlarson.dev/urllib3-is-fundraising-for-http2-support">urllib3
is raising ~$40,000 USD</a> to release HTTP/2 support and ensure
long-term sustainable maintenance of the project after a sharp decline
in financial support. If your company or organization uses Python and
would benefit from HTTP/2 support in Requests, pip, cloud SDKs, and
thousands of other projects <a
href="https://opencollective.com/urllib3">please consider contributing
financially</a> to ensure HTTP/2 support is developed sustainably and
maintained for the long-haul.</p>
<p>Thank you for your support.</p>
<h2>Changes</h2>
<ul>
<li>Restore previously removed <code>HTTPResponse.getheaders()</code>
and <code>HTTPResponse.getheader()</code> methods. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3731">#3731</a>)</li>
</ul>
<h2>2.6.0</h2>
<h2>🚀 urllib3 is fundraising for HTTP/2 support</h2>
<p><a
href="https://sethmlarson.dev/urllib3-is-fundraising-for-http2-support">urllib3
is raising ~$40,000 USD</a> to release HTTP/2 support and ensure
long-term sustainable maintenance of the project after a sharp decline
in financial support. If your company or organization uses Python and
would benefit from HTTP/2 support in Requests, pip, cloud SDKs, and
thousands of other projects <a
href="https://opencollective.com/urllib3">please consider contributing
financially</a> to ensure HTTP/2 support is developed sustainably and
maintained for the long-haul.</p>
<p>Thank you for your support.</p>
<h2>Security</h2>
<ul>
<li>Fixed a security issue where streaming API could improperly handle
highly compressed HTTP content ("decompression bombs") leading
to excessive resource consumption even when a small amount of data was
requested. Reading small chunks of compressed data is safer and much
more efficient now. (CVE-2025-66471 reported by <a
href="https://github.com/Cycloctane"><code>@Cycloctane</code></a>, 8.9
High, GHSA-2xpw-w6gg-jr37)</li>
<li>Fixed a security issue where an attacker could compose an HTTP
response with virtually unlimited links in the
<code>Content-Encoding</code> header, potentially leading to a denial of
service (DoS) attack by exhausting system resources during decoding. The
number of allowed chained encodings is now limited to 5. (CVE-2025-66418
reported by <a
href="https://github.com/illia-v"><code>@illia-v</code></a>, 8.9 High,
GHSA-gm62-xv2j-4w53)</li>
</ul>
<blockquote>
<p>[!IMPORTANT]</p>
<ul>
<li>If urllib3 is not installed with the optional
<code>urllib3[brotli]</code> extra, but your environment contains a
Brotli/brotlicffi/brotlipy package anyway, make sure to upgrade it to at
least Brotli 1.2.0 or brotlicffi 1.2.0.0 to benefit from the security
fixes and avoid warnings. Prefer using <code>urllib3[brotli]</code> to
install a compatible Brotli package automatically.</li>
</ul>
</blockquote>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/urllib3/urllib3/blob/main/CHANGES.rst">urllib3's
changelog</a>.</em></p>
<blockquote>
<h1>2.6.3 (2026-01-07)</h1>
<ul>
<li>Fixed a high-severity security issue where decompression-bomb
safeguards of
the streaming API were bypassed when HTTP redirects were followed.
(<code>GHSA-38jv-5279-wg99
<https://github.com/urllib3/urllib3/security/advisories/GHSA-38jv-5279-wg99></code>__)</li>
<li>Started treating <code>Retry-After</code> times greater than 6 hours
as 6 hours by
default. (<code>[#3743](https://github.com/urllib3/urllib3/issues/3743)
<https://github.com/urllib3/urllib3/issues/3743></code>__)</li>
<li>Fixed <code>urllib3.connection.VerifiedHTTPSConnection</code> on
Emscripten.
(<code>[#3752](https://github.com/urllib3/urllib3/issues/3752)
<https://github.com/urllib3/urllib3/issues/3752></code>__)</li>
</ul>
<h1>2.6.2 (2025-12-11)</h1>
<ul>
<li>Fixed <code>HTTPResponse.read_chunked()</code> to properly handle
leftover data in
the decoder's buffer when reading compressed chunked responses.
(<code>[#3734](https://github.com/urllib3/urllib3/issues/3734)
<https://github.com/urllib3/urllib3/issues/3734></code>__)</li>
</ul>
<h1>2.6.1 (2025-12-08)</h1>
<ul>
<li>Restore previously removed <code>HTTPResponse.getheaders()</code>
and
<code>HTTPResponse.getheader()</code> methods.
(<code>[#3731](https://github.com/urllib3/urllib3/issues/3731)
<https://github.com/urllib3/urllib3/issues/3731></code>__)</li>
</ul>
<h1>2.6.0 (2025-12-05)</h1>
<h2>Security</h2>
<ul>
<li>Fixed a security issue where streaming API could improperly handle
highly
compressed HTTP content ("decompression bombs") leading to
excessive resource
consumption even when a small amount of data was requested. Reading
small
chunks of compressed data is safer and much more efficient now.
(<code>GHSA-2xpw-w6gg-jr37
<https://github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37></code>__)</li>
<li>Fixed a security issue where an attacker could compose an HTTP
response with
virtually unlimited links in the <code>Content-Encoding</code> header,
potentially
leading to a denial of service (DoS) attack by exhausting system
resources
during decoding. The number of allowed chained encodings is now limited
to 5.
(<code>GHSA-gm62-xv2j-4w53
<https://github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53></code>__)</li>
</ul>
<p>.. caution::</p>
<ul>
<li>If urllib3 is not installed with the optional
<code>urllib3[brotli]</code> extra, but
your environment contains a Brotli/brotlicffi/brotlipy package anyway,
make
sure to upgrade it to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 to
benefit from the security fixes and avoid warnings. Prefer using</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0248277dd7"><code>0248277</code></a>
Release 2.6.3</li>
<li><a
href="8864ac407b"><code>8864ac4</code></a>
Merge commit from fork</li>
<li><a
href="70cecb27ca"><code>70cecb2</code></a>
Fix Scorecard issues related to vulnerable dev dependencies (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3755">#3755</a>)</li>
<li><a
href="41f249abe1"><code>41f249a</code></a>
Move "v2.0 Migration Guide" to the end of the table of
contents (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3747">#3747</a>)</li>
<li><a
href="fd4dffd2fc"><code>fd4dffd</code></a>
Patch <code>VerifiedHTTPSConnection</code> for Emscripten (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3752">#3752</a>)</li>
<li><a
href="13f0bfd55e"><code>13f0bfd</code></a>
Handle massive values in Retry-After when calculating time to sleep for
(<a
href="https://redirect.github.com/urllib3/urllib3/issues/3743">#3743</a>)</li>
<li><a
href="8c480bf87b"><code>8c480bf</code></a>
Bump actions/upload-artifact from 5.0.0 to 6.0.0 (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3748">#3748</a>)</li>
<li><a
href="4b40616e95"><code>4b40616</code></a>
Bump actions/cache from 4.3.0 to 5.0.1 (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3750">#3750</a>)</li>
<li><a
href="82b8479663"><code>82b8479</code></a>
Bump actions/download-artifact from 6.0.0 to 7.0.0 (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3749">#3749</a>)</li>
<li><a
href="34284cb017"><code>34284cb</code></a>
Mention experimental features in the security policy (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3746">#3746</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/urllib3/urllib3/compare/2.4.0...2.6.3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/infiniflow/ragflow/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### What problem does this PR solve?
During figure enhancement, some cropped figure images are extremely
small. Sending these to the Image2Text/VLM provider fails with a 400
invalid_parameter_error because the image width/height must
be >10px. This aborts the enhancement step. This PR adds a minimal size
guard to skip tiny crops and continue processing.
<img width="1084" height="494" alt="image"
src="https://github.com/user-attachments/assets/ad074270-94e6-4571-91c8-37df85212639"
/>
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix: key error "content" #12844
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
When all fulltext_search_columns use explicit weight 0 (e.g. "col^0"),
weight_sum is 0 and dividing by it raises ZeroDivisionError. Use equal
weights 1/n when weight_sum <= 0 and n > 0; otherwise normalize as
before.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [x] Refactoring
### What problem does this PR solve?
Put document metadata in ES/Infinity.
Index name of meta data: ragflow_doc_meta_{tenant_id}
### Type of change
- [x] Refactoring