**What problem does this PR solve?**
When loading JSON mapping/schema files, the code used
json.load(open(path)) without closing the file. The file handle stayed
open until garbage collection, which can leak file descriptors under
load (e.g. repeated reconnects or migrations).
**Type of change**
[x] Bug Fix (non-breaking change which fixes an issue)
**Change**
Replaced json.load(open(...)) with a context manager so the file is
closed after loading:
with open(fp_mapping, "r") as f: ... = json.load(f)
**Files updated**
rag/utils/opensearch_conn.py – mapping load (1 place)
common/doc_store/es_conn_base.py – mapping load + doc_meta_mapping load
(2 places)
common/doc_store/infinity_conn_base.py – schema loads in _migrate_db,
doc metadata table creation, and SQL field mapping (4 places)
Behavior is unchanged; only resource handling is fixed.
Co-authored-by: Gittensor Miner <miner@gittensor.io>
## Description
This PR implements comprehensive OceanBase performance monitoring and
health check functionality as requested in issue #12772. The
implementation follows the existing ES/Infinity health check patterns
and provides detailed metrics for operations teams.
## Problem
Currently, RAGFlow lacks detailed health monitoring for OceanBase when
used as the document engine. Operations teams need visibility into:
- Connection status and latency
- Storage space usage
- Query throughput (QPS)
- Slow query statistics
- Connection pool utilization
## Solution
### 1. Enhanced OBConnection Class (`rag/utils/ob_conn.py`)
Added comprehensive performance monitoring methods:
- `get_performance_metrics()` - Main method returning all performance
metrics
- `_get_storage_info()` - Retrieves database storage usage
- `_get_connection_pool_stats()` - Gets connection pool statistics
- `_get_slow_query_count()` - Counts queries exceeding threshold
- `_estimate_qps()` - Estimates queries per second
- Enhanced `health()` method with connection status
### 2. Health Check Utilities (`api/utils/health_utils.py`)
Added two new functions following ES/Infinity patterns:
- `get_oceanbase_status()` - Returns OceanBase status with health and
performance metrics
- `check_oceanbase_health()` - Comprehensive health check with detailed
metrics
### 3. API Endpoint (`api/apps/system_app.py`)
Added new endpoint:
- `GET /v1/system/oceanbase/status` - Returns OceanBase health status
and performance metrics
### 4. Comprehensive Unit Tests
(`test/unit_test/utils/test_oceanbase_health.py`)
Added 340+ lines of unit tests covering:
- Health check success/failure scenarios
- Performance metrics retrieval
- Error handling and edge cases
- Connection pool statistics
- Storage information retrieval
- QPS estimation
- Slow query detection
## Metrics Provided
- **Connection Status**: connected/disconnected
- **Latency**: Query latency in milliseconds
- **Storage**: Used and total storage space
- **QPS**: Estimated queries per second
- **Slow Queries**: Count of queries exceeding threshold
- **Connection Pool**: Active connections, max connections, pool size
## Testing
- All unit tests pass
- Error handling tested for connection failures
- Edge cases covered (missing tables, connection errors)
- Follows existing code patterns and conventions
## Code Statistics
- **Total Lines Changed**: 665+ lines
- **New Code**: ~600 lines
- **Test Coverage**: 340+ lines of comprehensive tests
- **Files Modified**: 3
- **Files Created**: 1 (test file)
## Acceptance Criteria Met
✅ `/system/oceanbase/status` API returns OceanBase health status
✅ Monitoring metrics accurately reflect OceanBase running status
✅ Clear error messages when health checks fail
✅ Response time optimized (metrics cached where possible)
✅ Follows existing ES/Infinity health check patterns
✅ Comprehensive test coverage
## Related Files
- `rag/utils/ob_conn.py` - OceanBase connection class
- `api/utils/health_utils.py` - Health check utilities
- `api/apps/system_app.py` - System API endpoints
- `test/unit_test/utils/test_oceanbase_health.py` - Unit tests
Fixes#12772
---------
Co-authored-by: Daniel <daniel@example.com>
## What problem does this PR solve?
This PR addresses three specific issues to improve agent reliability and
model support:
1. **`codeExec` Output Limitation**: Previously, the `codeExec` tool was
strictly limited to returning `string` types. I updated the output
constraint to `object` to support structured data (Dicts, Lists, etc.)
required for complex downstream tasks.
2. **`codeExec` Error Handling**: Improved the execution logic so that
when runtime errors occur, the tool captures the exception and returns
the error message as the output instead of causing the process to abort
or fail silently.
3. **Spark Model Configuration**:
- Added support for the `MAX-32k` model variant.
- Fixed the `Spark-Lite` mapping from `general` to `lite` to match the
latest API specifications.
## Type of change
- [x] Bug Fix (fixes execution logic and model mapping)
- [x] New Feature / Enhancement (adds model support and improves tool
flexibility)
## Key Changes
### `agent/tools/code_exec.py`
- Changed the output type definition from `string` to `object`.
- Refactored the execution flow to gracefully catch exceptions and
return error messages as part of the tool output.
### `rag/llm/chat_model.py`
- Added `"Spark-Max-32K": "max-32k"` to the model list.
- Updated `"Spark-Lite"` value from `"general"` to `"lite"`.
## Checklist
- [x] My code follows the style guidelines of this project.
- [x] I have performed a self-review of my own code.
Signed-off-by: evilhero <2278596667@qq.com>
### What problem does this PR solve?
During figure enhancement, some cropped figure images are extremely
small. Sending these to the Image2Text/VLM provider fails with a 400
invalid_parameter_error because the image width/height must
be >10px. This aborts the enhancement step. This PR adds a minimal size
guard to skip tiny crops and continue processing.
<img width="1084" height="494" alt="image"
src="https://github.com/user-attachments/assets/ad074270-94e6-4571-91c8-37df85212639"
/>
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
When all fulltext_search_columns use explicit weight 0 (e.g. "col^0"),
weight_sum is 0 and dividing by it raises ZeroDivisionError. Use equal
weights 1/n when weight_sum <= 0 and n > 0; otherwise normalize as
before.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [x] Refactoring
### What problem does this PR solve?
Put document metadata in ES/Infinity.
Index name of meta data: ragflow_doc_meta_{tenant_id}
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Added Redis port calculation and environment variable export to support
Redis service in test environment. The port is dynamically assigned
based on runner number to prevent conflicts during parallel test
execution. Removed by #12685
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fixes parent chunking fails on DOCX files.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Introduced a helper method _check_task_canceled to centralize and
simplify task cancellation checks throughout
RecursiveAbstractiveProcessing4TreeOrganizedRetrieval. This reduces code
duplication and improves maintainability.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Aliyun OSS do not support boto s4 signature_version which will lead to
an error:
```
botocore.exceptions.ClientError: An error occurred (InvalidArgument) when calling the PutObject operation: aws-chunked encoding is not supported with the specified x-amz-content-sha256 value
```
According to aliyun oss docs, oss_conn need to use s3 signature_version.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
This PR fixes a `KeyError` crash when running RAPTOR tasks on documents
that don't have the expected vector field.
## Related Issue
Fixes https://github.com/infiniflow/ragflow/issues/12675
## Problem
When running RAPTOR tasks, the code assumes all chunks have the vector
field `q_<size>_vec` (e.g., `q_1024_vec`). However, chunks may not have
this field if:
1. They were indexed with a **different embedding model** (different
vector size)
2. The embedding step **failed silently** during initial parsing
3. The document was parsed before the current embedding model was
configured
This caused a crash:
```
KeyError: 'q_1024_vec'
```
## Solution
Added defensive validation in `run_raptor_for_kb()`:
1. **Check for vector field existence** before accessing it
2. **Skip chunks** that don't have the required vector field instead of
crashing
3. **Log warnings** for skipped chunks with actionable guidance
4. **Provide informative error messages** suggesting users re-parse
documents with the current embedding model
5. **Handle both scopes** (`file` and `kb` modes)
## Changes
- `rag/svr/task_executor.py`: Added validation and error handling in
`run_raptor_for_kb()`
## Testing
1. Create a knowledge base with an embedding model
2. Parse documents
3. Change the embedding model to one with a different vector size
4. Run RAPTOR task
5. **Before**: Crashes with `KeyError`
6. **After**: Gracefully skips incompatible chunks with informative
warnings
---
<!-- Gittensor Contribution Tag: @GlobalStar117 -->
Co-authored-by: GlobalStar117 <GlobalStar117@users.noreply.github.com>
### What problem does this PR solve?
1) Create dataset using table parser for infinity
2) Answer questions in chat using SQL
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
In paragraph() of class FulltextQueryer, "len(keywords) / 10" should be
rounded to integer before set to minimum_should_match.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Skip duplicate errors to avoid 'create_idx' failures caused by slow
metadata refresh or external modifications.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fixes Infinity-specific API regressions: preserves ```important_kwd```
round‑trip for ```[""]```, restores required highlight key in retrieval
responses, and enforces Infinity guards for unsupported
```parser_id=tag``` and pagerank in ```/v1/kb/update```. Also removes a
slow/buggy pandas row-wise apply that was throwing ```ValueError``` and
causing flakiness.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Problem
The \`important_kwd\` field in Infinity connector was using mismatched
separators:
- **Storage**: \`list2str(v)\` uses space as default separator
- **Reading**: \`v.split()\` splits by all whitespace
This causes multi-word keywords like \`\"Senior Fund Manager\"\` to be
incorrectly split into \`[\"Senior\", \"Fund\", \"Manager\"]\`.
## Solution
Use comma \`,\` as separator for both storing and reading, consistent
with:
1. The LLM output format in \`keyword_prompt.md\` (\"delimited by
ENGLISH COMMA\")
2. The \`cached.split(\",\")\` in \`task_executor.py\`
## Changes
- \`insert()\`: \`list2str(v)\` → \`list2str(v, \",\")\`
- \`update()\`: \`list2str(v)\` → \`list2str(v, \",\")\`
- \`get_fields()\`: \`v.split()\` → \`v.split(\",\") if v else []\`
## Impact
This bug affects:
- Python-level reranking weight calculation (\`important_kwd * 5\`)
- API response keyword display
- Search precision due to fragmented keywords
## Summary
Fixes#12520 - Deleted chunks should not appear in retrieval/reference
results.
## Changes
### Core Fix
- **api/apps/chunk_app.py**: Include \doc_id\ in delete condition to
properly scope the delete operation
### Improved Error Handling
- **api/db/services/document_service.py**: Better separation of concerns
with individual try-catch blocks and proper logging for each cleanup
operation
### Doc Store Updates
- **rag/utils/es_conn.py**: Updated delete query construction to support
compound conditions
- **rag/utils/opensearch_conn.py**: Same updates for OpenSearch
compatibility
### Tests
- **test/testcases/.../test_retrieval_chunks.py**: Added
\TestDeletedChunksNotRetrievable\ class with regression tests
- **test/unit/test_delete_query_construction.py**: Unit tests for delete
query construction
## Testing
- Added regression tests that verify deleted chunks are not returned by
retrieval API
- Tests cover single chunk deletion and batch deletion scenarios
### What problem does this PR solve?
Fix regex pattern validation in split_with_pattern (#12605)
- Add try-except block to validate user-provided regex patterns before
use
- Gracefully fallback to single chunk when invalid regex is provided
- Prevent server crash during DOCX parsing with malformed delimiters
## Problem
Parsing DOCX files with custom regex delimiters crashes with `re.error:
nothing to repeat at position 9` when users provide invalid regex
patterns.
Closes#12605
## Solution
Validate and compile regex pattern before use. On invalid pattern, log
warning and return content as single chunk instead of crashing.
## Changes
- `rag/nlp/__init__.py`: Add regex validation in `split_with_pattern()`
function
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=42954461
### What problem does this PR solve?
Feat: Hash doc id to avoid duplicate name.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fixes#12604 - DOCX files containing hyperlinks to internal bookmarks
(e.g., `#_文档目录`) cause a `KeyError` during parsing:
```
KeyError: "There is no item named 'word/#_文档目录' in the archive"
```
This happens because python-docx incorrectly tries to read internal
bookmark references as files from the ZIP archive. Internal bookmarks
are relationship targets starting with `#` and are not actual files.
This PR extends the existing `load_from_xml_v2` workaround (which
already handles `NULL` targets) to also skip relationship targets
starting with `#`.
Related upstream issue:
https://github.com/python-openxml/python-docx/issues/902
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=94194147
### Issue
When using Qwen3 models (`qwen3-32b`, `qwen3-max`) through the
Tongyi-Qianwen provider for non-streaming calls (e.g., knowledge graph
generation), the API fails with:
Closes#12424
```
parameter.enable_thinking must be set to false for non-streaming calls
```
### Root Cause
In `LiteLLMBase.async_chat()`, the `extra_body={"enable_thinking":
False}` was set in `kwargs` but never forwarded to
`_construct_completion_args()`.
### What problem does this PR solve?
Pass merged kwargs to `_construct_completion_args()` using
`**{**gen_conf, **kwargs}` to safely handle potential duplicate
parameters.
### Changes
- `rag/llm/chat_model.py`: Forward kwargs containing `extra_body` to
`_construct_completion_args()` in `async_chat()`
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=42954461
### What problem does this PR solve?
This PR eliminates unnecessary debug print statements that were left in
hot paths of the codebase.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
When there are multiple users, parsing a document for a new user can
trigger the reuse of column objects, leading to the error
`sqlalchemy.exc.ArgumentError: Column object 'id' already assigned to
Table xxx`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix image not displaying thumbnails when using pipeline.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. PaddleOCR PDF parser supports thumnails and positions.
2. Add FAQ documentation for PaddleOCR PDF parser.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- API server
- Ingestion server
- Data sync server
- Admin server
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Add PaddleOCR as a new PDF parser.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
If we delete the password in kwargs, func 'init_db_config' will fail, so
we need to keep this field.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: support context window for docx
#12303
Done:
- [x] naive.py
- [x] one.py
TODO:
- [ ] book.py
- [ ] manual.py
Fix: incorrect image position
Fix: incorrect chunk type tag
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)