### What problem does this PR solve?
When there are multiple users, parsing a document for a new user can
trigger the reuse of column objects, leading to the error
`sqlalchemy.exc.ArgumentError: Column object 'id' already assigned to
Table xxx`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix image not displaying thumbnails when using pipeline.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. PaddleOCR PDF parser supports thumnails and positions.
2. Add FAQ documentation for PaddleOCR PDF parser.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- API server
- Ingestion server
- Data sync server
- Admin server
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Add PaddleOCR as a new PDF parser.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
If we delete the password in kwargs, func 'init_db_config' will fail, so
we need to keep this field.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: support context window for docx
#12303
Done:
- [x] naive.py
- [x] one.py
TODO:
- [ ] book.py
- [ ] manual.py
Fix: incorrect image position
Fix: incorrect chunk type tag
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- Fixes the health check failure in multi-bucket MinIO environments.
Previously, health checks would fail because the default
"ragflow-bucket" did not exist. This caused false negatives for system
health.
- Also removes the _health_check write in single-bucket mode to avoid
side effects (minor optimization).
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Refactor TOC building logic to use enumerate instead of while loop, add
comprehensive error handling for missing/invalid chunk_id values, and
improve logging with more specific error messages. The changes make the
code more robust against malformed TOC data while maintaining the same
functionality for valid inputs.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
PDF vision figure parser supports reading context.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Improve task executor heartbeat handling and cleanup.
### What problem does this PR solve?
- **Reduce lock contention during executor cleanup**: The cleanup lock
is acquired only when removing expired executors, not during regular
heartbeat reporting, reducing potential lock contention.
- **Optimize own heartbeat cleanup**: Each executor removes its own
expired heartbeat using `zremrangebyscore` instead of `zcount` +
`zpopmin`, reducing Redis operations and improving efficiency.
- **Improve cleanup of other executors' heartbeats**: Expired executors
are detected by checking their latest heartbeat, and stale entries are
removed safely.
- **Other improvements**: IP address and PID are captured once at
startup, and unnecessary global declarations are removed.
### Type of change
- [x] Performance Improvement
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Eliminates SQL injection vectors in the OpenDAL MySQL initialization
logic by implementing strict input validation and explicit type casting.
**Modifications:**
1. **`init_db_config`**: Enforced integer casting for
`max_allowed_packet` before formatting it into the SQL string.
2. **`init_opendal_mysql_table`**: Implemented regex-based validation
for `table_name` to ensure only alphanumeric characters and underscores
are permitted, preventing arbitrary SQL command injection through
configuration parameters.
These changes ensure that even if configuration values are sourced from
untrusted environments, the database initialization remains secure.
### What problem does this PR solve?
Feat: Bitbucket connector NOT READY TO MERGE
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
issue:
#12313
change:
add Zendesk data source integration with configuration and sync
capabilities
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- Simplified and consolidated extraction rules
- Emphasized strict evidence-based extraction only
- Strengthened enum handling and hallucination prevention
- Clarified output requirements for empty results
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
issue:
#12217 [#12313](https://github.com/infiniflow/ragflow/issues/12313)
change:
add IMAP data source integration with configuration and sync
capabilities
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Use async task to save memory.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Feat: Gitlab connector
Fix: submit button in darkmode
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
change: Add Asana data source integration and configuration options
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Improve image table context.
Current strategy in attach_media_context:
- Order by position when possible: if any chunk has page/position info,
sort by (page, top, left), otherwise keep original order.
- Apply only to media chunks: images use image_context_size, tables use
table_context_size.
- Primary matching: on the same page, choose a text chunk whose vertical
span overlaps the media, then pick the one with the closest vertical
midpoint.
- Fallback matching: if no overlap on that page, choose the nearest text
chunk on the same page (page-head uses the next text; page-tail uses the
previous text).
- Context extraction: inside the chosen text chunk, find a mid-sentence
boundary near the text midpoint, then take context_size tokens split
before/after (total budget).
- No multi-chunk stitching: context comes from a single text chunk to
avoid mixing unrelated segments.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Manage message and use in agent.
Issue #4213
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
change:
add Airtable connector and integration for data synchronization
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add image table context to pipeline splitter.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fix LLM tool does not exist in multiple retrieval case
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
pr:#12117
change:remove duplicate tool_meta
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)