### What problem does this PR solve?
Feat: optimize aws s3 connector #12008
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Guard Dashscope response attribute access in token/log utils, since
`dashscope_response` returns dict like object.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Potential fix for
[https://github.com/infiniflow/ragflow/security/code-scanning/59](https://github.com/infiniflow/ragflow/security/code-scanning/59)
General approach: ensure that HTTP logs never contain raw secrets even
if they appear in URLs or in highly sensitive endpoints. There are two
complementary strategies: (1) for clearly sensitive endpoints (e.g.,
OAuth token URLs), completely suppress URL logging; and (2) ensure that
any URL that is logged is strongly redacted for any parameter name that
might carry a secret, and in a way that static analysis can see is a
dedicated sanitization step.
Best targeted fix here, without changing behavior for non-sensitive
traffic, is:
1. Strengthen the `_SENSITIVE_QUERY_KEYS` set to include any likely
secret-bearing keys (e.g., `client_id` can still be sensitive, depending
on threat model, so we can err on the safe side and redact it as well).
2. Ensure `_is_sensitive_url` (in `common/http_client.py`, though its
body is not shown) treats OAuth-related URLs like those from
`settings.GITHUB_OAUTH` and `settings.FEISHU_OAUTH` as sensitive and
thus disables URL logging. Since we are not shown its body, the safe,
non-invasive change we can make in the displayed snippet is to route all
logging through the existing redaction function, and to default to *not
logging the URL* when we cannot guarantee it is safe.
3. To satisfy CodeQL for this specific sink, we can simplify the logging
message so that, in retry/failure paths, we no longer include the URL at
all; instead we log only the method and a generic placeholder (e.g.,
`"async_request attempt ... failed; retrying..."`). This fully removes
the tainted URL from the sink and addresses all alert variants for that
logging statement, while preserving useful operational information
(method, attempt index, delay).
Concretely, in `common/http_client.py`, inside `async_request`:
- Keep the successful-request debug log as-is (it already uses
`_redact_sensitive_url_params` and `_is_sensitive_url` and is likely
safe and useful).
- In the `except httpx.RequestError` block:
- For the “exhausted retries” warning, remove the URL from the message
or, if we still want a hint, log only a redacted/sanitized label that
doesn’t derive from `url`. The simplest is to omit the URL entirely.
- For the per-attempt failure warning (line 162), similarly remove
`log_url` (and thus any use of `url`) from the formatted message so that
the sink no longer contains tainted data.
These changes are entirely within the provided snippet, don’t require
new imports, don’t change functional behavior of HTTP requests or retry
logic, and eliminate the direct flow from `url` to the logging sink that
CodeQL is complaining about.
---
_Suggested fixes powered by Copilot Autofix. Review carefully before
merging._
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Potential fix for
[https://github.com/infiniflow/ragflow/security/code-scanning/58](https://github.com/infiniflow/ragflow/security/code-scanning/58)
General approach: avoid logging potentially sensitive URLs (especially
at warning level) or ensure they are fully and robustly redacted before
logging. Since this client is shared and used with OAuth endpoints, the
safest minimal-change fix is to stop including the URL in warning logs
(retries exhausted and retry attempts) and only log the HTTP method and
a generic message. Debug logs can continue using the existing redaction
helper for non-sensitive URLs if desired.
Best concrete fix without changing functionality: in
`common/http_client.py`, in `async_request`, change the retry-exhausted
and retry-attempt warning log statements so that they no longer
interpolate `log_url` (and thus the tainted `url`). We can still compute
`log_url` if needed elsewhere, but the log string itself should not
contain `log_url`. This directly removes the tainted data from the sink
while preserving information about errors and retry behavior. No changes
are required in `common/settings.py` or `api/apps/user_app.py`, and we
do not need new imports or helpers.
Specifically:
- In `common/http_client.py`, around line 152–163, replace the two
warning logs:
- `logger.warning(f"async_request exhausted retries for {method}
{log_url}")`
- `logger.warning(f"async_request attempt {attempt + 1}/{retries + 1}
failed for {method} {log_url}; retrying in {delay:.2f}s")`
with versions that omit `{log_url}`, such as:
- `logger.warning(f"async_request exhausted retries for {method}")`
- `logger.warning(f"async_request attempt {attempt + 1}/{retries + 1}
failed for {method}; retrying in {delay:.2f}s")`
This ensures no URL-derived data flows into these warning logs,
addressing all variants of the alert, since they all trace to the same
sink.
---
_Suggested fixes powered by Copilot Autofix. Review carefully before
merging._
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Potential fix for
[https://github.com/infiniflow/ragflow/security/code-scanning/57](https://github.com/infiniflow/ragflow/security/code-scanning/57)
In general, the safest fix is to ensure that any logging of request URLs
from `async_request` (and similar helpers) cannot include secrets. This
can be done by (a) suppressing logging entirely for URLs considered
sensitive, or (b) logging only a non-sensitive subset (e.g., scheme +
host + path) and never query strings or credentials.
The minimal, backward-compatible change here is to strengthen
`_redact_sensitive_url_params` and `_is_sensitive_url` / the logging
call so that we never log query parameters at all. Instead of logging
the full URL (with redacted query), we can log only
`scheme://netloc/path` and optionally strip userinfo. This retains
useful observability (which endpoint, which method, response code,
timing) while guaranteeing that no secrets in query strings or path
segments appear in logs. Concretely:
- Update `_redact_sensitive_url_params` to *not* include the query
string in the returned value, and to drop any embedded userinfo
(`username:password@host`).
- Continue to wrap logging in a “sensitive URL” guard, but now the
redaction routine itself ensures no secrets from query are present.
- Leave callers (e.g., `github_callback`, `feishu_callback`) unchanged,
since they only pass URLs and do not control the logging behavior
directly.
All changes are confined to `common/http_client.py` inside the provided
snippet. No new imports are necessary.
_Suggested fixes powered by Copilot Autofix. Review carefully before
merging._
---------
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### What problem does this PR solve?
When there are multiple files with the same name the file would just
duplicate, making it hard to distinguish between the different files.
Now if there are multiple files with the same name, they will be named
after their folder path in the storage unit.
This was done for the webdav connector and with this PR also for Notion,
Confluence and S3 Storage.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Contribution by RAGcon GmbH, visit us [here](https://www.ragcon.ai/)
### What problem does this PR solve?
- Add license
- Fix IDE warnings
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Only support MinerU-API now, still need to complete frontend for
pipeline to allow the configuration of MinerU options.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Support uploading encrypted files to object storage.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: virgilwong <hyhvirgil@gmail.com>
### What problem does this PR solve?
Refactor metadata filter.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
change:
async issue and sensitive logging
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Manage and display memory datasets.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Treat MinerU as an OCR model.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
## What's changed
fix: unify embedding model fallback logic for both TEI and non-TEI
Docker deployments
> This fix targets **Docker / `docker-compose` deployments**, ensuring a
valid default embedding model is always set—regardless of the compose
profile used.
## Changes
| Scenario | New Behavior |
|--------|--------------|
| **Non-`tei-` profile** (e.g., default deployment) | `EMBEDDING_MDL` is
now correctly initialized from `EMBEDDING_CFG` (derived from
`user_default_llm`), ensuring custom defaults like `bge-m3@Ollama` are
properly applied to new tenants. |
| **`tei-` profile** (`COMPOSE_PROFILES` contains `tei-`) | Still
respects the `TEI_MODEL` environment variable. If unset, falls back to
`EMBEDDING_CFG`. Only when both are empty does it use the built-in
default (`BAAI/bge-small-en-v1.5`), preventing an empty embedding model.
|
## Why This Change?
- **In non-TEI mode**: The previous logic would reset `EMBEDDING_MDL` to
an empty string, causing pre-configured defaults (e.g., `bge-m3@Ollama`
in the Docker image) to be ignored—leading to tenant initialization
failures or silent misconfigurations.
- **In TEI mode**: Users need the ability to override the model via
`TEI_MODEL`, but without a safe fallback, missing configuration could
break the system. The new logic adopts a **“config-first,
env-var-override”** strategy for robustness in containerized
environments.
## Implementation
- Updated the assignment logic for `EMBEDDING_MDL` in
`rag/common/settings.py` to follow a unified fallback chain:
EMBEDDING_CFG → TEI_MODEL (if tei- profile active) → built-in default
## Testing
Verified in Docker deployments:
1. **`COMPOSE_PROFILES=`** (no TEI)
→ New tenants get `bge-m3@Ollama` as the default embedding model
2. **`COMPOSE_PROFILES=tei-gpu` with no `TEI_MODEL` set**
→ Falls back to `BAAI/bge-small-en-v1.5`
3. **`COMPOSE_PROFILES=tei-gpu` with `TEI_MODEL=my-model`**
→ New tenants use `my-model` as the embedding model
Closes#8916fix#11522fix#11306
### What problem does this PR solve?
Our HTTP wrapper still passed proxies to httpx.Client/AsyncClient, which
expect proxy. As a result, configured proxies were ignored and calls
could fail with ValueError("Failed to fetch OIDC metadata:
Client.__init__() got an unexpected keyword argument 'proxies'"). This
PR switches to the correct proxy kwarg so proxies are honored and the
runtime error is resolved.
### Type of change
- [X] Bug Fix (non-breaking change which fixes an issue)
---
Contribution during my time at RAGcon GmbH.
### What problem does this PR solve?
- typos
- IDE warnings
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
When there are multiple files with the same name the file would just
duplicate, making it hard to distinguish between the different files.
Now if there are multiple files with the same name, they will be named
after their folder path in the webdav storage unit.
The same could be done for the other connectors, too, since most of them
will have similars issues, when iterating through the folder paths.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Contribution by RAGcon GmbH, visit us [here](https://www.ragcon.ai/)
### What problem does this PR solve?
This Pull Request introduces native support for Google Cloud Storage
(GCS) as an optional object storage backend.
Currently, RAGFlow relies on a limited set of storage options. This
feature addresses the need for seamless integration with GCP
environments, allowing users to leverage a fully managed, highly
durable, and scalable storage service (GCS) instead of needing to deploy
and maintain third-party object storage solutions. This simplifies
deployment, especially for users running on GCP infrastructure like GKE
or Cloud Run.
The implementation uses a single GCS bucket defined via configuration,
mapping RAGFlow's internal logical storage units (or "buckets") to
folder prefixes within that GCS container to maintain data separation.
This architectural choice avoids the operational complexities associated
with dynamically creating and managing unique GCS buckets for every
logical unit.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Rename function and refactor log message
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Add metadata from moodle data source.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: add mineru auto installer
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Try to make this more asynchronous. Verified in chat and agent
scenarios, reducing blocking behavior. #11551, #11579.
However, the impact of these changes still requires further
investigation to ensure everything works as expected.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
To solve the problem of error reporting caused by type errors when
various types of exception returns are triggered
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
- Update sync data source to handle metadata properly
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
This PR adds webdav storage as data source for data sync service.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### Type of change
* [x] New Feature (non-breaking change which adds functionality)
Add support for Virtual Hosted Style and Path Style URL addressing in
S3_COMPATIBLE storage connector. Default to Virtual Hosted Style for
better compatibility with COS and other S3-compatible services.
- Add addressing_style field to credentials (virtual/path)
- Update frontend form with selection dropdown
- Add validation and tooltips for S3 Compatible endpoint URL
<img width="703" height="875" alt="image"
src="https://github.com/user-attachments/assets/af5ba7ca-f160-47fa-8ba1-32eace8f5fdf"
/>
<img width="1620" height="788" alt="image"
src="https://github.com/user-attachments/assets/6012b5ce-8bcb-478e-a9cb-425f886d5046"
/>
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
This PR adds a native Moodle connector to sync content (courses,
resources, forums, assignments, pages, books) into RAGFlow.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Enriches rich text (links, mentions, equations), flags to-do blocks with
[x]/[ ], captures block-level equations, builds table HTML, downloads
attachments.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fixes#10933
This PR fixes a `TypeError` in the Gemini model provider where the
`total_token_count_from_response()` function could receive a `None`
response object, causing the error:
TypeError: argument of type 'NoneType' is not iterable
**Root Cause:**
The function attempted to use the `in` operator to check dictionary keys
(lines 48, 54, 60) without first validating that `resp` was not `None`.
When Gemini's `chat_streamly()` method returns `None`, this triggers the
error.
**Solution:**
1. Added a null check at the beginning of the function to return `0` if
`resp is None`
2. Added `isinstance(resp, dict)` checks before all `in` operations to
ensure type safety
3. This defensive programming approach prevents the TypeError while
maintaining backward compatibility
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Changes Made
**File:** `rag/utils/__init__.py`
- Line 36-38: Added `if resp is None: return 0` check
- Line 52: Added `isinstance(resp, dict)` before `'usage' in resp`
- Line 58: Added `isinstance(resp, dict)` before `'usage' in resp`
- Line 64: Added `isinstance(resp, dict)` before `'meta' in resp`
### Testing
- [x] Code compiles without errors
- [x] Follows existing code style and conventions
- [x] Change is minimal and focused on the specific issue
### Additional Notes
This fix ensures robust handling of various response types from LLM
providers, particularly Gemini, w
---------
Signed-off-by: Zhang Zhefang <zhangzhefang@example.com>
### What problem does this PR solve?
Add OceanBase doc engine. Close#5350
### Type of change
- [x] New Feature (non-breaking change which adds functionality)