mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 12:32:30 +08:00
Docs: Added token chunker and title chunker components (#10711)
### What problem does this PR solve? ### Type of change - [x] Documentation Update
This commit is contained in:
40
docs/guides/agent/agent_component_reference/chunker_title.md
Normal file
40
docs/guides/agent/agent_component_reference/chunker_title.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
sidebar_position: 31
|
||||
slug: /chunker_title_component
|
||||
---
|
||||
|
||||
# Title chunker component
|
||||
|
||||
A component that splits texts into chunks by heading level.
|
||||
|
||||
---
|
||||
|
||||
A **Token chunker** component is a text splitter that uses specified heading level as delimiter to define chunk boundaries and create chunks.
|
||||
|
||||
## Scenario
|
||||
|
||||
A **Title chunker** component is optional, usually placed immediately after **Parser**.
|
||||
|
||||
:::caution WARNING
|
||||
Placing a **Title chunker** after a **Token chunker** is invalid and will cause an error. Please note that this restriction is not currently system-enforced and requires your attention.
|
||||
:::
|
||||
|
||||
## Configurations
|
||||
|
||||
### Hierarchy
|
||||
|
||||
Specifies the heading level to define chunk boundaries:
|
||||
|
||||
- H1
|
||||
- H2
|
||||
- H3 (Default)
|
||||
- H4
|
||||
|
||||
Click **+ Add** to add heading levels here or update the corresponding **Regular Expressions** fields for custom heading patterns.
|
||||
|
||||
### Output
|
||||
|
||||
The global variable name for the output of the **Title chunkder** component, which can be referenced by subsequent components in the ingestion pipeline.
|
||||
|
||||
- Default: `chunks`
|
||||
- Type: `Array<Object>`
|
||||
@ -3,15 +3,41 @@ sidebar_position: 32
|
||||
slug: /chunker_token_component
|
||||
---
|
||||
|
||||
# Parser component
|
||||
# Token chunker component
|
||||
|
||||
A component that sets the parsing rules for your dataset.
|
||||
A component that splits texts into chunks, respecting a maximum token limit and using delimiters to find optimal breakpoints.
|
||||
|
||||
---
|
||||
|
||||
A **Parser** component defines how various file types should be parsed, including parsing methods for PDFs , fields to parse for Emails, and OCR methods for images.
|
||||
A **Token chunker** component is a text splitter that creates chunks by respecting a recommended maximum token length, using delimiters to ensure logical chunk breakpoints. It splits long texts into appropriately-sized, semantically related chunks.
|
||||
|
||||
|
||||
## Scenario
|
||||
|
||||
A **Parser** component is auto-populated on the ingestion pipeline canvas and required in all ingestion pipeline workflows.
|
||||
A **Token chunker** component is optional, usually placed immediately after **Parser** or **Title chunker**.
|
||||
|
||||
## Configurations
|
||||
|
||||
### Recommended chunk size
|
||||
|
||||
The recommended maximum token limit for each created chunk. The **Token chunker** component creates chunks at specified delimiters. If this token limit is reached before a delimiter, a chunk is created at that point.
|
||||
|
||||
### Overlapped percent (%)
|
||||
|
||||
This defines the overlap percentage between chunks. An appropriate degree of overlap ensures semantic coherence without creating excessive, redundant tokens for the LLM.
|
||||
|
||||
- Default: 0
|
||||
- Maximum: 30%
|
||||
|
||||
|
||||
### Delimiters
|
||||
|
||||
Defaults to `\n`. Click the right-hand **Recycle bin** button to remove it, or click **+ Add** to add a delimiter.
|
||||
|
||||
|
||||
### Output
|
||||
|
||||
The global variable name for the output of the **Token chunkder** component, which can be referenced by subsequent components in the ingestion pipeline.
|
||||
|
||||
- Default: `chunks`
|
||||
- Type: `Array<Object>`
|
||||
Reference in New Issue
Block a user