mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 20:42:30 +08:00
Docs: Added token chunker and title chunker components (#10711)
### What problem does this PR solve? ### Type of change - [x] Documentation Update
This commit is contained in:
@ -3,15 +3,41 @@ sidebar_position: 32
|
||||
slug: /chunker_token_component
|
||||
---
|
||||
|
||||
# Parser component
|
||||
# Token chunker component
|
||||
|
||||
A component that sets the parsing rules for your dataset.
|
||||
A component that splits texts into chunks, respecting a maximum token limit and using delimiters to find optimal breakpoints.
|
||||
|
||||
---
|
||||
|
||||
A **Parser** component defines how various file types should be parsed, including parsing methods for PDFs , fields to parse for Emails, and OCR methods for images.
|
||||
A **Token chunker** component is a text splitter that creates chunks by respecting a recommended maximum token length, using delimiters to ensure logical chunk breakpoints. It splits long texts into appropriately-sized, semantically related chunks.
|
||||
|
||||
|
||||
## Scenario
|
||||
|
||||
A **Parser** component is auto-populated on the ingestion pipeline canvas and required in all ingestion pipeline workflows.
|
||||
A **Token chunker** component is optional, usually placed immediately after **Parser** or **Title chunker**.
|
||||
|
||||
## Configurations
|
||||
|
||||
### Recommended chunk size
|
||||
|
||||
The recommended maximum token limit for each created chunk. The **Token chunker** component creates chunks at specified delimiters. If this token limit is reached before a delimiter, a chunk is created at that point.
|
||||
|
||||
### Overlapped percent (%)
|
||||
|
||||
This defines the overlap percentage between chunks. An appropriate degree of overlap ensures semantic coherence without creating excessive, redundant tokens for the LLM.
|
||||
|
||||
- Default: 0
|
||||
- Maximum: 30%
|
||||
|
||||
|
||||
### Delimiters
|
||||
|
||||
Defaults to `\n`. Click the right-hand **Recycle bin** button to remove it, or click **+ Add** to add a delimiter.
|
||||
|
||||
|
||||
### Output
|
||||
|
||||
The global variable name for the output of the **Token chunkder** component, which can be referenced by subsequent components in the ingestion pipeline.
|
||||
|
||||
- Default: `chunks`
|
||||
- Type: `Array<Object>`
|
||||
Reference in New Issue
Block a user