Docs: Added token chunker and title chunker components (#10711)

### What problem does this PR solve?

### Type of change

- [x] Documentation Update
This commit is contained in:
writinwaters
2025-10-21 20:11:23 +08:00
committed by GitHub
parent 1694f32e8e
commit 9a4cd81891
2 changed files with 70 additions and 4 deletions

View File

@ -0,0 +1,40 @@
---
sidebar_position: 31
slug: /chunker_title_component
---
# Title chunker component
A component that splits texts into chunks by heading level.
---
A **Token chunker** component is a text splitter that uses specified heading level as delimiter to define chunk boundaries and create chunks.
## Scenario
A **Title chunker** component is optional, usually placed immediately after **Parser**.
:::caution WARNING
Placing a **Title chunker** after a **Token chunker** is invalid and will cause an error. Please note that this restriction is not currently system-enforced and requires your attention.
:::
## Configurations
### Hierarchy
Specifies the heading level to define chunk boundaries:
- H1
- H2
- H3 (Default)
- H4
Click **+ Add** to add heading levels here or update the corresponding **Regular Expressions** fields for custom heading patterns.
### Output
The global variable name for the output of the **Title chunkder** component, which can be referenced by subsequent components in the ingestion pipeline.
- Default: `chunks`
- Type: `Array<Object>`

View File

@ -3,15 +3,41 @@ sidebar_position: 32
slug: /chunker_token_component
---
# Parser component
# Token chunker component
A component that sets the parsing rules for your dataset.
A component that splits texts into chunks, respecting a maximum token limit and using delimiters to find optimal breakpoints.
---
A **Parser** component defines how various file types should be parsed, including parsing methods for PDFs , fields to parse for Emails, and OCR methods for images.
A **Token chunker** component is a text splitter that creates chunks by respecting a recommended maximum token length, using delimiters to ensure logical chunk breakpoints. It splits long texts into appropriately-sized, semantically related chunks.
## Scenario
A **Parser** component is auto-populated on the ingestion pipeline canvas and required in all ingestion pipeline workflows.
A **Token chunker** component is optional, usually placed immediately after **Parser** or **Title chunker**.
## Configurations
### Recommended chunk size
The recommended maximum token limit for each created chunk. The **Token chunker** component creates chunks at specified delimiters. If this token limit is reached before a delimiter, a chunk is created at that point.
### Overlapped percent (%)
This defines the overlap percentage between chunks. An appropriate degree of overlap ensures semantic coherence without creating excessive, redundant tokens for the LLM.
- Default: 0
- Maximum: 30%
### Delimiters
Defaults to `\n`. Click the right-hand **Recycle bin** button to remove it, or click **+ Add** to add a delimiter.
### Output
The global variable name for the output of the **Token chunkder** component, which can be referenced by subsequent components in the ingestion pipeline.
- Default: `chunks`
- Type: `Array<Object>`