From 9a4cd818918be272f2665e5e0ab1b0a56d8efaee Mon Sep 17 00:00:00 2001 From: writinwaters <93570324+writinwaters@users.noreply.github.com> Date: Tue, 21 Oct 2025 20:11:23 +0800 Subject: [PATCH] Docs: Added token chunker and title chunker components (#10711) ### What problem does this PR solve? ### Type of change - [x] Documentation Update --- .../chunker_title.md | 40 +++++++++++++++++++ .../chunker_token.md | 34 ++++++++++++++-- 2 files changed, 70 insertions(+), 4 deletions(-) create mode 100644 docs/guides/agent/agent_component_reference/chunker_title.md diff --git a/docs/guides/agent/agent_component_reference/chunker_title.md b/docs/guides/agent/agent_component_reference/chunker_title.md new file mode 100644 index 000000000..9ec692db5 --- /dev/null +++ b/docs/guides/agent/agent_component_reference/chunker_title.md @@ -0,0 +1,40 @@ +--- +sidebar_position: 31 +slug: /chunker_title_component +--- + +# Title chunker component + +A component that splits texts into chunks by heading level. + +--- + +A **Token chunker** component is a text splitter that uses specified heading level as delimiter to define chunk boundaries and create chunks. + +## Scenario + +A **Title chunker** component is optional, usually placed immediately after **Parser**. + +:::caution WARNING +Placing a **Title chunker** after a **Token chunker** is invalid and will cause an error. Please note that this restriction is not currently system-enforced and requires your attention. +::: + +## Configurations + +### Hierarchy + +Specifies the heading level to define chunk boundaries: + +- H1 +- H2 +- H3 (Default) +- H4 + +Click **+ Add** to add heading levels here or update the corresponding **Regular Expressions** fields for custom heading patterns. + +### Output + +The global variable name for the output of the **Title chunkder** component, which can be referenced by subsequent components in the ingestion pipeline. + +- Default: `chunks` +- Type: `Array` \ No newline at end of file diff --git a/docs/guides/agent/agent_component_reference/chunker_token.md b/docs/guides/agent/agent_component_reference/chunker_token.md index 8d29d4fa6..bcdc272df 100644 --- a/docs/guides/agent/agent_component_reference/chunker_token.md +++ b/docs/guides/agent/agent_component_reference/chunker_token.md @@ -3,15 +3,41 @@ sidebar_position: 32 slug: /chunker_token_component --- -# Parser component +# Token chunker component -A component that sets the parsing rules for your dataset. +A component that splits texts into chunks, respecting a maximum token limit and using delimiters to find optimal breakpoints. --- -A **Parser** component defines how various file types should be parsed, including parsing methods for PDFs , fields to parse for Emails, and OCR methods for images. +A **Token chunker** component is a text splitter that creates chunks by respecting a recommended maximum token length, using delimiters to ensure logical chunk breakpoints. It splits long texts into appropriately-sized, semantically related chunks. ## Scenario -A **Parser** component is auto-populated on the ingestion pipeline canvas and required in all ingestion pipeline workflows. \ No newline at end of file +A **Token chunker** component is optional, usually placed immediately after **Parser** or **Title chunker**. + +## Configurations + +### Recommended chunk size + +The recommended maximum token limit for each created chunk. The **Token chunker** component creates chunks at specified delimiters. If this token limit is reached before a delimiter, a chunk is created at that point. + +### Overlapped percent (%) + +This defines the overlap percentage between chunks. An appropriate degree of overlap ensures semantic coherence without creating excessive, redundant tokens for the LLM. + +- Default: 0 +- Maximum: 30% + + +### Delimiters + +Defaults to `\n`. Click the right-hand **Recycle bin** button to remove it, or click **+ Add** to add a delimiter. + + +### Output + +The global variable name for the output of the **Token chunkder** component, which can be referenced by subsequent components in the ingestion pipeline. + +- Default: `chunks` +- Type: `Array` \ No newline at end of file