Updated parser_config description (#3104)

### What problem does this PR solve?



### Type of change


- [x] Documentation Update
This commit is contained in:
writinwaters
2024-10-30 15:33:36 +08:00
committed by GitHub
parent 3fb2bc7613
commit 86b546f657
2 changed files with 94 additions and 30 deletions

View File

@ -78,7 +78,7 @@ curl --request POST \
- `"chunk_method"`: (*Body parameter*), `enum<string>`
The chunking method of the dataset to create. Available options:
- `"naive"`: General (default)
- `"manual`: Manual
- `"manual"`: Manual
- `"qa"`: Q&A
- `"table"`: Table
- `"paper"`: Paper
@ -88,16 +88,23 @@ curl --request POST \
- `"picture"`: Picture
- `"one"`: One
- `"knowledge_graph"`: Knowledge Graph
- `"email"`: Email
- `"parser_config"`: (*Body parameter*), `object`
The configuration settings for the dataset parser, a JSON object containing the following attributes:
- `"chunk_token_count"`: Defaults to `128`.
- `"layout_recognize"`: Defaults to `true`.
- `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
- `"task_page_size"`: Defaults to `12`. For PDF only.
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
- If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
- `"chunk_token_count"`: Defaults to `128`.
- `"layout_recognize"`: Defaults to `true`.
- `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
- `"task_page_size"`: Defaults to `12`. For PDF only.
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
- If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
- If `"chunk_method"` is `"table"` or `"one"`, `"parser_config"` is an empty JSON object.
- If `"chunk_method"` is `"knowledge_graph"`, the `"parser_config"` object contains the following attributes:
- `"chunk_token_count"`: Defaults to `128`.
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
- `"entity_types"`: Defaults to `["organization","person","location","event","time"]`
### Response
@ -256,7 +263,6 @@ curl --request PUT \
- `"picture"`: Picture
- `"one"`:One
- `"knowledge_graph"`: Knowledge Graph
- `"email"`: Email
### Response
@ -511,13 +517,22 @@ curl --request PUT \
- `"picture"`: Picture
- `"one"`: One
- `"knowledge_graph"`: Knowledge Graph
- `"email"`: Email
- `"parser_config"`: (*Body parameter*), `object`
The parsing configuration for the document:
- `"chunk_token_count"`: Defaults to `128`.
- `"layout_recognize"`: Defaults to `true`.
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
- `"task_page_size"`: Defaults to `12`. For PDF only.
The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
- If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
- `"chunk_token_count"`: Defaults to `128`.
- `"layout_recognize"`: Defaults to `true`.
- `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
- `"task_page_size"`: Defaults to `12`. For PDF only.
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
- If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
- `"raptor"`: Raptor-specific settings. Defaults to: `{"use_raptor": false}`.
- If `"chunk_method"` is `"table"` or `"one"`, `"parser_config"` is an empty JSON object.
- If `"chunk_method"` is `"knowledge_graph"`, the `"parser_config"` object contains the following attributes:
- `"chunk_token_count"`: Defaults to `128`.
- `"delimiter"`: Defaults to `"\n!?。;!?"`.
- `"entity_types"`: Defaults to `["organization","person","location","event","time"]`
### Response