Refa: http API create dataset and test cases (#7393)

### What problem does this PR solve?

This PR introduces Pydantic-based validation for the create dataset HTTP
API, improving code clarity and robustness. Key changes include:
1. Pydantic Validation
2. ​​Error Handling
3. Test Updates
4. Documentation

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Documentation Update
- [x] Refactoring
This commit is contained in:
liu an
2025-04-29 16:53:57 +08:00
committed by GitHub
parent c88e4b3fc0
commit 78380fa181
11 changed files with 1239 additions and 812 deletions

View File

@ -341,6 +341,7 @@ Creates a dataset.
- `"embedding_model"`: `string`
- `"permission"`: `string`
- `"chunk_method"`: `string`
- `"pagerank"`: `int`
- `"parser_config"`: `object`
##### Request example
@ -359,53 +360,83 @@ curl --request POST \
- `"name"`: (*Body parameter*), `string`, *Required*
The unique name of the dataset to create. It must adhere to the following requirements:
- Permitted characters include:
- English letters (a-z, A-Z)
- Digits (0-9)
- "_" (underscore)
- Must begin with an English letter or underscore.
- Maximum 65,535 characters.
- Case-insensitive.
- Basic Multilingual Plane (BMP) only
- Maximum 128 characters
- Case-insensitive
- `"avatar"`: (*Body parameter*), `string`
Base64 encoding of the avatar.
- Maximum 65535 characters
- `"description"`: (*Body parameter*), `string`
A brief description of the dataset to create.
- Maximum 65535 characters
- `"embedding_model"`: (*Body parameter*), `string`
The name of the embedding model to use. For example: `"BAAI/bge-zh-v1.5"`
The name of the embedding model to use. For example: `"BAAI/bge-large-zh-v1.5@BAAI"`
- Maximum 255 characters
- Must follow `model_name@model_factory` format
- `"permission"`: (*Body parameter*), `string`
Specifies who can access the dataset to create. Available options:
- `"me"`: (Default) Only you can manage the dataset.
- `"team"`: All team members can manage the dataset.
- `"pagerank"`: (*Body parameter*), `int`
Set page rank: refer to [Set page rank](https://ragflow.io/docs/dev/set_page_rank)
- Default: `0`
- Minimum: `0`
- Maximum: `100`
- `"chunk_method"`: (*Body parameter*), `enum<string>`
The chunking method of the dataset to create. Available options:
- `"naive"`: General (default)
- `"book"`: Book
- `"email"`: Email
- `"laws"`: Laws
- `"manual"`: Manual
- `"one"`: One
- `"paper"`: Paper
- `"picture"`: Picture
- `"presentation"`: Presentation
- `"qa"`: Q&A
- `"table"`: Table
- `"paper"`: Paper
- `"book"`: Book
- `"laws"`: Laws
- `"presentation"`: Presentation
- `"picture"`: Picture
- `"one"`: One
- `"email"`: Email
- `"tag"`: Tag
- `"parser_config"`: (*Body parameter*), `object`
The configuration settings for the dataset parser. The attributes in this JSON object vary with the selected `"chunk_method"`:
- If `"chunk_method"` is `"naive"`, the `"parser_config"` object contains the following attributes:
- `"chunk_token_count"`: Defaults to `128`.
- `"layout_recognize"`: Defaults to `true`.
- `"html4excel"`: Indicates whether to convert Excel documents into HTML format. Defaults to `false`.
- `"delimiter"`: Defaults to `"\n"`.
- `"task_page_size"`: Defaults to `12`. For PDF only.
- `"raptor"`: RAPTOR-specific settings. Defaults to: `{"use_raptor": false}`.
- `"auto_keywords"`: `int`
- Defaults to `0`
- Minimum: `0`
- Maximum: `32`
- `"auto_questions"`: `int`
- Defaults to `0`
- Minimum: `0`
- Maximum: `10`
- `"chunk_token_num"`: `int`
- Defaults to `128`
- Minimum: `1`
- Maximum: `2048`
- `"delimiter"`: `string`
- Defaults to `"\n"`.
- `"html4excel"`: `bool` Indicates whether to convert Excel documents into HTML format.
- Defaults to `false`
- `"layout_recognize"`: `string`
- Defaults to `DeepDOC`
- `"tag_kb_ids"`: `array<string>` refer to [Use tag set](https://ragflow.io/docs/dev/use_tag_sets)
- Must include a list of dataset IDs, where each dataset is parsed using the Tag Chunk Method
- `"task_page_size"`: `int` For PDF only.
- Defaults to `12`
- Minimum: `1`
- Maximum: `10000`
- `"raptor"`: `object` RAPTOR-specific settings.
- Defaults to: `{"use_raptor": false}`
- `"graphrag"`: `object` GRAPHRAG-specific settings.
- Defaults to: `{"use_graphrag": false}`
- If `"chunk_method"` is `"qa"`, `"manuel"`, `"paper"`, `"book"`, `"laws"`, or `"presentation"`, the `"parser_config"` object contains the following attribute:
- `"raptor"`: RAPTOR-specific settings. Defaults to: `{"use_raptor": false}`.
- `"raptor"`: `object` RAPTOR-specific settings.
- Defaults to: `{"use_raptor": false}`.
- If `"chunk_method"` is `"table"`, `"picture"`, `"one"`, or `"email"`, `"parser_config"` is an empty JSON object.
#### Response
@ -419,33 +450,34 @@ Success:
"avatar": null,
"chunk_count": 0,
"chunk_method": "naive",
"create_date": "Thu, 24 Oct 2024 09:14:07 GMT",
"create_time": 1729761247434,
"created_by": "69736c5e723611efb51b0242ac120007",
"create_date": "Mon, 28 Apr 2025 18:40:41 GMT",
"create_time": 1745836841611,
"created_by": "3af81804241d11f0a6a79f24fc270c7f",
"description": null,
"document_count": 0,
"embedding_model": "BAAI/bge-large-zh-v1.5",
"id": "527fa74891e811ef9c650242ac120006",
"embedding_model": "BAAI/bge-large-zh-v1.5@BAAI",
"id": "3b4de7d4241d11f0a6a79f24fc270c7f",
"language": "English",
"name": "test_1",
"name": "RAGFlow example",
"pagerank": 0,
"parser_config": {
"chunk_token_num": 128,
"delimiter": "\\n",
"html4excel": false,
"layout_recognize": true,
"chunk_token_num": 128,
"delimiter": "\\n!?;。;!?",
"html4excel": false,
"layout_recognize": "DeepDOC",
"raptor": {
"use_raptor": false
}
},
}
},
"permission": "me",
"similarity_threshold": 0.2,
"status": "1",
"tenant_id": "69736c5e723611efb51b0242ac120007",
"tenant_id": "3af81804241d11f0a6a79f24fc270c7f",
"token_num": 0,
"update_date": "Thu, 24 Oct 2024 09:14:07 GMT",
"update_time": 1729761247434,
"vector_similarity_weight": 0.3
}
"update_date": "Mon, 28 Apr 2025 18:40:41 GMT",
"update_time": 1745836841611,
"vector_similarity_weight": 0.3,
},
}
```
@ -453,8 +485,8 @@ Failure:
```json
{
"code": 102,
"message": "Duplicated knowledgebase name in creating dataset."
"code": 101,
"message": "Dataset name 'RAGFlow example' already exists"
}
```

View File

@ -95,11 +95,12 @@ else:
```python
RAGFlow.create_dataset(
name: str,
avatar: str = "",
description: str = "",
embedding_model: str = "BAAI/bge-large-zh-v1.5",
avatar: Optional[str] = None,
description: Optional[str] = None,
embedding_model: Optional[str] = "BAAI/bge-large-zh-v1.5@BAAI",
permission: str = "me",
chunk_method: str = "naive",
pagerank: int = 0,
parser_config: DataSet.ParserConfig = None
) -> DataSet
```
@ -112,16 +113,16 @@ Creates a dataset.
The unique name of the dataset to create. It must adhere to the following requirements:
- Maximum 65,535 characters.
- Maximum 128 characters.
- Case-insensitive.
##### avatar: `str`
Base64 encoding of the avatar. Defaults to `""`
Base64 encoding of the avatar. Defaults to `None`
##### description: `str`
A brief description of the dataset to create. Defaults to `""`.
A brief description of the dataset to create. Defaults to `None`.
##### permission
@ -147,6 +148,10 @@ The chunking method of the dataset to create. Available options:
- `"one"`: One
- `"email"`: Email
##### pagerank, `int`
The pagerank of the dataset to create. Defaults to `0`.
##### parser_config
The parser configuration of the dataset. A `ParserConfig` object's attributes vary based on the selected `chunk_method`: