Refactor Dataset API (#2783)

### What problem does this PR solve? Refactor Dataset API ### Type of change - [x] Refactoring --------- Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
2024-10-11 09:55:27 +08:00
parent a2f9c03a95
commit cbd7cd7c4d
11 changed files with 449 additions and 393 deletions
--- a/api/http_api.md
+++ b/api/http_api.md
@@ -5,63 +5,134 @@

 **POST** `/api/v1/dataset`

-Creates a dataset with a name. If dataset of the same name already exists, the new dataset will be renamed by RAGFlow automatically.
+Creates a dataset.

 ### Request

 - Method: POST
- URL: `/api/v1/dataset`
+- URL: `http://{address}/api/v1/dataset`
 - Headers:
  - `content-Type: application/json`
  - 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
 - Body:
-  - `"dataset_name"`: `string`
+  - `"id"`: `string`
+  - `"name"`: `string`
+  - `"avatar"`: `string`
  - `"tenant_id"`: `string`
+  - `"description"`: `string`
+  - `"language"`: `string`
  - `"embedding_model"`: `string`
-  - `"chunk_count"`: `integer`
+  - `"permission"`: `string`
  - `"document_count"`: `integer`
+  - `"chunk_count"`: `integer`
  - `"parse_method"`: `string`
+  - `"parser_config"`: `Dataset.ParserConfig`

 #### Request example

-```shell
+```bash
+# "id": id must not be provided.
+# "name": name is required and can't be duplicated.
+# "tenant_id": tenant_id must not be provided.
+# "embedding_model": embedding_model must not be provided.
+# "navie" means general.
 curl --request POST \
-     --url http://{address}/api/v1/dataset \
-     --header 'Content-Type: application/json' \
-     --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}' \
-     --data-binary '{
-     "dataset_name": "test",
-     "tenant_id": "4fb0cd625f9311efba4a0242ac120006",
-     "embedding_model": "BAAI/bge--zh-v1.5",
-     "chunk_count": 0,
-     "document_count": 0,
-     "parse_method": "general"
+  --url http://{address}/api/v1/dataset \
+  --header 'Content-Type: application/json' \
+  --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}' \
+  --data '{
+  "name": "test",
+  "chunk_count": 0,
+  "document_count": 0,
+  "parse_method": "naive"
 }'
 ```

 #### Request parameters

- `"dataset_name"`: (*Body parameter*)
+- `"id"`: (*Body parameter*)  
+    The ID of the created dataset used to uniquely identify different datasets.  
+    - If creating a dataset, `id` must not be provided.
+
+- `"name"`: (*Body parameter*)  
    The name of the dataset, which must adhere to the following requirements:  
-    - Maximum 65,535 characters.
+    - Required when creating a dataset and must be unique.
+    - If updating a dataset, `name` must still be unique.
+
+- `"avatar"`: (*Body parameter*)  
+    Base64 encoding of the avatar.
+
 - `"tenant_id"`: (*Body parameter*)  
-    The ID of the tenant.
+    The ID of the tenant associated with the dataset, used to link it with specific users.  
+    - If creating a dataset, `tenant_id` must not be provided.
+    - If updating a dataset, `tenant_id` cannot be changed.
+
+- `"description"`: (*Body parameter*)  
+    The description of the dataset.
+
+- `"language"`: (*Body parameter*)  
+    The language setting for the dataset.
+
 - `"embedding_model"`: (*Body parameter*)  
-    Embedding model used in the dataset.
- `"chunk_count"`: (*Body parameter*)  
-    Chunk count of the dataset.
+    Embedding model used in the dataset to generate vector embeddings.  
+    - If creating a dataset, `embedding_model` must not be provided.
+    - If updating a dataset, `embedding_model` cannot be changed.
+
+- `"permission"`: (*Body parameter*)  
+    Specifies who can manipulate the dataset.
+
 - `"document_count"`: (*Body parameter*)  
-    Document count of the dataset.
- `"parse_mehtod"`: (*Body parameter*)  
-    Parsing method of the dataset.
+    Document count of the dataset.  
+    - If updating a dataset, `document_count` cannot be changed.
+
+- `"chunk_count"`: (*Body parameter*)  
+    Chunk count of the dataset.  
+    - If updating a dataset, `chunk_count` cannot be changed.
+
+- `"parse_method"`: (*Body parameter*)  
+    Parsing method of the dataset.  
+    - If updating `parse_method`, `chunk_count` must be greater than 0.
+
+- `"parser_config"`: (*Body parameter*)  
+    The configuration settings for the dataset parser.

 ### Response

 The successful response includes a JSON object like the following:

-```shell
+```json
 {
-    "code": 0 
+    "code": 0,
+    "data": {
+        "avatar": null,
+        "chunk_count": 0,
+        "create_date": "Thu, 10 Oct 2024 05:57:37 GMT",
+        "create_time": 1728539857641,
+        "created_by": "69736c5e723611efb51b0242ac120007",
+        "description": null,
+        "document_count": 0,
+        "embedding_model": "BAAI/bge-large-zh-v1.5",
+        "id": "8d73076886cc11ef8c270242ac120006",
+        "language": "English",
+        "name": "test_1",
+        "parse_method": "naive",
+        "parser_config": {
+            "pages": [
+                [
+                    1,
+                    1000000
+                ]
+            ]
+        },
+        "permission": "me",
+        "similarity_threshold": 0.2,
+        "status": "1",
+        "tenant_id": "69736c5e723611efb51b0242ac120007",
+        "token_num": 0,
+        "update_date": "Thu, 10 Oct 2024 05:57:37 GMT",
+        "update_time": 1728539857641,
+        "vector_similarity_weight": 0.3
+    }
 }
 ```

@@ -71,10 +142,10 @@ The successful response includes a JSON object like the following:
  
 The error response includes a JSON object like the following:

-```shell
+```json
 {
-    "code": 3016,
-    "message": "Can't connect database"
+    "code": 102,
+    "message": "Duplicated knowledgebase name in creating dataset."
 }
 ```

@@ -82,27 +153,31 @@ The error response includes a JSON object like the following:

 **DELETE** `/api/v1/dataset`

-Deletes a dataset by its id or name.
+Deletes datasets by ids or names.

 ### Request

 - Method: DELETE
- URL: `/api/v1/dataset/{dataset_id}`
+- URL: `http://{address}/api/v1/dataset`
 - Headers:
  - `content-Type: application/json`
  - 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
+  - Body:
+    - `"names"`: `List[string]`
+    - `"ids"`: `List[string]`


 #### Request example

-```shell
+```bash
+# Either id or name must be provided, but not both.
 curl --request DELETE \
-     --url http://{address}/api/v1/dataset/0 \
-     --header 'Content-Type: application/json' \
-     --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
-     --data ' {
-        "names": ["ds1", "ds2"]
-     }'
+  --url http://{address}/api/v1/dataset \
+  --header 'Content-Type: application/json' \
+  --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}' \
+  --data '{
+  "names": ["test_1", "test_2"]
+  }'
 ```

 #### Request parameters
@@ -118,7 +193,7 @@ curl --request DELETE \

 The successful response includes a JSON object like the following:

-```shell
+```json
 {
    "code": 0 
 }
@@ -130,10 +205,10 @@ The successful response includes a JSON object like the following:
  
 The error response includes a JSON object like the following:

-```shell
+```json
 {
-    "code": 3016,
-    "message": "Try to delete non-existent dataset."
+    "code": 102,
+    "message": "You don't own the dataset."
 }
 ```

@@ -146,50 +221,47 @@ Updates a dataset by its id.
 ### Request

 - Method: PUT
- URL: `/api/v1/dataset/{dataset_id}`
+- URL: `http://{address}/api/v1/dataset/{dataset_id}`
 - Headers:
  - `content-Type: application/json`
  - 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
+  - Body: (Refer to the "Create Dataset" for the complete structure of the request body.)


 #### Request example

-```shell
+```bash
+# "id":  id is required.
+# "name": If you update name, it can't be duplicated.
+# "tenant_id": If you update tenant_id, it can't be changed
+# "embedding_model": If you update embedding_model, it can't be changed.
+# "chunk_count": If you update chunk_count, it can't be changed.
+# "document_count": If you update document_count, it can't be changed.
+# "parse_method": If you update parse_method, chunk_count must be 0. 
+# "navie" means general.
 curl --request PUT \
-     --url http://{address}/api/v1/dataset/0 \
-     --header 'Content-Type: application/json' \
-     --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
-     --data-binary '{
-     "dataset_name": "test",
-     "tenant_id": "4fb0cd625f9311efba4a0242ac120006",
-     "embedding_model": "BAAI/bge--zh-v1.5",
-     "chunk_count": 0,
-     "document_count": 0,
-     "parse_method": "general"
+  --url http://{address}/api/v1/dataset/{dataset_id} \
+  --header 'Content-Type: application/json' \
+  --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}' \
+  --data '{
+  "name": "test",
+  "tenant_id": "4fb0cd625f9311efba4a0242ac120006",
+  "embedding_model": "BAAI/bge-zh-v1.5",
+  "chunk_count": 0,
+  "document_count": 0,
+  "parse_method": "navie"
 }'
 ```

 #### Request parameters
+(Refer to the "Create Dataset" for the complete structure of the request parameters.)

- `"dataset_name"`: (*Body parameter*)
-    The name of the dataset, which must adhere to the following requirements:  
-    - Maximum 65,535 characters.
- `"tenant_id"`: (*Body parameter*)  
-    The ID of the tenant.
- `"embedding_model"`: (*Body parameter*)  
-    Embedding model used in the dataset.
- `"chunk_count"`: (*Body parameter*)  
-    Chunk count of the dataset.
- `"document_count"`: (*Body parameter*)  
-    Document count of the dataset.
- `"parse_mehtod"`: (*Body parameter*)  
-    Parsing method of the dataset.

 ### Response

 The successful response includes a JSON object like the following:

-```shell
+```json
 {
    "code": 0 
 }
@@ -201,35 +273,37 @@ The successful response includes a JSON object like the following:
  
 The error response includes a JSON object like the following:

-```shell
+```json
 {
-    "code": 3016,
-    "message": "Can't change embedding model since some files already use it."
+    "code": 102,
+    "message": "Can't change tenant_id."
 }
 ```

 ## List datasets

-**GET** `/api/v1/dataset?name={name}&page={page}&page_size={page_size}&orderby={orderby}&desc={desc}`
+**GET** `/api/v1/dataset?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id}`

 List all datasets

 ### Request

 - Method: GET
- URL: `/api/v1/dataset?name={name}&page={page}&page_size={page_size}&orderby={orderby}&desc={desc}`
+- URL: `http://{address}/api/v1/dataset?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id}`
 - Headers:
-  - `content-Type: application/json`
  - 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'


 #### Request example

-```shell
+```bash
+# If no page parameter is passed, the default is 1
+# If no page_size parameter is passed, the default is 1024
+# If no order_by parameter is passed, the default is "create_time"
+# If no desc parameter is passed, the default is True
 curl --request GET \
-     --url http://{address}/api/v1/dataset?page=0&page_size=50&orderby=create_time&desc=false \
-     --header 'Content-Type: application/json' \
-     --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
+  --url http://{address}/api/v1/dataset?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id} \
+  --header 'Authorization: Bearer {YOUR_ACCESS_TOKEN}'
 ```

 #### Request parameters
@@ -244,54 +318,63 @@ curl --request GET \
    A boolean flag indicating whether the sorting should be in descending order.
 - `name`: (*Path parameter*)
    Dataset name
+- - `"id"`: (*Path parameter*)  
+    The ID of the dataset to be retrieved.
+- `"name"`: (*Path parameter*)  
+    The name of the dataset to be retrieved.

 ### Response

 The successful response includes a JSON object like the following:

-```shell
+```json
 {
    "code": 0,
    "data": [
        {
-          "avatar": "",
-          "chunk_count": 0,
-          "create_date": "Thu, 29 Aug 2024 03:13:07 GMT",
-          "create_time": 1724901187843,
-          "created_by": "4fb0cd625f9311efba4a0242ac120006",
-          "description": "",
-          "document_count": 0,
-          "embedding_model": "BAAI/bge-large-zh-v1.5",
-          "id": "9d3d906665b411ef87d10242ac120006",
-          "language": "English",
-          "name": "Test",
-          "parser_config": {
-              "chunk_token_count": 128,
-              "delimiter": "\n!?。；！？",
-              "layout_recognize": true,
-              "task_page_size": 12
-          },
-          "parse_method": "naive",
-          "permission": "me",
-          "similarity_threshold": 0.2,
-          "status": "1",
-          "tenant_id": "4fb0cd625f9311efba4a0242ac120006",
-          "token_count": 0,
-          "update_date": "Thu, 29 Aug 2024 03:13:07 GMT",
-          "update_time": 1724901187843,
-          "vector_similarity_weight": 0.3
+            "avatar": "",
+            "chunk_count": 59,
+            "create_date": "Sat, 14 Sep 2024 01:12:37 GMT",
+            "create_time": 1726276357324,
+            "created_by": "69736c5e723611efb51b0242ac120007",
+            "description": null,
+            "document_count": 1,
+            "embedding_model": "BAAI/bge-large-zh-v1.5",
+            "id": "6e211ee0723611efa10a0242ac120007",
+            "language": "English",
+            "name": "mysql",
+            "parse_method": "knowledge_graph",
+            "parser_config": {
+                "chunk_token_num": 8192,
+                "delimiter": "\\n!?;。；！？",
+                "entity_types": [
+                    "organization",
+                    "person",
+                    "location",
+                    "event",
+                    "time"
+                ]
+            },
+            "permission": "me",
+            "similarity_threshold": 0.2,
+            "status": "1",
+            "tenant_id": "69736c5e723611efb51b0242ac120007",
+            "token_num": 12744,
+            "update_date": "Thu, 10 Oct 2024 04:07:23 GMT",
+            "update_time": 1728533243536,
+            "vector_similarity_weight": 0.3
        }
-    ],
+    ]
 }
 ```

  
 The error response includes a JSON object like the following:

-```shell
+```json
 {
-    "code": 3016,
-    "message": "Can't access database to get the dataset list."
+    "code": 102,
+    "message": "The dataset doesn't exist"
 }
 ```