> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getclaro.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create Dataset

> Create datasets for various operations including data enrichment, extraction, map extraction, and custom blank tables. Generate AI-powered datasets.

<Note>
  All operations require authentication using Bearer tokens. Make sure you have
  your API credentials ready.
</Note>

## Standard Dataset Creation

### Create Dataset from Data Sources

Create a new dataset with specified configuration using existing data sources or file uploads.

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST "https://secure-api.getclaro.ai/api/v2/datasets" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "type": "data_enrichment",
      "name": "Product Enrichment Dataset",
      "description": "Enrich product data with additional attributes and classifications",
      "datasourceId": "$DATASOURCE_ID"
    }'
  ```

  ```python Python theme={null}
  import requests

  headers = {
      "Authorization": "Bearer YOUR_API_KEY",
      "Content-Type": "application/json"
  }

  data = {
      "type": "data_enrichment",
      "name": "Product Enrichment Dataset",
      "description": "Enrich product data with additional attributes and classifications",
      "datasourceId": "your-datasource-id"  # Replace with your datasource ID
  }

  response = requests.post(
      "https://secure-api.getclaro.ai/api/v2/datasets",
      headers=headers,
      json=data
  )
  ```

  ```javascript JavaScript theme={null}
  const datasetConfig = {
    type: "data_enrichment",
    name: "Product Enrichment Dataset",
    description:
      "Enrich product data with additional attributes and classifications",
    datasourceId: "your-datasource-id", // Replace with your datasource ID
  };

  const response = await fetch("https://secure-api.getclaro.ai/api/v2/datasets", {
    method: "POST",
    headers: {
      Authorization: "Bearer YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(datasetConfig),
  });
  ```

  ```json Success Response theme={null}
  {
    "datasetId": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Product Enrichment Dataset",
    "type": "data_enrichment",
    "status": "created",
    "description": "Enrich product data with additional attributes and classifications",
    "datasourceId": "6ba7b810-9dad-11d1-80b4-00c04fd430c8",
    "createdAt": "2024-03-14T15:30:00Z",
    "rowCount": 0,
    "columnCount": 0
  }
  ```

  ```json Validation Error theme={null}
  {
    "error": "Invalid dataset configuration",
    "code": "VALIDATION_ERROR",
    "details": {
      "field": "type",
      "message": "Dataset type must be one of: data_enrichment, data_extraction, map_extraction, custom_dataset"
    }
  }
  ```
</CodeGroup>

### Create Dataset with File Upload

Create a dataset by uploading files directly instead of using existing data sources.

<Note>
  Files uploaded during dataset creation are automatically processed and saved
  as data sources, making them available for creating additional datasets in the
  future.
</Note>

<CodeGroup>
  ```bash cURL theme={null}
  # For CSV files (single file)
  curl -X POST "https://secure-api.getclaro.ai/api/v2/datasets/upload" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -F "type=data_enrichment" \
    -F "name=Product Enrichment Dataset" \
    -F "description=Enrich product data with additional attributes" \
    -F "file=@products.csv"

  # For PDF files (multiple files)

  curl -X POST "https://secure-api.getclaro.ai/api/v2/datasets/upload" \
   -H "Authorization: Bearer YOUR_API_KEY" \
   -F "type=data_extraction" \
   -F "name=Invoice Extraction Dataset" \
   -F "description=Extract key fields from invoice documents" \
   -F "files[]=@invoice1.pdf" \
   -F "files[]=@invoice2.pdf"

  ```

  ```python Python theme={null}
  import requests

  # For CSV files (single file)
  headers = {"Authorization": "Bearer YOUR_API_KEY"}
  data = {
      "type": "data_enrichment",
      "name": "Product Enrichment Dataset",
      "description": "Enrich product data with additional attributes"
  }
  files = {"file": open("products.csv", "rb")}

  response = requests.post(
      "https://secure-api.getclaro.ai/api/v2/datasets/upload",
      headers=headers,
      data=data,
      files=files
  )

  # For PDF files (multiple files)
  files = {
      "files[]": [
          open("invoice1.pdf", "rb"),
          open("invoice2.pdf", "rb")
      ]
  }
  ```

  ```javascript JavaScript theme={null}
  // For CSV files (single file)
  const formData = new FormData();
  formData.append("type", "data_enrichment");
  formData.append("name", "Product Enrichment Dataset");
  formData.append(
    "description",
    "Enrich product data with additional attributes"
  );
  formData.append("file", csvFile); // File input element

  const response = await fetch(
    "https://secure-api.getclaro.ai/api/v2/datasets/upload",
    {
      method: "POST",
      headers: {
        Authorization: "Bearer YOUR_API_KEY",
      },
      body: formData,
    }
  );

  // For PDF files (multiple files)
  const formData = new FormData();
  formData.append("type", "data_extraction");
  formData.append("name", "Invoice Extraction Dataset");
  formData.append("description", "Extract key fields from invoice documents");
  pdfFiles.forEach((file) => formData.append("files[]", file));
  ```

  ```json Success Response theme={null}
  {
    "datasetId": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Product Enrichment Dataset",
    "type": "data_enrichment",
    "status": "created",
    "description": "Enrich product data with additional attributes",
    "datasourceId": "6ba7b810-9dad-11d1-80b4-00c04fd430c8",
    "createdAt": "2024-03-14T15:30:00Z",
    "rowCount": 0,
    "columnCount": 0
  }
  ```

  ```json File Upload Error theme={null}
  {
    "error": "File upload failed",
    "code": "FILE_UPLOAD_ERROR",
    "details": {
      "message": "Invalid file format or file too large"
    }
  }
  ```
</CodeGroup>

## Dataset Types and Configuration

### Data Enrichment

Enhance existing data with additional attributes and classifications. Requires data source.

```json theme={null}
{
  "type": "data_enrichment",
  "name": "Product Classification Dataset",
  "description": "Classify products and add missing attributes for e-commerce catalog",
  "datasourceId": "your-datasource-id"
}
```

### Data Extraction

Extract structured data from unstructured sources like PDFs. Requires data sources.

```json theme={null}
{
  "type": "data_extraction",
  "name": "Invoice Data Extraction",
  "description": "Extract key fields from invoice documents for automated processing",
  "datasourceId": "your-datasource-id"
}
```

### Map Extraction

Extract location-based data within specified geographic boundaries. No data sources required.

```json theme={null}
{
  "type": "map_extraction",
  "name": "Restaurant Location Data",
  "description": "Find restaurants in downtown area for market analysis",
  "mapDetails": {
    "latitude": 40.7128,
    "longitude": -74.006,
    "radiusMeters": 5000
  }
}
```

### Custom Dataset (Blank Table)

Create a blank structured table with custom column definitions. No data sources required.

```json theme={null}
{
  "type": "custom_dataset",
  "name": "Customer Survey Responses",
  "description": "Collect and organize customer feedback for satisfaction analysis",
  "columnDefinitions": [
    {
      "name": "customer_id",
      "type": "string",
      "description": "Unique customer identifier"
    },
    {
      "name": "satisfaction_score",
      "type": "number",
      "description": "Rating from 1-10"
    },
    {
      "name": "feedback_text",
      "type": "text",
      "description": "Open-ended feedback"
    },
    {
      "name": "survey_date",
      "type": "date",
      "description": "Date survey was completed"
    }
  ]
}
```

## AI-Powered Dataset Generation

### Generate Sample Dataset with AI

Generate a sample dataset using AI with a prompt-based approach. This returns a preview for user confirmation before creating the actual dataset.

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST "https://secure-api.getclaro.ai/api/v2/datasets/generate" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "Create a dataset of tech startup companies with funding information",
      "sampleSize": 10
    }'
  ```

  ```python Python theme={null}
  import requests

  headers = {
      "Authorization": "Bearer YOUR_API_KEY",
      "Content-Type": "application/json"
  }

  data = {
      "prompt": "Create a dataset of tech startup companies with funding information",
      "sampleSize": 10
  }

  response = requests.post(
      "https://secure-api.getclaro.ai/api/v2/datasets/generate",
      headers=headers,
      json=data
  )
  ```

  ```javascript JavaScript theme={null}
  const sampleRequest = {
    prompt: "Create a dataset of tech startup companies with funding information",
    sampleSize: 10,
  };

  const response = await fetch(
    "https://secure-api.getclaro.ai/api/v2/datasets/generate",
    {
      method: "POST",
      headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
      },
      body: JSON.stringify(sampleRequest),
    }
  );
  ```

  ```json Success Response theme={null}
  {
    "datasetRequestId": "req_550e8400-e29b-41d4-a716-446655440000",
    "name": "Tech Startup Funding Dataset",
    "description": "Track and analyze funding rounds and investment data for technology startups",
    "type": "ai_generated",
    "status": "sample_ready",
    "rowCount": 10,
    "columnCount": 7,
    "sampleData": [
      {
        "company_name": "DataFlow AI",
        "industry": "Artificial Intelligence",
        "funding_stage": "Series A",
        "funding_amount": 15000000,
        "investors": "Accel Partners, Index Ventures",
        "founded_year": 2021,
        "location": "San Francisco, CA"
      },
      {
        "company_name": "CloudSecure",
        "industry": "Cybersecurity",
        "funding_stage": "Seed",
        "funding_amount": 3500000,
        "investors": "Andreessen Horowitz",
        "founded_year": 2022,
        "location": "Austin, TX"
      }
    ],
    "columns": [
      {
        "name": "company_name",
        "type": "string",
        "description": "Name of the startup company"
      },
      {
        "name": "industry",
        "type": "string",
        "description": "Primary industry or sector"
      },
      {
        "name": "funding_stage",
        "type": "string",
        "description": "Current funding round stage"
      },
      {
        "name": "funding_amount",
        "type": "number",
        "description": "Total funding amount in USD"
      },
      {
        "name": "investors",
        "type": "string",
        "description": "List of primary investors"
      },
      {
        "name": "founded_year",
        "type": "number",
        "description": "Year the company was founded"
      },
      {
        "name": "location",
        "type": "string",
        "description": "Company headquarters location"
      }
    ],
    "expiresAt": "2024-03-14T16:30:00Z"
  }
  ```

  ```json Generation Error theme={null}
  {
    "error": "Unable to generate dataset",
    "code": "GENERATION_FAILED",
    "details": {
      "message": "Prompt too vague or contains unsupported data types"
    }
  }
  ```
</CodeGroup>

### Refine AI Dataset Generation (Optional)

After reviewing the sample dataset, you can optionally request corrections or additions before confirming the final dataset.

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST "https://secure-api.getclaro.ai/api/v2/datasets/generate/$DATASET_REQUEST_ID/refine" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "Add a column for company valuation and remove the location column"
    }'
  ```

  ```python Python theme={null}
  import requests

  headers = {
      "Authorization": "Bearer YOUR_API_KEY",
      "Content-Type": "application/json"
  }

  dataset_request_id = "your-dataset-request-id"  # Replace with your request ID
  data = {
      "prompt": "Add a column for company valuation and remove the location column"
  }

  response = requests.post(
      f"https://secure-api.getclaro.ai/api/v2/datasets/generate/{dataset_request_id}/refine",
      headers=headers,
      json=data
  )
  ```

  ```javascript JavaScript theme={null}
  const datasetRequestId = "your-dataset-request-id"; // Replace with your request ID
  const refineRequest = {
    prompt: "Add a column for company valuation and remove the location column",
  };

  const response = await fetch(
    `https://secure-api.getclaro.ai/api/v2/datasets/generate/${datasetRequestId}/refine`,
    {
      method: "POST",
      headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
      },
      body: JSON.stringify(refineRequest),
    }
  );
  ```

  ```json Success Response theme={null}
  {
    "datasetRequestId": "req_550e8400-e29b-41d4-a716-446655440000",
    "name": "Tech Startup Funding Dataset",
    "description": "Track and analyze funding rounds and investment data for technology startups",
    "type": "ai_generated",
    "status": "sample_refined",
    "rowCount": 10,
    "columnCount": 7,
    "sampleData": [
      {
        "company_name": "DataFlow AI",
        "industry": "Artificial Intelligence",
        "funding_stage": "Series A",
        "funding_amount": 15000000,
        "investors": "Accel Partners, Index Ventures",
        "founded_year": 2021,
        "company_valuation": 75000000
      },
      {
        "company_name": "CloudSecure",
        "industry": "Cybersecurity",
        "funding_stage": "Seed",
        "funding_amount": 3500000,
        "investors": "Andreessen Horowitz",
        "founded_year": 2022,
        "company_valuation": 20000000
      }
    ],
    "columns": [
      {
        "name": "company_name",
        "type": "string",
        "description": "Name of the startup company"
      },
      {
        "name": "industry",
        "type": "string",
        "description": "Primary industry or sector"
      },
      {
        "name": "funding_stage",
        "type": "string",
        "description": "Current funding round stage"
      },
      {
        "name": "funding_amount",
        "type": "number",
        "description": "Total funding amount in USD"
      },
      {
        "name": "investors",
        "type": "string",
        "description": "List of primary investors"
      },
      {
        "name": "founded_year",
        "type": "number",
        "description": "Year the company was founded"
      },
      {
        "name": "company_valuation",
        "type": "number",
        "description": "Current company valuation in USD"
      }
    ],
    "expiresAt": "2024-03-14T16:30:00Z"
  }
  ```

  ```json Refinement Error theme={null}
  {
    "error": "Unable to apply refinements",
    "code": "REFINEMENT_FAILED",
    "details": {
      "message": "Requested changes conflict with existing data structure"
    }
  }
  ```
</CodeGroup>

### Confirm AI Dataset Generation

After reviewing the sample dataset (and optionally refining it), confirm creation of the full AI-generated dataset using the dataset request ID.

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST "https://secure-api.getclaro.ai/api/v2/datasets/ai-generate-confirm" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "datasetRequestId": "$DATASET_REQUEST_ID",
      "fullSize": 1000
    }'
  ```

  ```python Python theme={null}
  import requests

  headers = {
      "Authorization": "Bearer YOUR_API_KEY",
      "Content-Type": "application/json"
  }

  data = {
      "datasetRequestId": "your-dataset-request-id",  # Replace with your request ID
      "fullSize": 1000
  }

  response = requests.post(
      "https://secure-api.getclaro.ai/api/v2/datasets/ai-generate-confirm",
      headers=headers,
      json=data
  )
  ```

  ```javascript JavaScript theme={null}
  const confirmRequest = {
    datasetRequestId: "your-dataset-request-id", // Replace with your request ID
    fullSize: 1000,
  };

  const response = await fetch(
    "https://secure-api.getclaro.ai/api/v2/datasets/ai-generate-confirm",
    {
      method: "POST",
      headers: {
        Authorization: "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
      },
      body: JSON.stringify(confirmRequest),
    }
  );
  ```

  ```json Success Response theme={null}
  {
    "datasetId": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Tech Startup Funding Dataset",
    "type": "ai_generated",
    "status": "generating",
    "estimatedCompletion": "2024-03-14T16:45:00Z",
    "fullSize": 1000,
    "createdAt": "2024-03-14T15:30:00Z"
  }
  ```

  ```json Request Expired theme={null}
  {
    "error": "Dataset request expired",
    "code": "REQUEST_EXPIRED",
    "details": {
      "datasetRequestId": "req_550e8400-e29b-41d4-a716-446655440000",
      "expiredAt": "2024-03-14T16:30:00Z"
    }
  }
  ```
</CodeGroup>

## Request Parameters

### Create Dataset

| Parameter                 | Type   | Required    | Description                                                                                      |
| ------------------------- | ------ | ----------- | ------------------------------------------------------------------------------------------------ |
| `type`                    | string | Yes         | Dataset type: `data_enrichment`, `data_extraction`, `map_extraction`, `custom_dataset`           |
| `name`                    | string | Yes         | Dataset name (max 100 characters)                                                                |
| `description`             | string | Yes         | Purpose and use case description. Used as prompt for enrichment/extraction, search text for maps |
| `datasourceId`            | string | Conditional | Datasource ID. Required for `data_enrichment`, `data_extraction`. Not used for `map_extraction`  |
| `mapDetails`              | object | Conditional | Required for `map_extraction` type                                                               |
| `mapDetails.latitude`     | number | Conditional | Center latitude for map extraction                                                               |
| `mapDetails.longitude`    | number | Conditional | Center longitude for map extraction                                                              |
| `mapDetails.radiusMeters` | number | Conditional | Extraction radius in meters (max 50000)                                                          |
| `columnDefinitions`       | array  | Conditional | Required for `custom_dataset` type                                                               |

### Create Dataset with File Upload

| Parameter           | Type   | Required    | Description                        |
| ------------------- | ------ | ----------- | ---------------------------------- |
| `type`              | string | Yes         | Dataset type (same as above)       |
| `name`              | string | Yes         | Dataset name                       |
| `description`       | string | Yes         | Purpose and use case description   |
| `file`              | file   | Conditional | Single CSV file upload             |
| `files[]`           | files  | Conditional | Multiple PDF files upload          |
| `mapDetails`        | object | Conditional | Required for `map_extraction` type |
| `columnDefinitions` | array  | Conditional | Required for `custom_dataset` type |

### AI Generate Sample Dataset

| Parameter    | Type   | Required | Description                                                                                         |
| ------------ | ------ | -------- | --------------------------------------------------------------------------------------------------- |
| `prompt`     | string | Yes      | Natural language description of desired dataset                                                     |
| `sampleSize` | number | No       | Number of sample rows (default: 10, max: 50). Cannot be changed in refinement or confirmation steps |

### Refine AI Dataset Generation

| Parameter | Type   | Required | Description                                                     |
| --------- | ------ | -------- | --------------------------------------------------------------- |
| `id`      | string | Yes      | Dataset request ID from generate response (in URL path)         |
| `prompt`  | string | Yes      | Natural language description of corrections or additions needed |

### Confirm AI Dataset Generation

| Parameter          | Type   | Required | Description                                               |
| ------------------ | ------ | -------- | --------------------------------------------------------- |
| `datasetRequestId` | string | Yes      | Request ID from generate or generate/{id}/refine response |
| `fullSize`         | number | No       | Desired full dataset size (default: 1000, max: 100000)    |

## Column Definition Schema

For custom datasets, define columns with the following structure:

```json theme={null}
{
  "name": "column_name",
  "type": "string|number|date|boolean|text",
  "description": "Column purpose and content description",
  "required": true,
  "defaultValue": "optional_default"
}
```

### Column Types

| Type      | Description                                     | Example Use Cases                |
| --------- | ----------------------------------------------- | -------------------------------- |
| `string`  | Short text values (typically \< 255 characters) | Names, IDs, categories, status   |
| `text`    | Long text content (unlimited length)            | Descriptions, comments, articles |
| `number`  | Numeric values (integers and decimals)          | Prices, quantities, scores       |
| `date`    | Date and timestamp values                       | Created dates, deadlines         |
| `boolean` | True/false values                               | Active status, feature flags     |

## Error Codes

| Code                   | Description                          |
| ---------------------- | ------------------------------------ |
| `VALIDATION_ERROR`     | Invalid request parameters           |
| `DATASOURCE_NOT_FOUND` | Referenced datasource doesn't exist  |
| `GENERATION_FAILED`    | AI dataset generation failed         |
| `REFINEMENT_FAILED`    | AI dataset refinement failed         |
| `REQUEST_EXPIRED`      | Dataset request ID expired           |
| `QUOTA_EXCEEDED`       | Dataset creation limit reached       |
| `INVALID_MAP_BOUNDS`   | Map extraction coordinates invalid   |
| `FILE_UPLOAD_ERROR`    | File upload failed or invalid format |

## Next Steps

<CardGroup cols={2}>
  <Card title="Manage Datasets" icon="database" href="/api-reference/manage-dataset">
    View, update, and delete your created datasets
  </Card>

  <Card title="Dataset Tasks" icon="gear" href="/api-reference/tasks">
    Run AI processing and enrichment tasks on your datasets
  </Card>
</CardGroup>
