> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getclaro.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Import & Ingestion

> Every way data gets into a catalogue or a Research Agents dataset.

Data flows into Claro through inbound connectors. For catalogues, inbound is configured per catalogue on the **Data Source** tab. For Research Agents, inbound is configured per agent run.

This page covers every supported source, when to use it, and the limits and trade-offs to know about.

***

## Inbound for Catalogues

Each catalogue's **Data Source** tab is where you connect feeds. Each source has its own attribute mapping, schedule, and conflict policy.

### File upload (CSV, XLSX)

* **Use for** — initial loads, supplier files, periodic dumps from systems without an API.
* **Mapping** — column-to-attribute mapping is saved on first upload and reused on subsequent uploads of the same shape.
* **Limits** — large files are split into chunks server-side. For millions of rows, use a database connector or S3 instead.

### Supplier Portal

* **Use for** — suppliers without API access who need a self-serve way to send updates.
* **Behavior** — submissions land as Data Sources with the supplier's identity attached. Pre-mapped if the supplier has uploaded before.
* **Detail** — see [Onboard → Supplier Portal](/modules/onboard).

### Scheduled scrape

* **Use for** — public catalog pages, marketplace listings, competitor sources.
* **Configuration** — URL templates, target schema, cadence, throttling.
* **Outputs** — scrape runs land as a Data Source and feed any chained operations.

### HTTPS pull

* **Use for** — partner APIs, internal systems with REST endpoints.
* **Configuration** — endpoint, auth, request schedule, response parsing (JSON / CSV).
* **Limits** — auth modes supported: bearer, basic, OAuth, signed requests.

### Database connectors

| Source       | Modes                                                      |
| ------------ | ---------------------------------------------------------- |
| **BigQuery** | Read tables or query results on a schedule.                |
| **Postgres** | Read tables or query results; per-row CDC where available. |
| **Supabase** | Read tables and views via the managed REST API.            |

### Cloud storage

| Source           | Modes                                       |
| ---------------- | ------------------------------------------- |
| **S3**           | Pull files matching a prefix on a schedule. |
| **Google Drive** | Watch a folder for new files.               |

### Email-as-source

* **Use for** — suppliers who only send updates by email.
* **Behavior** — emails sent to a workspace address are parsed; attachments become Data Source uploads, body content can populate a target schema.

***

## Inbound for Research Agents

Research Agents accept inputs per agent type. See [Research Agents](/research_agents) for details.

| Agent                               | Inputs                                         |
| ----------------------------------- | ---------------------------------------------- |
| Find your perfect list              | Natural-language brief and seed criteria.      |
| Turn documents into structured data | PDFs, scanned docs, datasheets, target schema. |
| Analyze & enrich spreadsheets       | CSV / XLSX file, enrichment goal.              |
| Scrape data from URLs               | List of URLs or base URL plus crawl rules.     |

Outputs land in **Generated Datasets** and can be promoted into a Catalogue at any time.

***

## Mapping and conflict resolution

For every inbound source, you configure how it interacts with existing records.

### Mapping

Map source columns to catalogue attributes. Mappings include:

* **Type coercion** — convert strings to numbers, normalize dates and currencies.
* **Computed fields** — derive an attribute from one or more source columns.
* **Constants** — fill an attribute with a fixed value (e.g. *source = supplier\_x*).
* **Lookups** — translate enum-like source values to your canonical enum.

### Conflict policy

When an inbound row matches an existing record:

* **Overwrite** — replace existing values (default for trusted sources).
* **Append** — for multi-value attributes only.
* **Write-if-empty** — only fill blanks, never replace.
* **Custom rule** — most-recent, highest-confidence, or attribute-specific policy.

### Identity matching

Every inbound row needs to resolve to a record. **Data Source Mapping** runs first, using the similarity graph and configured key fields. Unmatched rows are flagged for review before they land in the catalogue.

***

## Limits and best practices

* Test with a small sample first — 50 to 200 rows — before connecting a recurring source.
* Validate the schema upstream when possible (Supplier Portal does this for you).
* Prefer column-level mappings to ad-hoc fixes; mappings are reused, ad-hoc fixes are not.
* For very high volume, prefer database or S3 connectors over file upload.

***

## Outbound

Distributing data downstream is handled by [Distribute](/modules/distribute) — Unified Catalog and Sync & Export.
