Analyse existing data

Sometimes you don’t want to build a template — you have data, and you want a template that produces more like it. DataMaker can analyse:

CSV files (one or many sheets).
JSON documents (objects or arrays of objects).
YAML / XML (best-effort).
OpenAPI schemas (components.schemas).
Live databases — point at a connection and a table, we read the schema + sample rows.
Live SAP OData — $metadata + sample rows from $top=100.

The output is a real DataMaker template you can edit, version, and reuse.

CSV / JSON upload

In any chat (Agent mode), drag a file onto the prompt area. The agent runs:

create_template_from_csv (or _json, _openapi)
  ↓
  - infer the column / property types
  - sample distributions for numeric / date columns
  - detect IDs, emails, phones, IBANs by pattern
  - mark obvious PII fields sensitive (name, email, tax_id, dob…)
  ↓
preview_template
  ↓
  show 3 rows for sanity check

You see what it inferred and can correct anything. Saving creates a template.

OpenAPI schema

For services with an OpenAPI spec, the agent can build a template from a named schema:

agent: create a DataMaker template from the Customer schema in petstore.yaml

→ create_template_from_openapi(file="petstore.yaml", schema="Customer")
  - 12 properties found
  - inferred types: 9 strings, 2 numbers, 1 enum
  - 2 properties have format hints we'll honour: email, date-time
  → template "Customer (from OpenAPI)" created

agent: previewing 3 rows...

Required vs optional fields in the schema map to nullable rules in the template. Format-tagged strings (email, uuid, uri, ipv4, …) become the matching DataMaker types.

Live database analysis

For a connected database, the agent can sample a table and infer:

agent: build a template from the customers table in conn_prod_postgres_readonly

→ fetch_endpoint_fields → 14 columns
→ execute("SELECT * FROM customers LIMIT 1000")
  - 14 columns inferred
  - id: UUID
  - email: pattern-matched as email; sensitive
  - balance: distribution=long-tail (Pareto), alpha~1.4
  - country: enum from observed distribution {DE: 71%, AT: 14%, CH: 8%, others}
  - created_at: distribution=recent_bias, halflife≈45 days
→ template "Customer (from prod)" created (sensitive flags applied)

Distributions are inferred from the sample, not a guess. The more rows you give it, the better it gets.

Live SAP OData

agent: build a template from A_BusinessPartner in conn_s4_sandbox

→ fetch_sap_metadata → 28 properties
→ fetch_sap_records_filtered (top=100)
  - 28 properties inferred
  - BusinessPartner: ID
  - Country: enum {DE: 88%, AT: 9%, CH: 3%}
  - TaxNumber1: pattern-matched as DE USt-ID; sensitive
  - …
→ template "A_BusinessPartner" created

This is the foundation of the SAP regression workflow — the auto-generated template lets you generate more records that look like the real ones, without ever re-touching production.

What gets inferred well, what doesn’t

Reliable:

Type — string vs number vs date vs boolean.
Format-tagged strings — email, UUID, URL, IBAN.
Categorical fields — when there are <30 unique values, we infer an enum.
Numeric distributions — uniform / gaussian / long-tail.
Sensitive fields — names, emails, national IDs, IBANs are flagged automatically.

Worth reviewing:

Free-text fields — we’ll usually pick paragraph or sentence; you may want something more structured.
Cross-field relationships — email = first_name@company.com patterns aren’t always caught; consider a derived field.
Domain-specific codes — industry codes, internal IDs. We’ll fall back to regex; refine to enum if you have the catalogue.

Tip

After analysis, run preview_template 5-10 times before saving. Each preview generates 3 sample rows; if every set looks plausible, the template is well-shaped. If you spot oddities, ask the agent to refine a specific field.