Skip to content

Analyse existing data

Sometimes you don’t want to build a template — you have data, and you want a template that produces more like it. DataMaker can analyse:

  • CSV files (one or many sheets).
  • JSON documents (objects or arrays of objects).
  • YAML / XML (best-effort).
  • OpenAPI schemas (components.schemas).
  • Live databases — point at a connection and a table, we read the schema + sample rows.
  • Live SAP OData$metadata + sample rows from $top=100.

The output is a real DataMaker template you can edit, version, and reuse.

CSV / JSON upload

In any chat (Agent mode), drag a file onto the prompt area. The agent runs:

create_template_from_csv (or _json, _openapi)
- infer the column / property types
- sample distributions for numeric / date columns
- detect IDs, emails, phones, IBANs by pattern
- mark obvious PII fields sensitive (name, email, tax_id, dob…)
preview_template
show 3 rows for sanity check

You see what it inferred and can correct anything. Saving creates a template.

OpenAPI schema

For services with an OpenAPI spec, the agent can build a template from a named schema:

agent: create a DataMaker template from the Customer schema in petstore.yaml
→ create_template_from_openapi(file="petstore.yaml", schema="Customer")
- 12 properties found
- inferred types: 9 strings, 2 numbers, 1 enum
- 2 properties have format hints we'll honour: email, date-time
→ template "Customer (from OpenAPI)" created
agent: previewing 3 rows...

Required vs optional fields in the schema map to nullable rules in the template. Format-tagged strings (email, uuid, uri, ipv4, …) become the matching DataMaker types.

Live database analysis

For a connected database, the agent can sample a table and infer:

agent: build a template from the customers table in conn_prod_postgres_readonly
→ fetch_endpoint_fields → 14 columns
→ execute("SELECT * FROM customers LIMIT 1000")
- 14 columns inferred
- id: UUID
- email: pattern-matched as email; sensitive
- balance: distribution=long-tail (Pareto), alpha~1.4
- country: enum from observed distribution {DE: 71%, AT: 14%, CH: 8%, others}
- created_at: distribution=recent_bias, halflife≈45 days
→ template "Customer (from prod)" created (sensitive flags applied)

Distributions are inferred from the sample, not a guess. The more rows you give it, the better it gets.

Live SAP OData

agent: build a template from A_BusinessPartner in conn_s4_sandbox
→ fetch_sap_metadata → 28 properties
→ fetch_sap_records_filtered (top=100)
- 28 properties inferred
- BusinessPartner: ID
- Country: enum {DE: 88%, AT: 9%, CH: 3%}
- TaxNumber1: pattern-matched as DE USt-ID; sensitive
- …
→ template "A_BusinessPartner" created

This is the foundation of the SAP regression workflow — the auto-generated template lets you generate more records that look like the real ones, without ever re-touching production.

What gets inferred well, what doesn’t

Reliable:

  • Type — string vs number vs date vs boolean.
  • Format-tagged strings — email, UUID, URL, IBAN.
  • Categorical fields — when there are <30 unique values, we infer an enum.
  • Numeric distributions — uniform / gaussian / long-tail.
  • Sensitive fields — names, emails, national IDs, IBANs are flagged automatically.

Worth reviewing:

  • Free-text fields — we’ll usually pick paragraph or sentence; you may want something more structured.
  • Cross-field relationships — email = first_name@company.com patterns aren’t always caught; consider a derived field.
  • Domain-specific codes — industry codes, internal IDs. We’ll fall back to regex; refine to enum if you have the catalogue.

Tip

After analysis, run preview_template 5-10 times before saving. Each preview generates 3 sample rows; if every set looks plausible, the template is well-shaped. If you spot oddities, ask the agent to refine a specific field.