Mask PII / GDPR
The point of masking in DataMaker is that the AI agent and the platform refuse to move PII into anywhere it shouldn’t go, by default. You declare which fields are sensitive, pick a strategy, and the rest of the pipeline behaves accordingly.
The model
- Mark fields sensitive on the template (or on a fetched set). See Templates → Sensitive fields.
- Pick a masking strategy per field:
replace,format-preserve, orredact. - Authorise exports explicitly — sensitive fields don’t leave DataMaker without a deliberate choice.
- Audit every export, every run.
The whole flow is auditable per template per project; your DPO can sign off once and not re-review every change.
Practical pipeline
For an existing customer table you want to use for training / staging / debugging, the typical scenario looks like this:
from datamaker import DataMakerdm = DataMaker()src = dm.connection("conn_prod_postgres_readonly")dst = dm.connection("conn_staging_postgres")
# 1. Fetch real records from prod (read-only).real = src.execute( "SELECT id, name, email, tax_id, dob, balance " "FROM customers WHERE created_at > NOW() - INTERVAL '90 days' " "ORDER BY random() LIMIT 5000")
# 2. Mask sensitive fields.masked = dm.mask(real, strategies={ "name": "replace", # pick a fresh fake name per row "email": "format-preserve", # keep domain, scramble local-part "tax_id": "replace", # substitute a country-correct USt-ID "dob": "shift", # ±90 days from the real date})
# 3. Push to staging.dst.insert(table="customers", rows=masked, on_conflict="update", key="id")print(f"✓ pushed {len(masked)} masked rows")The dm.mask() step:
- Honours each field’s strategy.
- Logs what was masked (without the values).
- Refuses to run if any sensitive field doesn’t have a strategy you’ve explicitly named — fails loud rather than leaking.
Strategies in practice
| Strategy | When to use |
|---|---|
replace | Default. Fresh fake of the right type. Most flexible. |
format-preserve | When downstream regex / format validators must still pass. |
redact | When the field’s presence matters but the value doesn’t. |
shift (date) | Preserve “this happened 30 days after that” relationships. |
bucket (number) | Round to the nearest band — 123.45 → 100, 1234 → 1000. |
keep | Explicitly keep the real value. Requires owner-level confirmation. |
Reproducibility
format-preserve and replace (when seeded) are deterministic — the same real value
always maps to the same fake. Useful when you need to join staging tables across
masked datasets without re-introducing the real ID.
masked = dm.mask(real, strategies={...}, seed="2026q2-staging")Same seed → same mappings. A new seed produces a fresh, unrelated set.
Audit log
Every dm.mask() and every export of a set with sensitive fields is logged. From
Settings → Audit log → Filter: sensitive_export = true, you get:
- Timestamp.
- Actor (user or agent session ID).
- Source (template / set / connection fetch).
- Target (download / connection / chat).
- Field count + strategies.
- Outcome (success / partial / blocked).
Export as CSV for your DPO. Retention: 14 days (Free), 90 days (Pro), per-contract (Enterprise).
What to NOT do
- Don’t
print()real values before masking. Logs are retained per plan and a rawprint(row)will land in them.dm.logredacts on sensitive keys; rawprintdoesn’t. - Don’t write real values to a workspace file. Workspace files are not subject to the per-row masking — they’re plain blobs. Mask first, then write.
- Don’t re-import into a non-DataMaker store and consider yourself done. The whole “masked at the template level” guarantee only holds while the data flows through DataMaker. A copy in another DB is your problem to govern.
Related
- Templates → Sensitive fields for the model.
- Workflows → SAP regression for SAP-specific masking.