Python script tips
A short list of things that surprise people the first time they hit them.
Per-field Python is sandboxed and capped at 2 seconds
Per-field Python generators run inside a sandbox and have a 2-second per-row timeout. They are not full scenarios.
- No
pip install. Standard library only. - No
subprocess, no filesystem (exceptdm.workspace_file). - No long-running loops. If you need state across rows, use
dm.counter().
If you’re reaching for any of those, lift the work into a scenario.
row only contains previously generated fields
In a per-field Python generator, row[name] only sees fields that come before the
current one in the template. Drag your Python field below any siblings it reads.
For a value that depends on later fields (or on the full row), use a derived field. Derived runs in a second pass and sees everything.
Use the rng argument, not random.random()
Per-field generators get a seeded rng — use it. random.random() (module-level)
isn’t seeded by DataMaker, so:
- Your output won’t be reproducible when the template has a seed set.
- Two preview runs with the same seed will diverge.
def value(rng, row, dm): return rng.choice(["a", "b", "c"]) # ✓ seeded # NOT: return random.choice([...]) # ✗ unseededStream large generations
dm.generate(template_id=..., count=1_000_000) returns a list of one million dicts.
That’s a few hundred MB in RAM. Use stream instead:
for row in dm.template("Customer").stream(count=1_000_000): pg.insert_one("customers", row) # batched internallyFor database connections, the SDK accepts a generator directly:
pg.insert(table="customers", rows=dm.template("Customer").stream(count=1_000_000))dm.connection() is cached per scenario
Resolving a connection by name does an API call. Cache the result if you call it more than once:
# bad — one extra API call per loop iterationfor chunk in batches: dm.connection("pg").insert(table="t", rows=chunk)
# goodpg = dm.connection("pg")for chunk in batches: pg.insert(table="t", rows=chunk)(In practice, the SDK caches resolutions for the lifetime of the DataMaker instance,
so this is a hot-path optimisation, not a correctness one.)
print() is captured but lossy
print() works inside a scenario and lands in the live log. But:
- It uses Python’s default flush behaviour. Wrap in
print(..., flush=True)if you want the log to update line-by-line during a long step. - The structured-log fields aren’t populated (no actor / source metadata). Prefer
dm.log.info(...)for anything you’d grep later.
Idempotency is your problem
Scenarios don’t have built-in checkpointing. A failed run that’s retried starts from the top. If the script POSTs to a non-idempotent endpoint, you can double-create. Two patterns:
# 1. Pre-checkif pg.execute("SELECT 1 FROM customers WHERE id = %s", [cid]): return
# 2. Idempotency keyssap.post( entity="A_BusinessPartner", rows=batch, idempotency_key=f"{dm.run_id}:{batch_index}",)params are always strings
Run params come in over the wire as JSON, but the SDK normalises every value to a string. Cast explicitly:
size = int(dm.params.get("size", "100"))debug = dm.params.get("debug", "false").lower() == "true"This avoids surprises when CI passes "100" and your code does count=size.
dm.log redacts Authorization
dm.log.info("got %s", response.headers) is safe — Authorization, X-API-Key, Set-Cookie
are redacted. print(response.headers) is not. Default to dm.log for anything that
might capture a header.
requirements: is parsed only at the top
# requirements: arrow~=1.3import arrowThe # requirements: comment is parsed once at script load. Don’t put it in the
middle of the file or inside a function.
Workers cold start, then stay warm
The first run after deploy / config change takes ~600ms longer than subsequent runs. For latency-sensitive flows (CI smoke tests in <2s), warm the worker:
curl -X POST .../scenarios/$ID/run -d '{"params": {"warm": "true"}, "wait": "true"}'Have the scenario branch on dm.params.get("warm") and return early.