Skip to content

Python script tips

A short list of things that surprise people the first time they hit them.

Per-field Python is sandboxed and capped at 2 seconds

Per-field Python generators run inside a sandbox and have a 2-second per-row timeout. They are not full scenarios.

  • No pip install. Standard library only.
  • No subprocess, no filesystem (except dm.workspace_file).
  • No long-running loops. If you need state across rows, use dm.counter().

If you’re reaching for any of those, lift the work into a scenario.

row only contains previously generated fields

In a per-field Python generator, row[name] only sees fields that come before the current one in the template. Drag your Python field below any siblings it reads.

For a value that depends on later fields (or on the full row), use a derived field. Derived runs in a second pass and sees everything.

Use the rng argument, not random.random()

Per-field generators get a seeded rng — use it. random.random() (module-level) isn’t seeded by DataMaker, so:

  • Your output won’t be reproducible when the template has a seed set.
  • Two preview runs with the same seed will diverge.
def value(rng, row, dm):
return rng.choice(["a", "b", "c"]) # ✓ seeded
# NOT: return random.choice([...]) # ✗ unseeded

Stream large generations

dm.generate(template_id=..., count=1_000_000) returns a list of one million dicts. That’s a few hundred MB in RAM. Use stream instead:

for row in dm.template("Customer").stream(count=1_000_000):
pg.insert_one("customers", row) # batched internally

For database connections, the SDK accepts a generator directly:

pg.insert(table="customers", rows=dm.template("Customer").stream(count=1_000_000))

dm.connection() is cached per scenario

Resolving a connection by name does an API call. Cache the result if you call it more than once:

# bad — one extra API call per loop iteration
for chunk in batches:
dm.connection("pg").insert(table="t", rows=chunk)
# good
pg = dm.connection("pg")
for chunk in batches:
pg.insert(table="t", rows=chunk)

(In practice, the SDK caches resolutions for the lifetime of the DataMaker instance, so this is a hot-path optimisation, not a correctness one.)

print() works inside a scenario and lands in the live log. But:

  • It uses Python’s default flush behaviour. Wrap in print(..., flush=True) if you want the log to update line-by-line during a long step.
  • The structured-log fields aren’t populated (no actor / source metadata). Prefer dm.log.info(...) for anything you’d grep later.

Idempotency is your problem

Scenarios don’t have built-in checkpointing. A failed run that’s retried starts from the top. If the script POSTs to a non-idempotent endpoint, you can double-create. Two patterns:

# 1. Pre-check
if pg.execute("SELECT 1 FROM customers WHERE id = %s", [cid]):
return
# 2. Idempotency keys
sap.post(
entity="A_BusinessPartner",
rows=batch,
idempotency_key=f"{dm.run_id}:{batch_index}",
)

params are always strings

Run params come in over the wire as JSON, but the SDK normalises every value to a string. Cast explicitly:

size = int(dm.params.get("size", "100"))
debug = dm.params.get("debug", "false").lower() == "true"

This avoids surprises when CI passes "100" and your code does count=size.

dm.log redacts Authorization

dm.log.info("got %s", response.headers) is safe — Authorization, X-API-Key, Set-Cookie are redacted. print(response.headers) is not. Default to dm.log for anything that might capture a header.

requirements: is parsed only at the top

# requirements: arrow~=1.3
import arrow

The # requirements: comment is parsed once at script load. Don’t put it in the middle of the file or inside a function.

Workers cold start, then stay warm

The first run after deploy / config change takes ~600ms longer than subsequent runs. For latency-sensitive flows (CI smoke tests in <2s), warm the worker:

Terminal window
curl -X POST .../scenarios/$ID/run -d '{"params": {"warm": "true"}, "wait": "true"}'

Have the scenario branch on dm.params.get("warm") and return early.