Python script tips

A short list of things that surprise people the first time they hit them.

Per-field Python is sandboxed and capped at 2 seconds

Per-field Python generators run inside a sandbox and have a 2-second per-row timeout. They are not full scenarios.

No pip install. Standard library only.
No subprocess, no filesystem (except dm.workspace_file).
No long-running loops. If you need state across rows, use dm.counter().

If you’re reaching for any of those, lift the work into a scenario.

`row` only contains previously generated fields

In a per-field Python generator, row[name] only sees fields that come before the current one in the template. Drag your Python field below any siblings it reads.

For a value that depends on later fields (or on the full row), use a derived field. Derived runs in a second pass and sees everything.

Use the `rng` argument, not `random.random()`

Per-field generators get a seeded rng — use it. random.random() (module-level) isn’t seeded by DataMaker, so:

Your output won’t be reproducible when the template has a seed set.
Two preview runs with the same seed will diverge.

def value(rng, row, dm):
    return rng.choice(["a", "b", "c"])    # ✓ seeded
    # NOT: return random.choice([...])     # ✗ unseeded

Stream large generations

dm.generate(template_id=..., count=1_000_000) returns a list of one million dicts. That’s a few hundred MB in RAM. Use stream instead:

for row in dm.template("Customer").stream(count=1_000_000):
    pg.insert_one("customers", row)        # batched internally

For database connections, the SDK accepts a generator directly:

pg.insert(table="customers", rows=dm.template("Customer").stream(count=1_000_000))

`dm.connection()` is cached per scenario

Resolving a connection by name does an API call. Cache the result if you call it more than once:

# bad — one extra API call per loop iteration
for chunk in batches:
    dm.connection("pg").insert(table="t", rows=chunk)

# good
pg = dm.connection("pg")
for chunk in batches:
    pg.insert(table="t", rows=chunk)

(In practice, the SDK caches resolutions for the lifetime of the DataMaker instance, so this is a hot-path optimisation, not a correctness one.)

`print()` is captured but lossy

print() works inside a scenario and lands in the live log. But:

It uses Python’s default flush behaviour. Wrap in print(..., flush=True) if you want the log to update line-by-line during a long step.
The structured-log fields aren’t populated (no actor / source metadata). Prefer dm.log.info(...) for anything you’d grep later.

Idempotency is your problem

Scenarios don’t have built-in checkpointing. A failed run that’s retried starts from the top. If the script POSTs to a non-idempotent endpoint, you can double-create. Two patterns:

# 1. Pre-check
if pg.execute("SELECT 1 FROM customers WHERE id = %s", [cid]):
    return

# 2. Idempotency keys
sap.post(
    entity="A_BusinessPartner",
    rows=batch,
    idempotency_key=f"{dm.run_id}:{batch_index}",
)

`params` are always strings

Run params come in over the wire as JSON, but the SDK normalises every value to a string. Cast explicitly:

size  = int(dm.params.get("size", "100"))
debug = dm.params.get("debug", "false").lower() == "true"

This avoids surprises when CI passes "100" and your code does count=size.

`dm.log` redacts `Authorization`

dm.log.info("got %s", response.headers) is safe — Authorization, X-API-Key, Set-Cookie are redacted. print(response.headers) is not. Default to dm.log for anything that might capture a header.

`requirements:` is parsed only at the top

# requirements: arrow~=1.3
import arrow

The # requirements: comment is parsed once at script load. Don’t put it in the middle of the file or inside a function.

Workers cold start, then stay warm

The first run after deploy / config change takes ~600ms longer than subsequent runs. For latency-sensitive flows (CI smoke tests in <2s), warm the worker:

curl -X POST .../scenarios/$ID/run -d '{"params": {"warm": "true"}, "wait": "true"}'

Have the scenario branch on dm.params.get("warm") and return early.

Python script tips

Per-field Python is sandboxed and capped at 2 seconds

row only contains previously generated fields

Use the rng argument, not random.random()

Stream large generations

dm.connection() is cached per scenario

print() is captured but lossy