Docs Running Executions

Executions

An Execution is what happens when you press Run: a recorded, auditable trip through stages, with logs, gates, and artifacts.

The mental model. An App is the recipe. An Execution is the time you actually cooked it. The recipe doesn't change between Tuesday's dinner and Wednesday's; the executions are different: different ingredients on the form, different sandbox, different output, different timestamps. The platform keeps every execution forever so you can compare, debug, and replay.

The execution lifecycle

Every run passes through exactly one of five terminal states. The state machine:

pending  →  running  →  completed

                     running  →  failed

                     running  →  cancelled
StatusMeaningWhat you can do
pendingThe run was queued but hasn't actually started. Usually <1 second.Wait. Cancel if needed.
runningAt least one stage is in progress. Stage logs are streaming.Watch live. Cancel. Respond to human gates.
completedAll stages finished successfully. All artifacts saved.Download artifacts. Share. Re-run with new inputs. Resume from a later stage to try changes.
failedA stage hit an error or timeout. Subsequent stages did not run.Read the error. Debug in chat. Retry. Or re-run after fixing the App.
cancelledYou (or someone) explicitly stopped the run.Re-run from scratch, or resume from a later stage if some completed.

Anatomy of an execution record

The platform persists every execution in app_execution with these fields:

{
  "id": "exec_8f4c2e1b",
  "app_id": "app_...",
  "user_id": "usr_...",
  "status": "completed",
  "input": { company_name: "Stripe", ... },
  "is_test": false,
  "shared_session": false,
  "start_from_stage_index": 0,
  "prior_execution_id": null,
  "duration_ms": 43210,
  "error": null,
  "created_at": "2026-05-31T09:14:22Z"
}

Each stage gets its own app_stage_log entry:

{
  "execution_id": "exec_8f4c2e1b",
  "stage_index": 0,
  "stage_type": "agent",
  "status": "completed",
  "goal_expanded": "Research Stripe and write a brief...",
  "session_id": "sess_...",     // linked chat session if you "Debug in chat"
  "duration_ms": 38421,
  "error": null
}

The goal_expanded field is the goal after input substitution: what the agent actually saw. If your form had company_name: "Stripe", this is where {{company_name}} in the original goal got resolved.

Watching a run live: the SSE stream

During running state, the App page subscribes to a Server-Sent Events stream keyed on the execution ID. The same stream is the public API surface for anyone wanting to mirror runs into another system. The endpoint:

GET /api/app-executions/:execId/stream
Authorization: Bearer <token>

// Response headers
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

The five execution-level events

Each event is framed as event: <name>\n data: <JSON>\n\n. On connect, the server first replays the current state so a late subscriber always sees a coherent sequence: current execution status, every stage log so far, any pending human gate, every saved artifact, before switching to live broadcast.

eventPayload shapeWhen it fires
status{ status: 'pending' | 'running' | 'completed' | 'failed' | 'cancelled' }On subscribe (replay) and on every execution-level state change.
stage_logA full AppStageLog row; see Anatomy above.On subscribe (one per existing stage log) and on every stage transition.
human_gateA full AppHumanGate row.When a human stage opens a gate and again when the gate is resolved.
artifactA full AppArtifact row including content if present.Each time a stage's outputs are persisted.
done{ status, duration_ms }Once, at the end. The server then closes the stream.

If you subscribe to an execution that is already in a terminal state, the server replays the stored stage logs and artifacts, sends a done event, and disconnects; no live socket is created. This makes replay and live monitoring use the same endpoint and the same parser.

The fine-grained runner events

The five execution events above are the summary stream. Inside an agent stage, the underlying run-agent binary emits a much finer JSONL trace: token-by-token text, tool calls, sub-tasks, thinking. Those events live on the session-level stream, not the execution-level one, but they're what powers the right-hand reasoning trace pane. See the runner protocol for the full event catalogue.

Minimal SSE consumer

const es = new EventSource('/api/app-executions/' + execId + '/stream', {
  withCredentials: true,
});

es.addEventListener('status', (e) => console.log('status', JSON.parse(e.data)));
es.addEventListener('stage_log', (e) => console.log('stage', JSON.parse(e.data)));
es.addEventListener('artifact', (e) => console.log('artifact', JSON.parse(e.data)));
es.addEventListener('human_gate', (e) => console.log('gate', JSON.parse(e.data)));
es.addEventListener('done', (e) => { console.log('done', JSON.parse(e.data)); es.close(); });

Test runs vs real runs

Set is_test: true on a run and it's flagged as ephemeral. Test runs:

  • Don't count against your usage quotas in the same way as production runs.
  • Are filtered out of the App's "Runs" tab by default (toggle "Show test runs" to see them).
  • Use a separate shared_session namespace so a teammate's test runs don't pollute yours.
  • Are auto-purged after 30 days unless pinned.

The App editor uses test runs when you click Try it; exactly what you want while iterating on prompts.

Human gates: pausing for approval

A human stage doesn't run code. It pauses the execution, displays the goal text and any prior artifacts, and waits for you (or another permitted user) to act. The pause is stored in app_human_gate:

{
  "id": "gate_...",
  "execution_id": "exec_...",
  "stage_index": 2,
  "message": "Review these 47 enriched leads. Remove rows you don't want to email.",
  "status": "pending", // pending | approved | rejected
  "response": null,
  "responded_at": null
}

What a gate looks like in the UI

┌─────────────────────────────── Run #exec_8f4c2 ─────────────────┐
Stage 3 of 4 · Review  paused
│ │
Review these 47 enriched leads. Remove rows you don't want
to email.
│ │
📎 enriched_leads.csv (47 rows) [ View ] │
│ │
[ Approve & continue ]  [ Reject ]  [ Edit input first ]
└──────────────────────────────────────────────────────────────────┘

The three buttons

  • Approve & continue: gate becomes approved, the next stage starts immediately.
  • Reject: gate becomes rejected, the run transitions to failed.
  • Edit input first: opens the artifact for editing (e.g. remove rows from the CSV), then on save the edited version is what Stage n+1 reads.

Finding gates that need your attention

Open Pending gates from the top nav, or hit GET /api/app-executions/pending-gates. Gates can be set up to ping you via email or Slack when they go pending: useful for long-running pipelines where you might not be watching.

The gate API

Approve or reject a gate by POSTing to its resolution endpoint. The body carries the verdict and an optional free-text response stored on the gate row:

POST /api/app-executions/:execId/gates/:gateId
Content-Type: application/json
Authorization: Bearer <token>

{
  "action": "approved" // or "rejected",
  "response": "Removed 4 lookalikes; lists looks tighter now."
}

Mechanically: the server marks app_human_gate.status and stamps responded_at, then resolves an in-memory Promise the executor has been awaiting on. The awaiting stage immediately transitions out of waiting and the pipeline either proceeds to stage N+1 (on approve) or short-circuits to failed (on reject). A human_gate SSE event with the updated row is fanned out to every subscriber.

The execution stays running the whole time a gate is pending. Only the stage log goes to waiting. The sandbox is held alive (via keepAlive) so resuming on approve doesn't pay a cold-start cost. The trade-off is that an indefinitely-pending gate keeps a sandbox warm: set a reasonable expectation with your reviewers, or build the gate around an auto-timeout in the App's design.

Cancelling a run

Click Cancel on a running execution, or send:

POST /api/app-executions/:execId/cancel
Authorization: Bearer <your_token>

Cancellation:

  • Stops the currently-running stage. Already-completed stages stay completed; their artifacts are preserved.
  • Closes the sandbox.
  • Transitions the execution to cancelled.
  • Refunds any unspent budget back to your balance.

You can later Resume from the next stage; see below.

Resuming from a specific stage

A real superpower: re-run only the stages that need re-running, reusing artifacts from the ones that worked. Pass start_from_stage_index and prior_execution_id on the run request:

POST /api/apps/:appId/run
Content-Type: application/json

{
  "input": { ... same shape as the original run ... },
  "start_from_stage_index": 2,
  "prior_execution_id": "exec_8f4c2e1b"
}

What the executor actually does

  1. Creates a fresh app_execution row pointing at the same App version. Stage logs for indexes 0 .. start_from_stage_index - 1 are written as status: 'skipped' immediately; the run history makes it clear which stages weren't re-executed.
  2. Loads the skipped stages' artifacts via dao.listArtifactsByStages(priorExecutionId, [stageIds]). For each artifact that doesn't carry inline content, the executor downloads the body from S3 by its s3_key and decodes UTF-8, so the next stage's system prompt sees the prior content even if the original row was offloaded.
  3. Acquires a brand-new sandbox. Files Stage 0-1 wrote on the original execution's sandbox are gone; only the persisted artifacts cross the boundary. If a later stage depended on intermediate scratch files rather than declared artifacts, resume won't reproduce them.
  4. Builds stage start_from_stage_index's system prompt with the loaded prior artifacts injected as ## Previous Stage Outputs, then proceeds from there through the end of the pipeline.

Resume + shared_session

When the original run used shared_session: true and the resume also asks for it, the executor goes one step further: it locates a stage log on prior_execution_id that already has a session_id, reuses that session, and continues the same Claude conversation. Visible result: Debug in chat shows one continuous thread spanning the original and the resumed run. Don't enable this blindly; a shared Claude session means the model's context still contains earlier turns, which is sometimes the point and sometimes not.

When you'd use this

  • Last stage's draft was off. Stages 0-2 took 4 minutes and worked fine; stage 3's email draft missed the tone. Tweak the App, then re-run from stage 3; saves 4 minutes.
  • Stage failed for an external reason (Connect token expired mid-run). Fix the Connect, re-run from the failed stage with the same inputs.
  • Human rejected a gate. Adjust the prior stage's prompt, re-run from there.
Resume doesn't roll back partial side-effects. If the resumed stage wrote rows to your CRM, sent a Slack message, or pushed a git commit, that work has already shipped. The platform doesn't track outbound side-effects; resume only reuses persisted artifacts. For idempotency, make Connect-touching stages check "have we already done this?" first.

In the UI: open the run, click Re-run from stage, pick the stage.

Run history

On the App page, the Runs tab lists every execution, most recent first. Each row shows:

ColumnWhat it shows
StatusOne of the 5 pills (pending / running / completed / failed / cancelled).
Triggered byYou, a teammate, a Schedule, or the API.
StartedTimestamp in your workspace's time zone.
DurationTotal wall-clock time across all stages.
CostSandbox seconds + LLM tokens converted to USD.
InputsThe form values used. Click to expand.
ArtifactsCount, with quick-download for each.

Filters at the top let you narrow by status, date range, who triggered, and tag. The list is also available at GET /api/app-executions?app_id=....

Debugging failed runs

Step 1: Read the error

Open the failed run. The first thing on the page is the error message and the stage where it happened. Common shapes:

ErrorAlmost always means
stage_timeoutThe stage took longer than timeout_ms. Increase it.
connect_unauthorizedA Connect was revoked or expired. Re-authorize.
required_input_missingA required form field was blank. Mark less aggressively as required.
artifact_format_mismatchThe stage produced output in the wrong format (e.g. markdown when CSV declared).
sandbox_oomOut of memory inside the sandbox. Heavy data processing; split into stages or use a script stage.
agent_gave_upThe agent decided the task couldn't be done. Read its reasoning; usually a missing input.

Step 2: Debug in chat

Click Debug in chat. A chat opens with the failed run's full context: inputs, partial artifacts, agent reasoning trace, error. Talk to the agent:

> Why did this fail? What should I change in the App?

The agent diagnoses, suggests a fix, and (often) directly edits the App to apply it. You can then re-run from the same point.

Step 3: Retry or re-run

  • Retry: same inputs, same App version, fresh sandbox. Sometimes flaky external APIs just need a second try.
  • Re-run from stage N: keeps prior artifacts. Use when only the late stages are broken.
  • Edit App, then run: for systematic bugs (vague prompt, wrong format). Bumps the App to a new version.

Sharing an execution

You can hand a teammate a read-only view of an execution: useful for "look what this App produced yesterday" without giving them write access to the App itself.

  1. Open the run.
  2. Click Share.
  3. Configure: artifacts to include, expiry, download allowed, password gate.
  4. Get a public URL: https://app.aitroop.net/s/<token>.

Shares are persisted in the share table with a token, an unlock policy, and a download handler. Revoke any time from Settings to Shares.

Triggering a run from the API

Useful when you want to run an App from a webhook, a CI pipeline, or another agent. The minimal call:

curl -X POST https://app.aitroop.net/api/apps/<appId>/run \
  -H "Authorization: Bearer $AT_USER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "input": { "company_name": "Stripe" } }'

The response is the new app_execution with status: "pending". To watch it:

curl https://app.aitroop.net/api/app-executions/<execId>/stream \
  -H "Authorization: Bearer $AT_USER_TOKEN" \
  -N

SSE stream until the run reaches a terminal state. Or poll GET /api/app-executions/:execId.

FAQ & troubleshooting

My run is stuck in pending for minutes. What's wrong?

Possible causes:

  • Sandbox provider cold start (rare; usually <1 s).
  • You hit your plan's concurrent-run quota; the run is queued behind others.
  • The agent worker queue is backed up (status banner at Settings to Workspace).

Fix: if it's been >2 minutes, cancel and re-run. Check the status banner. If everything's green and it's still stuck, the run is queued; wait or cancel.

Why does my completed run show "0 artifacts"?

The App's stages didn't declare any artifact_defs, or the agent produced output that didn't conform to the declared format and the platform discarded it. Open the run, scroll to the affected stage, and look for a "format mismatch" warning. Fix the goal to be explicit about what to produce, or change the declared format to file as an escape hatch.

I cancelled a run, can I "un-cancel" it?

No, but you can re-run from the last completed stage. The prior execution's artifacts are preserved, so you don't lose work. Open the cancelled run, click Re-run from stage, pick the stage after the last one that completed.

A human gate has been pending for 3 days. Did the run die?

No. Gates wait indefinitely by default. The execution is parked, not failed. To clean up, either respond to the gate or cancel the run.

If you want gates to auto-expire, add a timeout in the stage definition: "gate_timeout_ms": 86400000 for 24 hours. After the timeout, the gate transitions to rejected and the run transitions to failed.

Can I see what the agent was thinking?

Yes. Click any stage in the run timeline to expand its reasoning trace. The trace shows the agent's plan, every tool call with arguments and results, and its final commentary. For deeper detail, toggle "Show thinking"; this exposes the model's internal reasoning blocks (when available).

Why is the same App taking 2× longer this week than last week?

Likely cause: the agent is making more tool calls. Either the prompt changed (check Versions), the data being processed grew, or an external API got slower. Open both runs side by side: the run log shows the count and duration of every tool call. Compare to find the regression.

I want to A/B test two versions of the same App. How?

Fork the App (creates a private copy at v1), edit the copy, then run both Apps with the same inputs. Compare the runs side by side from each App's Runs tab. For statistical depth, write a Schedule that runs both versions every day on the same inputs and pipes results to a sheet.

Can I trigger a run from Zapier / n8n / a curl command in CI?

Yes; POST /api/apps/:appId/run with your bearer token. See the curl above. Most users put the call in a webhook handler and wait on GET /api/app-executions/:execId to detect completion. Outbound webhooks on completion are also supported via Schedule delivery destinations.