Executions

An Execution is what happens when you press Run: a recorded, auditable trip through stages, with logs, gates, and artifacts.

The mental model. An App is the recipe. An Execution is the time you actually cooked it. The recipe doesn't change between Tuesday's dinner and Wednesday's; the executions are different: different ingredients on the form, different sandbox, different output, different timestamps. The platform keeps every execution forever so you can compare, debug, and replay.

The execution lifecycle

Every run passes through exactly one of five terminal states. The state machine:

pending  →  running  →  completed

                     running  →  failed

                     running  →  cancelled

Status	Meaning	What you can do
`pending`	The run was queued but hasn't actually started. Usually <1 second.	Wait. Cancel if needed.
`running`	At least one stage is in progress. Stage logs are streaming.	Watch live. Cancel. Respond to human gates.
`completed`	All stages finished successfully. All artifacts saved.	Download artifacts. Share. Re-run with new inputs. Resume from a later stage to try changes.
`failed`	A stage hit an error or timeout. Subsequent stages did not run.	Read the error. Debug in chat. Retry. Or re-run after fixing the App.
`cancelled`	You (or someone) explicitly stopped the run.	Re-run from scratch, or resume from a later stage if some completed.

Anatomy of an execution record

The platform persists every execution in app_execution with these fields:

{
  "id": "exec_8f4c2e1b",
  "app_id": "app_...",
  "user_id": "usr_...",
  "status": "completed",
  "input": { company_name: "Stripe", ... },
  "is_test": false,
  "shared_session": false,
  "start_from_stage_index": 0,
  "prior_execution_id": null,
  "duration_ms": 43210,
  "error": null,
  "created_at": "2026-05-31T09:14:22Z"
}

Each stage gets its own app_stage_log entry:

{
  "execution_id": "exec_8f4c2e1b",
  "stage_index": 0,
  "stage_type": "agent",
  "status": "completed",
  "goal_expanded": "Research Stripe and write a brief...",
  "session_id": "sess_...",     // linked chat session if you "Debug in chat"
  "duration_ms": 38421,
  "error": null
}

The goal_expanded field is the goal after input substitution: what the agent actually saw. If your form had company_name: "Stripe", this is where {{company_name}} in the original goal got resolved.

Watching a run live: the SSE stream

During running state, the App page subscribes to a Server-Sent Events stream keyed on the execution ID. The same stream is the public API surface for anyone wanting to mirror runs into another system. The endpoint:

GET /api/app-executions/:execId/stream
Authorization: Bearer <token>

// Response headers
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

The five execution-level events

Each event is framed as event: <name>\n data: <JSON>\n\n. On connect, the server first replays the current state so a late subscriber always sees a coherent sequence: current execution status, every stage log so far, any pending human gate, every saved artifact, before switching to live broadcast.

`event`	Payload shape	When it fires
`status`	`{ status: 'pending' \| 'running' \| 'completed' \| 'failed' \| 'cancelled' }`	On subscribe (replay) and on every execution-level state change.
`stage_log`	A full `AppStageLog` row; see Anatomy above.	On subscribe (one per existing stage log) and on every stage transition.
`human_gate`	A full `AppHumanGate` row.	When a `human` stage opens a gate and again when the gate is resolved.
`artifact`	A full `AppArtifact` row including `content` if present.	Each time a stage's outputs are persisted.
`done`	`{ status, duration_ms }`	Once, at the end. The server then closes the stream.

If you subscribe to an execution that is already in a terminal state, the server replays the stored stage logs and artifacts, sends a done event, and disconnects; no live socket is created. This makes replay and live monitoring use the same endpoint and the same parser.
The fine-grained runner events
The five execution events above are the summary stream. Inside an agent stage, the underlying run-agent binary emits a much finer JSONL trace: token-by-token text, tool calls, sub-tasks, thinking. Those events live on the session-level stream, not the execution-level one, but they're what powers the right-hand reasoning trace pane. See the runner protocol for the full event catalogue.
Minimal SSE consumer
const es = new EventSource('/api/app-executions/' + execId + '/stream', {
  withCredentials: true,
});

es.addEventListener('status', (e) => console.log('status', JSON.parse(e.data)));
es.addEventListener('stage_log', (e) => console.log('stage', JSON.parse(e.data)));
es.addEventListener('artifact', (e) => console.log('artifact', JSON.parse(e.data)));
es.addEventListener('human_gate', (e) => console.log('gate', JSON.parse(e.data)));
es.addEventListener('done', (e) => { console.log('done', JSON.parse(e.data)); es.close(); });
Test runs vs real runs
Set is_test: true on a run and it's flagged as ephemeral. Test runs:
Don't count against your usage quotas in the same way as production runs.
Are filtered out of the App's "Runs" tab by default (toggle "Show test runs" to see them).
Use a separate shared_session namespace so a teammate's test runs don't pollute yours.
Are auto-purged after 30 days unless pinned.
The App editor uses test runs when you click Try it; exactly what you want while iterating on prompts.
Human gates: pausing for approval
A human stage doesn't run code. It pauses the execution, displays the goal text and any prior artifacts, and waits for you (or another permitted user) to act. The pause is stored in app_human_gate:
{
  "id": "gate_...",
  "execution_id": "exec_...",
  "stage_index": 2,
  "message": "Review these 47 enriched leads. Remove rows you don't want to email.",
  "status": "pending", // pending | approved | rejected
  "response": null,
  "responded_at": null
}
What a gate looks like in the UI
┌─────────────────────────────── Run #exec_8f4c2 ─────────────────┐
│ Stage 3 of 4 · Review  paused │
│ │
│ Review these 47 enriched leads. Remove rows you don't want │
│ to email. │
│ │
│ 📎 enriched_leads.csv (47 rows) [ View ] │
│ │
│ [ Approve & continue ]  [ Reject ]  [ Edit input first ] │
└──────────────────────────────────────────────────────────────────┘
The three buttons
Approve & continue: gate becomes approved, the next stage starts immediately.
Reject: gate becomes rejected, the run transitions to failed.
Edit input first: opens the artifact for editing (e.g. remove rows from the CSV), then on save the edited version is what Stage n+1 reads.
Finding gates that need your attention
Open Pending gates from the top nav, or hit GET /api/app-executions/pending-gates. Gates can be set up to ping you via email or Slack when they go pending: useful for long-running pipelines where you might not be watching.
The gate API
Approve or reject a gate by POSTing to its resolution endpoint. The body carries the verdict and an optional free-text response stored on the gate row:
POST /api/app-executions/:execId/gates/:gateId
Content-Type: application/json
Authorization: Bearer <token>

{
  "action": "approved" // or "rejected",
  "response": "Removed 4 lookalikes; lists looks tighter now."
}
Mechanically: the server marks app_human_gate.status and stamps responded_at, then resolves an in-memory Promise the executor has been awaiting on. The awaiting stage immediately transitions out of waiting and the pipeline either proceeds to stage N+1 (on approve) or short-circuits to failed (on reject). A human_gate SSE event with the updated row is fanned out to every subscriber.
The execution stays running the whole time a gate is pending. Only the stage log goes to waiting. The sandbox is held alive (via keepAlive) so resuming on approve doesn't pay a cold-start cost. The trade-off is that an indefinitely-pending gate keeps a sandbox warm: set a reasonable expectation with your reviewers, or build the gate around an auto-timeout in the App's design.
Cancelling a run
Click Cancel on a running execution, or send:
POST /api/app-executions/:execId/cancel
Authorization: Bearer <your_token>
Cancellation:
Stops the currently-running stage. Already-completed stages stay completed; their artifacts are preserved.
Closes the sandbox.
Transitions the execution to cancelled.
Refunds any unspent budget back to your balance.
You can later Resume from the next stage; see below.
Resuming from a specific stage
A real superpower: re-run only the stages that need re-running, reusing artifacts from the ones that worked. Pass start_from_stage_index and prior_execution_id on the run request:
POST /api/apps/:appId/run
Content-Type: application/json

{
  "input": { ... same shape as the original run ... },
  "start_from_stage_index": 2,
  "prior_execution_id": "exec_8f4c2e1b"
}
What the executor actually does
Creates a fresh app_execution row pointing at the same App version. Stage logs for indexes 0 .. start_from_stage_index - 1 are written as status: 'skipped' immediately; the run history makes it clear which stages weren't re-executed.
Loads the skipped stages' artifacts via dao.listArtifactsByStages(priorExecutionId, [stageIds]). For each artifact that doesn't carry inline content, the executor downloads the body from S3 by its s3_key and decodes UTF-8, so the next stage's system prompt sees the prior content even if the original row was offloaded.
Acquires a brand-new sandbox. Files Stage 0-1 wrote on the original execution's sandbox are gone; only the persisted artifacts cross the boundary. If a later stage depended on intermediate scratch files rather than declared artifacts, resume won't reproduce them.
Builds stage start_from_stage_index's system prompt with the loaded prior artifacts injected as ## Previous Stage Outputs, then proceeds from there through the end of the pipeline.
Resume + shared_session
When the original run used shared_session: true and the resume also asks for it, the executor goes one step further: it locates a stage log on prior_execution_id that already has a session_id, reuses that session, and continues the same Claude conversation. Visible result: Debug in chat shows one continuous thread spanning the original and the resumed run. Don't enable this blindly; a shared Claude session means the model's context still contains earlier turns, which is sometimes the point and sometimes not.
When you'd use this
Last stage's draft was off. Stages 0-2 took 4 minutes and worked fine; stage 3's email draft missed the tone. Tweak the App, then re-run from stage 3; saves 4 minutes.
Stage failed for an external reason (Connect token expired mid-run). Fix the Connect, re-run from the failed stage with the same inputs.
Human rejected a gate. Adjust the prior stage's prompt, re-run from there.
Resume doesn't roll back partial side-effects. If the resumed stage wrote rows to your CRM, sent a Slack message, or pushed a git commit, that work has already shipped. The platform doesn't track outbound side-effects; resume only reuses persisted artifacts. For idempotency, make Connect-touching stages check "have we already done this?" first.
In the UI: open the run, click Re-run from stage, pick the stage.
Run history
On the App page, the Runs tab lists every execution, most recent first. Each row shows:
Column What it shows
Status One of the 5 pills (pending / running / completed / failed / cancelled).
Triggered by You, a teammate, a Schedule, or the API.
Started Timestamp in your workspace's time zone.
Duration Total wall-clock time across all stages.
Cost Sandbox seconds + LLM tokens converted to USD.
Inputs The form values used. Click to expand.
Artifacts Count, with quick-download for each.
Filters at the top let you narrow by status, date range, who triggered, and tag. The list is also available at GET /api/app-executions?app_id=....
Debugging failed runs
Step 1: Read the error
Open the failed run. The first thing on the page is the error message and the stage where it happened. Common shapes:
Error Almost always means
stage_timeout The stage took longer than timeout_ms. Increase it.
connect_unauthorized A Connect was revoked or expired. Re-authorize.
required_input_missing A required form field was blank. Mark less aggressively as required.
artifact_format_mismatch The stage produced output in the wrong format (e.g. markdown when CSV declared).
sandbox_oom Out of memory inside the sandbox. Heavy data processing; split into stages or use a script stage.
agent_gave_up The agent decided the task couldn't be done. Read its reasoning; usually a missing input.
Step 2: Debug in chat
Click Debug in chat. A chat opens with the failed run's full context: inputs, partial artifacts, agent reasoning trace, error. Talk to the agent:
> Why did this fail? What should I change in the App?
The agent diagnoses, suggests a fix, and (often) directly edits the App to apply it. You can then re-run from the same point.
Step 3: Retry or re-run
Retry: same inputs, same App version, fresh sandbox. Sometimes flaky external APIs just need a second try.
Re-run from stage N: keeps prior artifacts. Use when only the late stages are broken.
Edit App, then run: for systematic bugs (vague prompt, wrong format). Bumps the App to a new version.
Sharing an execution
You can hand a teammate a read-only view of an execution: useful for "look what this App produced yesterday" without giving them write access to the App itself.
Open the run.
Click Share.
Configure: artifacts to include, expiry, download allowed, password gate.
Get a public URL: https://app.aitroop.net/s/<token>.
Shares are persisted in the share table with a token, an unlock policy, and a download handler. Revoke any time from Settings to Shares.
Triggering a run from the API
Useful when you want to run an App from a webhook, a CI pipeline, or another agent. The minimal call:
curl -X POST https://app.aitroop.net/api/apps/<appId>/run \
  -H "Authorization: Bearer $AT_USER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "input": { "company_name": "Stripe" } }'
The response is the new app_execution with status: "pending". To watch it:
curl https://app.aitroop.net/api/app-executions/<execId>/stream \
  -H "Authorization: Bearer $AT_USER_TOKEN" \
  -N
SSE stream until the run reaches a terminal state. Or poll GET /api/app-executions/:execId.
FAQ & troubleshooting
My run is stuck in pending for minutes. What's wrong?
Possible causes:
Sandbox provider cold start (rare; usually <1 s).
You hit your plan's concurrent-run quota; the run is queued behind others.
The agent worker queue is backed up (status banner at Settings → Workspace).
Fix: if it's been >2 minutes, cancel and re-run. Check the status banner. If everything's green and it's still stuck, the run is queued; wait or cancel.
Why does my completed run show "0 artifacts"?
The App's stages didn't declare any artifact_defs, or the agent produced output that didn't conform to the declared format and the platform discarded it. Open the run, scroll to the affected stage, and look for a "format mismatch" warning. Fix the goal to be explicit about what to produce, or change the declared format to file as an escape hatch.
I cancelled a run, can I "un-cancel" it?
No, but you can re-run from the last completed stage. The prior execution's artifacts are preserved, so you don't lose work. Open the cancelled run, click Re-run from stage, pick the stage after the last one that completed.
A human gate has been pending for 3 days. Did the run die?
No. Gates wait indefinitely by default. The execution is parked, not failed. To clean up, either respond to the gate or cancel the run.
If you want gates to auto-expire, add a timeout in the stage definition: "gate_timeout_ms": 86400000 for 24 hours. After the timeout, the gate transitions to rejected and the run transitions to failed.
Can I see what the agent was thinking?
Yes. Click any stage in the run timeline to expand its reasoning trace. The trace shows the agent's plan, every tool call with arguments and results, and its final commentary. For deeper detail, toggle "Show thinking"; this exposes the model's internal reasoning blocks (when available).
Why is the same App taking 2× longer this week than last week?
Likely cause: the agent is making more tool calls. Either the prompt changed (check Versions), the data being processed grew, or an external API got slower. Open both runs side by side: the run log shows the count and duration of every tool call. Compare to find the regression.
I want to A/B test two versions of the same App. How?
Fork the App (creates a private copy at v1), edit the copy, then run both Apps with the same inputs. Compare the runs side by side from each App's Runs tab. For statistical depth, write a Schedule that runs both versions every day on the same inputs and pipes results to a sheet.
Can I trigger a run from Zapier / n8n / a curl command in CI?
Yes; POST /api/apps/:appId/run with your bearer token. See the curl above. Most users put the call in a webhook handler and wait on GET /api/app-executions/:execId to detect completion. Outbound webhooks on completion are also supported via Schedule delivery destinations.
Next Schedules: running on cron

Column	What it shows
Status	One of the 5 pills (pending / running / completed / failed / cancelled).
Triggered by	You, a teammate, a Schedule, or the API.
Started	Timestamp in your workspace's time zone.
Duration	Total wall-clock time across all stages.
Cost	Sandbox seconds + LLM tokens converted to USD.
Inputs	The form values used. Click to expand.
Artifacts	Count, with quick-download for each.

Error	Almost always means
`stage_timeout`	The stage took longer than `timeout_ms`. Increase it.
`connect_unauthorized`	A Connect was revoked or expired. Re-authorize.
`required_input_missing`	A required form field was blank. Mark less aggressively as required.
`artifact_format_mismatch`	The stage produced output in the wrong format (e.g. markdown when CSV declared).
`sandbox_oom`	Out of memory inside the sandbox. Heavy data processing; split into stages or use a script stage.
`agent_gave_up`	The agent decided the task couldn't be done. Read its reasoning; usually a missing input.