Building Crash Recovery Hooks for Claude Code with PID-Based Detection

When you’re deep in a coding session and your CLI crashes, the context you’ve built up vanishes. I built crash recovery hooks that detect when a previous Claude Code session died unexpectedly, parse the transcript of what was happening, and create a structured handoff document so the next session can pick up where you left off — with no manual intervention and no waiting period.

The Problem: Crashed Sessions Leave No Trace

Claude Code has a hook system that fires events at session lifecycle points: SessionStart, SessionEnd, PreToolUse, PostToolUse, and others. When a session exits cleanly (Ctrl+C, /exit), the SessionEnd hook fires. But when the CLI process crashes or hangs and gets killed, SessionEnd never fires. The session just disappears.

I already had a cross-session coordination system using PostgreSQL — sessions register on startup and record heartbeats. The idea was straightforward: if a session has no exited_at timestamp and its heartbeat is stale, it probably crashed. But the implementation took several iterations.

First Attempt: The 5-Minute Wait Problem

The initial implementation queried for sessions where exited_at IS NULL AND last_heartbeat < NOW() - 5 minutes. This works for detecting sessions that died between work sessions — like when you close your laptop and come back the next day. But it completely fails for the most common case: you kill a stuck session and immediately restart.

In the immediate restart scenario, the crashed session’s heartbeat is only seconds old. The 5-minute threshold doesn’t see it as stale. You’d have to wait before restarting to get crash detection — which defeats the purpose.

Second Detour: Getting the Hook Output Format Wrong

Before even getting to the timing problem, I spent considerable time debugging why the hook’s message wasn’t showing up in the CLI. The hook was correctly detecting crashes, creating handoff files, and updating the database — but the user never saw the notification.

I initially used this output format, modeled after what another working hook appeared to use:

// WRONG - this format doesn't exist in the spec
{
  "result": "continue",
  "message": "Previous session crashed..."
}

After investigating the JSONL transcripts and finding zero trace of the message being injected, I researched the actual Claude Code hook output schema. The correct format for SessionStart hooks is:

// CORRECT - the actual hook output schema
{
  "systemMessage": "Text shown to the USER in the CLI",
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": "Text injected as context for the MODEL"
  }
}

Two completely different channels: systemMessage renders in the CLI banner that the user sees, while hookSpecificOutput.additionalContext gets injected into the model’s context so Claude knows about the crash. The {"result":"continue","message":"..."} pattern I initially tried simply doesn’t exist in the specification.

How the Crash Recovery Hooks Work

The system uses three hooks coordinating through PostgreSQL:

1. Session Registration (SessionStart)

When Claude Code starts, session-register.ts fires and stores the session’s metadata in PostgreSQL:

// session-register.ts
const registerResult = registerSession(
  sessionId,
  project,
  '',
  input.session_id,       // Claude's session UUID
  input.transcript_path,  // Path to the JSONL transcript
  process.ppid            // PID of the Claude CLI process
);

The key addition is process.ppid — the parent process ID. Since hooks run as child processes of the Claude CLI, process.ppid gives us the PID of the Claude instance itself. This gets stored in the sessions table alongside the transcript path and Claude’s internal session UUID.

2. Clean Exit Marker (SessionEnd)

When a session exits normally, session-clean-exit.ts sets exited_at = NOW() in the database. If this hook doesn’t fire (crash, hang, kill -9), the column stays NULL — our crash signal.

3. Crash Detection (SessionStart)

On the next startup, session-crash-recovery.ts queries for all sessions on this project where exited_at IS NULL. Then it determines if each session actually crashed using a two-tier check:

function isProcessAlive(pid: number): boolean {
  try {
    process.kill(pid, 0);  // Signal 0 = existence check
    return true;
  } catch {
    return false;
  }
}

function isSessionCrashed(session: CrashedSession): boolean {
  if (session.pid) {
    // PID stored: check if process is dead
    return !isProcessAlive(session.pid);
  }
  // No PID: fall back to 5-minute stale heartbeat
  if (!session.last_heartbeat) return true;
  const heartbeat = new Date(session.last_heartbeat).getTime();
  const staleThreshold = 5 * 60 * 1000;
  return Date.now() - heartbeat > staleThreshold;
}

process.kill(pid, 0) sends signal 0 to a process — it doesn’t actually signal anything, just checks whether the process exists. If the PID is dead and there’s no clean exit marker, the session crashed. No waiting required.

For sessions registered before PID tracking was added (or edge cases where the PID isn’t available), the old 5-minute heartbeat threshold kicks in as a fallback.

Transcript Parsing and Handoff Generation

When a crash is confirmed, the hook reads the crashed session’s JSONL transcript file and extracts high-signal data: the last todo state, recent tool calls, files modified, and errors encountered. This gets formatted into a YAML handoff document following the same format used by the manual /create_handoff command:

---
session: crash-test
date: 2026-02-12
status: crashed
outcome: PARTIAL_MINUS
---

goal: Session auto-compacted
now: Continue from auto-compact

decisions:
  - crash_recovery: "Previous session ended unexpectedly (CLI crash/hang)"

worked:
  - "WebFetch completed successfully"

next:
  - "Review session state and continue"

The handoff is written to the project’s thoughts/shared/handoffs/ directory where the resume_handoff skill can pick it up.

The Testing Process

Testing crash recovery requires actually crashing things, which creates its own set of challenges. The test procedure:

Clean the database of old test entries
Start Claude in a test directory
Run at least one command (empty transcripts are deliberately skipped)
Find the PID and kill -9 it from another terminal
Start Claude again and check for the recovery notification

One gotcha: the hook silently skips sessions where the transcript has no tool calls and no file modifications. This makes sense for production (don’t create recovery handoffs for sessions that did nothing), but during testing I forgot to run a command before killing the process and spent time debugging why the crash “wasn’t detected” — when it actually was detected and acknowledged, just without generating a handoff.

What I Learned

Claude Code hook output schema is specific. SessionStart hooks use systemMessage for user-visible text and hookSpecificOutput.additionalContext for model context. There’s no generic message field. Getting this wrong means your hook runs fine but nobody sees the output.
PID checking beats heartbeat timeouts. For crash detection, checking if a process is alive is both faster and more accurate than waiting for a heartbeat to go stale. process.kill(pid, 0) is a clean, zero-cost existence check.
Hooks run as child processes. process.ppid inside a hook gives you the Claude CLI’s PID. This is the right value to store for later liveness checks.
Test your empty case. A hook that works perfectly on rich transcripts but silently no-ops on empty ones will make you think crash detection is broken when it’s actually the handoff generation that’s (correctly) skipping.

Results

The crash recovery system now detects crashed sessions immediately on restart — no 5-minute wait. The user sees a notification in the CLI banner, the model gets context about what happened, and a structured handoff file is ready for the resume_handoff skill. The PID-based detection handles the common case (immediate restart after crash), while the heartbeat fallback covers edge cases like sessions from before PID tracking was added or sessions where PID reuse makes the check unreliable.

Creating a Self-Documenting Claude Code Skill for Technical Blog Posts — More on the Claude Code skill/hook ecosystem
Releasing an Alexa Skill Beta: Security Audits, Tests, and an Automated Release Workflow — Similar pattern of building resilient automation