docs: chat resilience + connection-pill probes state.db

Alan Wizemann
2026-04-27 22:00:41 +02:00
parent 07283d17bc
commit 703d5d6ccc
5 changed files with 29 additions and 4 deletions
+10
@@ -25,6 +25,16 @@ Streams tokens, thoughts, and tool calls live via the [ACP subprocess](ACP-Subpr
- Resume an old conversation: pick from the session picker; Scarf calls `session/load` or `session/resume` depending on the state.
- Auto-reconnect — if the subprocess dies, Scarf attempts a `session/resume` to pick up where it left off.
**Resilience** _(v2.5.2+, iOS-at-parity)_. Both Mac and iOS recover automatically when the SSH socket drops, the phone sleeps, or the network changes — Hermes keeps writing to `state.db` on the remote during the outage, and Scarf reattaches via `session/resume` (with `session/load` fallback) on a 5-attempt 1→2→4→8→16 s exponential backoff. After a successful reconnect, `reconcileWithDB` merges any messages the agent persisted while you were offline and a "Resynced N new messages" toast surfaces what changed.
- iOS specifically gains a yellow **Reconnecting (n/5)…** banner during recovery and a red **No network** banner while reachability is unsatisfied (driven by `NWPathMonitor`).
- iOS observes scene-phase transitions through `ScarfGoCoordinator` so a chat tab that was unmounted while you were on Dashboard still picks up the background → active edge and verifies channel health on resume.
- Composer drafts persist across force-quit, keyed by `(serverID, sessionID)` in `UserDefaults`; a 7-day janitor at app launch prunes stale slots.
**Pagination** _(v2.5.2+)_. Initial load fetches the most recent 200 messages for a session (500 on the Mac Sessions detail view). Sessions with more on-disk history surface a "Load earlier messages" button at the top of the transcript. Pagination is keyed by message id (monotonic) so streaming-chunk timestamps that collide on the same millisecond never split a page.
**Offline-tolerant snapshots** _(v2.5.2+)_. When a fresh remote `state.db` snapshot pull fails, Scarf falls back to the last cached copy at `~/Library/Caches/scarf/snapshots/<server-id>/state.db` so Dashboard and Sessions stay viewable. The chat history reload path explicitly opts out of this fallback (`forceFresh: true`) — falling back there would silently hide messages the agent streamed during the outage.
**Voice mode controls:** PTT (push-to-talk), TTS playback, STT transcription preferences live in **Settings → Voice**. The chat toolbar exposes the basic toggles.
## Chat density preferences _(v2.5.1+, Mac)_
+8 -2
@@ -7,7 +7,7 @@ In v2.5 most service code moved out of the Mac target into the shared **ScarfCor
| Service | Isolation | Lines | Purpose |
|---|---|---|---|
| [`ACPClient`](https://github.com/awizemann/scarf/blob/main/scarf/scarf/Core/Services/ACPClient.swift) | `actor` | ~605 | Spawns `hermes acp` subprocess; JSON-RPC over stdio; async event stream for chat. |
| [`HermesDataService`](https://github.com/awizemann/scarf/blob/main/scarf/scarf/Core/Services/HermesDataService.swift) | `actor` | ~658 | Read-only SQLite queries against `state.db`; pulls atomic snapshots for remote; dedupes concurrent snapshot calls via a nested `SnapshotCoordinator` actor. |
| [`HermesDataService`](https://github.com/awizemann/scarf/blob/main/scarf/scarf/Core/Services/HermesDataService.swift) | `actor` | ~750 | Read-only SQLite queries against `state.db`; pulls atomic snapshots for remote; dedupes concurrent snapshot calls via a nested `SnapshotCoordinator` actor. _v2.5.2:_ falls back to the cached snapshot when a fresh pull fails (`isUsingStaleSnapshot` / `lastSnapshotMtime`); `fetchMessages` paginates by id desc with `HistoryPageSize` budgets; `refresh(forceFresh:)` lets chat-history reloads opt out of the fallback. |
| [`HermesEnvService`](https://github.com/awizemann/scarf/blob/main/scarf/scarf/Core/Services/HermesEnvService.swift) | `Sendable struct` | ~217 | Non-destructive `~/.hermes/.env` I/O — preserves comments and blanks; `unset` comments out instead of deleting. |
| [`HermesFileService`](https://github.com/awizemann/scarf/blob/main/scarf/scarf/Core/Services/HermesFileService.swift) | `Sendable struct` | ~620 | Parses `config.yaml` into typed nested structs (now including `platform_toolsets`); resolves `hermes` binary; enriches `$PATH` for spawned tools (brew/nvm/asdf). |
| [`HermesFileWatcher`](https://github.com/awizemann/scarf/blob/main/scarf/scarf/Core/Services/HermesFileWatcher.swift) | `@Observable` | ~122 | Local: FSEvents via `DispatchSourceFileSystemObject`. Remote: mtime polling over the SSH ControlMaster. Updates `lastChangeDate`; views observe and refresh. |
@@ -44,7 +44,13 @@ In v2.5 most service code moved out of the Mac target into the shared **ScarfCor
|---|---|
| `KeychainSSHKeyStore` | Per-server Ed25519 keypair persistence in the iOS Keychain (`com.scarf.ssh-key` service). Default accessibility `kSecAttrAccessibleAfterFirstUnlockThisDeviceOnly` + `kSecAttrSynchronizable=false`; in v2.5.1 a `SSHKeyICloudPreference` opt-in (System → Security toggle) flips writes to `kSecAttrAccessibleAfterFirstUnlock` + `kSecAttrSynchronizable=true` so iCloud Keychain syncs the key across the user's Apple devices ([#52](https://github.com/awizemann/scarf/issues/52)). Read / list / delete queries unconditionally pass `kSecAttrSynchronizable=kSecAttrSynchronizableAny` so they match items regardless of sync state. v2 multi-server format: account `"server-key:<UUID>"`. Auto-migrates v1 (`"primary"` account) on first `listAll`. Public `migrateAllItems(toICloudSync:)` re-saves every stored bundle with target attributes — idempotent. |
| `CitadelSSHService` | Pure-Swift Ed25519 keypair generation + connection probes via Citadel. Used by Onboarding's "Generate Key" + "Test Connection" steps. |
| `CitadelServerTransport` | Citadel-backed implementation of `ServerTransport` — drives `executeCommandStream` for resilient stdout capture (preserves output on non-zero exit) and prepends `PATH=$HOME/.local/bin:/opt/homebrew/bin:/usr/local/bin:$PATH` so non-interactive sessions resolve `hermes` and its sub-tools without sourcing user shell rc files. |
| `CitadelServerTransport` | Citadel-backed implementation of `ServerTransport` — drives `executeCommandStream` for resilient stdout capture (preserves output on non-zero exit) and prepends `PATH=$HOME/.local/bin:/opt/homebrew/bin:/usr/local/bin:$PATH` so non-interactive sessions resolve `hermes` and its sub-tools without sourcing user shell rc files. _v2.5.2:_ exposes `cachedSnapshotPath` so `HermesDataService` can fall back to the on-disk snapshot when an SSH pull fails. |
## v2.5.2 additions (iOS-only — in ScarfIOS)
| Service | Purpose |
|---|---|
| `NetworkReachabilityService` | Process-wide `NWPathMonitor` singleton. Publishes `isSatisfied` / `isExpensive` / `transitionTick` on the main actor (the path-update handler bounces back through `Task { @MainActor in ... }`). `ChatController.handleReachabilityChange` observes `transitionTick` to suspend in-flight reconnect attempts on link-down (every retry would burn a budget slot against a guaranteed failure) and kick a fresh cycle on link-up — so airplane-mode toggles and WiFi↔cellular handoffs recover automatically. Lives in ScarfIOS rather than ScarfCore because `Network.framework` doesn't ship on Linux. |
See [ScarfCore Package](ScarfCore-Package) for the package architecture and how to add a new shared service.
+3
@@ -43,6 +43,9 @@ Onboarding details:
| **Skills** | Browse the skills tree from `~/.hermes/skills/`. Read-only. |
| **Settings** | Read view of full `config.yaml` plus a **Quick Edits** section that flips 7 commonly-changed keys (`model.default`, `model.provider`, `agent.approval_mode`, `agent.max_turns`, `display.show_cost`, `display.show_reasoning`, `display.streaming`) via `hermes config set` on the remote. Other keys remain read-only — edit from the Mac app or a remote shell. |
| **Slash commands** _(v2.5)_ | Read-only browser of project-scoped slash commands shipped via `<project>/.scarf/slash-commands/`. Tap a row to see the expanded prompt with a sample-argument field. Authoring is Mac-only in v1. See [Slash Commands](Slash-Commands). |
| **Auto-reconnect** _(v2.5.2)_ | Lock the phone, switch from WiFi to cellular, or just lose signal mid-prompt — when the SSH socket dies, ScarfGo reattaches via `session/resume` (with `session/load` fallback) on a 5-attempt 1→2→4→8→16 s exponential backoff. Hermes keeps writing to `state.db` on the remote during the outage; on success a "Resynced N new messages" toast surfaces what the agent did while you were offline. A yellow **Reconnecting (n/5)…** banner shows the recovery in progress; a red **No network** banner shows when reachability is unsatisfied. See [Chat](Chat) for the full resilience model. |
| **Draft persistence** _(v2.5.2)_ | A half-typed message survives force-quit — drafts are persisted to `UserDefaults` keyed by `(serverID, sessionID)` and restored when the session resumes. A 7-day janitor at app launch prunes stale slots. |
| **Load earlier messages** _(v2.5.2)_ | Long sessions (200+ messages) page chronologically — the initial fetch loads the most recent 200, with a "Load earlier messages" button at the top of the transcript for the rest. Pagination is keyed by message id (monotonic) so streaming-chunk timestamps that collide on the same millisecond never split a page. |
## Project-scoped chat
+4 -2
@@ -41,12 +41,14 @@ If the connection pill is green but the Dashboard shows "Stopped", "unknown", or
**The pill itself diagnoses common cases inline** _(v2.5.1+)._ Clicking the yellow "Can't read Hermes state" pill opens a popover with:
- The specific reason (`config.yaml is missing`, `permission denied on config.yaml`, `~/.hermes` doesn't exist, `Hermes profile <name> is active`, etc.)
- An actionable hint paragraph (`run hermes setup on the remote`, `chmod a+r ~/.hermes/config.yaml`, etc.)
- The specific reason (`Hermes hasn't been run yet`, `permission denied on state.db`, `~/.hermes` doesn't exist, `Hermes profile <name> is active`, etc.)
- An actionable hint paragraph (`run any hermes session on the remote to create state.db`, `chmod a+r ~/.hermes/state.db`, etc.)
- A Run Diagnostics button (opens the heavy 14-check sheet) and a Retry button
For the "profile is active" case the popover includes a copy-paste `hermes profile use default` command. See [Projects & Profiles](Projects-and-Profiles) for the full Hermes v0.11 profile model.
**Pill probes `state.db`, not `config.yaml`** _(v2.5.2+)._ The tier-2 readability check now targets `~/.hermes/state.db` because that's the file Scarf actively reads on every Dashboard / Sessions / Chat tick. Hermes v0.11+ doesn't materialize `config.yaml` until the user explicitly changes a setting — a freshly-installed working Hermes would otherwise be marked "degraded — config missing" indefinitely. `state.db` is created on the first agent run and is the actual surface Scarf depends on.
**Manage Servers → 🩺 Run Diagnostics** runs **fourteen** checks in one SSH session: connectivity, `sqlite3` presence, read access to `config.yaml` and `state.db`, the effective non-login `$PATH`, etc. Each failure explains itself with a remediation hint. **Copy Full Report** dumps the whole output for bug reports.
**Pill probe and diagnostics now use the same plumbing** _(v2.5.1+)._ Both go through the shared [`SSHScriptRunner`](Core-Services) (raw `/usr/bin/ssh ... -- /bin/sh -s`, script piped via stdin) instead of the prior split where the pill went through `runProcess`'s argument quoting and the diagnostics view used a local workaround. They no longer disagree about what the remote sees — issue [#44](https://github.com/awizemann/scarf/issues/44).
+4
@@ -116,6 +116,10 @@ sqlite3 '/tmp/scarf-snapshot-XYZ.db' "PRAGMA journal_mode=DELETE;"
`.backup` is WAL-safe — it captures a consistent snapshot without blocking writers. The `PRAGMA journal_mode=DELETE` strips WAL mode so the snapshot is self-contained (no `-wal`/`-shm` sidecars). `scp` pulls it to `~/Library/Caches/scarf/snapshots/<id>/state.db`. The remote temp is removed.
#### Snapshot fallback _(v2.5.2+)_
`ServerTransport.cachedSnapshotPath` exposes that local cache path even when the remote is unreachable. `HermesDataService.open()` uses it as a fallback when a fresh `snapshotSQLite` call throws — the data layer surfaces `isUsingStaleSnapshot = true` + `lastSnapshotMtime` so views can render a "Last updated X ago" affordance instead of blanking. The chat-history reload path explicitly opts out via `refresh(forceFresh: true)` because falling back there would silently hide messages the agent streamed during the outage. `LocalTransport.cachedSnapshotPath` returns `nil` (the live DB has no separate cache).
### Remote watching
3-second polling: the remote runs a one-liner concatenating mtimes for the watched paths, hashed into a signature. When the signature changes, the stream yields `.anyChanged`. Transient connection drops are tolerated.