fix(connection-pill): unify pill probe with diagnostics over raw ssh (#44)

Issue #44: pill stuck on "Connected — can't read Hermes state" while
Run Diagnostics shows 14/14 passing. Both code paths probe the same
question (`[ -r ~/.hermes/config.yaml ]`) yet disagreed.

Root cause: the pill called `transport.runProcess(executable:
"/bin/sh", args: ["-c", script])` which routes through
SSHTransport.remotePathArg quoting. That quoting double-quotes every
argument to rewrite `~/` → `$HOME/`, mangling multi-line shell
scripts containing `"$VAR"` references and nested quotes — the
remote received a scrambled `if`-test and `$H/config.yaml` evaluated
to `"/config.yaml"` (or worse), so tier-2 always read as failed.

`RemoteDiagnosticsViewModel` already documented this exact bug and
worked around it locally: invoke `/usr/bin/ssh ... -- /bin/sh -s`
directly and pipe the script via stdin so it travels as opaque
bytes. The pill never got the same treatment, hence the silent
disagreement. The #53 granular-cause script I added a few commits
back made the mangling worse — more $VARs, more `[ ! -e ]` tests,
more nested quoting, all things that increase the runProcess
quoting attack surface.

Move the diagnostics workaround into shared ScarfCore code as
`SSHScriptRunner.run(script:context:timeout:)`. Both the pill probe
and the diagnostics view now use it, so they always see the same
remote shell state. macOS-only via `#if os(macOS)` (Foundation.Process
isn't on iOS); iOS callers never reach this surface anyway —
ScarfGo uses Citadel-based SSH transports for its own flows.

Other tidy-ups:
- `ConnectionStatusViewModel` no longer holds a `transport` instance
  — the field was only used by the now-replaced runProcess path.
- `RemoteDiagnosticsViewModel` loses ~120 lines of duplicated
  `runOverSSH` / `runLocally` / `controlDirPath` helpers; calls into
  `SSHScriptRunner.run` directly.

Risk: low. The SSH path is the same shape that's been shipping in
the diagnostics view since #19. The pill's 15s heartbeat gains a
small forking-an-ssh-process overhead vs the ControlMaster-
multiplexed runProcess, which is invisible at that cadence and
amortized by ssh's own ControlMaster (the `-o ControlMaster=auto`
options match SSHTransport's, so the multiplex socket is shared).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alan Wizemann
2026-04-27 14:08:25 +02:00
parent 0bfae1227a
commit f72bf6e30b
3 changed files with 208 additions and 182 deletions
@@ -0,0 +1,183 @@
import Foundation
/// Runs multi-line shell scripts on a server (local or SSH) without
/// going through `ServerTransport.runProcess`.
///
/// **Why this exists.** `SSHTransport.runProcess` quotes every argument
/// via `remotePathArg` (it rewrites `~/` `$HOME/`), which is correct
/// for path arguments but mangles a multi-line script containing
/// `"$VAR"` references, nested quotes, and control structures. The
/// remote receives a scrambled string and the script silently
/// produces no useful output.
///
/// `RemoteDiagnosticsViewModel` originally documented this and worked
/// around it locally. Issue #44 surfaced the same bug for the
/// connection-status pill (multi-line probe script through
/// `runProcess` tier 2 always reads as failed even when the file
/// is readable, while diagnostics which used the workaround
/// reports 14/14 passing). This helper centralises the workaround so
/// any future caller running a script gets it for free.
///
/// **Approach.** We invoke `/usr/bin/ssh ... -- /bin/sh -s` directly
/// and pipe the script via stdin, so the script travels as a single
/// opaque byte stream that the remote shell parses unchanged. Local
/// contexts skip ssh and just pipe to `/bin/sh -s` same shape so
/// callers can treat both uniformly.
public enum SSHScriptRunner {
public enum Outcome: Sendable {
/// Couldn't even reach the remote (process spawn failed,
/// timeout before any output, network refused). Carries the
/// human-readable reason.
case connectFailure(String)
/// Script ran to completion (or until timeout cut it short
/// after producing partial output). Exit code, stdout, stderr
/// are reported as captured.
case completed(stdout: String, stderr: String, exitCode: Int32)
}
/// Run `script` against the given context. Times out after
/// `timeout` seconds, killing the subprocess if it overruns.
///
/// **Platforms.** Real implementation is macOS-only relies on
/// `Foundation.Process` which iOS doesn't ship. iOS callers
/// (ScarfGo) use Citadel-backed SSH transports for their own
/// flows; they never reach this entry point. To keep ScarfCore
/// cross-platform we return a connect failure on non-macOS so
/// the file compiles everywhere.
public static func run(script: String, context: ServerContext, timeout: TimeInterval = 30) async -> Outcome {
#if os(macOS)
switch context.kind {
case .local:
return await runLocally(script: script, timeout: timeout)
case .ssh(let config):
return await runOverSSH(script: script, config: config, timeout: timeout)
}
#else
return .connectFailure("SSHScriptRunner is only available on macOS")
#endif
}
// MARK: - SSH path
#if os(macOS)
private static func runOverSSH(script: String, config: SSHConfig, timeout: TimeInterval) async -> Outcome {
var sshArgv: [String] = [
"-o", "ControlMaster=auto",
"-o", "ControlPath=\(SSHTransport.controlDirPath())/%C",
"-o", "ControlPersist=600",
"-o", "ServerAliveInterval=30",
"-o", "ConnectTimeout=10",
"-o", "StrictHostKeyChecking=accept-new",
"-o", "LogLevel=QUIET",
"-o", "BatchMode=yes",
"-T", // no pty keep stdin/stdout a clean byte stream
]
if let port = config.port { sshArgv += ["-p", String(port)] }
if let id = config.identityFile, !id.isEmpty {
sshArgv += ["-i", id]
}
let hostSpec: String
if let user = config.user, !user.isEmpty { hostSpec = "\(user)@\(config.host)" }
else { hostSpec = config.host }
sshArgv.append(hostSpec)
sshArgv.append("--")
sshArgv.append("/bin/sh")
sshArgv.append("-s") // read script from stdin
return await Task.detached { () -> Outcome in
let proc = Process()
proc.executableURL = URL(fileURLWithPath: "/usr/bin/ssh")
proc.arguments = sshArgv
// Inherit shell-derived SSH_AUTH_SOCK so ssh-agent reaches.
// Same path SSHTransport uses internally see
// `environmentEnricher` set at app boot.
var env = ProcessInfo.processInfo.environment
if let enricher = SSHTransport.environmentEnricher {
let shellEnv = enricher()
for key in ["SSH_AUTH_SOCK", "SSH_AGENT_PID"] {
if env[key] == nil, let v = shellEnv[key], !v.isEmpty {
env[key] = v
}
}
}
proc.environment = env
let stdinPipe = Pipe()
let stdoutPipe = Pipe()
let stderrPipe = Pipe()
proc.standardInput = stdinPipe
proc.standardOutput = stdoutPipe
proc.standardError = stderrPipe
do {
try proc.run()
} catch {
return .connectFailure("Failed to launch ssh: \(error.localizedDescription)")
}
if let data = script.data(using: .utf8) {
try? stdinPipe.fileHandleForWriting.write(contentsOf: data)
}
try? stdinPipe.fileHandleForWriting.close()
let deadline = Date().addingTimeInterval(timeout)
while proc.isRunning && Date() < deadline {
try? await Task.sleep(nanoseconds: 100_000_000)
}
if proc.isRunning {
proc.terminate()
return .connectFailure("Script timed out after \(Int(timeout))s")
}
let out = (try? stdoutPipe.fileHandleForReading.readToEnd()) ?? Data()
let err = (try? stderrPipe.fileHandleForReading.readToEnd()) ?? Data()
// Best-effort fd close Pipe leaks fd's otherwise.
try? stdoutPipe.fileHandleForReading.close()
try? stderrPipe.fileHandleForReading.close()
return .completed(
stdout: String(data: out, encoding: .utf8) ?? "",
stderr: String(data: err, encoding: .utf8) ?? "",
exitCode: proc.terminationStatus
)
}.value
}
// MARK: - Local path
private static func runLocally(script: String, timeout: TimeInterval) async -> Outcome {
return await Task.detached { () -> Outcome in
let proc = Process()
proc.executableURL = URL(fileURLWithPath: "/bin/sh")
proc.arguments = ["-c", script]
let stdoutPipe = Pipe()
let stderrPipe = Pipe()
proc.standardOutput = stdoutPipe
proc.standardError = stderrPipe
do {
try proc.run()
} catch {
return .connectFailure("Failed to launch /bin/sh: \(error.localizedDescription)")
}
let deadline = Date().addingTimeInterval(timeout)
while proc.isRunning && Date() < deadline {
try? await Task.sleep(nanoseconds: 100_000_000)
}
if proc.isRunning {
proc.terminate()
return .connectFailure("Script timed out after \(Int(timeout))s")
}
let out = (try? stdoutPipe.fileHandleForReading.readToEnd()) ?? Data()
let err = (try? stderrPipe.fileHandleForReading.readToEnd()) ?? Data()
try? stdoutPipe.fileHandleForReading.close()
try? stderrPipe.fileHandleForReading.close()
return .completed(
stdout: String(data: out, encoding: .utf8) ?? "",
stderr: String(data: err, encoding: .utf8) ?? "",
exitCode: proc.terminationStatus
)
}.value
}
#endif // os(macOS)
}
@@ -70,12 +70,10 @@ public final class ConnectionStatusViewModel {
private let consecutiveFailureThreshold = 2 private let consecutiveFailureThreshold = 2
public let context: ServerContext public let context: ServerContext
private let transport: any ServerTransport
private var probeTask: Task<Void, Never>? private var probeTask: Task<Void, Never>?
public init(context: ServerContext) { public init(context: ServerContext) {
self.context = context self.context = context
self.transport = context.makeTransport()
if !context.isRemote { if !context.isRemote {
// Local contexts are always considered connected no network // Local contexts are always considered connected no network
// or auth can fail. // or auth can fail.
@@ -108,7 +106,7 @@ public final class ConnectionStatusViewModel {
} }
private func probeOnce() async { private func probeOnce() async {
let snapshot = transport let snapshot = context
let hermesHome = context.paths.home let hermesHome = context.paths.home
// Two-tier probe in one SSH round-trip: // Two-tier probe in one SSH round-trip:
// tier 1: `true` raw connectivity / auth / ControlMaster path // tier 1: `true` raw connectivity / auth / ControlMaster path
@@ -162,39 +160,38 @@ public final class ConnectionStatusViewModel {
case failure(TransportError) case failure(TransportError)
} }
let outcome: ProbeOutcome = await Task.detached { // Issue #44: previously this used `transport.runProcess(executable:
do { // "/bin/sh", args: ["-c", script])`, which goes through
let probe = try snapshot.runProcess( // SSHTransport's `remotePathArg` quoting. That mangles multi-line
executable: "/bin/sh", // shell scripts containing `"$VAR"` references and nested
args: ["-c", script], // quotes the remote received a scrambled string and the if-test
stdin: nil, // for config.yaml readability silently failed even when the file
timeout: 10 // was readable. Result: 14/14 diagnostics passing AND a stuck
) // "Connected can't read Hermes state" pill, simultaneously,
guard probe.exitCode == 0 else { // because diagnostics had its own runOverSSH workaround. Now
return .failure(.commandFailed(exitCode: probe.exitCode, stderr: probe.stderrString)) // both paths use SSHScriptRunner so they always agree.
let outcome: ProbeOutcome = await {
let result = await SSHScriptRunner.run(script: script, context: snapshot, timeout: 10)
switch result {
case .connectFailure(let msg):
return .failure(.other(message: msg))
case .completed(let out, let stderr, let exitCode):
guard exitCode == 0 else {
return .failure(.commandFailed(exitCode: exitCode, stderr: stderr))
} }
let out = probe.stdoutString
let tier1 = out.contains("TIER1:0") let tier1 = out.contains("TIER1:0")
let tier2 = out.contains("TIER2:0") let tier2 = out.contains("TIER2:0")
if !tier1 { if !tier1 {
// The script itself didn't reach tier 1 treat as connection failure.
return .failure(.commandFailed(exitCode: 1, stderr: out)) return .failure(.commandFailed(exitCode: 1, stderr: out))
} }
if tier2 { if tier2 {
return .connected return .connected
} }
// Connected but tier 2 failed. Parse the granular cause
// code; older remotes that don't emit a tag fall through
// to `.unknown` with a generic hint (issue #53).
let cause = Self.parseDegradedCause(stdout: out) let cause = Self.parseDegradedCause(stdout: out)
let (reason, hint) = Self.describe(cause: cause, hermesHome: hermesHome) let (reason, hint) = Self.describe(cause: cause, hermesHome: hermesHome)
return .degraded(reason: reason, hint: hint, cause: cause) return .degraded(reason: reason, hint: hint, cause: cause)
} catch let e as TransportError {
return .failure(e)
} catch {
return .failure(.other(message: error.localizedDescription))
} }
}.value }()
switch outcome { switch outcome {
case .connected: case .connected:
@@ -123,7 +123,11 @@ final class RemoteDiagnosticsViewModel {
finishedAt = nil finishedAt = nil
let script = Self.buildScript(hermesHome: context.paths.home) let script = Self.buildScript(hermesHome: context.paths.home)
let captured = await Self.execute(script: script, context: context) // Use the shared SSHScriptRunner so this view model and the
// ConnectionStatusViewModel pill always agree on what the
// remote sees (issue #44 the prior local copies of the
// workaround drifted from each other).
let captured = await SSHScriptRunner.run(script: script, context: context, timeout: 30)
switch captured { switch captured {
case .connectFailure(let msg): case .connectFailure(let msg):
@@ -282,164 +286,6 @@ final class RemoteDiagnosticsViewModel {
"""# """#
} }
enum Captured {
case connectFailure(String)
case completed(stdout: String, stderr: String, exitCode: Int32)
}
private static func execute(script: String, context: ServerContext) async -> Captured {
// Can't use `transport.runProcess(executable: "/bin/sh", args: ["-c", script])`
// here: SSHTransport.runProcess pipes every argument through
// `remotePathArg` (which double-quotes to rewrite `~/` `$HOME/`),
// which mangles a multi-line shell script containing `"$1"`,
// nested quotes, and `printf` escape sequences. The result on the
// remote is a scrambled string and every probe fails to emit.
//
// Mirror TestConnectionProbe's approach: build the ssh argv
// directly so the script travels as a single opaque argv entry
// that ssh forwards to the remote shell unchanged.
switch context.kind {
case .local:
return await runLocally(script: script)
case .ssh(let config):
return await runOverSSH(script: script, config: config)
}
}
/// Direct ssh invocation. Pipes the script into `sh` on stdin rather
/// than passing it as `sh -c <script>` argv because ssh concatenates
/// argv with spaces and sends that as a single command string to the
/// remote's LOGIN shell, which then parses newlines as command
/// separators. A multi-line `sh -c <script>` would run only the first
/// line inside the `sh` subprocess (any variables set there die when
/// `sh` exits), and the rest would run in the login shell with no
/// access to those variables. Symptom: `$H=""` everywhere downstream.
///
/// Feeding the script via stdin avoids the split entirely `sh -s`
/// consumes the whole stream in one process, so variable scope is
/// preserved and the script runs exactly the same way it would from
/// a local `cat script.sh | sh`.
private static func runOverSSH(script: String, config: SSHConfig) async -> Captured {
var sshArgv: [String] = [
"-o", "ControlMaster=auto",
"-o", "ControlPath=\(controlDirPath())/%C",
"-o", "ControlPersist=600",
"-o", "ServerAliveInterval=30",
"-o", "ConnectTimeout=10",
"-o", "StrictHostKeyChecking=accept-new",
"-o", "LogLevel=QUIET",
"-o", "BatchMode=yes",
"-T" // no pty keep stdin/stdout a clean byte stream
]
if let port = config.port { sshArgv += ["-p", String(port)] }
if let id = config.identityFile, !id.isEmpty {
sshArgv += ["-i", id]
}
let hostSpec: String
if let user = config.user, !user.isEmpty { hostSpec = "\(user)@\(config.host)" }
else { hostSpec = config.host }
sshArgv.append(hostSpec)
sshArgv.append("--")
sshArgv.append("/bin/sh")
sshArgv.append("-s") // read script from stdin
return await Task.detached { () -> Captured in
let proc = Process()
proc.executableURL = URL(fileURLWithPath: "/usr/bin/ssh")
proc.arguments = sshArgv
// Inherit the shell's SSH_AUTH_SOCK so ssh can reach the
// agent same pattern as SSHTransport + TestConnectionProbe.
var env = ProcessInfo.processInfo.environment
let shellEnv = HermesFileService.enrichedEnvironment()
for key in ["SSH_AUTH_SOCK", "SSH_AGENT_PID"] {
if env[key] == nil, let v = shellEnv[key], !v.isEmpty {
env[key] = v
}
}
proc.environment = env
let stdinPipe = Pipe()
let stdoutPipe = Pipe()
let stderrPipe = Pipe()
proc.standardInput = stdinPipe
proc.standardOutput = stdoutPipe
proc.standardError = stderrPipe
do {
try proc.run()
} catch {
return .connectFailure("Failed to launch ssh: \(error.localizedDescription)")
}
// Write the script to ssh's stdin, then close the write end so
// remote sh sees EOF and exits after executing the whole script.
if let data = script.data(using: .utf8) {
try? stdinPipe.fileHandleForWriting.write(contentsOf: data)
}
try? stdinPipe.fileHandleForWriting.close()
let deadline = Date().addingTimeInterval(30)
while proc.isRunning && Date() < deadline {
try? await Task.sleep(nanoseconds: 100_000_000)
}
if proc.isRunning {
proc.terminate()
return .connectFailure("Diagnostics timed out after 30s")
}
let out = (try? stdoutPipe.fileHandleForReading.readToEnd()) ?? Data()
let err = (try? stderrPipe.fileHandleForReading.readToEnd()) ?? Data()
return .completed(
stdout: String(data: out, encoding: .utf8) ?? "",
stderr: String(data: err, encoding: .utf8) ?? "",
exitCode: proc.terminationStatus
)
}.value
}
/// Local Shell invocation runs the diagnostic script against the
/// user's own Mac. Less useful than the remote form (most checks will
/// trivially pass), but lets the same UI work for both contexts.
private static func runLocally(script: String) async -> Captured {
return await Task.detached { () -> Captured in
let proc = Process()
proc.executableURL = URL(fileURLWithPath: "/bin/sh")
proc.arguments = ["-c", script]
let stdoutPipe = Pipe()
let stderrPipe = Pipe()
proc.standardOutput = stdoutPipe
proc.standardError = stderrPipe
do {
try proc.run()
} catch {
return .connectFailure("Failed to launch /bin/sh: \(error.localizedDescription)")
}
let deadline = Date().addingTimeInterval(10)
while proc.isRunning && Date() < deadline {
try? await Task.sleep(nanoseconds: 100_000_000)
}
if proc.isRunning {
proc.terminate()
return .connectFailure("Local diagnostics timed out (should be <1s)")
}
let out = (try? stdoutPipe.fileHandleForReading.readToEnd()) ?? Data()
let err = (try? stderrPipe.fileHandleForReading.readToEnd()) ?? Data()
return .completed(
stdout: String(data: out, encoding: .utf8) ?? "",
stderr: String(data: err, encoding: .utf8) ?? "",
exitCode: proc.terminationStatus
)
}.value
}
/// Same cache directory used by SSHTransport shared so the diagnostic
/// probe reuses the connection's ControlMaster socket when it already
/// exists (no second TCP handshake, no second auth).
private static func controlDirPath() -> String {
SSHTransport.controlDirPath()
}
private static func parse(stdout: String, stderr: String, exitCode: Int32) -> [Probe] { private static func parse(stdout: String, stderr: String, exitCode: Int32) -> [Probe] {
var results: [ProbeID: Probe] = [:] var results: [ProbeID: Probe] = [:]
for line in stdout.split(whereSeparator: { $0 == "\n" || $0 == "\r" }) { for line in stdout.split(whereSeparator: { $0 == "\n" || $0 == "\r" }) {