How we built real-time interpreter matching on mobile

June 18, 2024·10 min read·2 comments

The core feature of Reach is connecting hospital staff with interpreters. A nurse opens the app, selects a language, and within seconds is connected to an interpreter who can facilitate communication with a patient. This happens at all hours, across time zones, on devices ranging from the latest iPhone to three-year-old Android phones on slow networks.

The matching system is the most complex piece of our architecture. Here's how it works.

The WebSocket layer

Every active user maintains a WebSocket connection to the matching service. We use raw WebSocket (ws library on the server, React Native's built-in WebSocket on the client) rather than Socket.io, because we need precise control over reconnection behaviour and message timing.

The connection lifecycle on the client:

class MatchingConnection {
  private ws: WebSocket | null = null;
  private reconnectAttempt = 0;
  private heartbeatInterval: NodeJS.Timer | null = null;

  connect() {
    this.ws = new WebSocket(MATCHING_WS_URL);

    this.ws.onopen = () => {
      this.reconnectAttempt = 0;
      this.startHeartbeat();
      this.sendAvailabilityStatus();
    };

    this.ws.onmessage = (event) => {
      const message = JSON.parse(event.data);
      this.handleMessage(message);
    };

    this.ws.onclose = () => {
      this.stopHeartbeat();
      this.scheduleReconnect();
    };
  }

  private scheduleReconnect() {
    const delay = Math.min(
      1000 * Math.pow(2, this.reconnectAttempt),
      5000 // Cap at 5 seconds for time-sensitive matching
    );
    this.reconnectAttempt++;
    setTimeout(() => this.connect(), delay);
  }

  private startHeartbeat() {
    this.heartbeatInterval = setInterval(() => {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ type: 'heartbeat' }));
      }
    }, 15000);
  }
}

The reconnection cap is 5 seconds, not the typical 30 seconds used in non-critical applications. An interpreter who disconnects and reconnects needs to be back in the available pool quickly.

The availability state machine

Each interpreter has an availability status that follows a strict state machine:

OFFLINE → AVAILABLE → MATCHING → IN_SESSION → AVAILABLE
                    → DECLINED → AVAILABLE
                    → TIMEOUT  → AVAILABLE
AVAILABLE → OFFLINE
IN_SESSION → OFFLINE (app killed during session)

States:

OFFLINE: WebSocket disconnected or app in background for more than 5 minutes
AVAILABLE: connected, app in foreground, ready to accept sessions
MATCHING: the system is attempting to connect this interpreter with a request
IN_SESSION: actively in a session
DECLINED: interpreter declined the match request
TIMEOUT: interpreter didn't respond within the timeout window

The state machine is enforced server-side. The client sends status updates, but the server validates transitions. An interpreter can't go from OFFLINE to IN_SESSION without going through MATCHING first.

const validTransitions: Record<Status, Status[]> = {
  OFFLINE: ['AVAILABLE'],
  AVAILABLE: ['MATCHING', 'OFFLINE'],
  MATCHING: ['IN_SESSION', 'DECLINED', 'TIMEOUT', 'OFFLINE'],
  IN_SESSION: ['AVAILABLE', 'OFFLINE'],
  DECLINED: ['AVAILABLE', 'OFFLINE'],
  TIMEOUT: ['AVAILABLE', 'OFFLINE'],
};

function validateTransition(current: Status, next: Status): boolean {
  return validTransitions[current]?.includes(next) ?? false;
}

The matching sequence

When a hospital staff member requests an interpreter:

Request received: the server creates a match request with the required language and priority level
Pool query: the server queries available interpreters who speak the requested language, ordered by: language proficiency rating, response time history, and time since last session
Match attempt: the server sends a match request to the top-ranked interpreter via WebSocket
Response window: the interpreter has 20 seconds to accept or decline
Accept: the server creates a session and connects both parties
Decline or timeout: the server moves to the next interpreter in the pool
Exhaustion: if no interpreters accept, the request enters a queue and interpreters aren'tified as they become available

The entire sequence targets completion in under 30 seconds for the common case where an interpreter is available.

async function matchRequest(request: MatchRequest): Promise<Session | null> {
  const candidates = await getAvailableInterpreters(request.language);

  for (const interpreter of candidates) {
    const accepted = await offerMatch(interpreter.id, request, 20000);

    if (accepted) {
      return createSession(request, interpreter);
    }
    // If declined or timed out, try the next candidate
  }

  // No immediate match available
  await enqueueRequest(request);
  return null;
}

Push notification fallback

WebSocket connections aren't reliable on mobile. The app might be in the background. The operating system might have killed the WebSocket connection to save battery. The network might have switched from Wi-Fi to cellular.

When the WebSocket delivery fails (no pong response within 3 seconds), the server immediately sends a push notification:

async function offerMatch(
  interpreterId: string,
  request: MatchRequest,
  timeoutMs: number
): Promise<boolean> {
  // Try WebSocket first
  const wsDelivered = await sendViaWebSocket(interpreterId, {
    type: 'match_offer',
    requestId: request.id,
    language: request.language,
    priority: request.priority,
  });

  if (!wsDelivered) {
    // Fallback to push notification within 3 seconds
    await sendPushNotification(interpreterId, {
      title: 'Session Request',
      body: `${request.language} interpreter needed`,
      data: { requestId: request.id },
    });
  }

  // Wait for response regardless of delivery method
  return waitForResponse(interpreterId, request.id, timeoutMs);
}

The push notification opens the app and establishes a WebSocket connection. The interpreter can then accept the match through the normal flow.

This fallback path adds 2-3 seconds to the matching time. We track the ratio of WebSocket vs push notification deliveries to monitor connection health.

Monitoring

We track:

Matching latency: time from request to session creation (p50, p95, p99)
Match rate: percentage of requests that find an interpreter within 60 seconds
WebSocket delivery rate: percentage of match offers delivered via WebSocket vs push notification
Interpreter response time: how long interpreters take to accept or decline
Queue depth: number of unmatched requests waiting

Alerts fire when:

p95 matching latency exceeds 45 seconds
Match rate drops below 85%
WebSocket delivery rate drops below 70% (indicates a systemic connection issue)
Queue depth exceeds 10 for more than 5 minutes

These metrics are on a real-time dashboard that the engineering team monitors. When matching latency degrades, we can usually identify the cause within minutes: a region with no available interpreters, a spike in requests from a specific hospital, or a WebSocket connectivity issue.

What we learned

Real-time matching on mobile is a systems problem, not a feature. The WebSocket connection management, the state machine, the fallback mechanisms, the monitoring: these aren't optional additions to a matching algorithm. They're the matching system.

The matching algorithm itself (rank interpreters by language, proficiency, and response history) is the simplest part. The complexity is in making it work reliably across unreliable networks, diverse devices, and all hours of the day.

RESPONSES

Elena PopovaJul 2, 2024

The state machine diagram for interpreter availability is something I've been trying to figure out how to model for a similar matching system. This is the most concrete treatment of the problem I've found. Thank you.

Fatima Al-RashidJul 15, 2024

The push notification fallback to WebSocket pattern is clever. We've been debating whether to implement something similar. Useful to see it described in a production context rather than as a hypothetical.