WebSockets in Node.js: getting it right for production

January 14, 2023·8 min read

Socket.io is the default answer when someone asks how to add WebSockets to a Node.js application. It works. For a demo, a hackathon project, or a small internal tool, it's the right choice because it handles fallbacks, reconnection, and room management out of the box.

For production, you need to understand what is happening underneath and make deliberate choices about reconnection, scaling, and failure handling. Socket.io doesn't solve these for you automatically, and the abstractions it provides can hide problems that surface at scale.

ws vs Socket.io

The ws library is a pure WebSocket implementation. It does one thing: WebSocket connections. No fallbacks, no automatic reconnection, no rooms, no namespaces.

Socket.io is a layer on top that adds: transport fallback (long-polling if WebSocket fails), automatic reconnection, room/namespace management, and a custom packet protocol.

The actual trade-off isn't "Socket.io has more features." It's about control and overhead.

Socket.io's custom protocol adds overhead to every message. The packet format includes metadata for event names, acknowledgements, and namespace routing. For a chat application sending occasional messages, this overhead is negligible. For a real-time system sending hundreds of messages per second per connection, it adds up.

Socket.io's transport negotiation starts with an HTTP long-poll request, then upgrades to WebSocket. This adds latency to the initial connection. With raw ws, the connection starts as WebSocket immediately.

For the healthcare application I work on, we use ws directly because connection latency matters (we are matching interpreters to hospital staff in real-time) and we need precise control over the reconnection behaviour.

Redis pub/sub for horizontal scaling

A single Node.js process can handle thousands of WebSocket connections. When you need more capacity, you run multiple processes. The problem: a client connected to Process A needs to receive a message sent by a client connected to Process B.

Redis pub/sub solves this:

import { createClient } from 'redis';
import { WebSocketServer } from 'ws';

const pub = createClient({ url: process.env.REDIS_URL });
const sub = createClient({ url: process.env.REDIS_URL });
await pub.connect();
await sub.connect();

const wss = new WebSocketServer({ port: 8080 });

const clientsByChannel = new Map();

wss.on('connection', (ws) => {
  ws.on('message', async (data) => {
    const msg = JSON.parse(data.toString());

    if (msg.type === 'subscribe') {
      if (!clientsByChannel.has(msg.channel)) {
        clientsByChannel.set(msg.channel, new Set());
        await sub.subscribe(msg.channel, (message) => {
          const clients = clientsByChannel.get(msg.channel);
          if (clients) {
            for (const client of clients) {
              if (client.readyState === 1) {
                client.send(message);
              }
            }
          }
        });
      }
      clientsByChannel.get(msg.channel).add(ws);
    }

    if (msg.type === 'publish') {
      await pub.publish(msg.channel, JSON.stringify(msg.payload));
    }
  });
});

Every Node.js process subscribes to the same Redis channels. When a message is published, Redis delivers it to all subscribers regardless of which process they're running on. Each process then delivers the message to its locally connected clients.

This is the same pattern Socket.io's Redis adapter uses internally. Using it with raw ws gives you the same broadcast capability without the Socket.io overhead.

Connection state and heartbeats

WebSocket connections can silently die. The client's network drops, the connection enters a half-open state, and neither side knows the other is gone. The TCP keepalive timeout is typically 2 hours, which is far too long.

The fix is application-level heartbeats:

const HEARTBEAT_INTERVAL = 30_000;
const HEARTBEAT_TIMEOUT = 10_000;

wss.on('connection', (ws) => {
  ws.isAlive = true;

  ws.on('pong', () => {
    ws.isAlive = true;
  });
});

setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) {
      return ws.terminate();
    }
    ws.isAlive = false;
    ws.ping();
  });
}, HEARTBEAT_INTERVAL);

Every 30 seconds, the server sends a ping to each client. The client's WebSocket implementation automatically responds with a pong (this is part of the WebSocket protocol, not application code). If the server doesn't receive a pong before the next interval, it terminates the connection.

On the client side, implement a similar check:

function connect() {
  const ws = new WebSocket(url);
  let heartbeatTimer;

  ws.onopen = () => {
    heartbeatTimer = setInterval(() => {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(JSON.stringify({ type: 'ping' }));
      }
    }, 30_000);
  };

  ws.onclose = () => {
    clearInterval(heartbeatTimer);
    setTimeout(connect, getReconnectDelay());
  };
}

Client-side reconnection

When a WebSocket connection drops, the client should reconnect automatically. The reconnection strategy matters:

function getReconnectDelay(attempt) {
  const base = 1000;
  const max = 30_000;
  const delay = Math.min(base * Math.pow(2, attempt), max);
  const jitter = delay * 0.2 * Math.random();
  return delay + jitter;
}

Exponential backoff prevents a thundering herd problem. If the server restarts and 10,000 clients all try to reconnect at the same time, the connection attempts alone can overwhelm it. Exponential backoff with jitter spreads the reconnection attempts over time.

The maximum delay caps the backoff at 30 seconds. Without a cap, clients would wait exponentially longer and the user experience would degrade.

Graceful shutdown

When you deploy a new version of the server, the running process needs to shut down. If it terminates immediately, all WebSocket connections drop mid-message. Clients reconnect to the new process, but any in-flight data is lost.

Graceful shutdown:

process.on('SIGTERM', () => {
  // Stop accepting new connections
  wss.close();

  // Give existing connections time to finish
  const drainTimeout = setTimeout(() => {
    process.exit(0);
  }, 10_000);

  // Notify connected clients
  wss.clients.forEach((ws) => {
    if (ws.readyState === 1) {
      ws.send(JSON.stringify({ type: 'server_shutdown' }));
      ws.close(1001, 'Server shutting down');
    }
  });

  // If all connections close before the timeout, exit early
  const checkInterval = setInterval(() => {
    if (wss.clients.size === 0) {
      clearTimeout(drainTimeout);
      clearInterval(checkInterval);
      process.exit(0);
    }
  }, 500);
});

The server stops accepting new connections, notifies existing clients (so they can reconnect to a different instance), and waits up to 10 seconds for connections to close cleanly. If any connections remain after 10 seconds, it exits anyway.

Load balancer configuration

AWS ALB supports WebSocket connections natively, but requires sticky sessions. Without sticky sessions, the HTTP upgrade request and the subsequent WebSocket frames may be routed to different backend instances, which breaks the connection.

Enable sticky sessions on the target group:

Stickiness type: Application-based cookie
Cookie name: AWSALB (default)
Duration: 86400 seconds

The idle timeout on the ALB also matters. By default, ALB closes idle connections after 60 seconds. If your heartbeat interval is longer than 60 seconds, the ALB will close the connection between heartbeats. Set the ALB idle timeout to at least twice your heartbeat interval.

In a healthcare application, dropped connections have real consequences. An interpreter might miss a session request. A hospital staff member might not receive a critical notification. The infrastructure choices around WebSockets aren't performance optimizations. They're reliability requirements.

RESPONSES