AWS for mobile backends: what I actually use

December 11, 2024·8 min read

AWS has hundreds of services. Tutorials enthusiastically show you dozens of them. After running Node.js backends on AWS for several years, I use about eight services consistently. Everything else is either unnecessary for my use cases or replaceable by something simpler.

This is the subset of AWS that runs the backend for a mobile app with 50,000+ active users.

Compute: ECS over EC2

We run containerised Node.js services on ECS (Elastic Container Service) with Fargate. No EC2 instances to manage. No AMIs to maintain. No security patches to apply to the OS.

The services are defined as ECS task definitions:

{
  "family": "api-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
      "portMappings": [{ "containerPort": 3000 }],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

ECS handles scaling. We use target tracking scaling based on CPU utilization (target: 60%) and request count per target (target: 1000). During peak hours, ECS adds tasks. During quiet hours, it removes them. We went from running 4 instances 24/7 to running 2-6 instances depending on load, which reduced compute costs by about 40%.

EC2 makes sense when you need GPU instances, specific kernel configurations, or workloads that don't containerize well. For standard Node.js API services, ECS Fargate is simpler and cheaper.

Database: RDS PostgreSQL

Amazon RDS for PostgreSQL. Multi-AZ deployment for automatic failover. Read replicas for query-heavy workloads.

The configuration that matters:

Instance class: db.r6g.large for production (2 vCPU, 16GB RAM). This handles our query load with room for spikes.
Storage: gp3 with 3000 IOPS baseline. gp3 is cheaper than gp2 for the same IOPS.
Backup retention: 7 days with point-in-time recovery. This has saved us twice.
Parameter group: max_connections = 200, increased from the default of 83 for our instance class. With connection pooling, 200 connections support thousands of concurrent API requests.

We use PgBouncer for connection pooling, running as a sidecar container in the same ECS task. Without PgBouncer, each Node.js process opens its own connection pool, and with multiple tasks running, we quickly exhaust the database's connection limit.

Caching: ElastiCache Redis

Redis handles two functions:

Session state: user sessions, WebSocket connection state, interpreter availability cache. Data that needs to be shared across API instances and accessed with low latency.
Pub/sub: broadcasting events across instances. When an interpreter's availability changes, the update is published to Redis and all API instances receive it.

const redis = new Redis(process.env.REDIS_URL);

// Cache interpreter availability
await redis.set(
  `interpreter:${id}:status`,
  JSON.stringify({ status: 'available', languages: ['es', 'fr'] }),
  'EX', 300 // 5 minute TTL
);

// Pub/sub for cross-instance events
const sub = new Redis(process.env.REDIS_URL);
sub.subscribe('interpreter-status');
sub.on('message', (channel, message) => {
  const update = JSON.parse(message);
  broadcastToLocalClients(update);
});

ElastiCache with cluster mode provides automatic failover and read scaling. For our workload, a single cache.r6g.large node handles all caching and pub/sub needs.

Load balancing: ALB

Application Load Balancer in front of the ECS services. The key configuration for a mobile backend:

WebSocket support: ALB supports WebSocket connections natively. Set the idle timeout to match your heartbeat interval (we use 120 seconds).
Sticky sessions: required for WebSocket connections so the upgrade request and subsequent frames go to the same target.
Health checks: HTTP health check on /health endpoint with 10-second interval and 3 consecutive failures before marking unhealthy.
TLS termination: ACM certificate on the ALB. The connection between ALB and ECS tasks is HTTP (internal VPC traffic).

Storage: S3

S3 for user-uploaded content (profile images, session documents), application assets (translation files, configuration), and log archives.

Access pattern: the mobile app uploads to S3 via pre-signed URLs generated by the API. This avoids routing large files through the API server:

import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';

async function getUploadUrl(key: string): Promise<string> {
  const command = new PutObjectCommand({
    Bucket: process.env.S3_BUCKET,
    Key: key,
    ContentType: 'image/jpeg',
  });
  return getSignedUrl(s3Client, command, { expiresIn: 300 });
}

The client uploads directly to S3 using the pre-signed URL. The API never handles the file data.

Monitoring: CloudWatch

CloudWatch for logs and metrics. Structured JSON logging from the application:

const log = (level: string, message: string, meta: Record<string, unknown> = {}) => {
  console.log(JSON.stringify({
    level,
    message,
    timestamp: new Date().toISOString(),
    service: process.env.SERVICE_NAME,
    ...meta,
  }));
};

log('info', 'Session created', { sessionId: '123', language: 'es', matchTime: 4200 });

CloudWatch Logs Insights queries the structured logs:

fields @timestamp, message, sessionId, matchTime
| filter level = "info" and message = "Session created"
| stats avg(matchTime) as avgMatch, p95(matchTime) as p95Match by bin(1h)

CloudWatch Alarms for operational alerts: API error rate > 5%, matching latency p95 > 45 seconds, ECS task count at maximum capacity. Alerts go to a Slack channel via SNS.

Infrastructure as code: Terraform

All infrastructure is defined in Terraform. No manual configuration in the AWS console.

The value of Terraform isn't just reproducibility. It's code review for infrastructure changes. When someone adds a new security group rule or changes a database parameter, the change goes through a pull request with the same review process as application code.

What we tried and replaced

Lambda for API endpoints: cold starts added 500ms-2s of latency to requests. For a real-time matching system, this was unacceptable. Replaced with ECS.
DynamoDB for session data: the access patterns evolved beyond DynamoDB's key-value model. Complex queries required secondary indexes and scan operations that negated DynamoDB's performance advantages. Replaced with PostgreSQL.
SQS for job queues: worked fine but we needed exactly-once processing and dead letter queue visibility that SQS didn't provide cleanly. Replaced with Redis-backed queues using BullMQ.

Cost optimization

The changes that reduced our AWS bill:

ECS auto-scaling (reduced compute costs 40%)
gp3 storage instead of gp2 (reduced database storage costs 20%)
Reserved instances for the database (reduced RDS costs 35%)
S3 lifecycle policies to move old logs to Glacier after 30 days
Deleting unused EBS volumes and snapshots (surprisingly common accumulation)

Total monthly cost for the described setup at 50,000 active users: roughly $2,800. AWS is expensive, but the cost is predictable and the managed services reduce the operational burden enough that a small engineering team can run the infrastructure without a dedicated DevOps role.

RESPONSES