aws

6 posts with the tag “aws”

Cloud-init tips and tricks for EC2 instances

Aug 4, 2025

Working extensively with RunsOn, I’ve spent considerable time with cloud-init, the industry-standard tool that automatically configures cloud instances on startup. Present on Ubuntu distributions on AWS, cloud-init fetches metadata from the underlying cloud provider and applies initial configurations. Here are some useful commands and techniques I’ve discovered for troubleshooting and inspecting EC2 instances.

Querying user-data content

When debugging instance startup issues, you often need to check what user-data was passed to your instance. Here are three ways to retrieve it:

1. Using cloud-init query (recommended)

The most reliable method is using the built-in cloud-init query command:

# View user data
sudo cloud-init query userdata

# View all instance data including user data
sudo cloud-init query --all

2. Reading from the filesystem

Cloud-init stores user-data locally after fetching it:

# User data is stored in
sudo cat /var/lib/cloud/instance/user-data.txt

3. Using the instance metadata service (IMDS)

You can also query the EC2 metadata service directly:

# IMDSv2 (recommended)
TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/user-data

# IMDSv1 (legacy)
curl http://169.254.169.254/latest/user-data

Accessing all EC2 metadata without additional API calls

Here’s a powerful tip: cloud-init automatically fetches all relevant data from the EC2 metadata API at startup and caches it locally. Instead of making multiple API calls, you can read everything from a single JSON file:

sudo cat /run/cloud-init/instance-data.json

This file contains a wealth of information about your instance. Let’s explore some useful queries:

Instance identity document

Get comprehensive instance details:

sudo cat /run/cloud-init/instance-data.json | jq '.ds.dynamic["instance-identity"].document'

Output:

{
  "accountId": "135269210855",
  "architecture": "x86_64",
  "availabilityZone": "us-east-1b",
  "billingProducts": null,
  "devpayProductCodes": null,
  "imageId": "ami-0db4eca8382e7fc27",
  "instanceId": "i-00a3d21a80694c44b",
  "instanceType": "m7a.large",
  "kernelId": null,
  "marketplaceProductCodes": null,
  "pendingTime": "2025-04-25T12:00:18Z",
  "privateIp": "10.0.1.93",
  "ramdiskId": null,
  "region": "us-east-1",
  "version": "2017-09-30"
}

Complete metadata

Access all EC2 metadata fields:

sudo cat /run/cloud-init/instance-data.json | jq '.ds["meta-data"]'

This reveals extensive information including:

Network configuration (VPC, subnet, security groups)
IAM instance profile details
Block device mappings
Hostname and IP addresses
Maintenance events

Extracting specific values

Need just the public IP? Or the instance type? Use jq to extract specific fields:

# Get public IPv4 address
sudo jq -r '.ds["meta-data"]["public-ipv4"]' /run/cloud-init/instance-data.json
# Output: 3.93.38.69

# Get instance type
sudo jq -r '.ds["meta-data"]["instance-type"]' /run/cloud-init/instance-data.json
# Output: m7a.large

# Get availability zone
sudo jq -r '.ds["meta-data"]["placement"]["availability-zone"]' /run/cloud-init/instance-data.json
# Output: us-east-1b

# Get VPC ID
sudo jq -r '.ds["meta-data"]["network"]["interfaces"]["macs"][.ds["meta-data"]["mac"]]["vpc-id"]' /run/cloud-init/instance-data.json
# Output: vpc-03460bc2910d2b4e6

Why this matters

Understanding cloud-init and how to query instance metadata is crucial for:

Troubleshooting: When instances fail to start correctly, checking user-data and metadata helps identify configuration issues
Automation: Scripts can use this cached data instead of making API calls, reducing latency and API throttling
Security: Accessing cached data avoids exposing credentials to the metadata service repeatedly
Performance: Reading from local files is faster than HTTP requests to the metadata service

Conclusion

Cloud-init does more than just run your user-data script. It provides a comprehensive interface to instance metadata that’s invaluable for debugging and automation. Next time you’re troubleshooting an EC2 instance or writing automation scripts, remember these commands - they’ll save you time and API calls.

🚀 v2.8.2 is out, with EFS, Ephemeral Registry support, and YOLO mode (tmpfs)!

May 15, 2025

Check out the new documentation pages for:

Now for the full release notes:

Details

Released on: 2025-05-15T07:36:53Z.
For more details: view release notes on GitHub.
CloudFormation template: https://runs-on.s3.eu-west-1.amazonaws.com/cloudformation/template-v2.8.2.yaml

Summary

Support for EFS, TMPFS, and ECR ephemeral registry for fast docker builds. Also some bug fixes.

What's changed

EFS

Embedded networking stack can now create an Elastic File System (EFS), and runners will auto-mount it at /mnt/efs if the extras label include efs. Useful to share artefacts across job runs, with classic filesystem primitives.

jobs:
  with-efs:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=efs
    steps:
      - run: df -ah /mnt/efs
      # 127.0.0.1:/      8.0E   35G  8.0E   1% /mnt/efs

📝 Example use case for maintaining mirrors

For instance this can be used to maintain local mirrors of very large github repositories and avoid long checkout times for every job:

env:
  MIRRORS: "https://github.com/PostHog/posthog.git"
  # can be ${{ github.ref }} if same repo as the workflow
  REF: main

jobs:
  with-efs:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=efs
    steps:
      - name: Setup / Refresh mirrors
        run: |
          for MIRROR in ${{ env.MIRRORS }}; do
            full_repo_name=$(echo $MIRROR | cut -d/ -f4-)
            MIRROR_DIR=/mnt/efs/mirrors/$full_repo_name
            mkdir -p "$(dirname $MIRROR_DIR)"
            test -d "${MIRROR_DIR}" || git clone --mirror ${MIRROR/https:\/\//https:\/\/x-access-token:${{ secrets.GITHUB_TOKEN }}@} "${MIRROR_DIR}"
            ( cd "$MIRROR_DIR" && \
              git remote set-url origin ${MIRROR/https:\/\//https:\/\/x-access-token:${{ secrets.GITHUB_TOKEN }}@} && \
              git fetch origin ${{ env.REF }} )
          done
      - name: Checkout from mirror
        run: |
          git clone file:///mnt/efs/mirrors/PostHog/posthog.git --branch ${{ env.REF }} --single-branch --depth 1 upstream

Ephemeral registry

Support for an Ephemeral ECR registry: can now automatically create an ECR repository that can act as an ephemeral registry for pulling/pushing images and cache layers from your runners. Especially useful with the type=registry buildkit cache instruction. If the extras label includes ecr-cache, the runners will automatically setup docker credentials for that registry at the start of the job.

jobs:
  ecr-cache:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=ecr-cache
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v4
        env:
          TAG: ${{ env.RUNS_ON_ECR_CACHE }}:my-app-latest
        with:
          context: .
          push: true
          tags: ${{ env.TAG }}
          cache-from: type=registry,ref=${{ env.TAG }}
          cache-to: type=registry,ref=${{ env.TAG }} }},mode=max,compression=zstd,compression-level=22

Tmpfs

Support for setting up a tmpfs volume (size: 100% of available RAM, so only to be used on high-memory instances), and binding the /tmp, /home/runner, and /var/lib/docker folders on it. /tmp and /home/runner are mounted as overlays, preserving their existing content.

Can speed up some IO-intensive workflows. Note that if tmpfs is active, instances with ephemeral disks won't have those mounted since it would conflict with the tmpfs volume.

jobs:
  with-tmpfs:
    runs-on: runs-on=${{ github.run_id }},family=r7,ram=16,extras=tmpfs
    steps:
      - run: df -ah /mnt/tmpfs
      # tmpfs            16G  724K   16G   1% /mnt/tmpfs
      - run: df -ah /home/runner
      # overlay          16G  724K   16G   1% /home/runner
      - run: df -ah /tmp
      # overlay          16G  724K   16G   1% /tmp
      - run: df -ah /var/lib/docker
      # tmpfs            16G  724K   16G   1% /var/lib/docker

You can obviously combine options, i.e. extras=efs+tmpfs+ecr-cache+s3-cache is a valid label 😄

Instance-storage mounting changes

Until now, when an instance has locally attached NVMe SSDs available, they would be automatically formatted and mounted so that /var/lib/docker and /home/runner/_work directories would end up on the local disks. Since a lot of stuff (caches etc.) seem to end up within the /home/runner folder itself, the agent now uses the same strategy as for the new tmpfs mounts above (i.e. the whole /home/runner folder is mounted as an overlay on the local disk volume, as well as the /tmp folder. /var/lib/docker remains mounted as a normal filesystem on the local disk volume). Fixes #284.

Misc

Move all RunsOn-specific config files into /runs-on folder on Linux. More coherent with Windows (C:\runs-on), and avoids polluting /opt folder.
Fix app_version in logs (was previously empty string due to incorrect env variable being used in v2.8.1).
Fix "Require any Amazon EC2 launch template not to auto-assign public IP addresses to network interfaces" from AWS Control Tower. When the Private mode is set to only, no longer enable public ip auto-assignment in the launch templates. Thanks @temap!

v2.6.5 - Optimized GPU images, VpcEndpoint stack parameter, tags for custom runners

Feb 1, 2025

👋 v2.6.4 and v2.6.5 have been released in the last weeks, with the following changes.

Note: v2.6.6 ↗ has been released to fix an issue with the VpcEndpoints stack parameter.

Details

Released on: 2025-01-31T13:27:03Z.
For more details: view release notes on GitHub.
CloudFormation template: https://runs-on.s3.eu-west-1.amazonaws.com/cloudformation/template-v2.6.5.yaml

Summary

Optimized GPU images, new VpcEndpoints stack parameter, ability to specify custom instance tags for custom runners.

Note: there appears to be some issues with the new VPC endpoints. I'm on it! If you need that feature, please hold on to your current version of RunsOn.

What's Changed

New GPU images ubuntu22-gpu-x64 and ubuntu24-gpu-x64: 1-1 compatibility with GitHub base images + NVidia GPU drivers, CUDA toolkit, and container toolkit.
Add new VpcEndpoints stack parameter (fixes #213), and reorganize template params. Note that the EC2 VPC endpoint was previously automatically created when Private mode was enabled. This is no longer the case, so make sure you select the VPC endpoints that you need when you update your CloudFormation stack.
Suspend versioning for cache bucket (fixes #191).
Allow to specify instance tags for runners (fixes #205). Tag keys can't start with runs-on- prefix, and key and values will be sanitized according to AWS rules.

Details

Released on: 2025-01-22T09:49:56Z.
For more details: view release notes on GitHub.
CloudFormation template: https://runs-on.s3.eu-west-1.amazonaws.com/cloudformation/template-v2.6.4.yaml

Summary

CLI 0.0.1 released, fix for Magic Cache, fleet objects deletion.

What's changed

CLI released: https://github.com/runs-on/cli. Allows to easily view logs (both server logs and cloud-init logs) for a workflow job by just pasting its GitHub URL or ID. Also allows easy connection to a runner through SSM.
Fix race-condition in Magic Cache (fixes #209).
Delete the fleet instead of just the instance (fixes #217).

How to verify that VPC traffic to S3 is going through your S3 gateway?

Feb 2, 2024

Gateway endpoints for Amazon S3 are a must-have whenever your EC2 instances send and receive traffic from S3, because they allow the traffic to stay within the AWS network, hence better security, bandwidth, throughput, and costs. They can easily be created, and added to your VPC route tables.

But how do you verify that traffic is indeed going through the S3 gateway, and not crossing the outer internet?

Using traceroute, you can probe the routes and see whether you are directly hitting the S3 servers (i.e. no intermediate gateway). In this example, the instance is running from a VPC located in us-east-1:

$ traceroute -n -T -p 443 s3.us-east-1.amazonaws.com
traceroute to s3.us-east-1.amazonaws.com (52.216.215.72), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  52.216.215.72  0.890 ms  0.916 ms  0.892 ms

$ traceroute -n -T -p 443 s3.amazonaws.com
traceroute to s3.amazonaws.com (52.217.139.232), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  52.217.139.232  0.268 ms  0.275 ms  0.252 ms

Both outputs produce the expected result, i.e. no intermediary gateway. This is what would happen if you were accessing a bucket located in the us-east-1 region.

Let’s see what happens if we try to access an S3 endpoint located in another zone:

$ traceroute -n -T -p 443 s3.eu-west-1.amazonaws.com
traceroute to s3.eu-west-1.amazonaws.com (52.218.25.211), 30 hops max, 60 byte packets
 1  * * *
 2  240.4.88.37  0.275 ms 240.0.52.64  0.265 ms 240.4.88.39  0.215 ms
 3  240.4.88.49  0.205 ms 240.4.88.53  0.231 ms 240.4.88.51  0.206 ms
 4  100.100.8.118  1.369 ms 100.100.6.96  0.648 ms 240.0.52.57  0.233 ms
 5  240.0.228.5  0.326 ms * *
 6  240.0.32.16  0.371 ms 240.0.48.30  0.362 ms *
 7  * 240.0.228.31  0.251 ms *
 8  * * *
 9  * * 240.0.32.27  0.392 ms
10  * * *
11  * 242.0.154.49  1.321 ms *
12  * * 52.93.28.131  1.491 ms
13  * * 100.100.6.108  1.286 ms
14  100.92.212.7  67.909 ms 52.218.25.211  67.356 ms  67.929 ms

As you can see, the route is completely different, and as expected does not hit straight to the S3 endpoint.

TL;DR: make sure your route tables are correct, and only point to S3 buckets located in the same region.

Gateway endpoints for Amazon S3 ↗

GitHub Action runner images (AMI) for AWS EC2

Jan 31, 2024

As part of the RunsOn service, we automatically maintain and publish replicas of the official GitHub runner images ↗ images as AWS-formatted images (AMIs) in this repository: https://github.com/runs-on/runner-images-for-aws ↗.

New images are automatically released every 2 weeks, and are slightly trimmed to remove outdated software, or (mostly useless) caches.

Supported images

ubuntu22-full-x64
ubuntu22-full-arm64
ubuntu24-full-x64
ubuntu24-full-arm64

Supported regions

North Virginia (us-east-1)
Ohio (us-east-2)
Oregon (us-west-2)
Ireland (eu-west-1)
London (eu-west-2)
Paris (eu-west-3)
Frankfurt (eu-central-1)
Mumbai (ap-south-1)
Tokyo (ap-northeast-1)
Singapore (ap-southeast-1)
Sydney (ap-southeast-2)

Find the AMI

For the x86_64 image, search for:

name: runs-on-v2.2-<IMAGE_ID>-*
owner: 135269210855

For instance, for the ubuntu22-full-x64 image, search for:

name: runs-on-v2.2-ubuntu22-full-x64-*
owner: 135269210855

Notes

SSH daemon is disabled by default, so be sure to enable it in a user-data script if needed.

You can find more details on https://github.com/runs-on/runner-images-for-aws ↗.

Automatically cleanup outdated AMIs in all AWS regions

Nov 27, 2023

Here is a script you can use to automatically cleanup AMIs older than 60 days (configurable), while keeping the 2 most recent AMIs in each region. This helps to remove outdated images, as well as reducing storage costs for your AMIs.

Particularly useful in the case of runs-on.com, where we regularly rebuild base images whenever GitHub releases a new version of the image runner.

The bin/cleanup script (simply adjust the filters as needed):

#!/bin/bash
# Deregisters old AMIs and deletes associated snapshots, in all regions

set -e
set -o pipefail

APPLICATION="RunsOn"
REGIONS="$(aws ec2 describe-regions --query "Regions[].RegionName" --output text)"
# Number of days to keep AMIs
DAYS_TO_KEEP=${DAYS_TO_KEEP:=60}
# Define the age threshold in seconds (60 days)
AGE_THRESHOLD=$((DAYS_TO_KEEP*24*3600))
# Get the current timestamp in seconds since epoch
CURRENT_TIMESTAMP=$(date +%s)

for region in ${REGIONS[@]}; do
    echo "---- Region: ${region} ---"
    # List all your AMIs and extract relevant information using the AWS CLI
    image_count=$(aws ec2 describe-images --owner self --filters "Name=tag:application, Values=${APPLICATION}" --query 'length(Images)' --region "$region" --output text)
    echo "     Total AMIs in this region: ${image_count}"

    if [ "$image_count" -lt 2 ]; then
      echo "     Less than 2 AMIs found, skipping"
      continue
    fi

    aws ec2 describe-images --owner self --region "${region}" --filters "Name=tag:application, Values=${APPLICATION}" --query 'Images[*].[Name,ImageId,CreationDate]' --output text | \
      while read -r name image_id creation_date; do
        # Parse the creation date into seconds since epoch
        image_timestamp=$(date -d "$creation_date" +%s)

        # Calculate the age of the AMI in seconds
        age=$((CURRENT_TIMESTAMP - image_timestamp))

        # Check if the AMI is older than the threshold
        if [ $age -gt $AGE_THRESHOLD ]; then
          echo "     ! Deregistering AMI: ${image_id} (${name}) created on $creation_date"
          snapshot_id=$(aws ec2 describe-images --image-ids "$image_id" --query "Images[].BlockDeviceMappings[].Ebs.SnapshotId" --region "${region}" --output text)
          if [ "$DRY_RUN" = "true" ]; then
            echo "     DRY_RUN is set to true, skipping deregistering AMI ${image_id} and deleting snapshot ${snapshot_id}"
            continue
          fi
          aws ec2 deregister-image --image-id "$image_id" --region "${region}"
          echo "     ! Deleting snapshot ${snapshot_id} for AMI ${image_id}"
          aws ec2 delete-snapshot --snapshot-id "${snapshot_id}" --region "${region}"
        fi
      done
done

Example output:

---- Region: ap-southeast-2 ---
     Total AMIs in this region: 0
---- Region: eu-central-1 ---
     Total AMIs in this region: 5
     Deregistering AMI: ami-0576e83d0a0f89fbe (runner-ubuntu2204-1699888130) created on 2023-11-13T16:53:48.000Z
     Deleting snapshot snap-07db95a7f230d3f76 for AMI ami-0576e83d0a0f89fbe
     Deregistering AMI: ami-004d4d18e6db2f812 (runner-ubuntu-22-1699873337) created on 2023-11-13T12:40:48.000Z
     Deleting snapshot snap-0500b0e3fb95ab36a for AMI ami-004d4d18e6db2f812
     Deregistering AMI: ami-0e6239eae649effcd (runner-ubuntu22-20231115.7-1700233930) created on 2023-11-17T17:01:58.000Z
     Deleting snapshot snap-05e795e4c6fe9e66f for AMI ami-0e6239eae649effcd
     Deregistering AMI: ami-0dd7f6b263a3ce28c (runner-ubuntu22-20231115-1700156105) created on 2023-11-16T19:24:38.000Z
     Deleting snapshot snap-02c1aef800c429b76 for AMI ami-0dd7f6b263a3ce28c
---- Region: us-east-1 ---
     Total AMIs in this region: 4
     Deregistering AMI: ami-0b56f2d6af0d58ce0 (runner-ubuntu2204-1699888130) created on 2023-11-13T15:54:22.000Z
     Deleting snapshot snap-0f2e8759bea8f3937 for AMI ami-0b56f2d6af0d58ce0
     Deregistering AMI: ami-04266841492472a95 (runner-ubuntu22-20231115.7-1700233930) created on 2023-11-17T16:02:34.000Z
     Deleting snapshot snap-0f0fcf9c6406c3ad9 for AMI ami-04266841492472a95
     Deregistering AMI: ami-0738c7108915044fe (runner-ubuntu22-20231115-1700156105) created on 2023-11-16T18:21:40.000Z
     Deleting snapshot snap-03f16588f59ed7cea for AMI ami-0738c7108915044fe
     ...

Example GitHub Action workflow file to schedule a cleanup every night:

name: Cleanup

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}

on:
  workflow_dispatch:
  schedule:
    - cron: '0 2 * * *'

jobs:
  check:
    timeout-minutes: 30
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      - run: bin/cleanup