Published 2020-10-25.
Last modified 2022-01-27.
Time to read: 7 minutes.
The upgrade from Ubuntu 20.04 to 20.10 has been especially problematic for each of the half-dozen XUbuntu systems that I manage. One important server that I run on Scaleway became unresponsive and would not boot shortly after starting the installation, and another important server on AWS ran fine, but did not allow logins.
This article details what I did to recover the AWS server using a standard
*nix
procedure that any competent system administrator would be comfortable with:
chroot
.
Before Linux had cgroups
,
we used chroot
and its close cousin, jail
.
I used chroot
for the technical basis of Zamples back in 2001.
Because the chroot
environment will be set up in a way that it shares the rescue system’s /var/run
directory, the rescue system should have all upgrades in place and should be rebooted if /var/run/reboot-required
exists.
AWS also provides a tool called
EC2Rescue
,
which does a complicated series of actions to accomplish something similar.
Here is additional documentation.
I find the AWS documentation is frequently obtuse, and the approach taken by most AWS products and tools is extremely general.
Consequently I often find myself wasting a lot of time trying to get things to work.
I don’t subscribe to AWS support; if I had subscribed to expensive enterprise-level support,
complete with an AWS expert to hold my hand while I attempted to resurrect the server,
I might have tried using EC2Rescue
.
On the other hand, when pressed with an emergency, I prefer to lean on tried-and-true methods like chroot
.
This article concludes with two Bash scripts to automate the details. I wrote the second script in late January 2022, approximately 2 years after this article was first published. The second script is less user-friendly, but more fine-grained. That is a reasonable tradeoff, and both approaches have merit.
Setup
AWS CLI
I prefer to use the AWS CLI instead of the web console. Installation instructions are here. This article uses the AWS CLI exclusively in favor of the AWS web console.
jq
I also use jq for parsing JSON in the bash shell. Install it on Debian-style Linux distros such as Ubuntu like this:
$ yes | sudo apt install jq
Discover information about the Problem EC2 instance
Getting the AWS EC2 Instance Information
Because my problem EC2 instance has a tag called Name
with Value
production
,
I was able to easily obtain a JSON representation of all the information about it.
I stored the JSON in an environment variable called AWS_EC2_PRODUCTION
.
The results are shown in unselectable text. This is so you can easily use this sample code yourself. You can copy the code to run into your clipboard. Just click on the little copy icon at the top right hand corner of the scrolling code display area. Because the prompt and the results and are unselectable, your clipboard will only pick up the code you need to paste in order to run the code example yourself.
$ AWS_EC2_PRODUCTION="$( aws ec2 describe-instances | \ jq '.Reservations[].Instances[] | select((.Tags[]?.Key=="Name") and (.Tags[]?.Value=="production"))' )" $ echo "$AWS_EC2_PRODUCTION" { "AmiLaunchIndex": 0, "ImageId": "ami-e29b9388", "InstanceId": "i-825eb905", "InstanceType": "t2.small", "KeyName": "sslTest", "LaunchTime": "2017-10-12T16:24:14.000Z", "Monitoring": { "State": "disabled" }, "Placement": { "AvailabilityZone": "us-east-1c", "GroupName": "", "Tenancy": "default" }, "PrivateDnsName": "ip-10-0-0-201.ec2.internal", "PrivateIpAddress": "10.0.0.201", "ProductCodes": [], "PublicDnsName": "", "PublicIpAddress": "52.207.225.143", "State": { "Code": 16, "Name": "running" }, "StateTransitionReason": "", "SubnetId": "subnet-49de033f", "VpcId": "vpc-f16a0895", "Architecture": "x86_64", "BlockDeviceMappings": [ { "DeviceName": "/dev/sda1", "Ebs": { "AttachTime": "2016-04-05T19:07:17.000Z", "DeleteOnTermination": true, "Status": "attached", "VolumeId": "vol-1c8903b4" } } ], "ClientToken": "GykZz1459883236367", "EbsOptimized": false, "Hypervisor": "xen", "NetworkInterfaces": [ { "Association": { "IpOwnerId": "amazon", "PublicDnsName": "", "PublicIp": "52.207.225.143" }, "Attachment": { "AttachTime": "2016-04-05T19:07:16.000Z", "AttachmentId": "eni-attach-a58bd15f", "DeleteOnTermination": true, "DeviceIndex": 0, "Status": "attached" }, "Description": "Primary network interface", "Groups": [ { "GroupName": "testSG", "GroupId": "sg-4cbc6f35" } ], "Ipv6Addresses": [], "MacAddress": "0a:a4:be:1b:8e:eb", "NetworkInterfaceId": "eni-fa4f65bb", "OwnerId": "031372724784", "PrivateIpAddress": "10.0.0.201", "PrivateIpAddresses": [ { "Association": { "IpOwnerId": "amazon", "PublicDnsName": "", "PublicIp": "52.207.225.143" }, "Primary": true, "PrivateIpAddress": "10.0.0.201" } ], "SourceDestCheck": true, "Status": "in-use", "SubnetId": "subnet-49de033f", "VpcId": "vpc-f16a0895", "InterfaceType": "interface" } ], "RootDeviceName": "/dev/sda1", "RootDeviceType": "ebs", "SecurityGroups": [ { "GroupName": "testSG", "GroupId": "sg-4cbc6f35" } ], "SourceDestCheck": true, "Tags": [ { "Key": "Name", "Value": "production" } ], "VirtualizationType": "hvm", "CpuOptions": { "CoreCount": 1, "ThreadsPerCore": 1 }, "CapacityReservationSpecification": { "CapacityReservationPreference": "open" }, "HibernationOptions": { "Configured": false }, "MetadataOptions": { "State": "applied", "HttpTokens": "optional", "HttpPutResponseHopLimit": 1, "HttpEndpoint": "enabled" } }
Getting the AWS EC2 Problem Instance Id
The instance ID for the problem EC2 instance can be extracted from the JSON returned by the preceding results easily:
$ AWS_PROBLEM_INSTANCE_ID="$( jq -r .InstanceId <<< "$AWS_EC2_PRODUCTION" )" $ echo "$AWS_PROBLEM_INSTANCE_ID" i-825eb905
Getting the AWS EC2 Problem Instance IP Address
The IP address for the problem EC2 instance can be extracted from the JSON returned by the preceding results easily:
$ AWS_PROBLEM_IP="$( jq -r .PublicIpAddress <<< "$AWS_EC2_PRODUCTION" )" $ echo "$AWS_PROBLEM_IP" 52.207.225.143
Getting the AWS EC2 Problem Availability Zone
The AWS availability zone for the problem EC2 instance can be extracted from the JSON returned by the preceding results easily:
$ AWS_AVAILABILITY_ZONE="$( jq -r .Placement.AvailabilityZone <<< "$AWS_EC2_PRODUCTION" )" $ echo "$AWS_AVAILABILITY_ZONE" us-east-1c
Getting the AWS EC2 Problem Volume ID
The following command line extracts the volume id of the problem server’s system drive into an
environment variable called $AWS_PROBLEM_VOLUME_ID
:
$ AWS_PROBLEM_VOLUME_ID="$( jq -r '.BlockDeviceMappings[].Ebs.VolumeId' <<< "$AWS_EC2_PRODUCTION" )" $ echo "$AWS_PROBLEM_VOLUME_ID" vol-1c8903b4
Make a Snapshot of the Problem Server
One approach, which would be living dangerously,
would be to mount the system volume of the problem server on another server,
set up chroot
,
attempt to repair the drive image,
remount the repaired drive on the problem server,
and reboot the server.
I am never that optimistic.
Things invariably go wrong.
Instead, we will take a snapshot of the problem drive, turn the snapshot into a volume, repair the volume,
swap in the repaired volume on the problem system, and reboot that system.
It is better to shut down the EC2 instance before making a snapshot, however a snapshot can be taken whenever the server is idling. We will need to shut down the server anyway, so that could be done now, or at the last minute.
I made a snapshot with a tag called Name
with the value like production 2020-10-25
and saved the snapshot id in an environment variable called AWS_PROBLEM_SNAPSHOT_ID
:
$ AWS_PROBLEM_SNAPSHOT_ID="$( aws ec2 create-snapshot --volume-id "$AWS_PROBLEM_VOLUME_ID" \ --description "production `date '+%Y-%m-%d'`" \ --tag-specifications "ResourceType=snapshot,Tags=[{Key=Created, Value=`date '+%Y-%m-%d'`},{Key=Name, Value=\"Broken do-release-upgrade 20.{04,10\"}]" | \ jq -r .SnapshotId )" $ echo "$AWS_PROBLEM_SNAPSHOT_ID" snap-0a856be1f58b8a856 $ aws ec2 wait snapshot-completed --snapshot-ids "$AWS_PROBLEM_SNAPSHOT_ID"
Snapshots only take a few minutes to complete.
The aws ec2 wait
command blocks until the specified operation finishes.
Create Rescue Volume From Snapshot
Once the snapshot process has completed, create a new volume from the snapshot.
The default volume type is gp2
.
We’ll refer to this volume as $AWS_RESCUE_VOLUME_ID
.
It is important to create the volume in the same availability zone as the problem EC2 instance so that it can easily be attached.
This command applies a tag called Name
, with the value rescue
, for easy identification.
$ AWS_RESCUE_VOLUME_ID="$( aws ec2 create-volume \ --availability-zone $AWS_AVAILABILITY_ZONE \ --snapshot-id $AWS_PROBLEM_SNAPSHOT_ID \ --tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=rescue}]' | \ jq -r .VolumeId )" $ echo "$AWS_RESCUE_VOLUME_ID" vol-0e20fd22d2dc5a933 $ aws ec2 wait volume-available --volume-id "$AWS_RESCUE_VOLUME_ID"
Use an EC2 Spot Instance For the Rescue Server
Now that the rescue volume is available
, we need to mount it on a server,
which I’ll call the rescue server.
We’ll refer to the server where the rescue volume is prepared via its instance id,
saved as AWS_EC2_RESCUE_ID
.
You can either create a new EC2 instance for this purpose, or use an existing EC2 instance.
The rescue server does not need to be anything special; a tiny virtual machine of any description will do fine.
However, some rescue operations will be much easier if the type of operating system is the same as that on the problem drive.
Yesterday
I blogged about how to find a suitable AMI, and determine its image-id
.
$ AWS_AMI="$( aws ec2 describe-images \ --owners 099720109477 \ --filters "Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-━━━━━???-━━━━━???-amd64-server-━━━━━???" \ "Name=state,Values=available" \ --query "reverse(sort_by(Images, &CreationDate))[:1]" | \ jq -r '.[0]' )" $ echo "$AWS_AMI" { "Architecture": "x86_64", "CreationDate": "2020-10-30T14:07:42.000Z", "ImageId": "ami-0c71ec98278087e60", "ImageLocation": "099720109477/ubuntu/images/hvm-ssd/ubuntu-groovy-20.10-amd64-server-20201030", "ImageType": "machine", "Public": true, "OwnerId": "099720109477", "PlatformDetails": "Linux/UNIX", "UsageOperation": "RunInstances", "State": "available", "BlockDeviceMappings": [ { "DeviceName": "/dev/sda1", "Ebs": { "DeleteOnTermination": true, "SnapshotId": "snap-00bf581086dd686e5", "VolumeSize": 8, "VolumeType": "gp2", "Encrypted": false } }, { "DeviceName": "/dev/sdb", "VirtualName": "ephemeral0" }, { "DeviceName": "/dev/sdc", "VirtualName": "ephemeral1" } ], "Description": "Canonical, Ubuntu, 20.10, amd64 groovy image build on 2020-10-30", "EnaSupport": true, "Hypervisor": "xen", "Name": "ubuntu/images/hvm-ssd/ubuntu-groovy-20.10-amd64-server-20201030", "RootDeviceName": "/dev/sda1", "RootDeviceType": "ebs", "SriovNetSupport": "simple", "VirtualizationType": "hvm" }
Now let's extract the ID of the AMI image and save it as AWS_AMI_ID
.
$ AWS_AMI_ID="$( jq -r '.[0].ImageId' <<< "$AWS_AMI" )" $ echo "$AWS_AMI_ID" ami-0c71ec98278087e60
Volumes can be attached to running and stopped server instances.
The load on the rescue server will likely be light and short-lived.
An EC2 spot instance is ideal, and only costs two cents per hour!
The spot instance will likely only be needed for 15 minutes.
I specified my VPC id as SubnetId
, the security group sg-4cbc6f35
and the AvailabilityZone
.
$ AWS_EC2_RESCUE="$( aws ec2 run-instances \ --image-id "$AWS_AMI_ID" \ --instance-market-options '{ "MarketType": "spot" }' \ --instance-type t2.medium \ --key-name rsa-2020-11-03.pub \ --network-interfaces '[ { "DeviceIndex": 0, "Groups": ["sg-4cbc6f35"], "SubnetId": "subnet-49de033f", "DeleteOnTermination": true, "AssociatePublicIpAddress": true } ]' \ --placement '{ "AvailabilityZone": "us-east-1c" }' )" $ echo "$AWS_EC2_RESCUE" { "Groups": [], "Instances": [ { "AmiLaunchIndex": 0, "ImageId": "ami-0dba2cb6798deb6d8", "InstanceId": "i-012a54aefcd333de9", "InstanceType": "t2.small", "KeyName": "rsa-2020-11-03.pub", "LaunchTime": "2020-11-03T23:19:50.000Z", "Monitoring": { "State": "disabled" }, "Placement": { "AvailabilityZone": "us-east-1c", "GroupName": "", "Tenancy": "default" }, "PrivateDnsName": "ip-10-0-0-210.ec2.internal", "PrivateIpAddress": "10.0.0.210", "ProductCodes": [], "PublicDnsName": "", "State": { "Code": 0, "Name": "pending" }, "StateTransitionReason": "", "SubnetId": "subnet-49de033f", "VpcId": "vpc-f16a0895", "Architecture": "x86_64", "BlockDeviceMappings": [], "ClientToken": "026583fb-c94e-4bca-bdd2-8dcdcaa3aae9", "EbsOptimized": false, "EnaSupport": true, "Hypervisor": "xen", "InstanceLifecycle": "spot", "NetworkInterfaces": [ { "Attachment": { "AttachTime": "2020-11-03T23:19:50.000Z", "AttachmentId": "eni-attach-04feb4d36cf5c6792", "DeleteOnTermination": true, "DeviceIndex": 0, "Status": "attaching" }, "Description": "", "Groups": [ { "GroupName": "testSG", "GroupId": "sg-4cbc6f35" } ], "Ipv6Addresses": [], "MacAddress": "0a:6d:ba:c5:65:4b", "NetworkInterfaceId": "eni-09ef90920cfb29dd9", "OwnerId": "031372724784", "PrivateIpAddress": "10.0.0.210", "PrivateIpAddresses": [ { "Primary": true, "PrivateIpAddress": "10.0.0.210" } ], "SourceDestCheck": true, "Status": "in-use", "SubnetId": "subnet-49de033f", "VpcId": "vpc-f16a0895", "InterfaceType": "interface" } ], "RootDeviceName": "/dev/sda1", "RootDeviceType": "ebs", "SecurityGroups": [ { "GroupName": "testSG", "GroupId": "sg-4cbc6f35" } ], "SourceDestCheck": true, "SpotInstanceRequestId": "sir-rrs9gm3j", "StateReason": { "Code": "pending", "Message": "pending" }, "VirtualizationType": "hvm", "CpuOptions": { "CoreCount": 1, "ThreadsPerCore": 1 }, "CapacityReservationSpecification": { "CapacityReservationPreference": "open" }, "MetadataOptions": { "State": "pending", "HttpTokens": "optional", "HttpPutResponseHopLimit": 1, "HttpEndpoint": "enabled" } } ], "OwnerId": "031372724784", "ReservationId": "r-0d45e1919e7bad5c9" }
We can use jq
to extract the EC2 InstanceId
of the spot instance:
$ AWS_SPOT_INSTANCE_ID="$( jq -r '.Instances[].InstanceId' <<< "$AWS_EC2_RESCUE" )" $ echo "$AWS_SPOT_INSTANCE_ID" ami-0dba2cb6798deb6d8
We need to retrieve the IP address of the newly created EC2 spot instance. This instance will disappear (terminate) once it shuts down, so do not reboot it.
$ aws ec2 describe-instances --instance-ids "$AWS_SPOT_INSTANCE_ID" { "Reservations": [ { "Groups": [], "Instances": [ { "AmiLaunchIndex": 0, "ImageId": "ami-0dba2cb6798deb6d8", "InstanceId": "i-012a54aefcd333de9", "InstanceType": "t2.small", "KeyName": "rsa-2020-11-03.pub", "LaunchTime": "2020-11-03T23:19:50.000Z", "Monitoring": { "State": "disabled" }, "Placement": { "AvailabilityZone": "us-east-1c", "GroupName": "", "Tenancy": "default" }, "PrivateDnsName": "ip-10-0-0-210.ec2.internal", "PrivateIpAddress": "10.0.0.210", "ProductCodes": [], "PublicDnsName": "", "PublicIpAddress": "54.242.88.254", "State": { "Code": 16, "Name": "running" }, "StateTransitionReason": "", "SubnetId": "subnet-49de033f", "VpcId": "vpc-f16a0895", "Architecture": "x86_64", "BlockDeviceMappings": [ { "DeviceName": "/dev/sda1", "Ebs": { "AttachTime": "2020-11-03T23:19:51.000Z", "DeleteOnTermination": true, "Status": "attached", "VolumeId": "vol-0c44c8c009d1fafda" } } ], "ClientToken": "026583fb-c94e-4bca-bdd2-8dcdcaa3aae9", "EbsOptimized": false, "EnaSupport": true, "Hypervisor": "xen", "InstanceLifecycle": "spot", "NetworkInterfaces": [ { "Association": { "IpOwnerId": "amazon", "PublicDnsName": "", "PublicIp": "54.242.88.254" }, "Attachment": { "AttachTime": "2020-11-03T23:19:50.000Z", "AttachmentId": "eni-attach-04feb4d36cf5c6792", "DeleteOnTermination": true, "DeviceIndex": 0, "Status": "attached" }, "Description": "", "Groups": [ { "GroupName": "testSG", "GroupId": "sg-4cbc6f35" } ], "Ipv6Addresses": [], "MacAddress": "0a:6d:ba:c5:65:4b", "NetworkInterfaceId": "eni-09ef90920cfb29dd9", "OwnerId": "031372724784", "PrivateIpAddress": "10.0.0.210", "PrivateIpAddresses": [ { "Association": { "IpOwnerId": "amazon", "PublicDnsName": "", "PublicIp": "54.242.88.254" }, "Primary": true, "PrivateIpAddress": "10.0.0.210" } ], "SourceDestCheck": true, "Status": "in-use", "SubnetId": "subnet-49de033f", "VpcId": "vpc-f16a0895", "InterfaceType": "interface" } ], "RootDeviceName": "/dev/sda1", "RootDeviceType": "ebs", "SecurityGroups": [ { "GroupName": "testSG", "GroupId": "sg-4cbc6f35" } ], "SourceDestCheck": true, "SpotInstanceRequestId": "sir-rrs9gm3j", "VirtualizationType": "hvm", "CpuOptions": { "CoreCount": 1, "ThreadsPerCore": 1 }, "CapacityReservationSpecification": { "CapacityReservationPreference": "open" }, "HibernationOptions": { "Configured": false }, "MetadataOptions": { "State": "applied", "HttpTokens": "optional", "HttpPutResponseHopLimit": 1, "HttpEndpoint": "enabled" } } ], "OwnerId": "031372724784", "ReservationId": "r-0d45e1919e7bad5c9" } ] }
Mount the Rescue Volume On the Rescue Server
We need to select a device name to be assigned to the rescue disk once it is attached to an EC2 instance.
The available names depend on what names are already in use on the rescue server.
After logging into the rescue server, I ran the lsblk
Linux command to see the available disk devices and their mount points.
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop1 7:1 0 53.1M 1 loop /snap/lxd/10984 loop2 7:2 0 88.4M 1 loop /snap/core/7169 loop3 7:3 0 97.8M 1 loop /snap/core/10185 loop4 7:4 0 53.1M 1 loop /snap/lxd/11348 xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part /
The lsblk output does not show full device paths, instead, the /dev/
prefix is omitted.
With that in mind we can see that the only disk
device on the rescue server is /dev/xvda
,
and its only partition called /dev/xvda1
is mounted on the root directory.
Because Linux drives are normally named sequentially, we should name the rescue disk /dev/xvdb
.
Let’s define an environment variable called AWS_RESCUE_DRIVE
to memorialize that decision.
$ AWS_RESCUE_DRIVE=/dev/xvdb
The aws ec2 attach-volume
command will attach the rescue volume to the rescue server.
It automatically selects an appropriate device name for the rescue volume,
which in the following example is /dev/xvdb
:
$ AWS_ATTACH_VOLUME="$( aws ec2 attach-volume \ --device $AWS_RESCUE_DRIVE \ --instance-id $AWS_EC2_RESCUE_ID \ --volume-id $AWS_RESCUE_VOLUME_ID )" $ echo "$AWS_ATTACH_VOLUME" { "AttachTime": "2020-10-26T14:34:55.222Z", "InstanceId": "i-d3b03954", "VolumeId": "vol-0e20fd22d2dc5a933", "State": "attaching", "Device": "/dev/xvdb" } $ aws ec2 wait volume-in-use --volume-id "$AWS_RESCUE_VOLUME_ID"
The details of the mounted rescue drive are provided by fdisk -l
:
$ sudo fdisk -l | sed -n -e '/xvdb/,$p' Disk /dev/xvdb: 12 GiB, 12884901888 bytes, 25165824 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x00000000 Device Boot Start End Sectors Size Id Type /dev/xvdb1 * 16065 25165790 25149726 12G 83 Linux
Now it is time to mount the rescue drive on the rescue server.
Ubuntu has a directory called /mnt
whose purpose is to act as a mount point:
$ sudo mount /dev/xvdb1 /mnt
Let’s confirm that the drive is mounted:
$ df -h | grep '^/dev/' | grep -v '^/dev/loop' /dev/xvda1 7.8G 6.3G 1.1G 86% / /dev/xvdb1 12G 9.0G 2.2G 82% /mnt
The last line shows that this drive is mounted on /mnt
and it is 82% full.
Set Up a chroot to Establish an Environment for Making Repairs
We need to mount some more file systems before we perform the chroot
.
The following mounts the rescue server’s /dev
, /dev/shm
, /sys
, and /run
to the same paths within the rescue volume.
Because programs like do-release-upgrade
need a tty
, I also mount devtps
and proc
.
These mounts only last until the next server reboot.
After all the mounts the chroot
is issued.
Warning - mounting /run
and then updating the system on the rescue disk from within a chroot may change
the host system’s /run
contents;
if the package managers (apt
and dpkg
) get out of sync with the actual state on the host system you
won’t be able to update the host system until you restore the host system’s image from the snapshot that we made
earlier.
$ sudo mount -o bind /dev /mnt/dev $ sudo mount -o bind /dev/shm /mnt/dev/shm $ sudo mount -o bind /sys /mnt/sys $ sudo mount -o bind /run /mnt/run $ sudo mount -t proc proc /mnt/proc $ sudo mount -t devpts devpts /mnt/dev/pts $ sudo chroot /mnt root@ip-10-0-0-189:/#
Notice how the prompt changed after the chroot
.
That is your clue that it is active.
Correct the Problem
This step depends on whatever is wrong. I won’t bore you with the problem I had.
Unmount the New Volume
Exit the chroot
and unmount the rescue volume from the rescue server.
# exit $ sudo umount /mnt/dev $ sudo umount /mnt/dev/shm $ sudo umount /mnt/sys $ sudo umount /mnt/run $ sudo umount /mnt/proc $ sudo umount /mnt/dev/pts $ sudo umount /mnt
Detach the rescue volume from the rescue server.
This can be done from any machine that is configured with aws cli
for use with your account credentials.
$ aws ec2 detach-volume --volume-id $AWS_RESCUE_VOLUME_ID $ aws ec2 wait volume-available --volume-id $AWS_RESCUE_VOLUME_ID
Unmount the Problem Volume
The problem server must be shut down for this to work.
Detach the problem volume from the problem server.
This can be done from any machine that is configured with aws cli
for use with your account credentials.
$ aws ec2 stop-instances --instance-id $AWS_PROBLEM_INSTANCE_ID $ aws ec2 wait instance-stopped --instance-ids $AWS_PROBLEM_INSTANCE_ID $ aws ec2 detach-volume --volume-id $AWS_PROBLEM_VOLUME_ID $ aws ec2 wait volume-available --volume-id $AWS_PROBLEM_VOLUME_ID
Replace the Problem Volume
Now it is time to replace the problem volume containing the problem boot drive on the problem system with the newly created volume.
BTW, AWS EC2 always refers to boot drives as /dev/sda1
,
even when the device has a different name, such as /dev/xvdb1
.
replaceSystemVolume Bash function
This Bash function detaches the volume containing the current boot drive of an EC2 instance and replaces it with another volume. If the EC2 instance is running then it is first stopped.
#!/bin/bash function replaceSystemVolume { # $1 - EC2 instance id # $2 - new volume to mount as system boot drive export EC2_INSTANCE="$( aws ec2 describe-instances --instance-ids "$1" | \ jq -r ".Reservations[].Instances[0]" )" export EC2_NAME="$( jq -r ".Tags[] | select(.Key==\"Name\") | .Value" <<< "$EC2_INSTANCE" )" export ATTACHED_VOLUME_ID="$( jq -r ".BlockDeviceMappings[].Ebs.VolumeId" <<< "$EC2_INSTANCE" )" if [[ "$ATTACHED_VOLUME_ID" == "$2" ]]; then >&2 echo "VolumeId $2 is already attached to EC2 instance $1" exit 1 fi export EC2_STATE="$( jq -r ".State.Name" <<< "$EC2_INSTANCE" )" if [ "$EC2_STATE" == running ]; then echo "Stopping EC2 instance $1" aws ec2 stop-instances --instance-ids "$1" aws ec2 wait instance-stopped --instance-ids "$1" fi aws ec2 detach-volume --volume-id "$ATTACHED_VOLUME_ID" aws ec2 wait volume-available --volume-id "$ATTACHED_VOLUME_ID" aws ec2 attach-volume \ --device /dev/sda1 \ --instance-id "$1" \ --volume-id "$2" aws ec2 wait volume-in-use --volume-id "$2" aws ec2 start-instances --instance-ids "$1" aws ec2 wait instance-started --instance-ids "$1" } set -e replaceSystemVolume "$@"
Here is how to use it:
$ replaceSystemVolume "$AWS_PROBLEM_INSTANCE_ID" "$AWS_RESCUE_VOLUME_ID"
Preview 2 instance id is AWS_EC2_RESCUE_ID
.
Replace rescue volume on preview with preview's original volume:
$ replaceSystemVolume "$AWS_EC2_RESCUE_ID" "$AWS_PREVIEW_VOLUME_ID"
Boot the problem system
Boot the problem system and verify the problem is solved.
$ aws ec2 start-instances --instance-ids $AWS_PROBLEM_INSTANCE_ID $ aws ec2 wait instance-started --instance-ids $AWS_PROBLEM_INSTANCE_ID
Acknowledgements
This article was inspired by this excellent article, which uses the AWS web console to achieve similar results.