AWS Lab: Automating Operations with Playbooks & Runbooks

This lab creates an application which is an API hosted inside a Docker container using AWS ECS. Access to the application is done via the Application Load Balancer.

In order to keep the API secure the entire application (Docker container, ECS) is hosted in Amazon’s VPC. Communication to the API can only be done through routes within the VPC subnet. The API is able to encrypt and decrypt as needed.

The encrypt action allows me to pass a secret message along with the ‘Name’ key as the identifier and it will return a ‘Secret Key ID’ that I can use to decrypt my message.

The decrypt action allows me to decrypt the msecret message passing along the ‘Name’ key ‘Secret Key ID’ I obtained before I got the secret message.

Both encrypt/decrypt action makes write and read calls to the application database hosted in AWS RDS.

There are items that will automate the building of the entire environment:

Cloudformation template: walab-ops-base-resources

This provdies provides the resources needed for the application to run.

This builds the entire virtual environment: App Container Repo, IDE, IGW, Nat Gateway, VPC, VPC Subnets

Build_application.sh

This utilizes the resources built from the walab-ops-base-resources to launch the application. This script specifies the region, repo name, docker image, etc.

Once the stack is created and the application is built, it’s time to test the application

Testing

How the application works:

The application will take a JSON payload with Name as the identifier and Text key as the value of the secret message. The application will encrypt the value under Text key with a designated KMS key and store the encrypted text in the RDS database with Name as the primary key.

In the Cloud9 terminal I will run the following commands to execute testing. The following commands will run curl to send a POST request with the secret message payload {“Name”: “Bob” , “Text”: “run your operations as code”} to the API.

# ALBEndpoint=” internal-walab-op-ALB-mKI3YIIQb6J9–2061713640.us-west-2.elb.amazonaws.com

# curl — header “Content-Type: application/json” — request POST — data ‘ {“Name”: “Bob” , “Text” : “Run your operations as code”} ‘ $ALBEndpoint/encrypt

Once these commands are ran correctly I received a message stating: {“Message”:”Data encrypted and stored, keep your key save”,”Key”:”EncryptKey”}

I then need to record the encrypt key value: bd29fa7b-7d62–407b-bae3–43b7ccca404b

I then paste the encrypt key value into the following command in order to test the decrypt API:

# curl — header “Content-Type: application/json” — request GET — data ‘{“Name”:”Bob”,”Key”:”bd29fa7b-7d62–407b-bae3–43b7ccca404b”}’ $ALBEndpoint/decrypt

If the command is ran correctly I should get the following output:

{“Text”:”Run your operations as code”}

Simulate Application Issue

This section will asssist in my understanding of workload health and it’s importance to Operational Excellence. By combining metrics, thresholds and alerts I can ensure that issues can be identified and resolved as quickly as possible in order to reduce interupptions.

The problem that will be simulated will cause the API to take longer than 6 seconds to respond, when this happens an alert will be raised, triggering and email notification.

Process:

1.) Create and run script that will send a large amount of traffic to the API.

2.) Observe and confirm issue with AWS monitoring tools.

Sending Traffic to the Application

Build and Run an Investigative Playbook

A playbook allows team members to amalgamate everyone’s knowledge into a document of predefined steps, which are ran to identify an issue. This saves precious time and resources when it comes to resolving issues. Another layer that can be added is the automation of playbooks, which shaves even more time off resolving issues. There a numerous tools that can be used to build a playbook within AWS, Systems Manager being a main one. Systems Manager offers and automation document capability, Runbooks, that allows for the creation of a series of executable steps to orchestrate my investigation and remediation. AWS Systems Manager Automation Documents allows me to run custom scripts, call AWS service APIs, or even run remote commands on cloud or on-premise compute instances.

Process to Create Investigative Playbook

1.) Prepare Automation Document IAM Role

I’ll first have to run the following command to create the playbook

# cd ~/environment/aws-well-architected-labs/static/2_prepare/Automating_operations_with_playbooks_and_runbooks/Code/templates

Once this is done, I need to add configurations to the template that was created:

# aws cloudformation create-stack — stack-name waopslab-automation-role \

— capabilities CAPABILITY_NAMED_IAM \

— template-body file://automation_role.yml

I’ll then need to confirm that the stack was created:

# aws cloudformation describe-stacks — stack-name waopslab-automation-role

Once the output reads: “CREATE_COMPLETE” I can proceed.

Building the “Gather-Resources” Playbook

In order to conduct an effective investigation I need to know all of the services and resources associated with the issue. The email notifications do not list any resources information. To gather this necessary information I will need to build a playbook to acquire all related resources using my CloudWatch alarm ARN as a reference.

By codifying my playbook with AWS Systems Manager this allows for maximum code reusability, which reduces overhead in re-writing codes that has identical objectives.

Process

Move to folder with playbook template

# cd ~/environment/aws-well-architected-labs/static/2_prepare/Automating_operations_with_playbooks_and_runbooks/Code/templates

Run following command to add configurations to playbook

# aws cloudformation create-stack — stack-name waopslab-playbook-gather-resources \

— parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn\:aws\:iam::000000000000\:role/AutomationRole \

— template-body file://playbook_gather_resources.yml

Confirm stack was created and completed.

# aws cloudformation describe-stacks — stack-name waopslab-playbook-gather-resources

Once the automation document is created I can not test it by going to Systems Manager and running the “Playbook-Gather-Resources”. By placing the Cloudwatch Alarm ARN in the required field this playbook will gather all resources pertaining to this alert. I then need to copy the list of resources in order to remediate the problems with the next playbook.

Building the “Investigate-Application-Resources” Playbook

This playbook will interrogate resources, capture recent metrics and logs, to look for insights and fully understand the root cause of the issue. This playbook will analyze the data and highlight the metrics and logs to identify resources operating out of normal operational threshold.

Process to Build Investigative Playbook

Run script to create playbook

# aws cloudformation create-stack — stack-name waopslab-playbook-investigate-resources \

— parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \

— template-body file://playbook_investigate_application_res..

Confirm that the stack was playbook was created

# aws cloudformation describe-stacks — stack-name waopslab-playbook-investigate-resources

Once the “CREATE_COMPLETE” is shown the playbook is ready to be executed. I can then paste in the resources that were gathered from the “Gather-Resources” playbook and run it.

The output shows that the playbook went through each resource to assess it’s current health and log its output. I now to create a Parent playbook which will be responsible for being triggered by the CloudWatch alarm and running both the “Gather-Resources” and “Investigate-Application-Resources” playbooks. This adds an almost full automation feature to issue identification and remediation.

Once the playbook is created and the SNSTopic created, running this playbook will run all necessary playbooks. The notification sent from this playbook outlies a summary of the errors. A large number of ELB504Count errors and TargetResposeTime from the Load Balancer. This is the cause of the delay in API requests. I can also see that there is only Elastic Container Service TaskRunningCount along with a relatively high CPUUtilization average. The most likely cause of the latency issue is resource constrained at the application API level running in ECS.

Build and Run Remediation Runbook

Runbooks are bit different from Playbooks in that they are procedures that accomplish specific task to achieve an outcome. After discovering that the issue was CPU utilization, this can be remediated through the use of auto-scaling. This runbook will notify me and give me an option to intercept the scale-up action should I choose not to proceed.

Process of Building “Approval-Gate” Runbook

This runbook will provide me with the ability to deny or approve remediation within a specified waiting time. If a decision isn’t made the runbook will automatically approve the action.

Using the Systems Manager Automation document I will build the following items:

Approval-Gate runbook: executes a separate document called the Approve-Timer
Approve-Timer runbook will then wait for a preconfigured amount of time and send and approve signal to the Approval-Gate runbook
Approval-Gate runbook then sends an approval request to the workload owner via a designated SNS topic

If I choose to approve, the Approval-Gate runbook will continue to the next step
If I decline the approval, the runbook will fail, blocking further steps
If I don’t respond within the designated time, the Approve-Timer runbook will automatically approve the request

Create Runbook

# aws cloudformation create-stack — stack-name waopslab-runbook-approval-gate \

— parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \

— template-body file://runbook_approval_gate.yml

Confirm creation of Runbook

# aws cloudformation describe-stacks — stack-name waopslab-runbook-approval-gate

Building the “ECS-Scale-Up” runbook

The ECS-Scale-Up runbook will complete the following tasks:

Run the Approval-Gate runbook
Wait for the Approval-Gate runbook to complete
Increase the number of ECS tasks in the cluster once the Approval-Gate runbook completes successfully

Process for building ECS-Scale-Up runbook

Create and add parameters to runbook:
# aws cloudformation create-stack — stack-name waopslab-runbook-scale-ecs-service \

— parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \

— template-body file://runbook_scale_ecs_service.yml

Confirm creation of runbook
# aws cloudformation describe-stacks — stack-name waopslab-runbook-scale-ecs-service

Executing remediation Runbook

Once I execute this runbook by taking the Output values from the walab-ops-sample-application and paste them into the respective fields in the Runbook-ECS-Scale-Up, an email notification is sent to me to approve or deny the auto-scaling. When I approve the request the ECS task count will increase to the value specified. After this the API response time will return to normal and CloudWatch Alarm goes back to an Ok state.

Mr. Robot

Mr. Robot

AWS Lab: Automating Operations with Playbooks & Runbooks