**Work Completed By: Sukhpal Shergill & Tim West **
In order to ensure an application (in this case a series of web services) continues to operate at an agreed service level the underlying infrastructure must be capable of adapting in line with demand. this raises several questions:
- What is the service level for the application?
- What are the key metrics within the infrastructure?
- How are these metrics monitored and used as triggers?
This project will look at the application that is being used to provide a series of web services for the iPhone application development in order to understand what approach must be taken, where the issues are and how they can be overcome.
DETAILS/LOCATION OF PRACTICAL DEMONSTRABLE OUTPUT
We started by taking the current AMI used to provide the iPhone application API and creating a custom API we can use with the Elastic Load Balancer.
Logged on to the template AMI as root:
# ec2-bundle-vol -c cert......pem -k pk.....pem -u ACCOUNT-ID # ec2-upload-bundle -b wcc-api-server-v01 -a ACCESS-KEY -s SECRET-KEY -m /tmp/image.manifest.xml # ec2-register -C cert......pem -K pk.....pem --region eu-west-1 wcc-api-server-v01/image.manifest.xml
Now that we have our AMI ready for use, create an new instance of it and make sure all is running as expected. In our case the application didn't start as there was a .pid file still on disk, as we had created the AMI with the application still running. So we deleted the .pid file, and repeated the above steps to create a new AMI based on the fixed AMI.
Once we has a working AMI we created the load balancer and the auto scaling setup as follows:
elb-create-lb ApiLoadBalancer --listener "lb-port=80, instance-port=80, protocol=HTTP" --availability-zones eu-west-1a elb-configure-healthcheck ApiLoadBalancer --target "HTTP:80/" --interval 30 --timeout 3 --unhealthy-threshold 2 --healthy-threshold 2 as-create-launch-config ApiLaunchConfig --image-id ami-ee765dXX --instance-type m1.small --key wcc-test-key1 as-create-auto-scaling-group ApiAutoScalingGroup --launch-configuration ApiLaunchConfig --availabilty-zones eu-west-1a --min-size 1 --max-size 1 --load-balancers ApiLoadBalancer as-create-or-replace-trigger ApiTrigger --auto-scaling-group ApiAutoScalingGroup --namespace "AWS/EC2" --measure CPUUtilization --statistic Average --dimensions "AutoScalingGroup=ApiAutoScalingGroup" --period 60 --lower-threshold 40 --upper-threshold 80 --lower-breach-increment=-1 --upper-breach-increment=1 --breach-duration=600
We looked at the original iPhone application web site (http://ec2-79-125-14-15.eu-west-1.compute.amazonaws.com) and the load balanced site (http://apiloadbalancer-318684055.eu-west-1.elb.amazonaws.com) using the Java based load testing application JMeter to stress test the web sites to see the load the sites could handle.
The key measurements available to us through Amazons AWS Console were CPU load, Disk Reads, Disk Writes, Network In and Network Out, and the response time in Milliseconds from JMeter.
Jmeter is configured by creating a thread group in which you can specify the number of threads (users), the ramp up period (in seconds for the threads to begin) and the loop count (the number of attempts by each user). Once the threads have been configured you specify the HTTP requests and specify the outputs which can be a graph or a results tree as well as other graphs.
We initially started of with a high thread and found the CPU load the on web server was at its maximum of 100% for the duration of the test. We then lowered the thread and found the CPU still stayed high, to prove it was not just the Jmeter application that was causing the issue I manually tested the website by hitting each page while we monitored the CPU load and I managed to get the CPU up to 20% by myself. This indicates the application itself is CPU intensive in the way it is rendering the web page so an alternative measure of performance was needed.
The measure we decided upon was latency which measures the time between sending a request and receiving a response, this time includes all the processing needed to assemble the request as well as assembling the response which resembles the experience of a browser. The latency figure we agreed upon was 1 second which is 1000 milliseconds; to make the test as simple as possible we looked at the number of users that would need to simultaneously hit the same web page in a second to get the average latency of approximately 1 second.
The figure we came to was 28 simultaneous hits as the screenshot above shows it leads to a average latency of just over 1 second with a median (middle sample figure) of just under 1 second (929 ms) with the last sample being 2898 Ms. This is the result of the CPU load hitting 100% as it tries to render the web page for each of the threads rather than the limits of the proxy server (border manager) as I got similar results while testing from home on a broadband connection with speed of 6 MB/s.
The tests indicate that more simultaneous hits will NOT lead to request timeouts; they will however increase the latency to load a web page which may or may not be acceptable. Amazon’s recommendation is that you test your application on the different AMI (Amazon Machine Image) instances (small, large, extra large, high CPU medium and high CPU extra large) and make a decision on which instance you will deploy. The tests were done on the low specification AMI (m1.small) running at 1 GHZ CPU.
We also looked at using the check_http plug-in from the Nagios project to test the latency from both within our network and from within Amazons Cloud. From within our network the latency was on average 300ms and within the Cloud 80ms. These times were for retrieving a single page.
SHORT TERM BENEFITS
From these results we decided that a single instance of the application within Amazons Cloud would be sufficient to run the current implementation of the iPhone application, using the auto-scaling feature of Amazons Cloud to make sure there was always one instance running.
As the latency of the application can not be monitored directly using Amazons CloudWatch, we could use an application like Nagios to monitor the availability of the web site and measure its latency and then using Nagios Event Management to issue commands to change the number of running instances to increase capacity when needed.