8/15/2023 0 Comments Sqs queue metricswithValue(approximateSecondsToCompleteQueue)) withMetricName("ApproximateSecondsToCompleteQueue") PutMetricDataRequest request = new PutMetricDataRequest() _tRegion(RegionUtils.getRegion("us-east-1")) private AmazonCloudWatchClient _cloudWatchClient = new AmazonCloudWatchClient() A scheduled executor on our primary instance runs every 15 seconds to calculate and publish it. The metric we publish is called ApproximateSecondsToCompleteQueue. ApproximateSecondsToCompleteQueue = MessagesInQueue / AverageMessageProcessRate With this in mind I’ll describe the metric we publish to easily scale our EC2 instances. We prefer that all our messages be handled within 2 hours of being received. It’s not sufficient to simply look at the number of messages in the queue as the average processing speed can vary between 2 and 60 messages per second depending on the data. The load profile can vary over time some messages can be handled very quickly and some take significantly more time. Solution 1 & 2: Publish your own CloudWatch metricsĬustom metrics can overcome both of these limitations, you can publish metrics related to your service’s needs and you can publish them much more often.įor example, one of our services runs on EC2 instances and processes messages off an SQS queue. Our team has instances that take about 10 minutes to come online, so 5 minutes can make a lot of difference to our responsiveness to changing load. In five minutes a lot can happen, with more granular metrics you could learn about your scaling needs quite a bit faster. Problem 2: AWS CloudWatch default metrics are only published every 5 minutes. More predictive scaling would start up the instances before the load became business critical or it would shut down instances when it becomes clear they are not going to be needed instead of when their workload drops to zero. EC2 instances take time to start up and instances are billed by the hour, so you’re either starting to get a backlog of work while starting up or you might shut down too late to take advantage of an approaching hour boundary and get charged for a mostly unused instance hour. The downside is that by the time you notice that you are using too much CPU or sending too few messages, you’re often too late. You could also use them to reactively scale your service. These metrics are helpful for monitoring. EC2 Instances post metrics like CPUUtilization and DiskReadOps. Problem 1: AWS EC2 Autoscaling Groups can only scale in response to metrics in CloudWatch and most of the default metrics are not sufficient for predictive scaling.įor instance, by looking at the CloudWatch Namespaces reference page we can see that Amazon SQS queues, EC2 Instances and many other Amazon services post metrics to CloudWatch by default.įrom SQS you get things like NumberOfMessagesSent and SentMessageSize. 1.One of the chief promises of the cloud is fast scalability, but what good is snappy scalability without load prediction to match? How many teams out there are still manually switching group sizes when load spikes? If you would like to make your Amazon EC2 scaling more predictive, less reactive and hopefully less expensive it is my intention to help you with this article. These sections do not carry the ‘Exclusive’ mark. Of course, I also analyze what’s happening in the tech industry, citing other media sources and quoting them as I dive into trends I observe. I’m adding an ‘Exclusive’ label to news that features original reporting direct from my sources, as distinct from analysis, opinion, and reaction to events. The Scoop sometimes delivers first hand, original reporting. How did the company go about communicating redundancies, and what would a more humane process have been? Exclusive. The infrastructure provider cut 8% of staff, and seems to have ‘optimized’ this process from a business perspective. I examine the results and add context on how the survey was conducted. Īn explosion in software engineers using AI coding tools? GitHub surveyed 500 developers in the US for a sense of how they use AI coding tools. I talked with current Meta engineers for their reaction – and give my two cents as well. On a recent podcast, Meta’s founder and CEO shared his reasoning for why the tech giant now has fewer managers. Why Meta is reducing its number of managers. Which services and companies were impacted and what really caused this incident? I spoke with engineers at AWS to get answers. Amazon’s most important region went down for 3 hours, and the whole of the web felt it. Have a scoop to share? Send me a message ! I treat all such messages as anonymous.ĪWS’s us-east-1 outage: a deep dive. The Scoop is a series covering insights, patterns, and trends I observe and hear about within Big Tech and at high growth startups.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |