Prometheus Victoria Metrics On AWS ECS
Supporting our HA / Fault Tolerant Prometheus infrastructure required a long term data storage and data de-duplication solution.
We looked into Cortex, Thanos and Victoria Metrics.
Cortex is complex to setup, especially for our small environment. It also requires a handful of third party services:
Postgres
Memcache
Consul
Big Table / Cassandra / DynDB
S3 / GCS
Thanos just needs S3 as a third party service. It is slightly less complex to setup and maintain than Cortex, but it pulls is metrics from Prometheus and has reliability and availability challenges:
https://medium.com/faun/comparing-thanos-to-victoriametrics-cluster-b193bea1683
Another factor for consideration was cost. With all of the additional third party services, storage cost and computing costs due to the complexity of the architecture, Thanos and Cortex are typically more expensive to run than other solutions.
We stumbled upon Victoria Metrics.
Victoria Metrics is open-source and has a community edition with a good community and excellent customer service / technical support
They have a clustered solution that has only three components:
vmstorage (store data)
vmselect (select data)
vminsert (write data into storage)
A much shorter list and simpler than Thanos and Cortex.
It also supports PromQL out of the box making integration easy with Grafana.
We wanted to keep cost, complexity and setup time down so we decided to use their single-node solution which combines the services above into one single container.
For reasons we will discuss in a future blog post, we decided to deploy this solution into our AWS ECS infrastructure.
Since Victoria Metrics uses EBS storage (commodity HDD) and not object storage or network storage, configuring the environment required a few extra steps.
JuiceFS and ObjectiveFS were tested as viable solutions to use S3 as potential storage, but testing lead to poor random read performance.
So ST1 EBS it is!
In addition to the usual AWS ECS setup, we had to add the following in Terraform
-docker plugin install rexray/ebs REXRAY_PREEMPT=true EBS_REGION=${region} — grant-all-permissions
-stop ecs
-start ecs
Why? Well we need to provision a persistent EBS volume for the container. In order to do so we used the RexRay storage driver plugin. This required the plugin to be installed on the EC2 hosts.
In our task definition, we had to then define and mount this new volume to the container:
“mountPoints”: [{
“sourceVolume”: “victoria-vol”,
“containerPath”: “/victoria”,
“readOnly”: “”
}],
“volumes”: [{
“dockerVolumeConfiguration”: {
“driver”: “rexray/ebs”,
“scope”: “shared”,
“driverOpts”: {
“size”: “10”,
“volumetype”: “st1”
},
“autoprovision”: true
},
“name”: “victoria-vol”}]
This will auto-provision a 10GB ST1 EBS volume and mount it as a RW /victoria on the container.
The final challenge was that since EBS is only tied to one availability zone we needed to make sure that the container would always come up in that AZ. We use a placement constraint for this:
“placement_constraints”: [
{
“type”: “memberOf”,
“expression”: “attribute:ecs.availability-zone in [us-east-1c]”
}
You may be thinking, well you said this was part of a HA/FT solution, so now you are relying on a single container tied to a single AZ? No… :) We will have a container in each AZ and Prometheus will push metrics to each AZ.
We will also use their built in backup config to back up to S3 for disaster recovery.
We used the single-node solution which is a single self-contained binary without any dependencies. It is configured by a few command-line flags, while the rest of configs have sane defaults, so they shouldn’t be touched in most cases.
Here is our DOCKERFILE:
FROM victoriametrics/victoria-metrics
EXPOSE 8428
ENTRYPOINT [“/victoria-metrics-prod”]
CMD [“ — http.pathPrefix”,”/victoria”,” — storageDataPath”,”/victoria”]
Once built and deployed it came up very quickly and we were able to pull metrics from http://host/victoria/metrics
For a small and simple Prometheus setup, I highly recommend Victoria Metrics. It was simple to setup. I hope this article helps you setup this solution in your AWS ECS cluster.
Follow up 3/7/21
Support for NVMe requires a udev
rule to alias the NVMe device to the path REX-Ray expects as a mount point. A similar udev rule is built into the Amazon Linux AMI already, and trivial to add to other linux distributions.
The following is an example of the udev
rule that must be in place:
# /etc/udev/rules.d/999-aws-ebs-nvme.rules
# ebs nvme devices
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/local/bin/ebs-nvme-mapping /dev/%k", SYMLINK+="%c"
We had to reload the udev rules on the EC2 without rebooting. We did that by adding udevadm control --reload-rules && udevadm trigger
You may have stumbled upon this warning
But the above fixes worked on both C4 and C5 class EC2 instances