Skip to content
Snippets Groups Projects

infrastructure-monitoring-and-testing: Add document

Merged Emanuele Aina requested to merge wip/em/t7098-infrastructure-monitoring-and-testing into master
3 files
+ 236
0
Compare changes
  • Side-by-side
  • Inline
Files
3
+++
title = "Infrastructure monitoring and testing"
short-description = "Requirements and plans for monitoring and testing the Apertis infrastructure"
weight = 100
outputs = [ "html", "pdf-in",]
date = "2019-08-19"
+++
The Apertis infrastructure is itself a fundamental component of what Apertis
delivers: its goal is to enable developers and product teams to work and
collaborate efficiently, focusing on their value-add rather than starting
from scratch.
This document focuses on the components of the current infrastructure and their
monitoring and testing requirements.
## The Apertis infrastructure
The Apertis infrastructure is composed by a few high level components:
* GitLab
* OBS
* APT repository
* Artifacts hosting
* LAVA
![](/images/apertis-infrastructure-components.svg)
From the point of view of developers and product teams, GitLab is the main
interface to Apertis. All the source code is hosted there and all the workflows
that tie everything together run as GitLab CI/CD pipelines, which means that
its runners interact with every other service.
The Open Build Service (OBS) manages the build of every package, dealing with
dependency resolution, pristine environments and multiple architectures. For
each package, GitLab CI/CD pipelines take the source code hosted with Git and
pushes it to OBS, which then produces binary packages.
The binary packages built by OBS are then published in a repository for APT, to
be consumed by other GitLab CI/CD pipelines.
These pipelines produce the final artifacts, which are then stored and published
by the artifacts hosting service.
At the end of the workflow, LAVA is responsible for executing integration tests
on actual hardware devices for all the artifacts produced.
## Deployment types
The high-level services often involve multiple components that need to be
deployed and managed. This section describes the kind of deployments that
can be expected.
### Traditional package-based deployments
The simplest services can be deployed using traditional methods: for instance
in basic setups the APT repository and artifacts hosting services only involve
a plain webserver and access via SSH, which can be easily managed by installing
the required packages on a standard virtual machine.
Non-autoscaling GitLab Runners and the autoscaling GitLab Runners Manager using
Docker Machine are another example of components that can be set up using
traditional packages.
### Docker containers
An alternative to setting up a dedicated virtual machine is to use services
packaged as single Docker containers.
An example of that is the
[GitLab Omnibus Docker container](https://docs.gitlab.com/omnibus/docker/)
which ships all the components needed to run GitLab in a single Docker image.
The GitLab Runners Manager using Docker Machine may also be deployed as a
Docker container rather than setting up a dedicated VM for it.
### Docker Compose
More complex services may be available as a set of interconnected Docker
containers to be set up with
[Docker Compose](https://docs.docker.com/compose/).
In particular OBS and LAVA can be deployed with this approach.
### Kubernetes Helm charts
As a further abstraction over virtual machines and hand-curated containers
most cloud providers now offer Kubernetes clusters where multiple components
and services can be deployed as Docker containers with enhanced scaling and
availabily capabilities.
The [GitLab cloud native Helm chart](https://docs.gitlab.com/charts/) is the
main example of this approach.
## Maintenance, monitoring and testing
These are the goals that drive the infrastructure maintenance:
* ensuring all components are up-to-date, shipping the latest security fixes
and features
* minimizing downtime to avoid blocking users
* reacting on regressions
* keeping the users' data safe
* checking that data across services is coherent
* providing fast recovery after unplanned outages
* verify functionality
* preventing performance degradations that may affect the user experience
* optimizing costs
* testing changes
### Ensuring all components are up-to-date
Users care about services that behave as expected and about being able to use
new features that can lessen their burden.
Deploying updates timely is a fundamental step to addess this need.
Traditional setups can use tools like
[`unattended-upgrades`](https://wiki.debian.org/UnattendedUpgrades) to
automatically deploy updates as soon as they become available without any
manual intervetion.
For Docker-based deployment the `pull` command needs to be executed to ensure
that the latest images are available and then the services need to be
restarted. Tools like [`watchtower`](https://github.com/containrrr/watchtower)
can help to automate the process.
However, this kind of automation can be problematic for services where high
availability is required, like GitLab: in case anything goes wrong there may be
a considerable delay before a sysadmin becomes available to investigate and fix
the issue, so explicitly scheduling manual updates is recommended.
### Minimizing downtimes
To minimize the impact on users of the downtime due to the updates it is
recommended to schedule them during a window where most users are inactive,
for instance during the weekend.
### Reacting on regressions
Some updates may fail or introduce regressions that impact users. In those
cases it may be necessary to roll back a component or an entire service to a
previous version.
Rollbacks are usually problematic with traditional package managers, so this
kind of deployment is acceptable only for service where the risk of regressions
is very low, as it is for standard web servers.
Docker-based deployment make this much easier as each image has a unique digest
that can be used to control exactly what gets run.
### Keeping the users' data safe
In cloud deployments the object storage services is a common target of attacks.
Care must be taken to ensure all the object storage buckets/accounts have
strict access policies and are not public to prevent data leaks.
Deleting unused buckets/accounts should also be done with care if other
resource point to them: for instance, in some cases it can lead to
[subdomain takeovers](https://www.we45.com/blog/how-an-unclaimed-aws-s3-bucket-escalates-to-subdomain-takeover).
### Checking that data across services is coherent
With large amount of data being stored across different interconnected services
it's likely that discrepancies will creep in due to bugs in the automation or
due to human mistakes.
It is thus important to cross-correlate data from different sources to detect
issue and act on them timely. The
[Apertis infrastructure dashboard](https://infrastructure.pages.apertis.org/dashboard/)
currently provides such overview ensuring that the packaging data is consistent
across GitLab, OBS, the APT repository and the upstream sources.
### Providing fast recovery after unplanned outages
Unplanned outages may happen for a multitude of causes:
* hardware failures
* human mistakes
* ransomware attacks
To mitigate their unavoidable impact a good backup and restore strategy has to
be devised.
All the service data should be backed up to separate locations to make them
available even in case of infrastructure-wide outages.
For services it is important to be able to re-deploy them quickly: for this
reason it is strongly recommended to follow a
["cattle not pets"](http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/)
approach and be able to deploy new service instances with minimal human intervention.
Docker-based deployment types are strongly recommended since the recovery
procedure only involves the re-download of pre-assembled container images once data
volumes have been restored from backups.
Traditional approaches instead involve a lengthy reinstallation process even if
automation tools such as Ansible are used, with good chances that the
re-provisioned system differs significantly from the original one, requiring a
more intensive revalidation process.
On cloud-based setups it is strongly recommended to use automation tools like
Terraform to be able to quickly re-deploy full services from scratch,
potentially on different cloud accounts or even on different cloud providers.
### Verify functionality
TODO: pipelines are somewhat self-testing, pipeline testing pipeline
### Monitoring and communicating availability
Timely detecting unplanned outages is as important as properly communicating
planned downtimes.
A common approach is to set up a global status page that reports the
availability of each service and provides information to users about incidents
being addressed and planned downtimes.
The status page can be [self-hosted](http://cachethq.io/) or a hosted service
can be used.
### Preventing performance degradations that may affect the user experience
TODO: grafana
### Optimizing costs
TODO: use cheaper VM, move to container services, host multiple apps on the same K8s cluster
### Testing changes
TODO: test environment, ansible, terraform
Loading