Skip to content
Snippets Groups Projects

infrastructure-monitoring-and-testing: Add document

Merged Emanuele Aina requested to merge wip/em/t7098-infrastructure-monitoring-and-testing into master
All threads resolved!
Compare and
3 files
+ 201
0
Compare changes
  • Side-by-side
  • Inline
Files
3
+++
title = "Infrastructure monitoring and testing"
short-description = "Requirements and plans for monitoring and testing the Apertis infrastructure"
weight = 100
outputs = [ "html", "pdf-in",]
date = "2019-08-19"
+++
The Apertis infrastructure is itself a fundamental component of what Apertis
delivers: its goal is to enable developers and product teams to work and
collaborate efficiently, focusing on their value-add rather than starting
from scratch.
This document focuses on the components of the current infrastructure and their
monitoring and testing requirements.
## The Apertis infrastructure
The Apertis infrastructure is composed by a few high level components:
* GitLab
* OBS
* APT repository
* Artifacts hosting
* LAVA
For the developers, GitLab is the main interface to Apertis. All the source
code is hosted there and all the workflows that tie everything together run as
GitLab CI/CD pipelines, which means that its runners interact with every
other service.
The Open Build Service (OBS) manages the build of every package, dealing with
dependency resolution, pristine environments and multiple architectures. For
each package, GitLab CI/CD pipelines take the source code hosted with Git and
pushes it to OBS, which then produces binary packages.
The binary packages built by OBS are then published in a repository for APT, to
be consumed by other GitLab CI/CD pipelines.
These pipelines produce the final artifacts, which are then stored and published
by the artifacts hosting service.
At the end of the workflow, LAVA is responsible for executing integration tests
on actual hardware devices for all the artifacts produced.
![](/images/apertis-infrastructure-components.svg)
## Deployment types
The high-level services often involve multiple components that need to be
deployed and managed. This section describes the kind of deployments that
can be expected.
### Traditional package-based deployments
The simplest services can be deployed using traditional methods: for instance
in basic setups the APT repository and artifacts hosting services only involve
a plain webserver and access via SSH, which can be easily managed by installing
the required packages on a standard virtual machine.
Non-autoscaling GitLab Runners and the autoscaling GitLab Runners Manager using
Docker Machine are another example of components that can be set up using
traditional packages.
### Docker containers
An alternative to setting up a dedicated virtual machine is to use services
packaged as single Docker containers.
An example of that is the
[GitLab Omnibus Docker container](https://docs.gitlab.com/omnibus/docker/)
which ships all the components needed to run GitLab in a single Docker image.
The GitLab Runners Manager using Docker Machine may also be deployed as a
Docker container rather than setting up a dedicated VM for it.
### Docker Compose
More complex services may be available as a set of interconnected Docker
containers to be set up with
[Docker Compose](https://docs.docker.com/compose/).
In particular OBS and LAVA can be deployed with this approach.
### Kubernetes Helm charts
As a further abstraction over virtual machines and hand-curated containers
most cloud providers now offer Kubernetes clusters where multiple components
and services can be deployed as Docker containers with enhanced scaling and
availabily capabilities.
The [GitLab cloud native Helm chart](https://docs.gitlab.com/charts/) is the
main example of this approach.
## Maintenance, monitoring and testing
These are the goals that drive the infrastructure maintenance:
* ensuring all components are up-to-date, shipping the latest security fixes
and features
* minimizing downtime to avoid blocking users
* reacting on regressions
* keeping the users' data safe
* checking that data across services is coherent
* providing fast recovery after unplanned outages
* verify functionality
* preventing performance degradations that may affect the user experience
* optimizing costs
* testing changes
### Ensuring all components are up-to-date
Users care about services that behave as expected and about being able to use
new features that can lessen their burden.
Deploying updates timely is a fundamental step to addess this need.
Traditional setups can use tools like
[`unattended-upgrades`](https://wiki.debian.org/UnattendedUpgrades) to
automatically deploy updates as soon as they become available without any
manual intervetion.
For Docker-based deployment the `pull` command needs to be executed to ensure
that the latest images are available and then the services need to be
restarted. Tools like [`watchtower`](https://github.com/containrrr/watchtower)
can help to automate the process.
However, this kind of automation can be problematic for services where high
availability is required, like GitLab: in case anything goes wrong there may be
a considerable delay before a sysadmin becomes available to investigate and fix
the issue, so explicitly scheduling manual updates is recommended.
### Minimizing downtimes
To minimize the impact on users of the downtime due to the updates it is
recommended to schedule them during a window where most users are inactive,
for instance during the weekend.
### Reacting on regressions
Some updates may fail or introduce regressions that impact users. In those
cases it may be necessary to roll back a component or an entire service to a
previous version.
Rollbacks are usually problematic with traditional package managers, so this
kind of deployment is acceptable only for service where the risk of regressions
is very low, as it is for standard web servers.
Docker-based deployment make this much easier as each image has a unique digest
that can be used to control exactly what gets run.
### Keeping the users' data safe
In cloud deployments the object storage services is a common target of attacks.
Care must be taken to ensure all the object storage buckets/accounts have
strict access policies and are not public to prevent data leaks.
Deleting unused buckets/accounts should also be done with care if other
resource point to them: for instance, in some cases it can lead to
[subdomain takeovers](https://www.we45.com/blog/how-an-unclaimed-aws-s3-bucket-escalates-to-subdomain-takeover).
### Checking that data across services is coherent
TODO: dashboard
### Providing fast recovery after unplanned outages
TODO: hw failures, user error, ransomware
backup and restore
### Verify functionality
TODO: pipelines are somewhat self-testing, pipeline testing pipeline
### Monitoring and communicating availability
Timely detecting unplanned outages is as important as properly communicating
planned downtimes.
A common approach is to set up a global status page that reports the
availability of each service and provides information to users about incidents
being addressed and planned downtimes.
The status page can be [self-hosted](http://cachethq.io/) or a hosted service
can be used.
### Preventing performance degradations that may affect the user experience
TODO: grafana
### Optimizing costs
TODO: use cheaper VM, move to container services, host multiple apps on the same K8s cluster
### Testing changes
TODO: test environment, ansible, terraform
Loading