Emanuele Aina
--- a/content/designs/infrastructure-monitoring-and-testing.md 0 → 100644

+ 233

− 0
+++ b/content/designs/infrastructure-monitoring-and-testing.md 0 → 100644

+ 233

− 0
+++
+title = "Infrastructure monitoring and testing"
+short-description = "Requirements and plans for monitoring and testing the Apertis infrastructure"
+weight = 100
+outputs = [ "html", "pdf-in",]
+date = "2019-08-19"
+++
+
+
+The Apertis infrastructure is itself a fundamental component of what Apertis
+delivers: its goal is to enable developers and product teams to work and
+collaborate efficiently, focusing on their value-add rather than starting
+from scratch.
+
+This document focuses on the components of the current infrastructure and their
+monitoring and testing requirements.
+
+## The Apertis infrastructure
+
+The Apertis infrastructure is composed by a few high level components:
+* GitLab
+* OBS
+* APT repository
+* Artifacts hosting
+* LAVA
+
+![](/images/apertis-infrastructure-components.svg)
+
+From the point of view of developers and product teams, GitLab is the main
+interface to Apertis. All the source code is hosted there and all the workflows
+that tie everything together run as GitLab CI/CD pipelines, which means that
+its runners interact with every other service.
+
+The Open Build Service (OBS) manages the build of every package, dealing with
+dependency resolution, pristine environments and multiple architectures. For
+each package, GitLab CI/CD pipelines take the source code hosted with Git and
+pushes it to OBS, which then produces binary packages.
+
+The binary packages built by OBS are then published in a repository for APT, to
+be consumed by other GitLab CI/CD pipelines.
+
+These pipelines produce the final artifacts, which are then stored and published
+by the artifacts hosting service.
+
+At the end of the workflow, LAVA is responsible for executing integration tests
+on actual hardware devices for all the artifacts produced.
+
+## Deployment types
+
+The high-level services often involve multiple components that need to be
+deployed and managed. This section describes the kind of deployments that
+can be expected.
+
+### Traditional package-based deployments
+
+The simplest services can be deployed using traditional methods: for instance
+in basic setups the APT repository and artifacts hosting services only involve
+a plain webserver and access via SSH, which can be easily managed by installing
+the required packages on a standard virtual machine.
+
+Non-autoscaling GitLab Runners and the autoscaling GitLab Runners Manager using
+Docker Machine are another example of components that can be set up using
+traditional packages.
+
+### Docker containers
+
+An alternative to setting up a dedicated virtual machine is to use services
+packaged as single Docker containers.
+
+An example of that is the
+[GitLab Omnibus Docker container](https://docs.gitlab.com/omnibus/docker/)
+which ships all the components needed to run GitLab in a single Docker image.
+
+The GitLab Runners Manager using Docker Machine may also be deployed as a
+Docker container rather than setting up a dedicated VM for it.
+
+### Docker Compose
+
+More complex services may be available as a set of interconnected Docker
+containers to be set up with
+[Docker Compose](https://docs.docker.com/compose/).
+
+In particular OBS and LAVA can be deployed with this approach.
+
+### Kubernetes Helm charts
+
+As a further abstraction over virtual machines and hand-curated containers
+most cloud providers now offer Kubernetes clusters where multiple components
+and services can be deployed as Docker containers with enhanced scaling and
+availabily capabilities.
+
+The [GitLab cloud native Helm chart](https://docs.gitlab.com/charts/) is the
+main example of this approach.
+
+## Maintenance, monitoring and testing
+
+These are the goals that drive the infrastructure maintenance:
+
+* ensuring all components are up-to-date, shipping the latest security fixes
+  and features
+* minimizing downtime to avoid blocking users
+* reacting on regressions
+* keeping the users' data safe
+* checking that data across services is coherent
+* providing fast recovery after unplanned outages
+* verify functionality
+* preventing performance degradations that may affect the user experience
+* optimizing costs
+* testing changes
+
+### Ensuring all components are up-to-date
+
+Users care about services that behave as expected and about being able to use
+new features that can lessen their burden.
+
+Deploying updates timely is a fundamental step to addess this need.
+
+Traditional setups can use tools like
+[`unattended-upgrades`](https://wiki.debian.org/UnattendedUpgrades) to
+automatically deploy updates as soon as they become available without any
+manual intervetion.
+
+For Docker-based deployment the `pull` command needs to be executed to ensure
+that the latest images are available and then the services need to be
+restarted. Tools like [`watchtower`](https://github.com/containrrr/watchtower)
+can help to automate the process.
+
+However, this kind of automation can be problematic for services where high
+availability is required, like GitLab: in case anything goes wrong there may be
+a considerable delay before a sysadmin becomes available to investigate and fix
+the issue, so explicitly scheduling manual updates is recommended.
+
+### Minimizing downtimes
+
+To minimize the impact on users of the downtime due to the updates it is
+recommended to schedule them during a window where most users are inactive,
+for instance during the weekend.
+
+### Reacting on regressions 
+
+Some updates may fail or introduce regressions that impact users. In those
+cases it may be necessary to roll back a component or an entire service to a
+previous version.
+
+Rollbacks are usually problematic with traditional package managers, so this
+kind of deployment is acceptable only for service where the risk of regressions
+is very low, as it is for standard web servers.
+
+Docker-based deployment make this much easier as each image has a unique digest
+that can be used to control exactly what gets run.
+
+### Keeping the users' data safe
+
+In cloud deployments the object storage services is a common target of attacks.
+
+Care must be taken to ensure all the object storage buckets/accounts have
+strict access policies and are not public to prevent data leaks.
+
+Deleting unused buckets/accounts should also be done with care if other
+resource point to them: for instance, in some cases it can lead to
+[subdomain takeovers](https://www.we45.com/blog/how-an-unclaimed-aws-s3-bucket-escalates-to-subdomain-takeover).
+
+### Checking that data across services is coherent
+
+With large amount of data being stored across different interconnected services
+it's likely that discrepancies will creep in due to bugs in the automation or
+due to human mistakes.
+
+It is thus important to cross-correlate data from different sources to detect
+issue and act on them timely. The
+[Apertis infrastructure dashboard](https://infrastructure.pages.apertis.org/dashboard/)
+currently provides such overview ensuring that the packaging data is consistent
+across GitLab, OBS, the APT repository and the upstream sources.
+
+### Providing fast recovery after unplanned outages
+
+Unplanned outages may happen for a multitude of causes:
+
+* hardware failures
+* human mistakes
+* ransomware attacks
+
+To mitigate their unavoidable impact a good backup and restore strategy has to
+be devised.
+
+All the service data should be backed up to separate locations to make them
+available even in case of infrastructure-wide outages.
+
+For services it is important to be able to re-deploy them quickly: for this
+reason it is strongly recommended to follow a
+["cattle not pets"](http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/)
+approach and be able to deploy new service instances with minimal human intervention.
+
+Docker-based deployment types are strongly recommended since the recovery
+procedure only involves the re-download of pre-assembled container images once data
+volumes have been restored from backups.
+
+Traditional approaches instead involve a lengthy reinstallation process even if
+automation tools such as Ansible are used, with good chances that the
+re-provisioned system differs significantly from the original one, requiring a
+more intensive revalidation process.
+
+On cloud-based setups it is strongly recommended to use automation tools like
+Terraform to be able to quickly re-deploy full services from scratch,
+potentially on different cloud accounts or even on different cloud providers.
+
+### Verify functionality
+
+TODO: pipelines are somewhat self-testing, pipeline testing pipeline
+
+### Monitoring and communicating availability
+
+Timely detecting unplanned outages is as important as properly communicating
+planned downtimes.
+
+A common approach is to set up a global status page that reports the
+availability of each service and provides information to users about incidents
+being addressed and planned downtimes.
+
+The status page can be [self-hosted](http://cachethq.io/) or a hosted service
+can be used.
+
+### Preventing performance degradations that may affect the user experience
+
+TODO: grafana
+
+### Optimizing costs
+
+TODO: use cheaper VM, move to container services, host multiple apps on the same K8s cluster
+
+### Testing changes
+
+TODO: test environment, ansible, terraform