Remove closing CI loop document

This concept document describes a possible approach to closing the CI loop using Jenkins. Apertis is moving away from the use of Jenkins and so this document is now defunct. Remove it. Signed-off-by: Martyn Welch <martyn.welch@collabora.com>

Remove closing CI loop document
9e4d463c · Martyn Welch · Martyn Welch · 25f64346 · 25f64346
Commit 9e4d463c authored 4 years ago by Martyn Welch Committed by Martyn Welch 4 years ago
--- a/content/designs/closing-ci-loop.md
+++ b/content/designs/closing-ci-loop.md
-+++
-title = "Closing the Automated Continuous Integration Loop"
-short-description = "Close the automated CI loop using the existing infrastructure."
-weight = 100
-aliases = [
-	"/old-designs/latest/closing-ci-loop.html",
-	"/old-designs/v2019/closing-ci-loop.html",
-	"/old-designs/v2020/closing-ci-loop.html",
-	"/old-designs/v2021dev3/closing-ci-loop.html",
-]
-outputs = [ "html", "pdf-in",]
-date = "2019-09-27"
-+++
-
-# Background
-
-The last phase in the current CI workflow is running the LAVA automated tests, and
-there is no mechanism in place to properly process and report these tests results
-which basically leave the CI loop incomplete.
-
-# Current Issues
-
-The biggest issues are:
-
-  - Tests results need to be checked manually from LAVA logs and dashboard.
-
-  - Bugs need to be reported manually for tests issues.
-
-  - Weekly test reports need to be created manually.
-    This point might just be partially addressed by this concept document since a
-    proper data storage for test cases and tests results has not been defined.
-
-  - No mechanism in place to conveniently notify tests issues. Critical issues can
-    be very easily overlooked.
-
-# Proposal
-
-This document proposes a design around the available infrastructure to implement
-a solution to close the CI loop.
-
-The document only covers automated tests, and it leaves manual tests for a later
-proposal with a more complete solution.
-
-# Benefits of closing the CI loop
-
-Closing the loop will allow to save time and resources by automating the manual
-tasks of checking automated tests results and reporting their issues. It will also
-provide the infrastructure foundation for further improvements in tracking the
-overall project health.
-
-From a design perspective, it will also help to keep a more complete workflow
-in place for the whole infrastructure.
-
-Some of the most important benefits:
-
-  - Checking automated tests results will need minimal or no manual intervention.
-  - Automated tests failures will be reported automatically on time.
-  - It will provide a more consistent and accurate way to track issues found
-    by automated tests.
-  - It will help to keep tests reports up to date.
-  - It will provide the infrastructure components that will help to implement
-    further improvements in tracking the overall project health.
-
-Though the project as a whole will benefit from the above points, some benefits
-will be more relevant depending on the project roles and areas. Following
-subsections give a list of these benefits for each role.
-
-## Developers and Testers Benefits
-
-  - It will save time for developers and testers since they won't need to check
-    automated tests logs and tests results manually in order to report issues.
-  - Developers will be able to notice and work on critical issues much faster,
-    since failures will have more visibility on time.
-
-## Managers Benefits
-
-  - Given automated tests issues will be reported on time and more consistently,
-    that will help managers to take more accurate decisions during planning.
-
-## Products Teams Benefits
-
-  - The whole CI workflow for automated tests is properly implemented, so it
-    offers a more complete solution to other teams and projects.
-  - Closing the CI loop offers a more coherent infrastructure design that other
-    products teams can adapt to their own needs.
-  - Products teams will have a better view of the bugs opened in a given time, thus
-    having a better idea about the overall project health.
-
-# Overview of steps to close the CI loop
-
-This is an overview of the phases required to complete closing the current CI loop:
-
-  - Tests results should be fetched from LAVA.
-  - Tests results should be processed and optionally saved somewhere.
-  - Tests results should be analyzed.
-  - Tasks for tests issues should be created using analyzed tests results.
-
-# Current Infrastructure
-
-This section explores the different services available from our infrastructure
-proposed to implement the remaining phases to close the CI loop.
-
-## LAVA User Notifications
-
-As of LAVA V2, there is a new feature called **Notication Callback** which allows
-to send a GET or POST request to a specified URL to trigger some action remotely.
-If using the POST request, this allows to attach and send test job information and
-results.
-
-This can be used to send the tests results back to Jenkins from LAVA for further
-processing in new pipeline phases.
-
-## Jenkins Webhook Plugin
-
-This plugin provides an easy way to block a build pipeline in Jenkins until an
-external system posts to a webhook.
-
-This can be used to wait for the automated tests results sent by LAVA from a new
-Jenkins job responsible of triggering the automated tests.
-
-## Phabricator API
-
-Conduit is the developer API for Phabricator which can be used to implement the
-management of tasks.
-
-This API can be used (either with tools or language bindings) to manage Phabricator
-tasks from a Jenkins phase in the main pipeline.
-
-## Mattermost
-
-Mattermost is the chat system used by the Apertis project.
-
-Jenkins already offers a plugin to send messages to mattermost.
-
-This can be used in order to send notifications messages to the chat channels,
-for example, to notify the team once a critical test starts failing, or when a bug
-has been updated.
-
-# CI Workflow Overview
-
-The main workflow would basically consist on combining the above mentioned
-technologies to implement the different phases for the main CI pipeline. 
-
-A general overview of the steps involved would be:
-
-  - Jenkins build images and trigger LAVA jobs.
-  - Use the `webHook` pipeline plugin to wait for LAVA tests results from Jenkins.
-  - LAVA execute automated tests jobs and results are saved in its database.
-  - LAVA triggers an user notification callback attaching test job information
-    and results to send to the Jenkins webHook.
-  - Tests results are received by Jenkins through the webHook.
-  - Test information is sent to a new `pipeline` to process and analize tests
-    results.
-  - Once tests results are processed and analyzed, these are sent to a new
-    `pipeline` to manage Phabricator tasks.
-  - Optionally a new Jenkins phase could notify results to mattermost or via email.
-
-This complete loop will be executed every time new images are built.
-
-# Fetching Tests Results
-
-The initial and most important phase to close the loop is fetching and processing
-the automated tests results from LAVA.
-
-The proposed solution in this document is to use the webHook plugin to fetch the
-LAVA tests results from Jenkins once the automated test job is finished.
-
-Currently, LAVA tests are submitted in the last stage of the Jenkins pipeline job
-creating and publishing the images.
-
-Automated tests are organized in groups, which are submitted all at once using the
-`lqa` tool for each image type once the images are published.
-
-A webhook should be registered for each `test job` rather than for a group of
-tests, so a change in the way LAVA jobs are submitted is required.
-
-## Jenkins and LAVA Interaction
-
-The proposed solution is to separate the LAVA job submission stage from the main
-Jenkins pipeline job building images, and instead have a single Jenkins job that
-will take care of triggering the automated tests in LAVA once the images are
-published.
-
-The only required fields for the stage submitting the LAVA tests jobs are the
-`image_name` , `profile_name`, and `version` of the image. A single Jenkins job
-could receive these values as arguments and trigger the automated tests for each
-of the respective images.
-
-The way LAVA jobs are submitted from Jenkins will also require some changes. The
-`lqa` tool currently submit several `groups` of tests jobs at once, but since each
-test job requires to have an unique webhook, they will need to be submitted
-independently.
-
-One simple solution is to have `lqa` processing the job templates first and then
-submit each processed job file with an unique webhook.
-
-Once all tests jobs are submitted for a specific image type, the Jenkins executor
-will wait for all of their webhooks. This will block the executor, but since the
-webhook returns immeditealy for those jobs that already posted the results to the
-webhook callback, it is fair to say that the executor will only block until the
-last completed test job sends its results back to Jenkins.
-
-After all results are received in Jenkins, these can be processed by the remaining
-stages required for tasks management.
-
-## Jenkins Jobs
-
-Since images are built from a single Jenkins job, the most sensible approach for
-final implementation is to have a new Jenkins job receiving all the images types
-information and triggering tests for all of them, then a different job for
-processing tests results, and possible another one handling the tasks management
-phases.
-
-# Tasks Management
-
-One of the most important phases in closing the loop is reporting tests issues in
-Phabricator.
-
-Tests issues will be reported automatically in Phabricator as tasks per test cases
-instead of tasks per issues. This has an important consequence explained in the[considerations]( {{< ref "#considerations" >}} ) section.
-
-This section gives an overview for the behaviour of this phase.
-
-## Workflow Overview
-
-Management of Phabricator tasks can be as follow:
-
-  1) Query Phabricator to find all open tasks with the tag `test-failure`.
-
-  2) Filter the list of received tasks to make sure only the exact tasks are
-     processed. For this, scanning for further specific fields in the task can be
-     helpful, for example, keeping only tasks with a specific task name format.
-
-  3) Fetch analyzed tests results.
-
-  4) For each test, based on its results and checking the tasks list, do the
-     following:
-
-     a) Task exists: Add a coment to the task.
-
-     b) Task does not exist:
-        - If test has status `failed`: Create a new task.
-        - If test has status `passed`: Do nothing.
-
-## Considerations
-
-  - The comment added to the task will contain general information of the failure
-    with a link to the LAVA job logs.
-
-  - Tasks won't be reported per platform but per test case. Once a task for a test
-    case failure is reported, all platforms failures for such a test case should be
-    added as comments to that single task.
-
-  - Closing and verifying tasks will still require manual intervention. This will
-    help avoiding the following corner cases:
-
-    - Flaky tests that would otherwise end up in a series of new tasks that gets
-      autoclosed.
-    - Tests failing on one image that also succeed on a different image.
-
-  - If a test starts failing again for a previously closed task, a new task will be
-    created automatically for it, and manual verification is required to check if
-    it is the same previously reported issue, in which case is recommended to add
-    a reference to the old task.
-
-  - If after fixing an issue for a reported task, a new issue arises for the
-    same test case, the same old task will be updated with this new issue. This
-    is an effect of reporting tasks per test cases instead of per issues. In such a
-    case, manual verification can be used to confirm if it is or not the same issue
-    and a new subtask can be manually created by the developer if deemed necessary.
-
-## Phabricator Conventions
-
-For automation of the phabricator tasks management, there will be the need of
-creating certain conventions in phabricator. This will require minimal manual
-intervention.
-
-First of all a specific user should be created in Phabricator to manage these
-tasks automatically.
-
-This username could be named `apertis-qa` or `apertis-tests`, and its only purpose
-will be to manage tasks automatically at this stage of the loop.
-
-A special tag and a specific format in the tasks name will also be used in tasks
-reported by this special user:
-
-  - The tag `test-failure` is the special tag for automated tests failure.
-  - The task name will have the format: "{testcasename} failed: <Task title>".
-  - A `{testcasename}` tag can also be used if it is available for the test.
-
-# Design and Implementation
-
-This section gives a brief overview of the design for the main components to close
-the loop.
-
-Each of these components can be developed as independent modules, or as a single
-unit containing all the logic. The details and final design of these components
-depend on the most convenient approach chosen during implementation.
-
-## Design
-
-### Tests Processor
-
-This will take care of processing the tests results as they are received from LAVA.
-
-LAVA tests results are sent to Jenkins in a `raw` format, so things to do
-at this level could involve cleaning data or even converting tests results to
-a new format so they can be more easily processed by the rest of the tools.
-
-### Tests Analyzer
-
-This will make sure that the test results data is in a consistent and convenient
-format to be used by the next module (`task manager`).
-
-This can be a new tool or just be part of the `test result processor` running
-in its same Jenkins phase just for convenience.
-
-### Tasks Manager
-
-This will receive the whole tests results data analyzed, and ideally it shouldn't
-deal with any test data manipulation.
-
-It will take care of comparing the status between tests results and
-phabricator tasks, decide next steps to do and manage those tasks accordingly.
-
-### Notifier
-
-This can be considered an `optional` component and can involve sending further
-forms of notifications to different services, for example, send messages to
-`Mattermost` channels or emails notifying about new critical bugs.
-
-## Implementation
-
-As originally envisioned, each of the design components could be written using a
-scripting language, preferably one that already offers a good integration with
-our infrastructure.
-
-The `Python` language is highly recommended, as it already offers plugin for
-all the related infrastructure, so it would require a minimal effort to integrate
-a solution written in this language.
-
-As a suggested environment, Jenkins could be used like the main place to
-execute and orchestrate each of the components. They could be executed using a
-different pipeline for each phase, or just a single pipeline executing all the
-functionality.
-
-For example, once the LAVA results are fetched in Jenkins, a new pipeline phase
-receiving tests results can be started to execute the `test processor` and
-`test analyzer`, which in turn will send the output to a new pipeline phase to
-execute the `task manager` and later (if available) to the `notifier`.
-
-## Diagram
-
-This is a diagram explaining the different infrastructure processes involved in
-the proposed design to close the CI loop.
-
-![](/images/closing_automated_loop.svg)
-
-# Security Improvement
-
-The Jenkins webhook URL will be visible from the public LAVA tests definitions,
-which might arise security concerns. For example, another process posting to the
-webhook before LAVA does will break the Jenkins job waiting for the tests results.
-
-After researching several options to solve this issue, one solution has been found
-which consists in checking for a protected authorization header in Jenkins sent by
-LAVA when posting to the webhook.
-
-This solution requires changes both in the Jenkins plugin and the LAVA code, and
-they need to be implemented as part of the solution for closing the CI loop.
-
-# Implementation
-
-The final implementation for the solution proposed in this document will mainly
-involve developing tools that need to be executed in Jenkins and will interact
-with the rest of the existing infra services: LAVA, Phabricator and optionally
-Mattermost.
-
-All tools and programms will be available from the project git repositories with
-their respective documentation, including how to setup and use them.
-
-In addition to this, the final implementation will also include documentation about
-how to integrate, use and maintain this solution using the currently available
-infrastructure services, so other teams and projects can also make use of it.
-
-# Constraints or Limitations
-
-  - Some errors might not be trivially detected for automated tests, since
-    LAVA can fail in several ways, for example, infra errors sometimes might be
-    difficult to analyze and will still require manual intervention.
-
-  - The `webHook` plugin blocks the Jenkins pipeline. This might be an issue
-    in the long term and it should be an open point for further researching
-    in later version of this document or during implementation.
-
-  - This document deals with the existing infra, so a proper data storage has
-    not been defined for test cases and tests results. Creation of weekly tests
-    reports will continue requiring manual intervention.
-
-  - The test definitions for public LAVA jobs are publicly visible. The Jenkins
-    webhook URL will also be visible in these tests definitions, which can be a
-    security concern. A solution for this issue is proposed in
-   [security improvement]( {{< ref "#security-improvement" >}} ).
-
-  - Closing and verifying tasks will still require manual intervention due to
-    the points explained in the[considerations]( {{< ref "#considerations" >}} ) section.
-
-# New CI Infrastructure and Workflow
-
-The main changes in the new infrastructure is that test results and test cases
-will be stored in SQUAD and Git respectively, and there will be mechanisms in
-place to visualise test results and send test issues notifications. The new
-infrastructure is defined at the [test data storage document][TestDataStorage].
-
-Manual tests are processed by the new infrastructure, so the new workflow will
-also cover closing the CI loop for manual tests.
-
-## Components and Workflow
-
-A new web service can be setup to receive the callback triggered by LAVA in a
-specific URL in order to fetch the automated tests results, instead of using the
-Jenkins webhook plugin. This is in case that using the Jenkins webhook might turn
-out not to be a suitable solution during implementation either for the current
-CI loop infrastructure or for the new one.
-
-Therefore the following steps will use the term `Tests Processor System` to refer
-to the infrastructure in charge of receiving and processing these test results,
-and which can be setup either in Jenkins or as a new infrastructure service.
-
-The main components for the new infrastructure can be broadly split into the
-following phases: automated tests, manual tests, tasks management, and reporting
-and visualization.
-
-### Automated Tests
-
-Workflow for automated tests:
-
-  - Jenkins build images and trigger LAVA jobs.
-  - LAVA execute automated tests jobs and results are saved in its database.
-  - LAVA triggers an user notification callback attaching test job information
-    and results to send to the tests processor system.
-  - The system opens a HTTP URL to wait for the LAVA callback in order to receive
-    tests results.
-  - Tests results are received by the tests processor system.
-  - Once test results are received, these are processed with the tool to convert
-    the test data into SQUAD format.
-  - After the data is in the correct format, it is sent to SQUAD using the HTTP
-    API.
-
-### Manual Tests
-
-Test results will be entered manually by the tester using a new application,
-in this workflow named `Test Submitter Application`.
-
-This application will prompt the tester to enter each manual test results, and
-will send the data to the SQUAD backend, as explained in the
-[test data storage document][TestDataStorage].
-
-The following workflow includes the processing of manual test results into the
-CI loop:
-
-  - Tester manually executes test cases.
-  - Tester enters test results into the test submitter application.
-  - The application sends the test data to the tests processor system using a
-    reliable network protocol.
-  - Tests results are received by the tests processor system infrastructure.
-  - Once test results are received, these are processed with the tools to convert
-    the test data into SQUAD format.
-  - After the data is in the correct format, it is sent to SQUAD using the HTTP
-    API.
-
-### Tasks Management
-
-This phase deals with processing the test results data in order to file and manage
-Phabricator tasks and send notifications.
-
-  - Once all tests results are stored in the SQUAD backend, they might still need
-    to be processed by other phases in the tests processor system, and sent to a
-    new phase to manage Phabricator tasks.
-  - The new Phabricator phase uses the tests data to file new tasks following the
-    logic explained in the[tasks management]( {{< ref "#tasks-management" >}} ) section.
-  - The same phase or a new one could notify results to mattermost or via email.
-
-### Reporting and Visualization
-
-A new web application dashboard will be used to view test results and generate
-reports and graphical statistics.
-
-This web application will fetch results from the SQUAD backend and will process
-them to generate the relevant statistic and graphics.
-
-The weekly test report will be generated either periodically or at any time as
-needed using this web application dashboard.
-
-More details can be found in the [reporting and visualization document][TestDataReporting].
-
-## General Workflow Overview
-
-This section gives an overview about the complete workflow in the following
-steps:
-
-  - Automated tests and manual tests are executed in different environments.
-
-  - Automated tests are executed in LAVA and results are sent to the `HTTP URL`
-    service open by the `Tests Processor System` to receive the LAVA callback
-    sending the tests results.
-
-  - Manual tests are executed by the tester. The tester uses the `Test Submitter
-    App` to collect tests results and send them to the `Tests Processor System`
-    using a reliable network protocol for data transfer.
-
-  - All test results are processed and converted to the SQUAD JSON format by the
-    `Test Processor and Analyzer`.
-
-  - Once test results are in the correct format, they are sent to the SQUAD backend
-    using the SQUAD HTTP API.
-
-  - Test results might still need to be processed by the `Test Processor and
-    Analyzer` in order to be sent to the new phases. Once results are processed,
-    these are passed to the `Task Manager` and `Notification System` phases to
-    manage Phabricator tasks and send email or mattermost notifications
-    respectively.
-
-  - From the SQUAD backend, the new `Web Application Dashboard` fetches tests
-    results periodically or as needed to generate test results views, graphical
-    statistics, and reports.
-
-The following diagram shows the visual for the above workflow:
-
-![](/images/new_infra_ci_loop.svg)
-
-## New Infrastructure Migration Steps
-
- Setup a SQUAD instance. This can be done using a Docker image, so the setup
-  should be very straightforward and convenient to replicate downstream.
-
- Extend the current `Test Processor System` to submit results to SQUAD. This
-  basically consists on using the SQUAD URL API to submit the test data.
-
- Convert the testcases from the wiki format to the strictly defined YAML format.
-
- Write an application to render the YAML testcases, guide testers through them and
-  provide them a form to submit their results. This is the `Test Submitter App` and
-  can be developed as either a web frontend or a command line tool.
-
- Write the reporting web application which fetches results from SQUAD and renders
-  reports. This is the `Web App Dashboard` and it will be developed using existing
-  modules and frameworks in a convenient way such a deployment and maintenance can
-  be done in the same way than other infrastructure services.
-
-## Maintenance Impact
-
-The new components required for the new infrastructure are the `Test Submitter`,
-`Web Application Dashboard` and SQUAD, along with some changes needed for the
-`Test Processor System` to receive the manual tests results and send test data
-to SQUAD.
-
-SQUAD is an upstream dashboard that can be deployed using Docker, so it can be
-conveniently used by other projects and its maintenance effort won't be more than
-other infrastructure services.
-
-The test submitter and web application dashboard will be developed reusing existing
-modules and frameworks for each of their functionalities, they mainly need to use
-already well defined APIs to interact with the rest of the services, and they will
-be designed in such a way that can be conveniently deployed (for example, using
-Docker). They are not expected to be large applications, so maintenance should be
-equal to other tools in the project.
-
-The test processor is a system tool, developed in a modular way, so each component
-can reuse existing modules or libraries to implement the required functionality,
-for example, make use of an existing HTTP module to access the SQUAD URL API, so
-it won't require a big maintenance effort and it will practically be the same than
-other infrastructure tools in the project.
-
-Converting the test cases to the new YAML format can be done manually, and a small
-tool can be used to assist with the format migration (for example, to sanitize the
-format). This should be a one time task, so no further maintenance is involved.
-
-# Links
-
-LAVA Notification Callback :
-  - https://lava.collabora.co.uk/static/docs/v2/user-notifications.html#notification-callback
-
-Jenkins Webhook Plugin:
-  - https://wiki.jenkins.io/display/JENKINS/Webhook+Step+Plugin
-
-Phabricator API:
-  - https://phabricator.apertis.org/conduit
-
-Mattermost Jenkins Plugin:
-  - https://wiki.jenkins.io/display/JENKINS/Mattermost+Plugin
-
-
-[TestDataStorage]: test-data-storage.md
-
-[TestDataReporting]: test-data-reporting.md