diff --git a/content/designs/closing-ci-loop.md b/content/designs/closing-ci-loop.md deleted file mode 100644 index 3ba0bc310399030892686e907e3230dd6ca3a666..0000000000000000000000000000000000000000 --- a/content/designs/closing-ci-loop.md +++ /dev/null @@ -1,596 +0,0 @@ -+++ -title = "Closing the Automated Continuous Integration Loop" -short-description = "Close the automated CI loop using the existing infrastructure." -weight = 100 -aliases = [ - "/old-designs/latest/closing-ci-loop.html", - "/old-designs/v2019/closing-ci-loop.html", - "/old-designs/v2020/closing-ci-loop.html", - "/old-designs/v2021dev3/closing-ci-loop.html", -] -outputs = [ "html", "pdf-in",] -date = "2019-09-27" -+++ - -# Background - -The last phase in the current CI workflow is running the LAVA automated tests, and -there is no mechanism in place to properly process and report these tests results -which basically leave the CI loop incomplete. - -# Current Issues - -The biggest issues are: - - - Tests results need to be checked manually from LAVA logs and dashboard. - - - Bugs need to be reported manually for tests issues. - - - Weekly test reports need to be created manually. - This point might just be partially addressed by this concept document since a - proper data storage for test cases and tests results has not been defined. - - - No mechanism in place to conveniently notify tests issues. Critical issues can - be very easily overlooked. - -# Proposal - -This document proposes a design around the available infrastructure to implement -a solution to close the CI loop. - -The document only covers automated tests, and it leaves manual tests for a later -proposal with a more complete solution. - -# Benefits of closing the CI loop - -Closing the loop will allow to save time and resources by automating the manual -tasks of checking automated tests results and reporting their issues. It will also -provide the infrastructure foundation for further improvements in tracking the -overall project health. - -From a design perspective, it will also help to keep a more complete workflow -in place for the whole infrastructure. - -Some of the most important benefits: - - - Checking automated tests results will need minimal or no manual intervention. - - Automated tests failures will be reported automatically on time. - - It will provide a more consistent and accurate way to track issues found - by automated tests. - - It will help to keep tests reports up to date. - - It will provide the infrastructure components that will help to implement - further improvements in tracking the overall project health. - -Though the project as a whole will benefit from the above points, some benefits -will be more relevant depending on the project roles and areas. Following -subsections give a list of these benefits for each role. - -## Developers and Testers Benefits - - - It will save time for developers and testers since they won't need to check - automated tests logs and tests results manually in order to report issues. - - Developers will be able to notice and work on critical issues much faster, - since failures will have more visibility on time. - -## Managers Benefits - - - Given automated tests issues will be reported on time and more consistently, - that will help managers to take more accurate decisions during planning. - -## Products Teams Benefits - - - The whole CI workflow for automated tests is properly implemented, so it - offers a more complete solution to other teams and projects. - - Closing the CI loop offers a more coherent infrastructure design that other - products teams can adapt to their own needs. - - Products teams will have a better view of the bugs opened in a given time, thus - having a better idea about the overall project health. - -# Overview of steps to close the CI loop - -This is an overview of the phases required to complete closing the current CI loop: - - - Tests results should be fetched from LAVA. - - Tests results should be processed and optionally saved somewhere. - - Tests results should be analyzed. - - Tasks for tests issues should be created using analyzed tests results. - -# Current Infrastructure - -This section explores the different services available from our infrastructure -proposed to implement the remaining phases to close the CI loop. - -## LAVA User Notifications - -As of LAVA V2, there is a new feature called **Notication Callback** which allows -to send a GET or POST request to a specified URL to trigger some action remotely. -If using the POST request, this allows to attach and send test job information and -results. - -This can be used to send the tests results back to Jenkins from LAVA for further -processing in new pipeline phases. - -## Jenkins Webhook Plugin - -This plugin provides an easy way to block a build pipeline in Jenkins until an -external system posts to a webhook. - -This can be used to wait for the automated tests results sent by LAVA from a new -Jenkins job responsible of triggering the automated tests. - -## Phabricator API - -Conduit is the developer API for Phabricator which can be used to implement the -management of tasks. - -This API can be used (either with tools or language bindings) to manage Phabricator -tasks from a Jenkins phase in the main pipeline. - -## Mattermost - -Mattermost is the chat system used by the Apertis project. - -Jenkins already offers a plugin to send messages to mattermost. - -This can be used in order to send notifications messages to the chat channels, -for example, to notify the team once a critical test starts failing, or when a bug -has been updated. - -# CI Workflow Overview - -The main workflow would basically consist on combining the above mentioned -technologies to implement the different phases for the main CI pipeline. - -A general overview of the steps involved would be: - - - Jenkins build images and trigger LAVA jobs. - - Use the `webHook` pipeline plugin to wait for LAVA tests results from Jenkins. - - LAVA execute automated tests jobs and results are saved in its database. - - LAVA triggers an user notification callback attaching test job information - and results to send to the Jenkins webHook. - - Tests results are received by Jenkins through the webHook. - - Test information is sent to a new `pipeline` to process and analize tests - results. - - Once tests results are processed and analyzed, these are sent to a new - `pipeline` to manage Phabricator tasks. - - Optionally a new Jenkins phase could notify results to mattermost or via email. - -This complete loop will be executed every time new images are built. - -# Fetching Tests Results - -The initial and most important phase to close the loop is fetching and processing -the automated tests results from LAVA. - -The proposed solution in this document is to use the webHook plugin to fetch the -LAVA tests results from Jenkins once the automated test job is finished. - -Currently, LAVA tests are submitted in the last stage of the Jenkins pipeline job -creating and publishing the images. - -Automated tests are organized in groups, which are submitted all at once using the -`lqa` tool for each image type once the images are published. - -A webhook should be registered for each `test job` rather than for a group of -tests, so a change in the way LAVA jobs are submitted is required. - -## Jenkins and LAVA Interaction - -The proposed solution is to separate the LAVA job submission stage from the main -Jenkins pipeline job building images, and instead have a single Jenkins job that -will take care of triggering the automated tests in LAVA once the images are -published. - -The only required fields for the stage submitting the LAVA tests jobs are the -`image_name` , `profile_name`, and `version` of the image. A single Jenkins job -could receive these values as arguments and trigger the automated tests for each -of the respective images. - -The way LAVA jobs are submitted from Jenkins will also require some changes. The -`lqa` tool currently submit several `groups` of tests jobs at once, but since each -test job requires to have an unique webhook, they will need to be submitted -independently. - -One simple solution is to have `lqa` processing the job templates first and then -submit each processed job file with an unique webhook. - -Once all tests jobs are submitted for a specific image type, the Jenkins executor -will wait for all of their webhooks. This will block the executor, but since the -webhook returns immeditealy for those jobs that already posted the results to the -webhook callback, it is fair to say that the executor will only block until the -last completed test job sends its results back to Jenkins. - -After all results are received in Jenkins, these can be processed by the remaining -stages required for tasks management. - -## Jenkins Jobs - -Since images are built from a single Jenkins job, the most sensible approach for -final implementation is to have a new Jenkins job receiving all the images types -information and triggering tests for all of them, then a different job for -processing tests results, and possible another one handling the tasks management -phases. - -# Tasks Management - -One of the most important phases in closing the loop is reporting tests issues in -Phabricator. - -Tests issues will be reported automatically in Phabricator as tasks per test cases -instead of tasks per issues. This has an important consequence explained in the[considerations]( {{< ref "#considerations" >}} ) section. - -This section gives an overview for the behaviour of this phase. - -## Workflow Overview - -Management of Phabricator tasks can be as follow: - - 1) Query Phabricator to find all open tasks with the tag `test-failure`. - - 2) Filter the list of received tasks to make sure only the exact tasks are - processed. For this, scanning for further specific fields in the task can be - helpful, for example, keeping only tasks with a specific task name format. - - 3) Fetch analyzed tests results. - - 4) For each test, based on its results and checking the tasks list, do the - following: - - a) Task exists: Add a coment to the task. - - b) Task does not exist: - - If test has status `failed`: Create a new task. - - If test has status `passed`: Do nothing. - -## Considerations - - - The comment added to the task will contain general information of the failure - with a link to the LAVA job logs. - - - Tasks won't be reported per platform but per test case. Once a task for a test - case failure is reported, all platforms failures for such a test case should be - added as comments to that single task. - - - Closing and verifying tasks will still require manual intervention. This will - help avoiding the following corner cases: - - - Flaky tests that would otherwise end up in a series of new tasks that gets - autoclosed. - - Tests failing on one image that also succeed on a different image. - - - If a test starts failing again for a previously closed task, a new task will be - created automatically for it, and manual verification is required to check if - it is the same previously reported issue, in which case is recommended to add - a reference to the old task. - - - If after fixing an issue for a reported task, a new issue arises for the - same test case, the same old task will be updated with this new issue. This - is an effect of reporting tasks per test cases instead of per issues. In such a - case, manual verification can be used to confirm if it is or not the same issue - and a new subtask can be manually created by the developer if deemed necessary. - -## Phabricator Conventions - -For automation of the phabricator tasks management, there will be the need of -creating certain conventions in phabricator. This will require minimal manual -intervention. - -First of all a specific user should be created in Phabricator to manage these -tasks automatically. - -This username could be named `apertis-qa` or `apertis-tests`, and its only purpose -will be to manage tasks automatically at this stage of the loop. - -A special tag and a specific format in the tasks name will also be used in tasks -reported by this special user: - - - The tag `test-failure` is the special tag for automated tests failure. - - The task name will have the format: "{testcasename} failed: <Task title>". - - A `{testcasename}` tag can also be used if it is available for the test. - -# Design and Implementation - -This section gives a brief overview of the design for the main components to close -the loop. - -Each of these components can be developed as independent modules, or as a single -unit containing all the logic. The details and final design of these components -depend on the most convenient approach chosen during implementation. - -## Design - -### Tests Processor - -This will take care of processing the tests results as they are received from LAVA. - -LAVA tests results are sent to Jenkins in a `raw` format, so things to do -at this level could involve cleaning data or even converting tests results to -a new format so they can be more easily processed by the rest of the tools. - -### Tests Analyzer - -This will make sure that the test results data is in a consistent and convenient -format to be used by the next module (`task manager`). - -This can be a new tool or just be part of the `test result processor` running -in its same Jenkins phase just for convenience. - -### Tasks Manager - -This will receive the whole tests results data analyzed, and ideally it shouldn't -deal with any test data manipulation. - -It will take care of comparing the status between tests results and -phabricator tasks, decide next steps to do and manage those tasks accordingly. - -### Notifier - -This can be considered an `optional` component and can involve sending further -forms of notifications to different services, for example, send messages to -`Mattermost` channels or emails notifying about new critical bugs. - -## Implementation - -As originally envisioned, each of the design components could be written using a -scripting language, preferably one that already offers a good integration with -our infrastructure. - -The `Python` language is highly recommended, as it already offers plugin for -all the related infrastructure, so it would require a minimal effort to integrate -a solution written in this language. - -As a suggested environment, Jenkins could be used like the main place to -execute and orchestrate each of the components. They could be executed using a -different pipeline for each phase, or just a single pipeline executing all the -functionality. - -For example, once the LAVA results are fetched in Jenkins, a new pipeline phase -receiving tests results can be started to execute the `test processor` and -`test analyzer`, which in turn will send the output to a new pipeline phase to -execute the `task manager` and later (if available) to the `notifier`. - -## Diagram - -This is a diagram explaining the different infrastructure processes involved in -the proposed design to close the CI loop. - - - -# Security Improvement - -The Jenkins webhook URL will be visible from the public LAVA tests definitions, -which might arise security concerns. For example, another process posting to the -webhook before LAVA does will break the Jenkins job waiting for the tests results. - -After researching several options to solve this issue, one solution has been found -which consists in checking for a protected authorization header in Jenkins sent by -LAVA when posting to the webhook. - -This solution requires changes both in the Jenkins plugin and the LAVA code, and -they need to be implemented as part of the solution for closing the CI loop. - -# Implementation - -The final implementation for the solution proposed in this document will mainly -involve developing tools that need to be executed in Jenkins and will interact -with the rest of the existing infra services: LAVA, Phabricator and optionally -Mattermost. - -All tools and programms will be available from the project git repositories with -their respective documentation, including how to setup and use them. - -In addition to this, the final implementation will also include documentation about -how to integrate, use and maintain this solution using the currently available -infrastructure services, so other teams and projects can also make use of it. - -# Constraints or Limitations - - - Some errors might not be trivially detected for automated tests, since - LAVA can fail in several ways, for example, infra errors sometimes might be - difficult to analyze and will still require manual intervention. - - - The `webHook` plugin blocks the Jenkins pipeline. This might be an issue - in the long term and it should be an open point for further researching - in later version of this document or during implementation. - - - This document deals with the existing infra, so a proper data storage has - not been defined for test cases and tests results. Creation of weekly tests - reports will continue requiring manual intervention. - - - The test definitions for public LAVA jobs are publicly visible. The Jenkins - webhook URL will also be visible in these tests definitions, which can be a - security concern. A solution for this issue is proposed in - [security improvement]( {{< ref "#security-improvement" >}} ). - - - Closing and verifying tasks will still require manual intervention due to - the points explained in the[considerations]( {{< ref "#considerations" >}} ) section. - -# New CI Infrastructure and Workflow - -The main changes in the new infrastructure is that test results and test cases -will be stored in SQUAD and Git respectively, and there will be mechanisms in -place to visualise test results and send test issues notifications. The new -infrastructure is defined at the [test data storage document][TestDataStorage]. - -Manual tests are processed by the new infrastructure, so the new workflow will -also cover closing the CI loop for manual tests. - -## Components and Workflow - -A new web service can be setup to receive the callback triggered by LAVA in a -specific URL in order to fetch the automated tests results, instead of using the -Jenkins webhook plugin. This is in case that using the Jenkins webhook might turn -out not to be a suitable solution during implementation either for the current -CI loop infrastructure or for the new one. - -Therefore the following steps will use the term `Tests Processor System` to refer -to the infrastructure in charge of receiving and processing these test results, -and which can be setup either in Jenkins or as a new infrastructure service. - -The main components for the new infrastructure can be broadly split into the -following phases: automated tests, manual tests, tasks management, and reporting -and visualization. - -### Automated Tests - -Workflow for automated tests: - - - Jenkins build images and trigger LAVA jobs. - - LAVA execute automated tests jobs and results are saved in its database. - - LAVA triggers an user notification callback attaching test job information - and results to send to the tests processor system. - - The system opens a HTTP URL to wait for the LAVA callback in order to receive - tests results. - - Tests results are received by the tests processor system. - - Once test results are received, these are processed with the tool to convert - the test data into SQUAD format. - - After the data is in the correct format, it is sent to SQUAD using the HTTP - API. - -### Manual Tests - -Test results will be entered manually by the tester using a new application, -in this workflow named `Test Submitter Application`. - -This application will prompt the tester to enter each manual test results, and -will send the data to the SQUAD backend, as explained in the -[test data storage document][TestDataStorage]. - -The following workflow includes the processing of manual test results into the -CI loop: - - - Tester manually executes test cases. - - Tester enters test results into the test submitter application. - - The application sends the test data to the tests processor system using a - reliable network protocol. - - Tests results are received by the tests processor system infrastructure. - - Once test results are received, these are processed with the tools to convert - the test data into SQUAD format. - - After the data is in the correct format, it is sent to SQUAD using the HTTP - API. - -### Tasks Management - -This phase deals with processing the test results data in order to file and manage -Phabricator tasks and send notifications. - - - Once all tests results are stored in the SQUAD backend, they might still need - to be processed by other phases in the tests processor system, and sent to a - new phase to manage Phabricator tasks. - - The new Phabricator phase uses the tests data to file new tasks following the - logic explained in the[tasks management]( {{< ref "#tasks-management" >}} ) section. - - The same phase or a new one could notify results to mattermost or via email. - -### Reporting and Visualization - -A new web application dashboard will be used to view test results and generate -reports and graphical statistics. - -This web application will fetch results from the SQUAD backend and will process -them to generate the relevant statistic and graphics. - -The weekly test report will be generated either periodically or at any time as -needed using this web application dashboard. - -More details can be found in the [reporting and visualization document][TestDataReporting]. - -## General Workflow Overview - -This section gives an overview about the complete workflow in the following -steps: - - - Automated tests and manual tests are executed in different environments. - - - Automated tests are executed in LAVA and results are sent to the `HTTP URL` - service open by the `Tests Processor System` to receive the LAVA callback - sending the tests results. - - - Manual tests are executed by the tester. The tester uses the `Test Submitter - App` to collect tests results and send them to the `Tests Processor System` - using a reliable network protocol for data transfer. - - - All test results are processed and converted to the SQUAD JSON format by the - `Test Processor and Analyzer`. - - - Once test results are in the correct format, they are sent to the SQUAD backend - using the SQUAD HTTP API. - - - Test results might still need to be processed by the `Test Processor and - Analyzer` in order to be sent to the new phases. Once results are processed, - these are passed to the `Task Manager` and `Notification System` phases to - manage Phabricator tasks and send email or mattermost notifications - respectively. - - - From the SQUAD backend, the new `Web Application Dashboard` fetches tests - results periodically or as needed to generate test results views, graphical - statistics, and reports. - -The following diagram shows the visual for the above workflow: - - - -## New Infrastructure Migration Steps - -- Setup a SQUAD instance. This can be done using a Docker image, so the setup - should be very straightforward and convenient to replicate downstream. - -- Extend the current `Test Processor System` to submit results to SQUAD. This - basically consists on using the SQUAD URL API to submit the test data. - -- Convert the testcases from the wiki format to the strictly defined YAML format. - -- Write an application to render the YAML testcases, guide testers through them and - provide them a form to submit their results. This is the `Test Submitter App` and - can be developed as either a web frontend or a command line tool. - -- Write the reporting web application which fetches results from SQUAD and renders - reports. This is the `Web App Dashboard` and it will be developed using existing - modules and frameworks in a convenient way such a deployment and maintenance can - be done in the same way than other infrastructure services. - -## Maintenance Impact - -The new components required for the new infrastructure are the `Test Submitter`, -`Web Application Dashboard` and SQUAD, along with some changes needed for the -`Test Processor System` to receive the manual tests results and send test data -to SQUAD. - -SQUAD is an upstream dashboard that can be deployed using Docker, so it can be -conveniently used by other projects and its maintenance effort won't be more than -other infrastructure services. - -The test submitter and web application dashboard will be developed reusing existing -modules and frameworks for each of their functionalities, they mainly need to use -already well defined APIs to interact with the rest of the services, and they will -be designed in such a way that can be conveniently deployed (for example, using -Docker). They are not expected to be large applications, so maintenance should be -equal to other tools in the project. - -The test processor is a system tool, developed in a modular way, so each component -can reuse existing modules or libraries to implement the required functionality, -for example, make use of an existing HTTP module to access the SQUAD URL API, so -it won't require a big maintenance effort and it will practically be the same than -other infrastructure tools in the project. - -Converting the test cases to the new YAML format can be done manually, and a small -tool can be used to assist with the format migration (for example, to sanitize the -format). This should be a one time task, so no further maintenance is involved. - -# Links - -LAVA Notification Callback : - - https://lava.collabora.co.uk/static/docs/v2/user-notifications.html#notification-callback - -Jenkins Webhook Plugin: - - https://wiki.jenkins.io/display/JENKINS/Webhook+Step+Plugin - -Phabricator API: - - https://phabricator.apertis.org/conduit - -Mattermost Jenkins Plugin: - - https://wiki.jenkins.io/display/JENKINS/Mattermost+Plugin - - -[TestDataStorage]: test-data-storage.md - -[TestDataReporting]: test-data-reporting.md