Skip to content

Follow up FS corruption issue - part 1

A few approaches have been tried out on #84 (closed) and {T8660} but the root cause still escapes us.

It's time to sit down and think about alternatives and:

  1. recap what we know about the issue
  2. define a clear strategy of which tests we should run next to shed more light on it

This task is timeboxed to 10pt.

Summary

This issue appears randomly on ostree images for versions v2022 and v2023. The issue could be due to either the image builder or the software in the image itself.

Because it happens on all devices, hardware or architecture specific issues have been ruled out.

Reproductability

The issue happens randomly, more often on the armhf architecture. There is no known procedure to reproduce it in a controlled environment.

All tests in the next sections consist of pushing a branch with changes that could fix the issue and run the pipeline multiple times, hoping that the FS Error doesn't appear anymore. When it appears, it is usually on the 1st or 2nd pipeline. So running 10 pipelines with no FS-system errors means that it is probably fixed there (as a workaround, actually fixing still needs to be done).

Tests done

Test Result Comments
Apply inode fix patch set (pkg/linux!160 (merged)) Not better https://phabricator.apertis.org/T8660#286048
Try setting UNIX_IO_NOZEROOUT to 1 Not better https://phabricator.apertis.org/T8660#278469, https://gitlab.apertis.org/infrastructure/apertis-image-recipes/-/pipelines/356149
Check FS with fsck Nothing seems wrong https://phabricator.apertis.org/T8660#278469
Use kernel 5.10 (from v2021) Not better https://phabricator.apertis.org/T8660#286800
Check issues for ext4 fast_commits fast_commits not used in Apertis https://phabricator.apertis.org/T8660#286800
Use ext3 instead of ext4 FS-error not seen ext3 works differently, may not be relevant to this issue
Deactivate ext4 lazy_init N/A Not tested yet, needs support in debos
Building v2023dev2 images in v2021 docker N/A That brings other issues like segfault on arm64 and kernel freeze on armhf
Build with KVM instead of UML Not better https://phabricator.apertis.org/T8660#287489
Build images with older e2fsprogs (from v2021) Not better https://phabricator.apertis.org/T8660#287941
Downgrade e2fsprogs, debos, uml and ostree on docker and built images Not better https://phabricator.apertis.org/T8660#288477

Next steps

  • One possibility is to find more packages to downgrade and start over.
    • Even though it has already been tried alone, downgrading linux with the other packages (e2fsprogs and ostree) on the image has been suggested
    • Try to reproduce more easily by forcing a full FS check at boot. Might not work because the errors could be coming later when accessing the FS.
  • Another approach would be to analyse the FS on the image and find what's wrong with it (But we can't be sure that errors are there before boot).

Outcomes

Management data

This section is for management only, it should be the last one in the description.

Phabricator link: https://phabricator.apertis.org/T8924

Edited by Walter Lozano