Follow up FS corruption issue - part 1
A few approaches have been tried out on #84 (closed) and {T8660} but the root cause still escapes us.
It's time to sit down and think about alternatives and:
- recap what we know about the issue
- define a clear strategy of which tests we should run next to shed more light on it
This task is timeboxed to 10pt.
Summary
This issue appears randomly on ostree images for versions v2022 and v2023. The issue could be due to either the image builder or the software in the image itself.
Because it happens on all devices, hardware or architecture specific issues have been ruled out.
Reproductability
The issue happens randomly, more often on the armhf architecture. There is no known procedure to reproduce it in a controlled environment.
All tests in the next sections consist of pushing a branch with changes that could fix the issue and run the pipeline multiple times, hoping that the FS Error doesn't appear anymore. When it appears, it is usually on the 1st or 2nd pipeline. So running 10 pipelines with no FS-system errors means that it is probably fixed there (as a workaround, actually fixing still needs to be done).
Tests done
Test | Result | Comments |
---|---|---|
Apply inode fix patch set (pkg/linux!160 (merged)) | Not better | https://phabricator.apertis.org/T8660#286048 |
Try setting UNIX_IO_NOZEROOUT to 1 | Not better | https://phabricator.apertis.org/T8660#278469, https://gitlab.apertis.org/infrastructure/apertis-image-recipes/-/pipelines/356149 |
Check FS with fsck | Nothing seems wrong | https://phabricator.apertis.org/T8660#278469 |
Use kernel 5.10 (from v2021) | Not better | https://phabricator.apertis.org/T8660#286800 |
Check issues for ext4 fast_commits | fast_commits not used in Apertis | https://phabricator.apertis.org/T8660#286800 |
Use ext3 instead of ext4 | FS-error not seen | ext3 works differently, may not be relevant to this issue |
Deactivate ext4 lazy_init | N/A | Not tested yet, needs support in debos |
Building v2023dev2 images in v2021 docker | N/A | That brings other issues like segfault on arm64 and kernel freeze on armhf |
Build with KVM instead of UML | Not better | https://phabricator.apertis.org/T8660#287489 |
Build images with older e2fsprogs (from v2021) | Not better | https://phabricator.apertis.org/T8660#287941 |
Downgrade e2fsprogs, debos, uml and ostree on docker and built images | Not better | https://phabricator.apertis.org/T8660#288477 |
Next steps
- One possibility is to find more packages to downgrade and start over.
- Even though it has already been tried alone, downgrading linux with the other packages (e2fsprogs and ostree) on the image has been suggested
- Try to reproduce more easily by forcing a full FS check at boot. Might not work because the errors could be coming later when accessing the FS.
- Another approach would be to analyse the FS on the image and find what's wrong with it (But we can't be sure that errors are there before boot).
Outcomes
- apertis-image-recipes!519 (closed)
- pkg/util-linux!15 (merged)
- pkg/util-linux!16 (merged)
- pkg/util-linux!17 (closed) (closed since the approach could lead to potential problems)
- pkg/util-linux!18 (closed) (closed since the approach could lead to potential problems)
- apertis-image-recipes!529 (merged)
- apertis-image-recipes!531 (merged)
- apertis-image-recipes!532 (merged)
Management data
This section is for management only, it should be the last one in the description.
Phabricator link: https://phabricator.apertis.org/T8924