@wlozano Do you have links to other failures like this ? Is it always on aaeon-UPN-EHLX4RE-A10-0864-cbg-2 ? I can't list the jobs of aaeon-UPN-EHLX4RE-A10-0864-cbg-2 on LAVA (error 500), even when logged in.
The first line set root='(hd0,gpt1)' is what is sent to the DUT by lava over the serial port. The second line grub> set root='(hd0,gpt1)' is what the DUT outputs on the serial port to show what was typed (the console echo).
The command seems complete there, so I'm not sure what can cause grub to miss the end of the value.
Also, because of the ', the grub console would probably fail as there would be no closing '.
Because it fails on other devices the same way, maybe the serial port is not the culprit, but a bug in the grub console ? That seems unlikely as well.
yeah indeed if the ' was missing grub would complain (or really it goes into multi-line input waiting for the final '), which would cause lava to timeout waiting for the prompt... So apparently "sometimes" 2 bytes are missed on the input even if it echo's... which is weird.
Especially as for the initial network boot the inputs are way way longer, so if it's a problem of e.g. overflows we'd see it there as well..
@detlev@wlozano thusfar our sample size is 2, which is a bit small; Could you analyse a bunch more failing test jobs to work out if:
It's always the "boot-to-disk" custom commands that have this issue (e.g. not the initial flashing phase)
The error is always the same (e.g. always missing the 1))
That might help us pin down if it's somewhat specific to setting env vars or...
I did some analysis in the grub code but all looks quite good. I thought there could be an issue with the simple quotes ' somehow messing up the strlen of the value (2 chars missing, 2 quotes) but my flex/yacc/bison knowledge is not good enough for something so subtle.
I couldn't find other jobs with this issue. @wlozano do you keep a list of these failures ?
According to grubs code, the message error: disk `(hd0,gpt' not found. is only shown when parsing the (hd0 part. If that succeeds, it goes on to the partition probing.
So the real error here seems to be that the disk hd0 was not found. The fact that `(hd0,gpt is shown in the message can be another issue, maybe just related to the print function and the data is actually correct.
That could be a timing issue in grub (not a hardware issue as it started happening on different boards at the same time)
To be sure of that, I would suggest adding a printenv (or equivalent) in the LAVA grub boot script to check that the variable is indeed correct the next time this happens. (a simple change to get a better understanding)
@sjoerd do you know what kind of disk hd0 is supposed to be in LAVA on these boards ? is It a USB Mass storage ?
hd0 should be the onboard emmc there aren't any extra storage devices plugged in on these afaik; You could add ls -l to the custom grub commands to see if it that provides us a bit more background of what's going on here.. To show the env var you'd add echo ${root}
For reference though if either the partition or the hd doesn't exist you don't get those errors:
grub> set root=(hd1,gpt1) grub> chainloader /efi/boot/bootx64.efi error: disk `hd1,gpt1' not found. grub> set root=(hd0,gpt4) grub> chainloader /efi/boot/bootx64.efi error: disk `hd0,gpt4' not found.
The only way to get error: disk (hd0,gpt' not found.seems to be when actually inputting(hd0,gpt`
Note that you do get (hd0,gpt if you input that; so it only ends up stripping the parentheses iff there is a pair. For grub i'm using 2.11 same as our lava setups -- the build comes from https://gitlab.collabora.com/lava/grub
@sjoerd I ran 150 jobs on the upsquared 6000 on staging and none failed when using the chainloader. I also tried 50 jobs with a character_delay of 20 ms and none failed either.
So that doesn't tell us that the character_delay is the solution, but that tells us that the character_delay doesn't hurt.
Does it make sense to add it in the production template ? (Maybe 5 ms would be enough though)
Sorry, I don't understand what do you mean with "using the chainloader". Regarding the comment of a delay of 5 ms I don't fully understand the consequences, but it looks OK to me.
Taking a look at this, if I look at jobs 12885140 and 12885141. These appear to both be runs against apertis_v2025dev2-hmi-amd64-uefi_20240228.0317, testing different things but the part of the test we care about seems basically the same. One fails the other doesn't.
After both echos of grub> set root='(hd0,gpt1)'\n from the target board, there is a string of mainly ASCII escape sequences. These differ subtly, one ends (the working one):
Decoding those, the former is positioning the cursor at successive locations between columns 25 and 28 on row 9 and printing 1)', the other prints ' at row 9, column 25.
I'm assuming that this data is primarily to output into a buffer for a screen? But shows it's somehow lost the 1), but keeps the '!!!
that's grub over a serial line so it's expecting vt100 on the other side; the loss of characters can be both loss on input and output still (likely input).
Looked into dropping to "dumb" mode on the serial however the vt100 seems to be hard coded for a given target/platform set and the following message found in the grub source suggests pursuing this further may not be very fruitful:
/* FIXME: The dumb interface is not supported yet. */
Ran 400 iterations of the following script locally overnight, using Qemu with the Grub binary from LAVA and Apertis v2023 fixedfunction image. Zero failures occurred, but then this is a highly contrived serial interface.