Reader

Read the latest posts from The Geek's corner – Recycled Cloud.

from vdoreau

#lowlevel #alpine #kernel

Recently, we had to update a router running alpine linux. This link is redundant with another alpine router, so a stressful operation but no big deal. Standard procedure: upgrade the system, then reboot to make sure everything is in order and the latest kernel is used. The second router takes over traffic, so far so good. Until the first router doesn't come back up.

We have access to its console, so I connect to see what's happening aaand ... bingo, the router is stuck at boot with the following line from the kernel, and after alpine took over the boot process.

random: crng: init done

We've already faced that and we prepared for that. Unlike default alpine that only keeps the latest kernel, we had a backup kernel on the system, ready for this exact case. So naturally I rebooted, selected the other kernel aaand ... back luck, still the exact same. At least the kernel was out of the equation.

At this point I knew I had to go into the initramfs to debug the boot process (I had already done it previously and was able to solve a similar issue this way). On alpine, the init script is /usr/share/mkinitfs/initramfs-init. Looking at it reveals you can use the single kernel option to tell the init script to spawn a shell before starting the root process (init on alpine, systemd on debian ...).

Here are the relevant lines from the script.

if [ "$SINGLEMODE" = "yes" ]; then
    echo "Entering single mode. Type 'exit' to continue booting."
    sh
fi

From there, you basically have a shell in the initramfs, so a working, minimal linux system. You can then alter the boot process. After all, init is only a shell script running some commands.

At this point, the only clue I had was the few lines after alpine started booting. Two of them mentioned changing permissions in the /run folder, and the last one was about randomness from the kernel.

I immediately connected the random stuff to boot time entropy starvation, which we already experienced in virtual machines but only as slow down, never a full stuck boot. I'm not going to explain what boot time entropy starvation is, if you want to know more, this article from the debian wiki is a good starting point.

Naturally, I went this way and found out you can get the available entropy of a system with /proc/sys/kernel/random/entropy_avail. At the time, it gave me something like 80, which is below the 256 mentioned in this forum post.

So here I am on my way to try to generate more entropy. Now you may have come across openssh asking you to move your mouse to generate entropy. Well of course this is not possible here since there is no X server. After trying a few commands and looking around, I cated the entropy available again, and to my surprise ... it increased! Interestingly, it increases with every command, even just typing gibberish worked; likely every keystroke increases entropy a little bit.

I continued this way until I hit 256 entropy, which seems to be the max. And at this exact moment, the kernel printed random: crng: init done. Out of curiosity/instinct I typed exit, which gives back control to the init script to continue booting and ... it worked! The router was now booting as normal.

In the end, this little APU was entropy starved. Now this is supposed to be fixed since Linux kernel 5.4, so I cannot explain entirely why this happens. To remedy this, we installed haveged, which is a userspace daemon gathering entropy from hardware. This is a first step towards full resolution that we'll need to monitor to verify it is sufficient.


When writing this blog post – and after some sleep – I realized something. At some point during the initial process I rebooted the router, I pinged it continuously to see when it would come back up.

Yay it pings, must be back up, let's ssh!

Oh ssh doesn't respond, let's give it a little bit of time, must still be booting.

Ok weird it's been a long time now, let's ping it again to see if it went down or something.

Still pings, ok so it's taking a suspiciously long amount of time to boot. Let's ssh again just to see if it finished now.

Ok works, piouf, that was close, it did reboot alone in the end

Or so I thought.

It's only later I understood, pinging an interface can generate randomness. So what I did without even knowing, was generating enough entropy, exactly as I did in the initramfs by issuing commands.

 
Read more...