Linux on HP

20 and

Intro

This article is in my category directory ‘sfreview’, software review, although it is actually about hardware and software. I encountered the problem, not very severely, in Linux Mint 18.3 Sylvia, more severely in Mint 19, less so in 20.1, and imperatively harshly when trying to install Linux Mint 21.1 Vera. Why there were those differences, I don’t know.

Bodhi Linux (like Mint, based on Ubuntu) was also uninstallable on the same hardware, due to this. I also noticed the problem in Alpine Linux 3.17, but there it was quite bearable.

So the issue isn’t specific to Mint. It is specific to Linux, in the sense that it doesn’t seem to occur with Windows. And it is specific to certain products by HP (Hewlett-Packard). In my case: an HP Pavilion x360 Convertible, bought 2 July 2016, when I urgently needed a computer because another one had suddenly broken down. It is convertible in the sense that it can be used as a tablet, and as a laptop. I only ever used it as a laptop computer.

HP doesn’t support Linux, but only Windows. So when a problem arises, you’re basically on your own. Unless someone somewhere on the internet knows a solution.

So what is the problem? And what solutions have I found?

The problem

I started the HP computer from a USB-stick containing an installable ISO image for Linux Mint 21.1 Vera, with Cinnamon as the desktop. Later I used Ventoy, so more ISOs fit on one stick. Ventoy or the ISO starts a live session, in which you can try out Mint, and optionally you can install the system to hard disk from there.

There were three processes that together sucked up a lot of computer resources, both in processor power and in disk IO. Summarised output of command top:

PRNI%CPUCOMMAND
−51093,4irq/123-aerdrv
19−193,4systemd-journal
20081,4rsyslogd

We see that the IRQ process has a priority (PR) of minus 51, which is actually very high. A higher number means a lower priority, 20 is normal, negative is exceptionally high. Makes sense because an IRQ, an interrupt request, is urgent and asks the processor to interrupt its current activity, and serve the IRQ first.

The nice value (NI) is usually zero, which is normal, but minus 1, i.e. slightly higher, for the systemd-journal process. In this case that results in a priority that is also slightly higher than usual, 19 instead of 20.

The three processes together use up almost 300% of available processor power. The actual numbers vary somewhat, between 80 and 105% each. This model has a chip (Intel i3, generation 6) with four processor cores, two in hardware, each with two in firmware, or maybe I should say: microcode.

This heavy load by itself doesn’t make the computer unresponsive, although it is slow. That’s probably because a fourth processor core is still free, to serve keyboard or mouse input, process it, and send results to the graphics processor, so you see things happening on the screen. And the other processors too can devote some time to other processes in sequential multitasking, giving them time slices.

When I noticed the same problem in Alpine Linux instead of Linux Mint, there was only one process that took up about 10% of available processor core capacity. Not nice, but not a problem. Probably under Alpine, the priority isn’t as high? Or the interrupt is handled more efficiently?

There was only one process then, which means there was no logging. And that’s where it bites under Linux Mint, in 21.1 and perhaps other versions too. These two logging processes write to disk files, /var/log/kern.log and /var/log/syslog. That has a clear impact on overall performance, even with the powerful and very fast hardware of modern computers.

When later the problem occurred after installation to hard disk, after just a few minutes the log files had a size of 16 gigabytes each! Not funny, although the disk is large enough to cope with it. This explains why in a live session situation, these processes are imperative, and soon make the computer completely unusable: a live session has limited ‘disk’ space, in a squashfs file system, which actually resides in RAM. Very fast, but not as spacious as a magnetic hard disk or an SSD.

So after a few minutes: disk full, nothing else worked. With zero bytes of free storage, hardly anything can still function. There was not enough time to install the system. So brick the hardware? That would be a shame. This HP Pavilion has a fine screen with very warm colours. And I think almost seven years isn’t too old for a computer. I want to use it for another five years if possible, although perhaps only as a backup computer in case the primary one fails.

(Side note: Shouldn’t log rotation have kicked in to handle those humongous log files? I thought I had it enabled. Perhaps it only checks every hour or so, and didn’t find the time yet. Anyway, clearing he log manually cannot be done using something like sudo cat > syslog, because the shell (bash, for example) opens the log file, and cat has obtained root capabilities, but bash has not. Workaround: sudo tee syslog, then press ctrl-d to end input.)

Log file contents

Below I list a typical set of log file entries, taken from /var/log/kern.log. There are some 20 thousand such report blocks for each second! Some variation occurs, sometimes there are fewer lines per set, sometimes more lines are repeated. I added some line breaks and tabs for readability, where in reality everything was in one long line for each timestamp.

Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417238] pcieport 0000:00:1c.4: PCIe Bus Error:
	severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417240] pcieport 0000:00:1c.4:
	device [8086:9d14] error status/mask=00000001/00002000
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417247] pcieport 0000:00:1c.4:
	[ 0] RxErr
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417257] pcieport 0000:00:1c.4:
	AER: Corrected error received: 0000:00:1c.4
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417264] pcieport 0000:00:1c.4:
	AER: can't find device of ID00e4
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417266] pcieport 0000:00:1c.4:
	AER: Corrected error received: 0000:00:1c.4
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417281] pcieport 0000:00:1c.4:
	AER: can't find device of ID00e4
Mar 19 18:51:03 rudhar-HP-Pavilion-x360-Convertible
	kernel: [   41.417283] pcieport 0000:00:1c.4:
	AER: Corrected error received: 0000:00:1c.4

Difficult to say whose fault this is, and who might have fixed it. Is it HP, who used faulty chips that cause errors that shouldn’t have happened? Or Intel, supplying the chipset? Is it something in PCI or PCIe (Peripheral Component Interconnect Express)? In MSI (Message Signalled Interrupts)? Or in AER maybe (Advanced Error Reporting), that shouldn’t report, or shouldn’t log, errors of severity “corrected”? Or should only report them three times and then stop, not the thousands or millions of times they occur?

I know far too little of this area of technology to be able to judge that. I did see someone mention the command lspci somewhere, which lists all PCI devices. It reveals that the mentioned device 00:1c.4 is a PCI bridge. There are four of them in the system:

00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)

Commands sudo lspci -v and sudo lspci -v give me some more info. I see no difference between this bridge and the three others. There is this line:
Interrupt: pin A routed to IRQ 123
that mentions the same interrupt number I also saw in the process name in top, the process reporting program: irq/123-aerdrv.

If perhaps Hewlett-Packard could have fixed the problem, by a firmware update or some­thing, it seems they don’t even try. They don’t take the problem seriously. In an HP Community forum, I found this reaction by an HP employee nicknamed A4Apollo. Quote:

HP does not support dual boot options unless the unit has been shipped with two operating systems.
   
You have to contact Linux support for more assistance.

Solution (phase 2)

Various places on the internet can be found where solutions are described, (like here, for example) basically variants of the same thing. Not really solutions maybe, but work­arounds. Those helped me in the past. They didn’t help me this time. But see the next chapter. So I deliberately numbered that one number 1, because chronologically, I had to apply that first. But I learned about it last.

Those solutions I found on the internet entail that you edit the file /etc/default/grub . Where there is a line like:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
you change it to:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"
Or if it isn’t there, add it. Then you run update-grub (which is sometimes called update-grub2; and Alpine Linux doesn’t have this), which, using grub-mkconfig and os-prober, generates a new /boot/grub/grub.cfg. That controls the menu that the grub bootloader presents to the user, and the parameter pci=noaer will be passed to the Linux kernel.

This setting noaer means ‘No Advanced Error Reporting’. It doesn’t really solve the problem, the errors (which are automatically corrected) are still happening, only they are no longer reported, so no longer logged, and no disk space is eaten up.

Some also suggest the setting pci=nomsi instead, which disables the MSI or Message Signalled Interrupts. Does it have consequences? It means interrupts should be done in a more traditional and old-fashioned manner? Does that always work correctly with all modern hardware? I don’t know. Some also mentioned pci=nommconf to disable Memory-Mapped PCI Configuration Space. No idea what that does.

Just pci=noaer worked for me.

Solution (phase 1)

The previous chapter, which I numbered number 2, may help to understand this one.

The problem with the grub file edits is that they become effective only after a reboot. And when you reboot a Linux live session (i.e. without having installed anything to hard disk yet), the live stick will reinstate its own settings, so you have the interrupt looping problem yet again.

The solution I suddenly thought of (the evening of 19 March, I wish the idea had come up earlier!): In the grub menu that the booted ISO, or Ventoy, or wherever it comes from, presents to you, press the letter e! That's e for edit. Then you can add pci=noaer there, just before the live session starts! And the looping and logging problem will be gone!

This allowed me to finally install Linux Mint 21.1 Vera on my HP Pavilion x360 laptop. Then I still had to edit /etc/default/grub, run update-grub, and reboot, as described earlier, to make the solution permanent. I felt so relieved!

In hindsight, the answer was already here, but I had overlooked it.

An earlier episode

(This chapter added 3 April 2023)

Later on I found some notes on how, in late February 2021, I managed to install Linux Mint 20.1 Ulyssa, also with Cinnamon if that makes any difference, on that same machine, an HP Pavilion x360 Convertible. That time I solved the looping problem, more or less, in a different way: by renice-ing.

This can be done in the command line (sudo renice nice-value process-number) or in top with the r command. A nice value 19, if I understand process scheduling correctly, means the process only gets a time slice when nothing else wants to run, or if other pro­cesses have already used up a lot of CPU time.

I reniced the processes that top reported with the names systemd-journal and rsyslogd. The process irq/123-aerdrv however could not be reniced.

The system was still quite slow, but it helped, and I managed to start the installation of Linux Mint from the live session run from a USB stick. During the installation I used sudo top to find active processes, which I reniced to minus 10, to give them a high scheduling priority. I had to repeat this several times, because not all processes continued to run during the whole installation.

The result of these measures was that the computer remained overburdened and slow, but the installation did continue, and finally made it to the end.


Newer Linux kernel

Earlier tests were all in OS’es that use Linux kernels 5.4 and 5.15. Today, 5 April 2023, following a question in StackExchange, I tested in a live (uninstalled) session of Manjaro 22.0 Sikaris with Xfce, which has kernel 6.1, using the same Hewlett-Packard hardware as before. Output of uname -a was:
Linux manjaro 6.1.19-1-MANJARO #1 SMP PREEMPT_DYNAMIC Mon Mar 13 12:59:35 UTC 2023 x86_64 GNU/Linux

Result, seen in top: process systemd-journal used 90 to 100% of a CPU, and irq/123-aerdrv took about 25%. The system remained quite responsive, though. I noticed no rapidly growing logfile, disk usage remained stable.

Later that day, following instructions found on the internet, I tried to compile the latest stable kernel, 6.2.9, downloaded as a compressed tarball. That was under Linux Mint 21.1 Vera. It went well, but took a long time. Eventually, in the linking (ld) phase of vmlinux.o, the procedure was killed by the system for lack of memory, despite having 4 GB of RAM and 2 GiB of swap. That wasn’t enough. Strange. And a pity.

The next day I tried again on other hardware, now under Linux Mint 20.3 Una, and with 8 GB RAM and 2 GiB of swap space. That too was not enough. I now noticed in top that the offending process was not ld, but rather objtool, which had 6,7g of resident (not virtual) memory allocated (is that GB or GiB?). That should be possible, because I had purposely stopped Firefox, another memory hog. But the make process was killed with error code 137, which, I find, indeed means “Out of memory”.

I think there’s something wrong here. vmlinuz-5.15.0-69-generic is only about 11 MB (11468936 bytes, to be exact), so why should compiling and linking it require almost a 1000 times more than that in RAM? OK, that file is compressed. But still.