this post was submitted on 14 Jul 2024
38 points (95.2% liked)

Linux

4804 readers
283 users here now

A community for everything relating to the linux operating system

Also check out [email protected]

Original icon base courtesy of [email protected] and The GIMP

founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (4 children)

I'm getting random reboots, tied to nothing. Micro computer, AMD Ryzen 5 5800H. New (<6mo) computer; no re-used old components. 36GB RAM, which has passed a few runs of memtest. I have regularly seen the k10 temp spike to the low 90s without reboot, and when the reboots happen I haven't noticed that the temps were higher than 60. The only thing I've been able to correlate it at all to is composing email; I'm a fairly fast typer and markdown-oxide goes berserk and consumes in the mid-high 100% CPU use (~165%) while I'm typing. I made the correlation because multiple times this has happened has been while I was composing emails (and subsequently lost them).

There is nothing in boot-1 logs. Just normal logging and then reboot. Nothing at all suspicious, no weird errors. I struggle to use more than 50% memory, so memory contention is not an issue. It's like a sudden power cycle.

The system is on a UPS; my next avenue of investigation is the UPS itself, but power surges in the house shouldn't be a possibility; there are a half dozen other computers in the house, some on UPS, some not, and none of those are having issues.

I saw an article a few days ago about a tool to help track down mysterious reboots like this, but can't find it now. I don't know how software could help; it is literally: everything is working, the screens go blank, and in a second or so the BIOS posts.

I am suspicious of the CPU core temp readings, which I can't seem to get at. I get the GPU temp, which is never stressed (stays around 45C); and k10temp_tctl, which from what I can find is an edge temp and not the core temp; and all of the NVMe temps, which all stay in the 40s. But the fact that I don't know if I'm seeing what's really going on temp-wise in the CPU worries me. But I don't think I've had it crash during a software update, which often includes compiling a bunch of Rust, C, Go, and whatever packages which I can see pegging multiple cores.

I'm at a loss. I've looked at everything I can think of, but still haven't gotten a hint about what is triggering this. I may just do a bunch of markdown editing with markdown-oxide enabled and see of I can reliably force it to happen, but that still wouldn't tell me why. I am certain it's not memory, and have mostly convinced myself it isn't temperature, unless it's something hidden I can't get a reading on.

Help?

Edit it just occurred to me: how do I check for UPS issues when the nut monitor is running on the computer connected to the UPS? If the UPS is stuttering, it's not going to get logged by but. I suppose I could connect a laptop and use it to be the monitor, but this sounds like a lot of work to set up. What else should I try first?

Edit 2 I've now run stress with 16 cores for multiple minutes a couple of times. Once, with -c (busy-work threads), and once with -m (busywork using malloc/free). Both times, gotop showed all 16 cores gratifyingly pegged at 99/100%. Interestingly, k10temp never hit 90C, which I've seen it do before, but today is cool so that's probably helping. With mem-thrashing, I got a bunch of cached memory and finally saw free memory drop to 28%, which I rarely see on this machine because - when I set it up - I was tired of always fretting about memory use and decided to make it a non-issue by maxing the memory with 64GB. Anyway, that's the lowest I've ever noticed free memory drop to. Neither tests crashed the machine. I may try longer runs - a half-hour, maybe? But I'm now suspecting less that it's thermal load related.

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (1 children)

Shot in the dark but my latest instability has been caused by the MSI board pushing (well, allowing) too much power to the CPU, in my case it's 13th gen intel so probably not the same thing - I've updated to a beta BIOS and set Intels defaults.

One other thing that might or might not help is https://github.com/mchehab/rasdaemon
Helped me identity failig cpu - by logging MCE events/cpu errors

[–] [email protected] 3 points 1 month ago

That isn't the forensic tool I saw, but it looks like it could be really useful, thank you!

[–] [email protected] 3 points 1 month ago (1 children)
[–] [email protected] 0 points 1 month ago

Started to. There's a small learning curve as I only recently switched from grub to EFI, and am still figuring out how to manage stuff like this.

[–] [email protected] 1 points 1 month ago (1 children)

Replace markdown oxide for another tool for some time, try breaking the correlation to find causation

[–] [email protected] 1 points 1 month ago

Yeah, I've disabled markdown-oxide for the moment, so I'll see what my uptimes look like for a bit.

I honestly can't imagine how a userspace program could cause this behavior, though. There's no memory pressure, and there are 16 cores in this CPU, fer chrissake. Even trying to peg the CPU, I didn't notice the md-oxide correlation until I started watching top; the temps weren't going up, performance wasn't impacted.

I thought for sure it was a memory (hardware) issue, but I've run several memtests and they come back clean. No odd kernel module crashes in the logs; no indication anything is wrong until - poof. Reboot.

[–] [email protected] 0 points 1 month ago

Couldn't be that the PSU is failing? Check with multimeter! I'd see that before UPS, or maybe both..