Recurring rpm database corruption

Question 1

I manage multiple Almalinux 8 VPS and dedicated servers, and recently I have been experiencing a lot of RPM Database corruption specifically this one:

error: rpmdb: BDB0113 Thread/process 401048/139672503925632 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
Error: Error: rpmdb open failed

And yes, I have run this MANY times:

mkdir -p /tmp/rpm-backup
cp -vf /var/lib/rpm/__db.* /tmp/rpm-backup
rm -vf /var/lib/rpm/__db.*

rpm --quiet -qa
rpm --rebuilddb
dnf clean all

This does solve the issue, but temporarily, and I do mean temporarily. I have one specific instance, as soon as I run dnf update -y after this, it downloads packages and tries to update it fine and as it installs, it fails with rpmdb error.

This issue is also concerning why I experienced issues like this: https://forums.almalinux.org/t/unable-to-detect-release-version/4070

I do not know what to do, there’s an underlying issue. Servers were rebooted multiple times and I do not have the experience or expertise to find out why this keeps happening and the main issue, it seems to happen with many instances I have, and not just one.

And yes, I tried following this: https://access.redhat.com/solutions/3330211 it's so bad that as I try to install these packages, the database goes corrupt. I can't install these packages without running into the issue constantly.

I would appreciate any assistance in specifically how to debug this exactly and provide you guys more details for any sort of assistance.

Question 2

After sitting with this issue for close to two years, yeah, I was a bit lazy in finding the solution; it boiled down to a server monitoring agent running a cron job every minute that had a command timeout -s 9 5 needs-restarting -r in it; that command opens /var/lib/rpm, touches __db.* / .dbenv.lock, and is then force-killed after 5 seconds, if it could not complete, with SIGKILL.

That is exactly the kind of pattern that corrupts or poisons the RPMDB environment. So while running a typical dnf update command, it would run and proceed to install the relevant packages, and half-way hit with a SIGKILL, leading to the constant corruption.

But yes, the proper solution to this issue is to spend time and audit your processes that touch /var/lib/rpm, because this will pinpoint the culprit. Running DB repair commands only addresses the symptoms temporarily.