Hello everyone,

Today the nvidia driver on my server stopped working out of nowhere. Yesterday it was working and today it’s not. I didn’t do anything in yesterday or today.

Today my Plex container stopped working because there was a problem with the nvidia card I was using for transcoding. It’s a GTX 1650.

I tried running nvidia-smi and it said Failed to initialize NVML: Driver/library version mismatch. After I tried upgrading my system because it was a months ago I upgraded, maybe it will help. It didn’t. I tried some rebooting because some sources said it solves the issue but it persisted.

It’s driver reinstall time. Purged the driver with apt purge nvidia* then installed driver with ubuntu-drivers install --gpgpu nvidia:525-server. After reboot nvidia-smi gives the error NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running..

lsmod | grep nvidia shows nothing and /proc/driver/nvidia/version doesn’t exists. I tried starting nvidia-persistenced with systemctl but it gives this error:

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 113 has read and write permissions for those files.

/dev/nvidia* doesn’t exist.

I’m very noobish when it comes to nvidia and linux it was a pain to set it up initially and I was hoping that it wouldn’t go wrong someday. But here I am unfortunatelly. I don’t really know what logs should I show you or what commands should I run to troubleshoot so every tip is appreciated and I will provide logs and things like that if needed.

System info:

  • Ubuntu Server 22.04
  • kernel: 5.15.0-76-generic
  • theoretically installed nvidia driver: nvidia-driver-525-server

Solution

I was using the ubuntu-drivers utility to install the driver but turns out it’s not that great. After installing with the manual method from https://help.ubuntu.com/community/NvidiaDriversInstallation using the command apt install linux-modules-nvidia-${DRIVER_BRANCH}${SERVER}-${LINUX_FLAVOUR} it’s working again.

  • wmassingham@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 年前

    Does it even show up in lspci? Eliminate your OS, boot it in a live system and see if it’s recognized there. A quick thing to check would be that your GPU is actually powered on (fully seated in the PCIe slot and has the necessary power).

    • Koma52@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 年前

      Shows up in lspci. Booting a live OS would be a little bit tricky because it’s in a wall mounted rack but I will try that if nothing else works. Thank you.

        • Koma52@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          1 年前

          I was using the ubuntu-drivers utility that this page mentions too but it turns out it isn’t working very much. Now I installed with the manual method from this page using apt install linux-modules-nvidia-${DRIVER_BRANCH}${SERVER}-${LINUX_FLAVOUR} and it’s working. Thank you for the suggestion!