Loading Alternative cuDNN Library Versions in Tensorflow

Yilin (Jim) Shi
10 min readMay 16, 2021

--

This article depicts my explorations and solutions to a Tensorflow & cuDNN version problem back in November 2020.

Key Takeaways/Table of Contents

  1. Tensorflow pip packages/wheels require specific cuDNN library versions to work
  2. Tensorflow log level is controlled by the environment variable TF_CPP_MIN_LOG_LEVEL
  3. CUDA installation can be verified by checking nvcc and should not be confused with the NVIDIA GPU Driver installation
  4. cuDNN installation can be verified by checking the corresponding header and library files
  5. Tensorflow uses dlopen to load libraries
  6. Libraries are first searched in paths specified by the LD_LIBRARY_PATH environment variable (and I helped fix a bug in CSIL's bash initialization script)
  7. You can download cuDNN from NVIDIA’s official archive for various CUDA versions
  8. Although fish environment variables are arrays, LD_LIBRARY_PATH is still initialized as a bash-style colon-separated string
  9. Solution 1: Use environment variables to specify the cuDNN library file
  10. Solution 2: Use tensorflow.load_library to manually load our cuDNN library file
  11. Solution 3: Use Docker for alternative CUDA runtime versions

Background

Recently I started using Tensorflow instead of PyTorch to easily optimize my trained models for the NVIDIA Jetson Xavier NX using TensorRT. Although the framework itself is pretty intuitive with my prior knowledge of PyTorch, the system side posed a problem.

1. Problem

Since my computer doesn’t have a dedicated GPU, I chose to run this demo using the computers in our school’s instructional lab. In this demo, I created a simple Keras sequential model that uses the convolution operation:

model = keras.Sequential(
[
keras.Input(shape=(250, 250, 3)),
keras.layers.Conv2D(32, 5, strides=2, activation="relu"),
keras.layers.Conv2D(32, 3, activation="relu", name="layer2"),
keras.layers.Conv2D(32, 3, activation="relu"),
]
)

When I tried to feed it input, I got the following error:

E tensorflow/stream_executor/cuda/cuda_dnn.cc:318] Loaded runtime CuDNN library: 7.4.1 but source was compiled with: 7.6.4.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

Apparently, this is a version parity issue caused by the cuDNN library responsible for the convolution operation.

2. Exploration: Change Tensorflow Log Level

The first action was to view Tensorflow’s debug messages. Tensorflow’s log verbosity is controlled by the environment variable TF_CPP_MIN_LOG_LEVEL:

  • 0: DEBUG
  • 1: INFO
  • 2: WARNING
  • 3: ERROR

We can set this variable to the most verbose level 0 directly in Python:

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "0"  # DEBUG, INFO, WARNING, ERROR: 0 ~ 3

It is crucial to set this environment variable before importing Tensorflow. Otherwise, Tensorflow will just use the value defined by the shell.

In case you don’t want to mess up your Python code, you may set the same thing in the shell (before running the Python script) using the env command:

env TF_CPP_MIN_LOG_LEVEL=0 python test.py

Run the same script again, and I got the following info message right before it complained about the cuDNN version:

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7

3. Exploration: Check CUDA Installation

To get the CUDA version, simply run nvcc --version (nvcc is the NVIDIA CUDA compiler). In my case, I got:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Then, run which nvcc to obtain the CUDA installation path. And I got:

/usr/local/cuda-10.0/bin/nvcc

Evidently, we have CUDA version 10.0 installed on our lab computers.

NVIDIA GPU Driver Version vs CUDA Runtime Version

As a special note, you may obtain a different “CUDA version” by running nvidia-smi. That is fine because the 2 versions refer to 2 distinct components: nvidia-smi refers to the NVIDIA GPU driver, while nvcc refers to the NVIDIA CUDA Toolkit (runtime). Roughly speaking, the version from nvidia-smi is the max CUDA version that your driver supports and has nothing to do with the CUDA runtime version installed. In this article, we only use the runtime version reported by nvcc --version.

4. Exploration: Check cuDNN Installation

Run whereis cudnn.h to get the location of the cuDNN header. And I got:

cudnn: /usr/include/cudnn.h

Then we can check the version of our cuDNN installation by running cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2. The -A 2 option makes grep print 2 lines of trailing context after matching lines and is used to print the minor and patch version numbers in our case. After running that command, I got:

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"

So as what Tensorflow just complained about, our lab indeed had the old cuDNN version 7.4.1.

I went into /usr/include and found out that cudnn.h is a link (indicated by the initial l) by running ls -al:

lrwxrwxrwx 1 root root 26 Sep 25  2019 cudnn.h -> /etc/alternatives/libcudnn

Then I found out that /etc/alternatives/libcudnn is also a link by running ls -al /etc/alternatives/libcudnn:

$ ls -al /etc/alternatives/libcudnn
lrwxrwxrwx 1 root root 40 Sep 25 2019 /etc/alternatives/libcudnn -> /usr/include/x86_64-linux-gnu/cudnn_v7.h

Finally, I found out the real cuDNN header file /usr/include/x86_64-linux-gnu/cudnn_v7.h.

In the previous “change log level” exploration, I’ve found out that the cuDNN library is called libcudnn.so.7. And by running whereis libcudnn.so.7, I confirmed that it was also sitting in the same folder as the cuDNN header file:

$ whereis libcudnn.so.7
libcudnn.so: /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/x86_64-linux-gnu/libcudnn.so

5. Exploration: How Tensorflow Loads Libraries

I switched my focus to Tensorflow’s dso_loader.cc file that is in charge of loading the cuDNN library.

First of all, Tensorflow tries to get the library filename from the library name:

auto filename = port::Env::Default()->FormatLibraryFileName(name, version);

Searching in the Tensorflow repo, I found the definition of FormatLibraryFileName in tensorflow/tensorflow/core/platform/default/load_library.cc:

string FormatLibraryFileName(const string& name, const string& version) {
string filename;
#if defined(__APPLE__)
if (version.size() == 0) {
filename = "lib" + name + ".dylib";
} else {
filename = "lib" + name + "." + version + ".dylib";
}
#else
if (version.empty()) {
filename = "lib" + name + ".so";
} else {
filename = "lib" + name + ".so" + "." + version;
}
#endif
return filename;
}

Since we’re using Linux, this function transforms "cudnn" and "7" into "libcudnn.so.7". Hmm, seemed useless to my problem.

Then we have the important part of dso_loader.cc: loading the library:

void* dso_handle;
port::Status status = port::Env::Default()->LoadDynamicLibrary(filename.c_str(), &dso_handle);

The source code for the LoadDynamicLibrary function can be found in tensorflow/tensorflow/core/framework/load_library.cc. This function checks if library_filename has already been loaded and returns a cached handle if it does. In our case, the library wasn't loaded before we call this function, so I focused on the following code:

Env* env = Env::Default();
Library library;
s = env->LoadDynamicLibrary(library_filename, &library.handle);

This refers to the LoadDynamicLibrary function in tensorflow/tensorflow/core/platform/default/load_library.cc:

Status LoadDynamicLibrary(const char* library_filename, void** handle) {
*handle = dlopen(library_filename, RTLD_NOW | RTLD_LOCAL);
if (!*handle) {
return errors::NotFound(dlerror());
}
return Status::OK();
}

Here the dlopen function loads the library_filename library and returns a handle for it. An important property of this function is that: it searches for the colon-separated list of directories in the environment variable LD_LIBRARY_PATH. This looked very handy to my problem!

6. Exploration: Environment Variables

Since we knew that we’re using dlopen to load libraries, the first thought was modifying the $LD_LIBRARY_PATH variable. And indeed, by running echo $LD_LIBRARY_PATH, I got:

/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:{LD_LIBRARY_PATH}

This format looked strange because I was using the fish shell, and I expected it to be an array instead of a string. Also, the trailing :{LD_LIBRARY_PATH} looked very alien: maybe there's something wrong in the configuration/startup scripts!

As I mentioned in my previous blog post: “Visual Studio Code Remote fish/tcsh/csh Shells Hotfix", I execute fish from .bashrc. Thus, fish should inherit environment variables from its parent bash.

I could not find anything that sets this variable in ~/.bash_profile or ~/.bashrc. So I turned my attention to the system-wide /etc/profile. That file executes all .sh scripts in the /etc/profile.d folder. Inside that folder, I found the cuda-10.0.sh which is responsible for setting the environment variables:

export PATH=/usr/local/cuda-10.0/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:{LD_LIBRARY_PATH}
export INCLUDE=/usr/local/cuda-10.0/include:{INCLUDE}

There’re no dollar signs before variables LD_LIBRARY_PATH and INCLUDE, which explains why they went wrong. Better tell this to the SFU CS Helpdesk quickly (They fixed it the next day, awesome)!

7. Exploration: Download cuDNN

I downloaded a newer version of cuDNN cudnn-10.0-linux-x64-v7.6.5.32.tgz for CUDA 10.0 from NVIDIA's official archive. Extract the file using tar, I got:

  • include/cudnn.h
  • lib64/

Inside the lib64/ directory, I had:

  • libcudnn.so -> libcudnn.so.7* (link)
  • libcudnn.so.7 -> libcudnn.so.7.6.5* (link)
  • libcudnn.so.7.6.5*
  • libcudnn_static.a

8. Exploration: Using Environment Variables

Since we already know that Tensorflow uses dlopen internally, I decided to add my custom cuDNN directory to LD_LIBRARY_PATH:

set -g LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64 /usr/local/cuda-10.0/lib64 /usr/local/cuda/extras/CUPTI/lib64

Run the script that uses the convolution operation again, and I encountered the same cuDNN version error:

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
E tensorflow/stream_executor/cuda/cuda_dnn.cc:318] Loaded runtime CuDNN library: 7.4.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

I then decided to check if this still happens if I run dlopen directly from a simple C program:

void* handle = dlopen("libcudnn.so", RTLD_NOW | RTLD_LOCAL);
if (!handle) {
printf("Failed!\n");
}
printf("Success!\n");char* origin = malloc(sizeof(char) * (PATH_MAX + 1));
dlinfo(handle, RTLD_DI_ORIGIN, origin);
printf("Library path: %s\n", origin);
struct link_map* m = NULL;
dlinfo(handle, RTLD_DI_LINKMAP, &m);
printf("Library path: %s\n", m->l_name);
dlclose(handle);

And surprisingly, both printf statements gave me the path to the system's cuDNN library:

$ ./dlopen_tryout.out
Success!
Library path: /usr/lib/x86_64-linux-gnu
Library path: /usr/lib/x86_64-linux-gnu/libcudnn.so

That means, somehow, dlopen ignored my custom library paths.

Then I thought of the other 2 LD_LIBRARY_PATHs defined by the system CUDA installation. Maybe they somehow interfered with dlopen's behavior. Thus, I re-ran the same program with only /path/to/my/cudnn/library/lib64 in my LD_LIBRARY_PATH.

$ env LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64 ./dlopen_tryout.out
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so

It’s a success this time!

My first hypothesis here was that LD_LIBRARY_PATH is not searched from front to back and matches the first one found (which is the behavior of PATH). So I changed the order of LD_LIBRARY_PATH and tried re-running the program.

$ env LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:/path/to/my/cudnn/library/lib64 ./dlopen_tryout.out
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so
$ env LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64 ./dlopen_tryout.out
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so

No matter where I put my custom library path, dlopen's always able to find my custom cuDNN library. Thus, that hypothesis was incorrect.

My next hypothesis was that there’s something wrong with fish that "hided" environment variables from my program above. Thus, I retrieved the environment variables in C by calling getenv:

printf("LD_LIBRARY_PATH: %s\n", getenv("LD_LIBRARY_PATH"));

When I re-ran the same program, I got:

$ echo $LD_LIBRARY_PATH
/path/to/my/cudnn/library/lib64 /usr/local/cuda-10.0/lib64 /usr/local/cuda/extras/CUPTI/lib64
$ ./dlopen_tryout.out
PATH: /opt/software1/bin:/usr/local/software2/bin:/home/username/.local/bin:/usr/local/cuda-10.0/bin:/opt/software3:/opt/software4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/software5/bin:/usr/local/software6/bin://usr/local/software7/bin:/usr/local/software6/bin:/usr/local/software8/bin:/usr/local/software9/bin:/opt/software10/bin:/opt/software11/bin:/snap/software12/bin
LD_LIBRARY_PATH: (null)
Success!
Library path: /usr/lib/x86_64-linux-gnu
Library path: /usr/lib/x86_64-linux-gnu/libcudnn.so

Oh no! It appeared as if the LD_LIBRARY_PATH variable was empty!

After reading the fish man page exhaustively, I found that I missed the option -x of the set command. This option causes the specified shell variable to be exported to child processes. In my previous cases, the child processes didn't get the variables because -x wasn't specified. Correcting this problem is as simple as adding an x to the previous set command:

set -gx LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64 /usr/local/cuda-10.0/lib64 /usr/local/cuda/extras/CUPTI/lib64

When I ran the C program above, I got:

$ ./dlopen_tryout.out
PATH: /opt/software1/bin:/usr/local/software2/bin:/home/username/.local/bin:/usr/local/cuda-10.0/bin:/opt/software3:/opt/software4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/software5/bin:/usr/local/software6/bin://usr/local/software7/bin:/usr/local/software6/bin:/usr/local/software8/bin:/usr/local/software9/bin:/opt/software10/bin:/opt/software11/bin:/snap/software12/bin
LD_LIBRARY_PATH: /path/to/my/cudnn/library/lib64/usr/local/cuda-10.0/lib64/usr/local/cuda/extras/CUPTI/lib64
Success!
Library path: /usr/lib/x86_64-linux-gnu
Library path: /usr/lib/x86_64-linux-gnu/libcudnn.so

There’s another problem: the paths in LD_LIBRARY_PATH were not separated at all!

And now came my final hypothesis: Maybe fish still treats LD_LIBRARY_PATH as a bash-style comma-separated string! So this time, I tried to assign a bash-style comma-separated string to LD_LIBRARY_PATH:

set -gx LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64

And it’s a success: both the C and Python programs were using my custom library!

$ ./dlopen_tryout.out
PATH: /opt/software1/bin:/usr/local/software2/bin:/home/username/.local/bin:/usr/local/cuda-10.0/bin:/opt/software3:/opt/software4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/software5/bin:/usr/local/software6/bin://usr/local/software7/bin:/usr/local/software6/bin:/usr/local/software8/bin:/usr/local/software9/bin:/opt/software10/bin:/opt/software11/bin:/snap/software12/bin
LD_LIBRARY_PATH: /path/to/my/cudnn/library/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so
$ python 16_keras_2.py

9. Solution 1: Use Environment Variables

There are multiple ways to set our environment variables.

fish Shell

In the fish shell, if your LD_LIBRARY_PATH is not empty and you want to retain the existing entries, run:

set -gx LD_LIBRARY_PATH $LD_LIBRARY_PATH:/path/to/my/cudnn/library/lib64

Otherwise, use:

set -gx LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64

to purge previous LD_LIBRARY_PATH entries and/or to avoid the leading colon.

Again, make sure to comma-separate the paths in fish (like in bash).

You may consider adding the line to your fish config file ~/.config/fish/config.fish to automatically run it in all fish sessions. To run it only in interactive sessions, check the exit status of the command status --is-interactive (returns 0 if fish is interactive).

bash Shell

In the bash shell, if your LD_LIBRARY_PATH is not empty and you want to retain the existing entries, run:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/path/to/my/cudnn/library/lib64

Otherwise, use:

export LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64

to purge previous LD_LIBRARY_PATH entries and/or to avoid the leading colon.

You may consider adding the line to your bash config file ~/.bashrc to automatically run it in interactive sessions.

env Command

The env command sets the environment variables for the current execution only and does not contaminate the environment for other programs.

env LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64 python test.py

Python’s os.environ Object

Python’s os.environ object may be used to modify or query the environment. Run the following in Python before any call to the cuDNN library:

os.environ["LD_LIBRARY_PATH"] = "/path/to/my/cudnn/library/lib64"

10. Solution 2: Use tensorflow.load_library

An alternative solution is to tell Tensorflow explicitly which cuDNN library to use before it loads one automatically. As mentioned in chapter 5, Tensorflow won’t load a library again if it’s already loaded. So we can just load our custom cuDNN library before calling it.

Add the following code to our Python script:

import tensorflow as tftf.load_library("/path/to/my/cudnn/library/lib64/libcudnn.so")

Note that this solution is non-portable. If the same file doesn’t exist on another computer, this function will raise OSError.

11. Solution 3: Use Docker

Docker may be the easiest way to enable Tensorflow GPU support on Linux since only the NVIDIA GPU driver is required on the host machine (the NVIDIA CUDA Toolkit is not required).

You may use the -v argument to map to the host's file system.

For more information, please refer to the official tutorial: https://www.tensorflow.org/install/docker

--

--