Loading Alternative cuDNN Library Versions in Tensorflow
This article depicts my explorations and solutions to a Tensorflow & cuDNN version problem back in November 2020.
Key Takeaways/Table of Contents
- Tensorflow pip packages/wheels require specific cuDNN library versions to work
- Tensorflow log level is controlled by the environment variable
TF_CPP_MIN_LOG_LEVEL
- CUDA installation can be verified by checking
nvcc
and should not be confused with the NVIDIA GPU Driver installation - cuDNN installation can be verified by checking the corresponding header and library files
- Tensorflow uses
dlopen
to load libraries - Libraries are first searched in paths specified by the
LD_LIBRARY_PATH
environment variable (and I helped fix a bug in CSIL's bash initialization script) - You can download cuDNN from NVIDIA’s official archive for various CUDA versions
- Although
fish
environment variables are arrays,LD_LIBRARY_PATH
is still initialized as abash
-style colon-separated string - Solution 1: Use environment variables to specify the cuDNN library file
- Solution 2: Use
tensorflow.load_library
to manually load our cuDNN library file - Solution 3: Use Docker for alternative CUDA runtime versions
Background
Recently I started using Tensorflow instead of PyTorch to easily optimize my trained models for the NVIDIA Jetson Xavier NX using TensorRT. Although the framework itself is pretty intuitive with my prior knowledge of PyTorch, the system side posed a problem.
1. Problem
Since my computer doesn’t have a dedicated GPU, I chose to run this demo using the computers in our school’s instructional lab. In this demo, I created a simple Keras sequential model that uses the convolution operation:
model = keras.Sequential(
[
keras.Input(shape=(250, 250, 3)),
keras.layers.Conv2D(32, 5, strides=2, activation="relu"),
keras.layers.Conv2D(32, 3, activation="relu", name="layer2"),
keras.layers.Conv2D(32, 3, activation="relu"),
]
)
When I tried to feed it input, I got the following error:
E tensorflow/stream_executor/cuda/cuda_dnn.cc:318] Loaded runtime CuDNN library: 7.4.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Apparently, this is a version parity issue caused by the cuDNN library responsible for the convolution operation.
2. Exploration: Change Tensorflow Log Level
The first action was to view Tensorflow’s debug messages. Tensorflow’s log verbosity is controlled by the environment variable TF_CPP_MIN_LOG_LEVEL
:
- 0: DEBUG
- 1: INFO
- 2: WARNING
- 3: ERROR
We can set this variable to the most verbose level 0 directly in Python:
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "0" # DEBUG, INFO, WARNING, ERROR: 0 ~ 3
It is crucial to set this environment variable before importing Tensorflow. Otherwise, Tensorflow will just use the value defined by the shell.
In case you don’t want to mess up your Python code, you may set the same thing in the shell (before running the Python script) using the env
command:
env TF_CPP_MIN_LOG_LEVEL=0 python test.py
Run the same script again, and I got the following info message right before it complained about the cuDNN version:
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
3. Exploration: Check CUDA Installation
To get the CUDA version, simply run nvcc --version
(nvcc
is the NVIDIA CUDA compiler). In my case, I got:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
Then, run which nvcc
to obtain the CUDA installation path. And I got:
/usr/local/cuda-10.0/bin/nvcc
Evidently, we have CUDA version 10.0 installed on our lab computers.
NVIDIA GPU Driver Version vs CUDA Runtime Version
As a special note, you may obtain a different “CUDA version” by running nvidia-smi
. That is fine because the 2 versions refer to 2 distinct components: nvidia-smi
refers to the NVIDIA GPU driver, while nvcc
refers to the NVIDIA CUDA Toolkit (runtime). Roughly speaking, the version from nvidia-smi
is the max CUDA version that your driver supports and has nothing to do with the CUDA runtime version installed. In this article, we only use the runtime version reported by nvcc --version
.
4. Exploration: Check cuDNN Installation
Run whereis cudnn.h
to get the location of the cuDNN header. And I got:
cudnn: /usr/include/cudnn.h
Then we can check the version of our cuDNN installation by running cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
. The -A 2
option makes grep
print 2 lines of trailing context after matching lines and is used to print the minor and patch version numbers in our case. After running that command, I got:
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)#include "driver_types.h"
So as what Tensorflow just complained about, our lab indeed had the old cuDNN version 7.4.1.
I went into /usr/include
and found out that cudnn.h
is a link (indicated by the initial l) by running ls -al
:
lrwxrwxrwx 1 root root 26 Sep 25 2019 cudnn.h -> /etc/alternatives/libcudnn
Then I found out that /etc/alternatives/libcudnn
is also a link by running ls -al /etc/alternatives/libcudnn
:
$ ls -al /etc/alternatives/libcudnn
lrwxrwxrwx 1 root root 40 Sep 25 2019 /etc/alternatives/libcudnn -> /usr/include/x86_64-linux-gnu/cudnn_v7.h
Finally, I found out the real cuDNN header file /usr/include/x86_64-linux-gnu/cudnn_v7.h
.
In the previous “change log level” exploration, I’ve found out that the cuDNN library is called libcudnn.so.7
. And by running whereis libcudnn.so.7
, I confirmed that it was also sitting in the same folder as the cuDNN header file:
$ whereis libcudnn.so.7
libcudnn.so: /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/x86_64-linux-gnu/libcudnn.so
5. Exploration: How Tensorflow Loads Libraries
I switched my focus to Tensorflow’s dso_loader.cc
file that is in charge of loading the cuDNN library.
First of all, Tensorflow tries to get the library filename from the library name:
auto filename = port::Env::Default()->FormatLibraryFileName(name, version);
Searching in the Tensorflow repo, I found the definition of FormatLibraryFileName
in tensorflow/tensorflow/core/platform/default/load_library.cc
:
string FormatLibraryFileName(const string& name, const string& version) {
string filename;
#if defined(__APPLE__)
if (version.size() == 0) {
filename = "lib" + name + ".dylib";
} else {
filename = "lib" + name + "." + version + ".dylib";
}
#else
if (version.empty()) {
filename = "lib" + name + ".so";
} else {
filename = "lib" + name + ".so" + "." + version;
}
#endif
return filename;
}
Since we’re using Linux, this function transforms "cudnn"
and "7"
into "libcudnn.so.7"
. Hmm, seemed useless to my problem.
Then we have the important part of dso_loader.cc
: loading the library:
void* dso_handle;
port::Status status = port::Env::Default()->LoadDynamicLibrary(filename.c_str(), &dso_handle);
The source code for the LoadDynamicLibrary
function can be found in tensorflow/tensorflow/core/framework/load_library.cc
. This function checks if library_filename
has already been loaded and returns a cached handle if it does. In our case, the library wasn't loaded before we call this function, so I focused on the following code:
Env* env = Env::Default();
Library library;
s = env->LoadDynamicLibrary(library_filename, &library.handle);
This refers to the LoadDynamicLibrary
function in tensorflow/tensorflow/core/platform/default/load_library.cc
:
Status LoadDynamicLibrary(const char* library_filename, void** handle) {
*handle = dlopen(library_filename, RTLD_NOW | RTLD_LOCAL);
if (!*handle) {
return errors::NotFound(dlerror());
}
return Status::OK();
}
Here the dlopen
function loads the library_filename
library and returns a handle for it. An important property of this function is that: it searches for the colon-separated list of directories in the environment variable LD_LIBRARY_PATH
. This looked very handy to my problem!
6. Exploration: Environment Variables
Since we knew that we’re using dlopen
to load libraries, the first thought was modifying the $LD_LIBRARY_PATH
variable. And indeed, by running echo $LD_LIBRARY_PATH
, I got:
/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:{LD_LIBRARY_PATH}
This format looked strange because I was using the fish
shell, and I expected it to be an array instead of a string. Also, the trailing :{LD_LIBRARY_PATH}
looked very alien: maybe there's something wrong in the configuration/startup scripts!
As I mentioned in my previous blog post: “Visual Studio Code Remote fish
/tcsh
/csh
Shells Hotfix", I execute fish
from .bashrc
. Thus, fish
should inherit environment variables from its parent bash
.
I could not find anything that sets this variable in ~/.bash_profile
or ~/.bashrc
. So I turned my attention to the system-wide /etc/profile
. That file executes all .sh
scripts in the /etc/profile.d
folder. Inside that folder, I found the cuda-10.0.sh
which is responsible for setting the environment variables:
export PATH=/usr/local/cuda-10.0/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:{LD_LIBRARY_PATH}
export INCLUDE=/usr/local/cuda-10.0/include:{INCLUDE}
There’re no dollar signs before variables LD_LIBRARY_PATH
and INCLUDE
, which explains why they went wrong. Better tell this to the SFU CS Helpdesk quickly (They fixed it the next day, awesome)!
7. Exploration: Download cuDNN
I downloaded a newer version of cuDNN cudnn-10.0-linux-x64-v7.6.5.32.tgz
for CUDA 10.0 from NVIDIA's official archive. Extract the file using tar
, I got:
include/cudnn.h
lib64/
Inside the lib64/
directory, I had:
libcudnn.so -> libcudnn.so.7*
(link)libcudnn.so.7 -> libcudnn.so.7.6.5*
(link)libcudnn.so.7.6.5*
libcudnn_static.a
8. Exploration: Using Environment Variables
Since we already know that Tensorflow uses dlopen
internally, I decided to add my custom cuDNN directory to LD_LIBRARY_PATH
:
set -g LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64 /usr/local/cuda-10.0/lib64 /usr/local/cuda/extras/CUPTI/lib64
Run the script that uses the convolution operation again, and I encountered the same cuDNN version error:
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
E tensorflow/stream_executor/cuda/cuda_dnn.cc:318] Loaded runtime CuDNN library: 7.4.1 but source was compiled with: 7.6.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
I then decided to check if this still happens if I run dlopen
directly from a simple C program:
void* handle = dlopen("libcudnn.so", RTLD_NOW | RTLD_LOCAL);
if (!handle) {
printf("Failed!\n");
}printf("Success!\n");char* origin = malloc(sizeof(char) * (PATH_MAX + 1));
dlinfo(handle, RTLD_DI_ORIGIN, origin);
printf("Library path: %s\n", origin);struct link_map* m = NULL;
dlinfo(handle, RTLD_DI_LINKMAP, &m);
printf("Library path: %s\n", m->l_name);dlclose(handle);
And surprisingly, both printf
statements gave me the path to the system's cuDNN library:
$ ./dlopen_tryout.out
Success!
Library path: /usr/lib/x86_64-linux-gnu
Library path: /usr/lib/x86_64-linux-gnu/libcudnn.so
That means, somehow, dlopen
ignored my custom library paths.
Then I thought of the other 2 LD_LIBRARY_PATH
s defined by the system CUDA installation. Maybe they somehow interfered with dlopen
's behavior. Thus, I re-ran the same program with only /path/to/my/cudnn/library/lib64
in my LD_LIBRARY_PATH
.
$ env LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64 ./dlopen_tryout.out
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so
It’s a success this time!
My first hypothesis here was that LD_LIBRARY_PATH
is not searched from front to back and matches the first one found (which is the behavior of PATH
). So I changed the order of LD_LIBRARY_PATH
and tried re-running the program.
$ env LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:/path/to/my/cudnn/library/lib64 ./dlopen_tryout.out
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so$ env LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64 ./dlopen_tryout.out
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so
No matter where I put my custom library path, dlopen
's always able to find my custom cuDNN library. Thus, that hypothesis was incorrect.
My next hypothesis was that there’s something wrong with fish
that "hided" environment variables from my program above. Thus, I retrieved the environment variables in C by calling getenv
:
printf("LD_LIBRARY_PATH: %s\n", getenv("LD_LIBRARY_PATH"));
When I re-ran the same program, I got:
$ echo $LD_LIBRARY_PATH
/path/to/my/cudnn/library/lib64 /usr/local/cuda-10.0/lib64 /usr/local/cuda/extras/CUPTI/lib64$ ./dlopen_tryout.out
PATH: /opt/software1/bin:/usr/local/software2/bin:/home/username/.local/bin:/usr/local/cuda-10.0/bin:/opt/software3:/opt/software4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/software5/bin:/usr/local/software6/bin://usr/local/software7/bin:/usr/local/software6/bin:/usr/local/software8/bin:/usr/local/software9/bin:/opt/software10/bin:/opt/software11/bin:/snap/software12/bin
LD_LIBRARY_PATH: (null)
Success!
Library path: /usr/lib/x86_64-linux-gnu
Library path: /usr/lib/x86_64-linux-gnu/libcudnn.so
Oh no! It appeared as if the LD_LIBRARY_PATH
variable was empty!
After reading the fish man page exhaustively, I found that I missed the option -x
of the set
command. This option causes the specified shell variable to be exported to child processes. In my previous cases, the child processes didn't get the variables because -x
wasn't specified. Correcting this problem is as simple as adding an x
to the previous set
command:
set -gx LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64 /usr/local/cuda-10.0/lib64 /usr/local/cuda/extras/CUPTI/lib64
When I ran the C program above, I got:
$ ./dlopen_tryout.out
PATH: /opt/software1/bin:/usr/local/software2/bin:/home/username/.local/bin:/usr/local/cuda-10.0/bin:/opt/software3:/opt/software4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/software5/bin:/usr/local/software6/bin://usr/local/software7/bin:/usr/local/software6/bin:/usr/local/software8/bin:/usr/local/software9/bin:/opt/software10/bin:/opt/software11/bin:/snap/software12/bin
LD_LIBRARY_PATH: /path/to/my/cudnn/library/lib64/usr/local/cuda-10.0/lib64/usr/local/cuda/extras/CUPTI/lib64
Success!
Library path: /usr/lib/x86_64-linux-gnu
Library path: /usr/lib/x86_64-linux-gnu/libcudnn.so
There’s another problem: the paths in LD_LIBRARY_PATH
were not separated at all!
And now came my final hypothesis: Maybe fish still treats LD_LIBRARY_PATH
as a bash
-style comma-separated string! So this time, I tried to assign a bash
-style comma-separated string to LD_LIBRARY_PATH
:
set -gx LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
And it’s a success: both the C and Python programs were using my custom library!
$ ./dlopen_tryout.out
PATH: /opt/software1/bin:/usr/local/software2/bin:/home/username/.local/bin:/usr/local/cuda-10.0/bin:/opt/software3:/opt/software4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/software5/bin:/usr/local/software6/bin://usr/local/software7/bin:/usr/local/software6/bin:/usr/local/software8/bin:/usr/local/software9/bin:/opt/software10/bin:/opt/software11/bin:/snap/software12/bin
LD_LIBRARY_PATH: /path/to/my/cudnn/library/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
Success!
Library path: /path/to/my/cudnn/library/lib64
Library path: /path/to/my/cudnn/library/lib64/libcudnn.so$ python 16_keras_2.py
9. Solution 1: Use Environment Variables
There are multiple ways to set our environment variables.
fish
Shell
In the fish
shell, if your LD_LIBRARY_PATH
is not empty and you want to retain the existing entries, run:
set -gx LD_LIBRARY_PATH $LD_LIBRARY_PATH:/path/to/my/cudnn/library/lib64
Otherwise, use:
set -gx LD_LIBRARY_PATH /path/to/my/cudnn/library/lib64
to purge previous LD_LIBRARY_PATH
entries and/or to avoid the leading colon.
Again, make sure to comma-separate the paths in fish
(like in bash
).
You may consider adding the line to your fish
config file ~/.config/fish/config.fish
to automatically run it in all fish
sessions. To run it only in interactive sessions, check the exit status of the command status --is-interactive
(returns 0 if fish
is interactive).
bash
Shell
In the bash
shell, if your LD_LIBRARY_PATH
is not empty and you want to retain the existing entries, run:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/path/to/my/cudnn/library/lib64
Otherwise, use:
export LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64
to purge previous LD_LIBRARY_PATH
entries and/or to avoid the leading colon.
You may consider adding the line to your bash
config file ~/.bashrc
to automatically run it in interactive sessions.
env
Command
The env
command sets the environment variables for the current execution only and does not contaminate the environment for other programs.
env LD_LIBRARY_PATH=/path/to/my/cudnn/library/lib64 python test.py
Python’s os.environ
Object
Python’s os.environ
object may be used to modify or query the environment. Run the following in Python before any call to the cuDNN library:
os.environ["LD_LIBRARY_PATH"] = "/path/to/my/cudnn/library/lib64"
10. Solution 2: Use tensorflow.load_library
An alternative solution is to tell Tensorflow explicitly which cuDNN library to use before it loads one automatically. As mentioned in chapter 5, Tensorflow won’t load a library again if it’s already loaded. So we can just load our custom cuDNN library before calling it.
Add the following code to our Python script:
import tensorflow as tftf.load_library("/path/to/my/cudnn/library/lib64/libcudnn.so")
Note that this solution is non-portable. If the same file doesn’t exist on another computer, this function will raise OSError
.
11. Solution 3: Use Docker
Docker may be the easiest way to enable Tensorflow GPU support on Linux since only the NVIDIA GPU driver is required on the host machine (the NVIDIA CUDA Toolkit is not required).
You may use the -v
argument to map to the host's file system.
For more information, please refer to the official tutorial: https://www.tensorflow.org/install/docker