-
Notifications
You must be signed in to change notification settings - Fork 0
Using Multiple GPUs
This page describes how to use Python's multiprocessing module to drive multiple GPUs, by spawning one child process per GPU from a common parent. The example provided uses two GPUs to compute a logistic regression model, and demonstrates how to pack up and communicate both common and private arguments from the parent to the children.
The key is to initialize theano specifying the CPU as the target device, and to later override the assigned device to a specified GPU. According to a discussion on theano-users, this re-binding of devices can occur only once per process. So this example would (probably) not work with threads, nor could it be used to do GPU switching.
Here is a generic .theanorc file, similar to the file I use:
[global]
floatX = float32
device = cpu
openmp = True
base_compiledir = /path/to/base/dir
[nvcc]
fastmath = True
[blas]
ldflags = -L/path/to/blas/libs -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lm
[cuda]
root = /path/to/cuda/
At a high level, the procedure looks like:
- set up your arguments.
- launch each of your sub-processes.
- immediately import theano.sandbox.cuda and bind this process to a specific GPU.
In the example below, f is the function which will be carried out by the working sub-process, and the dictionary private_args which was constructed in the parent process holds the name of the gpu device to use.
def f(shared_args,private_args):
# At this point, no theano import statements have been processed, and so the device is unbound
# Import sandbox.cuda to bind the specified GPU to this subprocess
# then import the remaining theano and model modules.
import theano.sandbox.cuda
theano.sandbox.cuda.use(private_args['gpu'])
import theano
import theano.tensor as T
from theano.tensor.shared_randomstreams import RandomStreams
...
That's it. After binding your GPU process, proceed as normal. The example script below uses two GPUs to independently construct and fit two logistic regression models. The script does not take into account use cases that include communication between sub-processes, nor does it perform any post-processing on the output of each sub-process. If you need to perform inter-process communication, the Manager (declared in the parent process) can provide a safe and easy way to do this.
Here is the script in full:
""" Test script that uses two GPUs, one per sub-process,
via the Python multiprocessing module. Each GPU fits a logistic regression model. """
# These imports will not trigger any theano GPU binding
from multiprocessing import Process, Manager
import numpy as np
import os
def f(shared_args,private_args):
""" Build and fit a logistic regression model. Adapted from
http://deeplearning.net/software/theano/tutorial/examples.html#a-real-example-logistic-regression
"""
# Import sandbox.cuda to bind the specified GPU to this subprocess
# then import the remaining theano and model modules.
import theano.sandbox.cuda
theano.sandbox.cuda.use(private_args['gpu'])
import theano
import theano.tensor as T
from theano.tensor.shared_randomstreams import RandomStreams
rng = np.random
# Pull the size of the matrices from
shared_args_dict = shared_args[0]
N = shared_args_dict['N']
feats = shared_args_dict['n_features']
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = shared_args_dict['n_steps']
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
gw,gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# (we shall return to this in a
# following section of this tutorial)
# Compile. allow_input_downcast reassures the compiler that we are ok using
# 64 bit floating point numbers on the cpu, gut only 32 bit floats on the gpu.
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)), allow_input_downcast=True)
predict = theano.function(inputs=[x], outputs=prediction, allow_input_downcast=True)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print "Final model:"
print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])
if __name__ == '__main__':
# Construct a dict to hold arguments that can be shared by both processes
# The Manager class is a convenient to implement this
# See: http://docs.python.org/2/library/multiprocessing.html#managers
#
# Important: managers store information in mutable *proxy* data structures
# but any mutation of those proxy vars must be explicitly written back to the manager.
manager = Manager()
args = manager.list()
args.append({})
shared_args = args[0]
shared_args['N'] = 400
shared_args['n_features'] = 784
shared_args['n_steps'] = 10000
args[0] = shared_args
# Construct the specific args for each of the two processes
p_args = {}
q_args = {}
p_args['gpu'] = 'gpu0'
q_args['gpu'] = 'gpu1'
# Run both sub-processes
p = Process(target=f, args=(args,p_args,))
q = Process(target=f, args=(args,q_args,))
p.start()
q.start()
p.join()
q.join()