# Instantiate the base model (or "template" model). # We recommend doing this with under a CPU device scope, # so that the model's weights are hosted on CPU memory. # Otherwise they may end up hosted on a GPU, which would # complicate weight sharing. with tf.device('/cpu:0'): model = Xception(weights=None, input_shape=(height, width, 3), classes=num_classes)
# Replicates the model on 8 GPUs. # This assumes that your machine has 8 available GPUs. parallel_model = multi_gpu_model(model, gpus=G) parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# Generate dummy data. x = np.random.random((num_samples, height, width, 3)) y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs. # Since the batch size is 256, each GPU will process 32 samples. parallel_model.fit(x, y, epochs=20, batch_size=batch_size)
# Save model via the template model (which shares the same weights): model.save('my_model.h5')
Using a single GPU we were able to obtain 63 second epochs with a total training time of 74m10s. However, by using multi-GPU training with Keras and Python we decreased training time to 16 second epochs with a total training time of 19m3s. 4x times speedup!