Skip to content

Fix training issues with TensorFlow

Ignacio Heredia Cachá requested to merge ignacio-br0 into master

Description

This a PR to fix issues over the training process in DEEPaaS.

The current DEEPaaS runs the training in a separate child process in order to be Cancellable. This process is created using the multiprocessing module. It is a known fact that CUDA and multiprocessing don't work out-of-the-box together very well [1]. In addition in the case of Tensorflow this doesn't work very well even in the case of using CPUs [2].

The fix proposed changes the process' start method from fork (default in Linux) to spawn [3].

[1] https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing [2] https://github.com/tensorflow/tensorflow/issues/5448#issuecomment-258934405 [3] https://docs.python.org/3.6/library/multiprocessing.html#contexts-and-start-methods

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Tests of the training function (and cancellation) have been performed using Tensorflow (both on CPU and GPU).

Merge request reports