Ignacio Heredia Cachá requested to merge ignacio-br0 into master Feb 19, 2020

Description

This a PR to fix issues over the training process in DEEPaaS.

The current DEEPaaS runs the training in a separate child process in order to be Cancellable. This process is created using the multiprocessing module. It is a known fact that CUDA and multiprocessing don't work out-of-the-box together very well [1]. In addition in the case of Tensorflow this doesn't work very well even in the case of using CPUs [2].

The fix proposed changes the process' start method from fork (default in Linux) to spawn [3].

[1] https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing [2] https://github.com/tensorflow/tensorflow/issues/5448#issuecomment-258934405 [3] https://docs.python.org/3.6/library/multiprocessing.html#contexts-and-start-methods

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Tests of the training function (and cancellation) have been performed using Tensorflow (both on CPU and GPU).

Fix training issues with TensorFlow

Description

Type of change

How Has This Been Tested?

Merge request reports