Skip to content

unify train and predict pools

Ignacio Heredia Cachá requested to merge ignacio-br0 into master

This fixes GPU out-of-memory problems that happened when we had two different pools (for predict and train). When we did train then predict sequentially (or viceversa) each pool wanted to have the whole GPU so out-of-memory errors happened. This won't fix out-of-memory errors when running parallel tasks on GPU (errors which also happened before).

CPU deployments shouldn't be affected.

This has been tested with the image classification package on tf 1.14 and GPU (GeForce GTX 1080). Summary of results:

  • predict then train: OK
  • train then predict: OK
  • train then train: OK
  • predict then predict: OK
  • predict in parallel (2 workers): Out of of memory.
  • train in parallel (2 workers): Out of of memory.

Additional tests on CPU:

  • warm: OK
  • predict in parallel (2 workers): OK

Merge request reports