Multi GPU training

Different approach to achieve parallelism in training deep neural networks:

  1. Split layer-wise – introduce a bottleneck because the data flow through the layers must be finally ensured
  2. Split neuron-wise – often splits based on the capability of the matrix operation to get decomposed
  3. Split training data-wise – the later merge of gradient updates may vanish the training results.

Any better idea?

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *