Different approach to achieve parallelism in training deep neural networks:
- Split layer-wise – introduce a bottleneck because the data flow through the layers must be finally ensured
- Split neuron-wise – often splits based on the capability of the matrix operation to get decomposed
- Split training data-wise – the later merge of gradient updates may vanish the training results.
Any better idea?