DDP and gradient sync
When we have enough resources we would want to train our neural networks in parallel, the way to do this is to train our NN with different data (different batches of data) in each GPU in parallel. For instance, if we have 8X A100 we run 8 different batches of data on each A100 GPU. The way to do this in pytorch is to use DDP (take a look into their docs)...