System Design — Large Scale Distributed Deep Networks

Ryan
2 min readAug 2, 2021

This is a reading note.

For better reading experience, please go to this page.

Introduction

This post is a reading note of the paper Large Scale Distributed Deep Networks. We will present the general idea of how we can train a large deep networks with billions of parameters.

Related Readings

Setup

There are not many strategies to scale a system. The most important techniques are partition and duplication, which are kind of special forms of the more general strategy divide-and-conquer. Training large deep neural network is no exception and the challenge is how we can do it efficiently and correctly.

There are two types of data in a neural network:

  • Training data
  • Parameters

Training data is more static and persistent. It’s similar to user data in a web application. We would expect that it’s relatively easy to partition training data.

Parameters are the state of the neural networks and they will be shared among the “application servers”. We cannot store billions of parameters in one single machine, therefore, parameters are saved to multiple servers.

In the context of large deep networks, application servers are the machines that perform the training task. Because the neural network is so large and we cannot store the whole network in one single machine, the training task will be performed by a group of servers. Each server contains a subset of parameters and they need to communicate with each other during the training process so that the teammate servers can get the updates of parameters.

Recall that training neural networks is a iterative process because we need to go through the training data multiple times so that the model can converge. This provides an opportunity to apply the replication strategy: A model in two different epoch could be considered as two different model instances. This is the “Model Replicas” concept mentioned in the paper.

Summary & Comment

  • The general idea is kind of “standard” when scaling a large system. Data partition technique is applied to both training data and parameters
  • Because we cannot have the whole model in one single machine, the training is done by a group of servers and coordination is required.
  • The neural network itself is a dependency graph.
  • The paper was written in 2012 and there was little theoretical support of why having model replicas updating the parameters concurrently works.

--

--