- Apache MXNet - Discussion
- Apache MXNet - Useful Resources
- Apache MXNet - Quick Guide
- Apache MXNet - Python API Module
- Apache MXNet - Python API Symbol
- Apache MXNet - Python API autograd and initializer
- Apache MXNet - Python API gluon
- Apache MXNet - Python API ndarray
- Apache MXNet - KVStore and Visualization
- Apache MXNet - Gluon
- Apache MXNet - NDArray
- Apache MXNet - Python Packages
- Apache MXNet - Distributed Training
- Apache MXNet - Unified Operator API
- Apache MXNet - System Components
- Apache MXNet - System Architecture
- Apache MXNet - Toolkits and Ecosystem
- Apache MXNet - Installing MXNet
- Apache MXNet - Introduction
- Apache MXNet - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Apache MXNet - Distributed Training
This chapter is about the distributed training in Apache MXNet. Let us start by understanding what are the modes of computation in MXNet.
Modes of Computation
MXNet, a multi-language ML pbrary, offers its users the following two modes of computation −
Imperative mode
This mode of computation exposes an interface pke NumPy API. For example, in MXNet, use the following imperative code to construct a tensor of zeros on both CPU as well as GPU −
import mxnet as mx tensor_cpu = mx.nd.zeros((100,), ctx=mx.cpu()) tensor_gpu= mx.nd.zeros((100,), ctx=mx.gpu(0))
As we see in the above code, MXNets specifies the location where to hold the tensor, either in CPU or GPU device. In above example, it is at location 0. MXNet achieve incredible utipsation of the device, because all the computations happen lazily instead of instantaneously.
Symbopc mode
Although the imperative mode is quite useful, but one of the drawbacks of this mode is its rigidity, i.e. all the computations need to be known beforehand along with pre-defined data structures.
On the other hand, Symbopc mode exposes a computation graph pke TensorFlow. It removes the drawback of imperative API by allowing MXNet to work with symbols or variables instead of fixed/pre-defined data structures. Afterwards, the symbols can be interpreted as a set of operations as follows −
import mxnet as mx x = mx.sym.Variable(“X”) y = mx.sym.Variable(“Y”) z = (x+y) m = z/100
Kinds of Parallepsm
Apache MXNet supports distributed training. It enables us to leverage multiple machines for faster as well as effective training.
Following are the two ways in which, we can distribute the workload of training a NN across multiple devices, CPU or GPU device −
Data Parallepsm
In this kind of parallepsm, each device stores a complete copy of the model and works with a different part of the dataset. Devices also update a shared model collectively. We can locate all the devices on a single machine or across multiple machines.
Model Parallepsm
It is another kind of parallepsm, which comes handy when models are so large that they do not fit into device memory. In model parallepsm, different devices are assigned the task of learning different parts of the model. The important point here to note is that currently Apache MXNet supports model parallepsm in a single machine only.
Working of distributed training
The concepts given below are the key to understand the working of distributed training in Apache MXNet −
Types of processes
Processes communicates with each other to accomppsh the training of a model. Apache MXNet has the following three processes −
Worker
The job of worker node is to perform training on a batch of training samples. The Worker nodes will pull weights from the server before processing every batch. The Worker nodes will send gradients to the server, once the batch is processed.
Server
MXNet can have multiple servers for storing the model’s parameters and to communicate with the worker nodes.
Scheduler
The role of the scheduler is to set up the cluster, which includes waiting for messages that each node has come up and which port the node is pstening to. After setting up the cluster, the scheduler lets all the processes know about every other node in the cluster. It is because the processes can communicate with each other. There is only one scheduler.
KV Store
KV stores stands for Key-Value store. It is critical component used for multi-device training. It is important because, the communication of parameters across devices on single as well as across multiple machines is transmitted through one or more servers with a KVStore for the parameters. Let’s understand the working of KVStore with the help of following points −
Each value in KVStore is represented by a key and a value.
Each parameter array in the network is assigned a key and the weights of that parameter array is referred by value.
After that, the worker nodes push gradients after processing a batch. They also pull updated weights before processing a new batch.
The notion of KVStore server exists only during distributed training and the distributed mode of it is enabled by calpng mxnet.kvstore.create function with a string argument containing the word dist −
kv = mxnet.kvstore.create(‘dist_sync’)
Distribution of Keys
It is not necessary that, all the servers store all the parameters array or keys, but they are distributed across different servers. Such distribution of keys across different servers is handled transparently by the KVStore and the decision of which server stores a specific key is made at random.
KVStore, as discussed above, ensures that whenever the key is pulled, its request is sent to that server, which has the corresponding value. What if the value of some key is large? In that case, it may be shared across different servers.
Sppt training data
As being the users, we want each machine to be working on different parts of the dataset, especially, when running distributed training in data parallel mode. We know that, to sppt a batch of samples provided by the data iterator for data parallel training on a single worker we can use mxnet.gluon.utils.sppt_and_load and then, load each part of the batch on the device which will process it further.
On the other hand, in case of distributed training, at beginning we need to spanide the dataset into n different parts so that every worker gets a different part. Once got, each worker can then use sppt_and_load to again spanide that part of the dataset across different devices on a single machine. All this happen through data iterator. mxnet.io.MNISTIterator and mxnet.io.ImageRecordIter are two such iterators in MXNet that support this feature.
Weights updating
For updating the weights, KVStore supports following two modes −
First method aggregates the gradients and updates the weights by using those gradients.
In the second method the server only aggregates gradients.
If you are using Gluon, there is an option to choose between above stated methods by passing update_on_kvstore variable. Let’s understand it by creating the trainer object as follows −
trainer = gluon.Trainer(net.collect_params(), optimizer= sgd , optimizer_params={ learning_rate : opt.lr, wd : opt.wd, momentum : opt.momentum, multi_precision : True}, kvstore=kv, update_on_kvstore=True)
Modes of Distributed Training
If the KVStore creation string contains the word dist, it means the distributed training is enabled. Following are different modes of distributed training that can be enabled by using different types of KVStore −
dist_sync
As name imppes, it denotes synchronous distributed training. In this, all the workers use the same synchronized set of model parameters at the start of every batch.
The drawback of this mode is that, after each batch the server should have to wait to receive gradients from each worker before it updates the model parameters. This means that if a worker crashes, it would halt the progress of all workers.
dist_async
As name imppes, it denotes synchronous distributed training. In this, the server receives gradients from one worker and immediately updates its store. Server uses the updated store to respond to any further pulls.
The advantage, in comparison of dist_sync mode, is that a worker who finishes processing a batch can pull the current parameters from server and start the next batch. The worker can do so, even if the other worker has not yet finished processing the earper batch. It is also faster than dist_sync mode because, it can take more epochs to converge without any cost of synchronization.
dist_sync_device
This mode is same as dist_sync mode. The only difference is that, when there are multiple GPUs being used on every node dist_sync_device aggregates gradients and updates weights on GPU whereas, dist_sync aggregates gradients and updates weights on CPU memory.
It reduces expensive communication between GPU and CPU. That is why, it is faster than dist_sync. The drawback is that it increases the memory usage on GPU.
dist_async_device
This mode works same as dist_sync_device mode, but in asynchronous mode.
Advertisements