Regulating Network Traffic with Slave Throttling

Version: Deadline 7.2

Overview

One of the things we know and love about Deadline is that when submitting a job, render nodes quickly become aware of it and can begin working immediately. A caveat, however, is that they all need access to the job’s auxiliary file(s) (eg. a Maya or 3dsMax scene file) before they can start rendering. As a result, the network can become congested as render nodes all try to pull down the resources they need. This is especially noticeable for jobs with large auxiliary files.

Enter Slave Throttling. This is a feature which, when enabled, regulates network traffic by imposing a limit on how many Slaves are allowed to copy resources to their local caches at once. In this post, I’ll briefly explain how Slave Throttling works, and how it can be configured in Deadline for your render farm.

How it Works: The Throttle Queue

The purpose of Slave Throttling is to ensure that at any given time, no more than a certain number of Slaves are copying files from the network at once. The maximum number of Slaves that are allowed to copy files concurrently is called the Throttle Limit.

The regular workflow for a Slave is to pick up a job, copy the resources it needs to render the job, and then begin rendering tasks from that job. When Slave Throttling is enabled, Slaves must first enter a Throttle Queue, where they wait their turn before they are allowed to copy files. Slaves can only leave the Throttle Queue if fewer Slaves than the Throttle Limit are currently copying files.

The illustration above shows a snapshot of a farm with seven nodes, where the Throttle Limit is two. There are three Slaves waiting in the Throttle Queue, two are copying files, and two have finished copying and are free to render tasks.

Note that Slave Throttling only applies when a Slave first picks up a job. Once a Slave has copied the necessary files for a job to its local cache, it can continue to render tasks from that job without reentering the Throttle Queue.

More Details and a “What If?”

Each Slave is responsible for reporting to the Throttle Queue when it is ready to copy files, when it is copying, and when it has finished copying files. So, what if after a Slave has started copying files, it suddenly bursts into flames and becomes a small pile of ashes? A rational person’s first thought might be “Is everything okay?” by which they of course mean “The Slave never had a chance to report that it was finished. Will it hold up the Throttle Queue forever?”

The answer is a reassuring no, it won’t prevent other Slaves from copying files in its place. Each Slave in the queue maintains a “heartbeat”, which is a signal to the Throttle Queue that everything is going okay. If a node hasn’t updated its heartbeat in a little while, it is assumed that the worst has happened, and it is purged from the queue.

Setting Up Slave Throttling

In Deadline 7.2, the Slave Throttling options can be found in Configure Repository Options by selecting the Pulse Settings option and switching to the Throttling tab. In order to use Slave Throttling in Deadline 7.2, Deadline Pulse must be running and Slaves must be able to connect to it.

In Deadline 8.0 and beyond, Deadline Pulse is no longer necessary for Slave Throttling, and the options for it can be found in Slave Settings, rather than Pulse Settings.

After enabling Slave Throttling via the checkbox, three configurable options become available. The first is the Throttle Limit, which is the maximum number of Slaves that can concurrently copy files. In the default configuration (shown above), the Throttle Limit is 10.

The second controls how often a Slave checks the throttle queue to see if it can start copying files. In the default configuration, Slaves wait 20 seconds between each time they ask for their turn.

The last option controls how long a slave can go without reporting in before it is assumed to have gone offline, after which it relinquishes its position in the Throttle Queue. This is a multiplier that is applied to the update interval. For instance, in the default configuration, a Slave that has not updated its heartbeat since 3 × 20 = 60 seconds ago is removed from the Throttle Queue.

Wrapup

And that’s all there is to it. If you’ve noticed your network getting bogged down when render nodes chaotically try to copy files all at once, give Slave Throttling a try.