Zabbix Monitoring - How Busy Is Your Farm?

Version: Deadline 8.0

The Grand Plan

As a continuation of last week’s blog entry on your first steps into scripting in Deadline, let’s roll up our Pipeline TD sleeves and jump right into the deep-end and take a look at Event Plugins in partnership with our Scripting API to query for statistical information from our Deadline farm and then inject that information on a regular basis into a 3rd party IT monitoring system such as Zabbix. Gulp! Not to worry, all the code to make it work in Deadline 8.0 and onwards is already available for you on our GitHub site. Nifty eh?

First, let’s remind ourselves what is available out of the box in Deadline to tracking metadata to your business advantage!

Farm Statistics

Deadline via Farm Statistics already tracks and stores all the metadata related to completed jobs in the queue and critically continues to store this information even when the actual jobs have been deleted or archived from the queue. Additionally, there are other report types which can gather statistical information for individual slaves, including the slave’s running time, rendering time, and idle time. These reports also include information about the number of tasks the Slave has completed, the number of errors it has reported, and its average Memory and CPU usage. You can use all of this information to figure out if there are any Slaves that aren’t being utilized to their full potential, BUT what if I want to know Slave usage per project? I should also note that some statistics can only be gathered if Pulse is running, so make sure your running the Pulse application in your farm setup. Lastly, don’t forget to actually Enable Statistics Gathering in your Repository Options as Super User.  

Information is Power as they say and with that; Custom Reports can also be generated, based off one of the built-in statistical objects in Deadline. The currently available data types are:

  • Completed Job Stats
  • Slave Stats
  • Repository Stats
  • Active Slave Stats
  • User Job Stats

These are super useful if you need to setup a custom report for your company, save it and then let another user open up the Farm Reports UI and run the report at a later date or perhaps for a different date range. Of course, all the reports generated can have their data exported to the usual ‘data import’ friendly file formats such as *.csv and *.tsv for further processing in your favorite spreadsheet application.

Ssshhh...Secret Tricks

Did you know in Farm Reports that...

  • all the graphs can be saved as images by right-clicking them.
  • bar graphs can be zoomed by click and dragging the desired range for more granularity!
  • you can click and drag over the data columns or use shift/ctrl/cmd + click to select the columns of data you would like to copy’n’paste. You don’t have to use the export option.
  • the drop-down option to disable “Human Readable Copying” can be used to control the data units to be copied in either their base value such as bytes or be enabled for MB, GB, TB. Useful if you don’t want to work out how many bytes in a gigabyte? [Answers on a postcard!]

Back To Ground Zero

In VFX.co, the VFX Producers want to understand how much of the farm is being used at any one time for a certain project in the studio as well as have the ability to cross-reference how much time individual render nodes per office are actually rendering compared to sitting idle. Additionally, the VFX Producers don’t use Deadline but the IT team have made available a Zabbix monitoring server, which the IT team is also currently using to monitor the file servers and network, but alas has no real insight into what the farm is doing beyond the obvious consequences of what happens when the entire render wall slams the local file servers and network!

The rest of this blog post will highlight how a solution was implemented using the Deadline OnHouseCleaning event callback, so we have a regular callback firing which takes some of the farm statistical information, combines it with information stored at a Job and Slave level (ExtraInfoX columns) and injects all this information into a 3rd party graphing solution; Zabbix in this example. We will use the following 3rd party Python libraries to help us communicate with our Zabbix server:

Zabbix Event Plugin

The Zabbix event plugin will trigger each time that Pulse performs house cleaning, which by default is every 60 seconds. This interval can be adjusted in the House Cleaning section of the Repository Options. In addition, a Zabbix server must be configured and running with at least one Host available.

When this event plugin is triggered, it will create the necessary Zabbix items and graphs for the Host(s) you have chosen if they don’t already exist, and then it will collect and push statistics to the Zabbix items. You do not need to create the items or graphs manually.

Installation

To install, please visit our GitHub repository to pull the latest version of the custom Zabbix plugin, which should be copied into \\your/repository/custom/events directory. You should ensure you create a folder called Zabbix, and it should contain 2 x Zabbix.* plugin files and an API folder.

Here is a description of the files that are shipped with the Zabbix event plugin:

  • Zabbix.param: This file defines the controls used by the Deadline Monitor to configure the settings that are stored in the Deadline database.
  • Zabbix.py: This file contains all the code used to connect to Zabbix, generate the items and graphs, and calculate and push the statistics. We have added lots of comments to this code to explain what the plugin is doing if you’re interested in taking a look.
  • API: These are the various Zabbix API Python modules that the event plugin uses to communicate with Zabbix. These have not been modified by us.

Configuration

Open your Deadline Monitor, enter Super User Mode, and select Tools -> Configure Events. Then select the Zabbix plugin from the list on the left.

There are many settings you can configure here, but we’ll focus on the General, Zabbix Connection, Slave and Project Statistics settings. The rest of the settings are used to help name the various items and graphs in Zabbix, and can be left as their defaults.

General Settings

These are some general settings for the plugin.

  • State: This must be set to “Global Enabled” for Pulse to trigger this event plugin when doing house cleaning.
  • Verbose Logging: If set to True, more information will be logged when the event is triggered. You can see this information in Pulse’s log.

Connection Settings

These are the settings that are used by the plugin to create items and graphs in Zabbix and to push statistics to it.

  • Server URL: The URL for your Zabbix server.
  • Host Name or IP Address: The host name or IP address of your Zabbix server.
  • Port: The port used to connect to your Zabbix server.
  • Hosts: The hosts that the items and graphs will be created for (one per line).
  • User Name: The user name used to connect to your Zabbix server.
  • Password: The password used to connect to your Zabbix server.

Slave Statistics

These settings tell the event plugin where to get the Slave region information from. See the Slave Regions section below for more information.

  • Slave Regions: These are the names of the regions for the region-specific slave statistics (one per line).
  • Get Region From Extra Info: If set to True, the plugin will pull the Slave region information from one of the Extra Info properties in the Slave Settings. If set to False, then the region information will be pulled from the Slave’s region property in the Machine Settings.
  • Region Extra Info Index: If the plugin is pulling the region from the Slave’s Extra Info properties, this is the property to pull it from. For example, if the slave’s region name is stored in Extra Info 1, you would set this to 1.

Project Statistics

These settings tell the event plugin where to get the Job project information from.

  • Project Names: The names of the projects for the project-specific statistics (one per line).
  • Project Extra Info Index: The job’s Extra Info property that contains the project information. For example, if you are storing the job’s project name in the job’s Extra Info 3 property, you would set this to 3.

Slave Regions

There are two ways that you can configure which regions a slave belongs to. You can either use Deadline’s built in region feature, or you can simply specify a region name in one of the slave’s Extra Info properties (similar to how the project name is specified in the job’s Extra Info properties).

If you just want to use the Extra Info properties, simply select the slaves you want to add to a specific region in the slave list in the Monitor, then right-click and select Modify Slave Properties. Select the Extra Info page and type the name in the field (note which Slave Extra Info property you have set in your Zabbix event plugin configuration above). Also, make sure you have set Get Region From Extra Info to True in your plugin configuration.

If you want to use Deadline’s built in Region feature, you must first create the regions, which can be done in the Region Settings in the Repository Options. Note that you don’t need to worry about the Cloud Region section.

After creating the settings, click on Deadline Launcher in your system tray and select Change Region.... Pick the appropriate region in the Region drop down box. Also, make sure you have set Get Region From Extra Infoto False in your plugin configuration.

Sample Graphs

Here are some examples of graphs that were created by the Zabbix event plugin.

Wrapping Up

VFX Producers in VFX.co can now visualize node availability per office (region) as well as farm usage per project at any time via a 3rd party graphing system of their studio’s choice. Deadline’s Farm Statistics are extensive but can also be extended via our Python API. This example could be extended to inject telemetry metrics of your Slaves to be graphed against your IT infrastructure such as:

  • File server performance - graph aggregates
  • Slave disk i/o per job
  • Slave network i/o per job
  • Slave CPU/RAM levels per job

How could you extend this further? Let us know if you extend this plugin further and feel free to create a pull request to contribute to our ever growing GitHub code resource.