1000 Windows Servers on a 30 day rebuild

 

We have a large HTC (High Throughput computing) farm which consists of 1100 physical windows servers and decided to rebuild these on a 30 day cycle, meaning no server is older than 30 days. This is a fully automated, zero-touch deployment using predominantly Ansible and IAC principles.


Why did we do it ?
1)It prevents the build up of unintentional artifacts on older nodes.
2)It allows frequent updating of software.
3)Servers do not require patching.
4)It allows for more consistent node builds.
5)It reduces manual intervention and time spent on server builds.
6)It reduce troubleshooting time.
7)It provides infrastructure faster to assist our research teams.


This automated pipeline was built using Ansible, Chocolatey, Jenkins, GIT and DSC. There are approximately 300 touch-points in each build which means there has been significant engineering effort to achieve this goal. 

Some of the challenges faced included; multiple secure air-gapped environments, all servers being physical and troubleshooting software deployment latency. 

We also have some great reporting and metrics using Splunk and Tableau to analyze, improve and tune our provisioning process and highlight the business benefits.

Slides here

Presenters:

 

Nathan Michael, GResearch

 

Nic McElroy, GResearch