Here at Flex, our Ciii Rack Scale Platforms team regularly deploys Openstack and Ceph on large clusters with various SDN platforms. With repeated multi-rack deployment, validation, benchmarking and tear down, automation plays a crucial role in improving the agility of our operations. For a small automation team to support a large group of engineers working across 200+ servers, it is necessary to select the right tools to simplify deployment, test infrastructure installation, debugging, and results collection. This enables the team to focus on reference architecture designs, benchmark logic, and results analysis.
We had originally developed a python-based automation framework for our testing. Some of its tasks included configuring operating system and OpenStack settings through their APIs, launching test workloads, and parsing output. However, with a small team, upkeep of such a large code base and an increasing complexity of test parameters became tedious We started looking at configuration management(CM) tools. We wanted a CM tool that was based on Python but easy for non-developers to use and straight-forward to troubleshoot. After building a small proof-of-concept, we determined Ansible met these requirements.
Moving the entire benchmarking process from a pure Python-built framework to Ansible took less than a month. Much of our feature code became Ansible modules, and we no longer had to worry about the transport method. We could now focus on improving the benchmarking process and expanding the tool set rather than debugging thousands of lines of code. When we ran into issues, we found the very active Ansible community to be very helpful, whether through the IRC channel, Google groups, and GitHub issues.
Once we started to expand the use of our benchmarking playbooks to other engineers, we realized we needed to simplify the task management process further. Using bash scripts to execute Ansible commands overnight to run tests led to hours of reading logs of failed tests to determine which execution failed, during which task of what role, for which hosts. When it comes to the cloud, there are hundreds and thousands of virtual machines and the logs get large fast.
Tower gave us a simple, centralized interface where developers and non-developers alike could easily utilize Ansible without the command-line, along with key Tower features such as:
- Role-based Access
- Job Scheduling
- Credentials Management
- OpenStack Support
- Real-time Job Events
- SCM Support
- Inventory Management
Not only did Tower improve usability, it increased work efficiency and improved the overall appeal of Ansible by providing a simple GUI where we could launch numerous jobs of varying parameters without once touching a text editor!
The Benchmarking Process
We run tests against both bare-metal hardware and large clusters of virtual machines. The hardware spans many vendors, which may be running hypervisors such as KVM and ESXi. The operating systems also vary between Debian-based (Ubuntu) and RHEL-based (CentOS). Then to top it off, we also validate various cloud platforms, both private (Openstack, VMware) and public (AWS, GCE, Rackspace). Our playbooks need to be flexible and compatible across all of these.
Each playbook has its own set of tests and tools. We try to keep the plays as modular as possible to allow maximum reusability between platforms. That way, when developing a playbook for a new benchmark or test procedure, we only need to modify a couple tasks.
The structure of our playbooks is divided into two distinct roles:
- As the name states, is the common role amongst all playbooks regardless of benchmark or tool. Handles system configuration tasks such as services, common dependencies, monitoring agents, etc.
- Preparation - prepares the test
- Execution - runs the test
- Collection - collects test results
- Cleanup - cleans up residual files
- name: Extracting to Test Machine
unarchive: src=roles/client/files/p95v287.linux64.tar.gz dest=/tmp/test
- name: Make sure mPrime Process isn't running
pkill: name=mprime signals=-9
- name: Create benchmark config file
template: src=templates/prime.txt.j2 dest=/tmp/test/prime.txt mode=0755
In the above example, we extract mPrime (popular system stress test) on our remote host, ensure the host isn’t already running mprime, and then create a mprime configuration file template with our desired settings.
- name: Executing Benchmark
shell: ./execution_wrapper.sh /tmp/test/_.out
Due to the default nature of mPrime’s output to terminal, we wrap the execution in a shell script and pass parameters into it. The excess use of variables throughout each task goes back to the idea of keeping as much of our playbooks as generic as possible to reduce total changes between playbooks.
- name: Run Parser on Result File and Generate CSV
script: parser.py "" /tmp/test/_.out
- name: Fetching Result CSV File to Control Host
fetch: src=/tmp/test/__result.csv dest=/tmp/results/ flat=yes
- name: Fetching Raw Output to Control Host
fetch: src=/tmp/test/_.out dest=/tmp/results/ flat=yes
To collect the data in a human-readable manner, we have parser scripts for each benchmark. The script takes the raw output from the tests and returns a nicely formatted csv file. We fetch both raw and csv back to the control host, upload to ElasticSearch, tarball it, and send it to a file server for storage.
- name: Clean Up - Removing Remote Test Directory
file: path=/tmp/test/ state=absent
Last but not least, clean-up is quite simple for this benchmark. Since all the binaries are located all within the extraction destination, we can just delete the remote folder and call it day.
We are always adding more benchmarks and tools to our list of playbooks. One area which we will tackle soon is network automation. With the increase in software-defined networking and Linux-based switch software, using Ansible to automate network configuration to test different network topologies on the fly will be an interesting task to take on.