How Flex Ciii Uses Ansible Tower for Benchmarking

June 23, 2016 by Hugh Ma

flex-cii.png

We love stories about how Ansible Tower has solved problems and made work easier. Special thanks to Hugh Ma from Flex for sharing his story about Ansible Tower.

Here at Flex, our Ciii Rack Scale Platforms team regularly deploys Openstack and Ceph on large clusters with various SDN platforms. With repeated multi-rack deployment, validation, benchmarking and tear down, automation plays a crucial role in improving the agility of our operations. For a small automation team to support a large group of engineers working across 200+ servers, it is necessary to select the right tools to simplify deployment, test infrastructure installation, debugging, and results collection. This enables the team to focus on reference architecture designs, benchmark logic, and results analysis. 

Background

We had originally developed a python-based automation framework for our testing. Some of its tasks included configuring operating system and OpenStack settings through their APIs, launching test workloads, and parsing output. However, with a small team, upkeep of such a large code base and an increasing complexity of test parameters became tedious  We started looking at configuration management(CM) tools. We wanted a CM tool that was based on Python but easy for non-developers to use and straight-forward to troubleshoot. After building a small proof-of-concept, we determined Ansible met these requirements. 

The Transition

Moving the entire benchmarking process from a pure Python-built framework to Ansible took less than a month. Much of our feature code became Ansible modules, and we no longer had to worry about the transport method. We could now focus on improving the benchmarking process and expanding the tool set rather than debugging thousands of lines of code.  When we ran into issues, we found the very active Ansible community to be very helpful, whether through the IRC channel, Google groups, and GitHub issues.

Tower

Once we started to expand the use of our benchmarking playbooks to other engineers, we realized we needed to simplify the task management process further. Using bash scripts to execute Ansible commands overnight to run tests led to hours of reading logs of failed tests to determine which execution failed, during which task of what role, for which hosts. When it comes to the cloud, there are hundreds and thousands of virtual machines and the logs get large fast.

Tower gave us a simple, centralized interface where developers and non-developers alike could easily utilize Ansible without the command-line, along with key Tower features such as:

  • Role-based Access
  • Job Scheduling
  • Credentials Management
  • OpenStack Support
  • Real-time Job Events
  • SCM Support
  • Inventory Management
  • Surveys

Not only did Tower improve usability, it increased work efficiency and improved the overall appeal of Ansible by providing a simple GUI where we could launch numerous jobs of varying parameters without once touching a text editor!

The Benchmarking Process

We run tests against both bare-metal hardware and large clusters of virtual machines. The hardware spans many vendors, which may be running hypervisors such as KVM and ESXi. The operating systems also vary between Debian-based (Ubuntu) and RHEL-based (CentOS). Then to top it off, we also validate various cloud platforms, both private (Openstack, VMware) and public (AWS, GCE, Rackspace). Our playbooks need to be flexible and compatible across all of these.

Each playbook has its own set of tests and tools. We try to keep the plays as modular as possible to allow maximum reusability between platforms. That way, when developing a playbook for a new benchmark or test procedure, we only need to modify a couple tasks. 

The structure of our playbooks is divided into two distinct roles:

  • Common
    • As the name states, is the common role amongst all playbooks regardless of benchmark or tool. Handles system configuration tasks such as services, common dependencies, monitoring agents, etc.
  • Client
    • Preparation - prepares the test
    • Execution - runs the test
    • Collection - collects test results
    • Cleanup - cleans up residual files

Preparation Example:

 - name: Extracting  to Test Machine
 unarchive: src=roles/client/files/p95v287.linux64.tar.gz dest=/tmp/test

- name: Make sure mPrime Process isn't running
 pkill: name=mprime signals=-9

- name: Create benchmark config file
 template: src=templates/prime.txt.j2 dest=/tmp/test/prime.txt mode=0755
 become: no

In the above example, we extract mPrime (popular system stress test) on our remote host, ensure the host isn’t already running mprime, and then create a mprime configuration file template with our desired settings.

Execution Example:

- name: Executing  Benchmark
 shell: ./execution_wrapper.sh /tmp/test/_.out
 args:
   chdir: /tmp/test/
 register: execute_return

Due to the default nature of mPrime’s output to terminal, we wrap the execution in a shell script and pass parameters into it. The excess use of variables throughout each task goes back to the idea of keeping as much of our playbooks as generic as possible to reduce total changes between playbooks.

Collection Example:

- name: Run Parser on Result File and Generate CSV
 script: parser.py "" /tmp/test/_.out

- name: Fetching Result CSV File to Control Host
 fetch: src=/tmp/test/__result.csv dest=/tmp/results/ flat=yes

- name: Fetching Raw Output to Control Host
 fetch: src=/tmp/test/_.out dest=/tmp/results/ flat=yes

To collect the data in a human-readable manner, we have parser scripts for each benchmark. The script takes the raw output from the tests and returns a nicely formatted csv file. We fetch both raw and csv back to the control host, upload to ElasticSearch, tarball it, and send it to a file server for storage.

Clean-Up Example:

- name: Clean Up - Removing Remote Test Directory
 file: path=/tmp/test/ state=absent

Last but not least, clean-up is quite simple for this benchmark. Since all the binaries are located all within the extraction destination, we can just delete the remote folder and call it day.

diagram.jpg

Looking Ahead

We are always adding more benchmarks and tools to our list of playbooks. One area which we will tackle soon is network automation. With the increase in software-defined networking and Linux-based switch software, using Ansible to automate network configuration to test different network topologies on the fly will be an interesting task to take on.

Share:

Topics:
Ansible Tower


 

Hugh Ma

Hugh Ma is a Cloud Validation Developer at Flex Ciii. He focuses on supporting various groups in developing automation processes for benchmarking and validation. He also works on Openstack and DevOps. When he’s not writing scripts to make machines do his bidding, he is available at Hugh.Ma@flextronics.com. Otherwise, he’ll be on #Ansible IRC, or riding his motorcycle off into the sunset.


rss-icon  RSS Feed

Ansible Tower by Red Hat
Ansible In-Depth Whitepaper
Ansible Tower by Red Hat
Learn About Ansible Tower