Using Ansible Automation Platform, GitLab CE and webhooks to deploy IIS website

Using Ansible Automation Platform, GitLab CE and webhooks to deploy IIS website

Inside Red Hat Ansible Automation Platform, the Ansible Tower REST API is the key mechanism that helps enable automation to be integrated into processes or tools that exist in an environment. With Ansible Tower 3.6 we have brought direct integration with webhooks from GitHub and GitLab, including the enterprise on-premises versions. This means that changes in source control can trigger automation to apply changes to infrastructure configuration, deploy new services, reconfigure existing applications, and more. In this blog, I'll run through a simple scenario and apply the new integrated webhook feature.

Environment

My environment consists of Ansible Tower (one component of Red Hat Ansible Automation Platform), GitLab CE with a project already created, and a code server running an IDE with the same git repository cloned. A single inventory exists on Ansible Tower with just one host, an instance of Windows 2019 Server running on a certified cloud. For this example, I'm going to deploy IIS on top of this Windows server and make some modifications to the html file that I'd like to serve from this site. 

My playbook to deploy IIS is very simple:

 ---
- name: Configure IIS
  hosts: windows

  tasks:
  - name: Install IIS
    win_feature:
      name: Web-Server
      state: present

  - name: Start IIS service
    win_service:
      name: W3Svc
      state: started

  - name: Create website index.html
    win_copy:
      src: files/web.html
      dest: C:\Inetpub\wwwroot\index.html

All that I am doing here is adding the Web-Server feature, starting IIS and copying my site's html file to the default location for web content being served by IIS. 

My html file is just as basic:

<html>
<title></title>
<body>

</body>
</html>

Objective and setup

What I would like to happen is that, for each merge request that makes changes to this one IIS site, the site should be redeployed with this basic html file.

Colin blog new one

GitLab Access Token

As my webhook is triggered, I would like to update the merge request created in GitLab with the status of my Ansible Tower job. 

To accomplish this, I first have to create a personal access token for my GitLab account so that Ansible Tower can access the GitLab API. This is pretty painless. All I have to do is navigate to my user settings and select "Access Tokens" from the left side navigation panel:

Colin blog two

I give my access token an easily recognizable name of "Ansible Tower," set the expiration date to the end of the month, and scope this access token to just the API. Upon clicking "Create personal access token," the token itself becomes visible and a new entry is shown at the bottom of this page:

Colin blog three

Next, I will use this token to create a new credential in Ansible Tower of type "GitLab Personal Access Token":

Colin blog four

Upon saving, Ansible Tower now has API access to my GitLab account. 

Ansible Tower Job Template

Now that Ansible Tower has the ability to update my merge requests, I need to configure webhook access to my job template that is configured to run my simple IIS playbook. Since the Ansible Tower 3.6 release, there is now a checkbox on each job template called ENABLE WEBHOOK.

coling blog new three

Once I select the option to ENABLE WEBHOOK I am presented with a few new fields. I select GitLab as my WEBHOOK SERVICE, supply the credential I created using my GitLab personal access token, WEBHOOK URL is prepopulated with the path to this job template and, upon saving my modifications, a WEBHOOK KEY is generated which I will use to configure the project hook in GitLab. Also, note that my project allows me to override the SCM branch. This means that the project will pull updates from the "change-web-text" branch instead of Master. 

GitLab Project Hook integration

The next step takes me back to GitLab, this time navigating to the integrations page of the project I would like to execute the webhook.

Colin blog six

On the integrations page, I supply the URL (WEBHOOK URL from my job template in Ansible Tower) and Secret Token (WEBHOOK KEY from my job template in Ansible Tower). I also specify the Trigger as "Merge request events" which means that the URL I specified will be launched anytime a merge request is opened.

colin blog new two

In action: Updating my website text

Now that I've given Ansible Tower access to my projects using a personal access token as a new credential type, configured my job template to enable webhooks, and configured a Project Hook on GitLab to respond to merge request events on my project, I'm ready to make a test commit of my html file.

Here, I add text to the <title> and <body> tags of my html document and save the file:

Colin blog eight

Once I've committed my change on my "change-web-text" branch, I will push my code, go back to GitLab and open a merge request to merge changes back into master.

colin new blog

Opening this merge request will now trigger my webhook which will deploy my web page changes to my IIS site. Because I have configured Ansible Tower with my personal access token, Ansible Tower will post a link to the job executed as a result of the webhook trigger on the merge request.

If all has been configured correctly, I should see a new job being executed that corresponds to the job template with the configured webhook. I should also see a job that has been kicked off, updating my project which will pull in the latest changes from my GitLab project.

Colin blog nine

Selecting the job for "iis website create", which is the job template I configured for webhook execution, shows that the job was LAUNCHED BY webhook. EXTRA VARIABLES will show a lot of project specific configuration facts, and more importantly the job output should show that the job is executing what it's supposed to.

Colin blog ten

Upon completion, I should be able to pull up the IP of my IIS server and see the changes to my incredible html page:

Colin blog eleven

Takeaways

Webhooks introduced in Ansible Tower 3.6 are an incredibly powerful way to launch automation in response to events in source control. While this basic website is just a very quick and simple example, applying this functionality to infrastructure as code where all service configurations are defined in Ansible Playbooks greatly emphasizes this robust feature.




Deep dive on VLANS resource modules for network automation

Deep dive on VLANS resource modules for network automation

In October of 2019, as part of Red Hat Ansible Engine 2.9, the Ansible Network Automation team introduced the concept of resource modules.  These opinionated network modules make network automation easier and more consistent for those automating various network platforms in production.  The goal for resource modules was to avoid creating overly complex jinja2 templates for rendering network configuration. This blog post goes through the eos_vlans module for the Arista EOS network platform.  I walk through several examples and describe the use cases for each state parameter and how we envision these being used in real world scenarios.

Before starting let's quickly explain the rationale behind naming of the network resource modules. Notice for resource modules that configure VLANs there is a singular form (eos_vlan, ios_vlan, junos_vlan, etc) and a plural form (eos_vlans, ios_vlans, junos_vlans).  The new resource modules are the plural form that we are covering today. We have deprecated the singular form. This was done so that those using existing network modules would not have their Ansible Playbooks stop working and have sufficient time to migrate to the new network automation modules.

VLAN Example

Let's start with an example of the eos_vlans resource module:

---
- name: add vlans
  hosts: arista
  gather_facts: false
  tasks:
    - name: add VLAN configuration
      eos_vlans:
        config:
          - name: desktops
            vlan_id: 20
          - name: servers
            vlan_id: 30
          - name: printers
            vlan_id: 40
          - name: DMZ
            vlan_id: 50

There is an implicit state parameter which defaults to merged (i.e. state: merged).  If we run this Ansible Playbook VLANs 20,30,40 and 50 will be merged into the running configuration of any device in the arista group.  The show vlan output on a new Arista switch will look like the following:

rtr2#show vlan
VLAN  Name                             Status    Ports
----- -------------------------------- --------- -------------------------------
1     default                          active
20    desktops                         active
30    servers                          active
40    printers                         active
50    DMZ                              active

while the running configuration will look like the following:

rtr2#show running-config | s vlan
vlan 20
   name desktops
!
vlan 30
   name servers
!
vlan 40
   name printers
!
vlan 50
   name DMZ

Now let's make a change manually to the network configuration:

rtr2(config)#vlan 100
rtr2(config-vlan-100)#name artisanal_vlan
rtr2(config-vlan-100)#end
rtr2#wr
Copy completed successfully.

If I re-run the Ansible Playbook it returns with changed=0 because it only cares about the VLANs 20, 30, 40 and 50. It won't remove VLAN 100 because we have the state parameter set to merged by default, so it only will merged the data model it knows about. It is just enforcing configuration policy of the VLANs I am sending.

Using the 'state' parameter

What happens if I change the state parameter to replaced?  Just change the previous example to the following:

---
- name: add vlans
  hosts: arista
  gather_facts: false
  tasks:
    - name: add VLAN configuration
      eos_vlans:
        state: replaced
        config:
          - name: desktops
            vlan_id: 20
          - name: servers
            vlan_id: 30
          - name: printers
            vlan_id: 40
          - name: DMZ
            vlan_id: 50

The Ansible Playbook ran just like before with changed=0. Can we tell if it removed the artisanal_vlan 100?

rtr2#show vlan
VLAN  Name                             Status    Ports
----- -------------------------------- --------- -------------------------------
1     default                          active
20    desktops                         active
30    servers                          active
40    printers                         active
50    DMZ                              active
100   artisanal_vlan                   active

Nope! The goal of resource modules is to update existing resources to match the existing data model. Since our data model (the key, value pairs that represent the VLANs, which are passed under the config parameter in the playbook) only includes VLANs 20, 30, 40 and 50 the eos_vlans module only updates parameters relevant to those particular VLANs.

Why would I use this versus a merged? The major difference between a merged and a replaced is that a merged just makes sure the commands are present that are represented within the data model, whereas the replaced parameter makes your resource match the data model. Let\'s look at the eos_vlans module to see what it considers as part of the vlans resource.

There are three parameters currently used for the vlans resource:

  • name
  • state (active or suspend)
  • vlan_id (range between 1-4094)

Let's look at the following example:

Data Model Sent

- name: desktops
  vlan_id: 20

Existing Arista Config

vlan 200
   state suspend
!

This is how merged compares to replaced:

merged

vlan 200
  name desktops
  state suspend
!

replaced

vlan 200
   name desktops
!

The replaced parameter enforces the data model on the network device for each configured VLAN.  In the example above it will remove the state suspend because it is not within the data model.  To think of this another way, the replaced parameter is aware of commands that shouldn't be there as well as what should.

Using the overridden state parameter

What happens if I change the state parameter to overridden?  Just change the original example to the following:

---
- name: add vlans
  hosts: arista
  gather_facts: false
  tasks:
    - name: add VLAN configuration
      eos_vlans:
        state: overridden
        config:
          - name: desktops
            vlan_id: 20
          - name: servers
            vlan_id: 30
          - name: printers
            vlan_id: 40
          - name: DMZ
            vlan_id: 50

Now run the Ansible Playbook:

screenshot

The Ansible Playbook now has changed=1.  But did it remove the artisanal_vlan 100?

Logging into one of the Arista devices confirms it did!

rtr2#show vlan
VLAN  Name                             Status    Ports
----- -------------------------------- --------- -------------------------------
1     default                          active
20    desktops                         active
30    servers                          active
40    printers                         active
50    DMZ                              active

The overridden parameter will enforce all vlans resources to the data model.  This means it removes VLANs that are not defined in the data model being sent.

Takeaways

There are currently three ways to push configuration using resource modules.  These are the merged, replaced and overridden parameters. These allow much more flexibility for network engineers to adopt automation in incremental steps.  We realize that most folks will start with the merged parameter as they gain familiarity with the new resource module concepts. Over time organizations will move towards the overridden parameter as they adopt a standard SoT (source of truth) for their data models, wherever they reside.




Agnostic network automation examples with Ansible and NRE Labs

Agnostic network automation examples with Ansible and NRE Labs

On February 10th, The NRE Labs project launched four Ansible Network Automation exercises, made possible by Red Hat and Juniper Networks.  This blog post covers job responsibilities of an NRE, the goal of NRE Labs, and a quick overview of new exercises and the concepts Red Hat and Juniper are jointly demonstrating.  The intended audience for these initial exercises is someone new to Ansible Network Automation with limited experience with Ansible and network automation. The initial network topology for these exercises covers Ansible automating Juniper Junos OS and Cumulus VX virtual network instances.

About NRE Labs

Juniper has defined an NRE or network reliability engineer, as someone that can help an organization with modern network automation.  This concept has many different names including DevOps for networks, NetDevOps, or simply just network automation.  Juniper and Red Hat realized that this skill set is new to many traditional network engineers and worked together to create online exercises to help folks get started with Ansible Network Automation.  Specifically, Juniper worked with us through NRE Labs, a project they started and co-sponsor that offers a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. This works through short, simple exercises within your browser.  You can find NRE Labs at the following location: https://nrelabs.io

With Red Hat Ansible Engine 2.9 we introduced the concept of resource modules and native fact gathering, so I wanted to make sure that these exercises covered the latest and greatest aspects of Ansible Network Automation to make this turn key for network engineers.  If you are new to resource modules, native fact gathering or even just the Juniper network platform I think it is worth skimming through these exercises!

Lets begin with a network diagram:

NRE diagram

Each of the four exercises has a different set of objectives outlined, step-by-step instructions and takeaways for your Ansible knowledge.

Exercise 1

This exercise covers what an Ansible INI-based inventory looks like, the Ansible configuration file (ansible.cfg) and running an Ansible Playbook for enabling NETCONF on Juniper Junos.  This exercise also illustrates the concept of idempotency and why it is important for network automation.

Exercise 2 - Facts

This exercise covers native fact gathering (using gather_facts: True) and using the debug module.  We show how to quickly print serial numbers and version numbers to the terminal window using just three tasks.

Exercise 3 - Resource Facts

This exercise covers more in depth fact gathering using the junos_facts module in conjunction with the new gather_network_resources parameter.  This allows the junos_facts module to gather facts from any resource module to read in network configurations and store them as YAML/JSON.  This exercise also covers converting these facts into a structured YAML file.

Exercise 4 - Network Configuration Templates

This exercise covers using and understanding host variables, using simple Jinja2 templating, using the junos_config module for Juniper Junos and the template module for Cumulus Linux.  The overarching goal of this exercise is using Ansible Network Automation to create an OSPF adjacency between the Cumulus VX device cvx11 and the Juniper Junos device vqfx1.




How useful is Ansible in a Cloud-Native Kubernetes Environment?

How useful is Ansible in a Cloud-Native Kubernetes Environment?

A question I've been hearing a lot lately is "why are you still using Ansible in your Kubernetes projects?" Followed often by "what's the point of writing your book Ansible for Kubernetes when Ansible isn't really necessary once you start using Kubernetes?"

I spent a little time thinking about these questions, and the motivation behind them, and wanted to write a blog post addressing them, because it seems a lot of people may be confused about what Kubernetes does, what Ansible does, and why both are necessary technologies in a modern business migrating to a cloud-native technology stack (or even a fully cloud-native business).

One important caveat to mention upfront, and I quote directly from my book:

While Ansible can do almost everything for you, it may not be the right tool for every aspect of your infrastructure automation. Sometimes there are other tools which may more cleanly integrate with your application developers' workflows, or have better support from app vendors.

We should always guard against the golden hammer fallacy. No single infrastructure tool---not even the best Kubernetes-as-a-service platform---can fill the needs of an entire business's IT operation. If anything, we have seen an explosion of specialist tools as is evidenced by the CNCF landscape.

Ansible fits into multiple areas of cloud-native infrastructure management, but I would like to specifically highlight three areas in this post:

Ansible_cloud-native-venn-diagram

Namely, how Ansible fits into the processes for Container Builds, Cluster Management, and Application Lifecycles.

I'd especially caution against teams diving into Kubernetes head first without a broader automation strategy. Kubernetes can't manage your entire application lifecycle, nor can it bootstrap itself; you should not settle for automating the inside of a Kubernetes cluster while using manual processes to build and manage your cluster; this becomes especially dangerous if you manage more than one cluster, as is best practice for most environments (at least having a staging and production cluster, or a private internal cluster and a public facing cluster).

Container Build

In the past decade, server management and application deployment became more and more automated. Usually, automation became more intuitive and maintainable, especially after the introduction of configuration management and orchestration tools like CFEngine, Puppet, Chef, and Ansible.

There's no great solution for all application deployments, though, even with modern automation tools. Java has WAR files and the VM. Python has virtual environments. PHP has scripts and multiple execution engines. Ruby has ruby environments. Running operations teams who can efficiently manage servers and deployments for five, ten, or more development stacks (and sometimes multiple versions of each, like Java 7, Java 8, and Java 11) is a failing proposition.

Luckily, containerization started to solve that issue. Instead of developers handing off source code and expecting operations to be able to handle the intricacies of multiple environments, developers hand off containers, which can be run by a compatible container runtime on almost any modern server environment.

But in some ways, things have stagnated in the container build realm; the Dockerfile, which was nothing more than a shell script with some Docker-specific DSL and hacky inline commands to solve image layer size issues, is still used in many places as the de facto app build script.

Geerling Blog 3

How many times have you encountered an indecipherable Dockerfile like this?

We can do better. Ansible can build and manage containers using Dockerfiles, sure, but Ansible is also very good at building container images directly---and nowadays, you don't even need to install Docker! There are lighter-weight open source build tools like Buildah that integrate with an Ansible container build tool ansible-bender to build containers using more expressive and maintainable Ansible Playbooks.

There are other ways to build containers, too. But I lament the fact that many developers and sysadmins have settled on the lowest common denominator, the Dockerfile, to build their critical infrastructure components, when there are more expressive, maintainable, and universal tools like Ansible which produce the same end result.

Cluster Management

Kubernetes Clusters don't appear out of thin air. Depending on the type of clusters you're using, they require management for upgrades and integrations. Cluster management can become crippling, especially if, like most organizations, you're managing multiple clusters (multiple production clusters, staging and QA clusters, etc.).

If you're running inside a private cloud, or on bare metal servers, you will need a way to install Kubernetes and manage individual servers in the cluster. Ansible has a proven track record of being able to orchestrate multi-server applications, and Kubernetes itself is a multi-server application---which happens to manage one or thousands of other multi-server applications through containerization.

Projects like Kubespray have used Ansible for custom Kubernetes cluster builds and are compatible with dozens of different infrastructure arrangements.

Even if you use a managed Kubernetes offering, like AKS, EKS, or GKE, Ansible has modules like azure_rm_aks, aws_eks_cluster, and gcp_container_cluster, which manage clusters, along with thousands of other modules which simplify and somewhat standardize cluster management among different cloud providers.

Even if you don't need multi-cloud capabilities, Ansible offers useful abstractions like managing CloudFormation template deployments on AWS with the cloudformation module, or Terraform deployments with the terraform module.

It's extremely rare to have an application which can live entirely within Kubernetes and not need to be coordinated with any external resource (e.g. networking device, storage, external database service, etc.). If you're lucky, there may be a Kubernetes Operator to help you integrate your applications with external services, but more often there's not. Here, too, Ansible helps by managing a Kubernetes application along with external integrations, all in one playbook written in cloud-native's lingua franca, YAML.

I'll repeat what I said earlier: you should not settle for automating the inside of a Kubernetes cluster while using manual processes to build and manage your cluster---especially if you have more than one cluster!

Application Lifecycle

The final area where Ansible shows great promise is in managing applications inside of Kubernetes. Using Ansible to build operators with the Operator SDK, you can encode all your application's lifecycle management (deployment, upgrades, backups, etc.) inside of a Kubernetes operator to be placed in any Kubernetes cluster---even if you don't use Ansible to manage anything else in that cluster.

Rather than forcing developers and ops teams to learn Go or another specialized language to maintain an operator, you can build it with YAML and Ansible.

There is a lot of promise here, though there are scenarios---at least, in the current state of the Operator SDK---where you might need to drop back to Go for more advanced use cases. The power comes in the ability to rely on Ansible's thousands of modules from within your running Application operator in the cluster, and in the ease of adoption for any kind of development team.

For teams who already use Ansible, it's a no-brainer to migrate their existing Ansible knowledge, roles, modules, and playbooks into Kubernetes management playbooks and Ansible-based operators. For teams new to Ansible, its flexibility for all things related to IT automation (Networking, Windows, Linux, Security, etc.) and ease of use make it an ideal companion for cloud-native orchestration.




Rebooting Network Devices with Ansible

Rebooting Network Devices with Ansible

With the Red Hat Ansible Automation Platform release in November, we released over 50 network resource modules to help make automating network devices easier and more turn-key for network engineers.  In addition to the new resource modules, Andrius also discussed fact gathering enhancements in his blog post, which means with every new resource module, users gain increased fact coverage for network devices.  For this blog post I want to cover another cool enhancement that may have gone unnoticed. This is the ability for network devices to make use of the wait_for_connection module.  If you are a network engineer that has operational Ansible Playbooks that need to reboot devices or take them offline, this module will help you make more programmatic playbooks to handle disconnects.  By leveraging wait_for_connection network automation playbooks can look and behave more like playbooks for Linux or Windows hosts.

Comparing wait_for and wait_for_connection

There are two great modules that can wait for a condition to be met, wait_for and the wait_for_connection.  I highly recommend against using the pause module if you can get away with it, and I equate it to using a programming equivalent of a sleep within an Ansible Playbook.  Using either of these two wait_for modules is superior to random pauses within your Ansible Playbook because they are a more programmatic solution that is more adaptable to devices taking different amounts of time to complete a task.  The other problem with the pause module is that using prompts does not work within Ansible Tower. A much better solution for human interaction would be to use an Ansible Tower workflow with an approval node.

The wait_for module can wait until a path on a filesystem exists, or until a port is active again.  This works great for most reboot use cases, except for when a system is not able to be logged into immediately after the port is up.  The wait_for_connection extends the functionality of the wait_for use case a bit further. The wait_for_connection module will make sure that Ansible can log back into the device and receive the appropriate prompts before finishing completing the task. For Linux and Windows hosts it will use the ping or win_ping module, for network devices it will make sure the connection plugin that was last used can fully connect to the device.  At the time of this blog post this only works with the network_cli connection plugin.  This means that subsequent tasks can begin operating as intended as soon as wait_for_connection completes versus where wait_for just knows that port is open.

Dealing with prompts

With networking devices when we perform operational tasks such as a reboot, there is often a prompt to confirm that you want to take an action.

For example on a Juniper vSRX device:

admin@rtr3> request system reboot
Reboot the system ? [yes,no] (no)

The user has to confirm the reload to be able to proceed. Something I neglected to cover on my deep dive with cli_command blog was that cli_command module can handle prompts. The cli_command module can even handle multiple prompts! For this example the Cisco router had not saved its config, and we are performing a reload. First the Cisco router will alert me that the System configuration has been modified, and ask me if I want to save this before I lose my running-configuration:

rtr1#reload

System configuration has been modified. Save? [yes/no]:

After confirming yes or no, you will receive a second prompt:

Proceed with reload? [confirm]

We need to build a task that can handle both prompts using the cli_command module:

---
- name: reboot ios device
  cli_command:
    command: reload
    prompt:
      - Save?
      - confirm
    answer:
     - y
     - y

The above task will answer yes to both prompts, saving the config and reloading the device. The list for prompt answer and the list for answer must match and be in the same order. This means that the answer for prompt[0] must be answer[0].

If you want to see a more detailed example of handling multiple prompts, here is an example of a password reset on a Juniper vSRX device.

Using reset_connection in combination

Now that you understand how to reboot the device with cli_command we can combine that with the wait_for_connection to create a reusable Ansible Playbook. However, we need to add one more task, a meta: reset_connection to make this work programmatically.  

We need to make sure the current connection to the network device is closed so that the socket can be reestablished to the network device after the reboot takes place.  If the connection is not closed and the command timeout is longer than the time it takes to reboot, the persistent connection will attempt to reuse the closed SSH connection resulting in the failure "Socket is closed". A correct Ansible Playbook looks like this:

- reboot task (this is a snippet, full task removed for brevity)

- name: reset the connection
  meta: reset_connection

- name: Wait for the network device to reload
  wait_for_connection:
    delay: 10

Now we have an Ansible Playbook that can reconnect to network devices after a reboot is issued! For a full example please refer to this reboot.yml Ansible Playbook for Arista vEOS network devices.

Where to go next?

This blog helped outline how to create reusable Ansible Playbooks for rebooting network devices.  One of the next steps is obviously building out an Ansible Role that can reboot multiple network platforms.  I have gone ahead and created one and uploaded it to Github here.  This role will work on Juniper Junos, Cisco IOS and Arista EOS devices and can be easily modified to handle many more network operating systems.




Getting Started With Ansible Content Collections

Getting Started With Ansible Content Collections

With the release of Red Hat Ansible Automation Platform, Ansible Content Collections are now fully supported. Ansible Content Collections, or collections, represent the new standard of distributing, maintaining and consuming automation. By combining multiple types of Ansible content (playbooks, roles, modules, and plugins), flexibility and scalability are greatly improved.

Who Benefits?

Everyone!

Traditionally, module creators have had to wait for their modules to be marked for inclusion in an upcoming Ansible release or had to add them to roles, which made consumption and management more difficult. By shipping modules within Ansible Content Collections along with pertinent roles and documentation, and removing the barrier to entry, creators are now able to move as fast as the demand for their creations. For a public cloud provider, this means new functionality of an existing service or a new service altogether, can be rolled out along with the ability to automate the new functionality.

For the automation consumer, this means that fresh content is continuously made available for consumption. Managing content in this manner also becomes easier as modules, plugins, roles, and docs are packaged and tagged with a collection version. Modules can be updated, renamed, improved upon; roles can be updated to reflect changes in module interaction; docs can be regenerated to reflect the edits and all are packaged and tagged together. 

On top of this, before collections, it was not uncommon for modules to break or lack timely updates needed to interact with the services they were interfacing with. This often required Ansible users or Ansible Tower administrators to run multiple versions of Ansible in virtual environments in order to consume a patch that addressed a module issue. Ansible Content Collections bring stability and predictability by breaking modules out from the core distribution.

For automated organizations, this means that certified content is readily available to be applied to use-cases ripe for automation from day one.

Where to Find Collections

With the launch of Red Hat Ansible Automation Platform, Automation Hub will be the source for certified collections. Additionally, collections creators can also package and distribute content on Ansible Galaxy. Ultimately, it is up to the creator to decide the delivery mechanism for their content, with Automation Hub being the only source for Red Hat Certified Collections.

A Closer Look at Collections

An Ansible Content Collection can be described as a package format for Ansible content:

example collection filesystem

This format has a simple, predictable data structure, with a straightforward definition:

  • docs/: local documentation for the collection
  • galaxy.yml: source data for the MANIFEST.json that will be part of the collection package
  • playbooks/: playbooks reside here 
    • tasks/: this holds 'task list files' for include_tasks/import_tasks usage
  • plugins/: all ansible plugins and modules go here, each in its own subdir
  • roles/: directory for ansible roles
  • tests/: tests for the collection's content

More information regarding collection metadata

Interacting with Collections

In addition to downloading collections through the browser, the ansible-galaxy command line utility has been updated to manage collections, providing much of the same functionality as has always been present to manage, create and consume roles. For example, ansible-galaxy collection init can be used to create a starting point for a new user created collection.

galaxy collection init example

Along with the correct directory structure to start creating a collection from, this command also generates a metadata file that will be used while building the collection with namespace and collection name pre-populated:

example galaxy metadata

Where to Go Next

Ansible Content Collections were first introduced as tech preview in Ansible Engine 2.8 and are now fully supported in Ansible Engine 2.9 and are an integral part of Red Hat Ansible Automation Platform. Collections allow Red Hat Ansible Automation Platform to offer certified, stable content in order to continue expanding use cases for automation. Future posts will dive deeper into developing new collections and converting existing roles into collections.




Ansible and ServiceNow Part 3, Making outbound RESTful API calls to Red Hat Ansible Tower

Ansible and ServiceNow Part 3, Making outbound RESTful API calls to Red Hat Ansible Tower

Red Hat Ansible Tower offers value by allowing automation to scale in a checked manner - users can run playbooks for only the processes and targets they need access to, and no further. 

Not only does Ansible Tower provide automation at scale, but it also integrates with several external platforms. In many cases, this means that users can use the interface they are accustomed to while launching Ansible Tower templates in the background. 

One of the most ubiquitous self service platforms in use today is ServiceNow, and many of the enterprise conversations had with Ansible Tower customers focus on ServiceNow integration. With this in mind, this blog entry walks through the steps to set up your ServiceNow instance to make outbound RESTful API calls into Ansible Tower, using OAuth2 authentication.

The following software versions are used:

  • Ansible Tower: 3.4, 3.5
  • ServiceNow: London, Madrid

If you sign up for a ServiceNow Developer account, ServiceNow offers a free instance that can be used for replicating and testing this functionality. Your ServiceNow instance needs to be able to reach your Ansible Tower instance. Additionally, you can visit https://ansible.com/license to obtain a trial license for Ansible Tower. Instructions for installing Ansible Tower can be found here

Preparing Ansible Tower

  1. In Ansible Tower, navigate to Applications on the left side of the screen. Click the green plus button on the right, which will present you with a Create Application dialog screen. Fill in the following fields:

  2. Name: Descriptive name of the application that will contact Ansible Tower

  3. Organization: The organization you wish this application to be a part of
  4. Authorization Grant Type: Authorization code
  5. Redirect URIS: https://<snow_instance_id>.service-now.com/oauth_redirect.do
  6. Client Type: Confidential

    image3-4

  7. Click the green Save button on the right, at which point a window will pop up, presenting you with the Client ID and Client Secret needed for ServiceNow to make API calls into Ansible Tower. This will only be presented ONCE, so capture these values for later use.

    image18

  8. Next, navigate to Settings->System on the left side of the screen. You'll want to toggle the Allow External Users to Create Oauth2 Tokens option to on. Click the green Save button to commit the change.

    image4-4

Preparing ServiceNow

  1. Moving over to ServiceNow, Navigate to System Definition->Certificates. This will take you to a screen of all the certificates Service Now uses. Click on the blue New button, and fill in these details:

  2. Name: Descriptive name of the certificate

  3. Format: PEM
  4. Type: Trust Store Cert
  5. PEM Certificate: The certificate to authenticate against Ansible Tower with. You can use the built-in certificate on your Tower server, located at /etc/tower/tower.cert. Copy the contents of this file into the field in ServiceNow.

    Click the Submit button at the bottom.

    image9-1

  6. In ServiceNow, Navigate to System OAuth->Application Registry. This will take you to a screen of all the Applications ServiceNow communicates with. Click on the blue New button, and you will be asked What kind of Oauth application you want to set up. Select Connect to a third party Oauth Provider.

    image20

  7. On the new application screen, fill in these details:

  8. Name: Descriptive Application Name

  9. Client ID: The Client ID you got from Ansible Tower
  10. Client Secret: The Client Secret you got from Ansible Tower
  11. Default Grant Type: Authorization Code
  12. Authorization URL: https://<tower_url>/api/o/authorize/
  13. Token URL: https://<tower_url>/api/o/token/
  14. Redirect URL: https://<snow_instance_id>.service-now.com/oauth_redirect.do

    Click the Submit button at the bottom.

    image19

  15. You should be taken out to the list of all Application Registries. Click back into the Application you just created. At the bottom, there should be two tabs: Click on the tab Oauth Entity Scopes. Under here, there is a section called Insert a new row.... Double click here, and fill in the field to say Writing Scope. Click on the green check mark to confirm this change. Then, right-click inside the grey area at the top where it says Application Registries and click Save in the menu that pops up.

    image11-1

  16. The writing scope should now be Clickable. Click on it, and in the dialog window that you are taken to, type write in the Oauth scope box. Click the Update button at the bottom.

    image7-1

  17. Back in the Application Settings page, scroll back to the bottom and click the Oauth Entity Profiles tab. There should be an entity profile populated - click into it.

    image21

  18. You will be taken to the Oauth Entity Profile Window. At the bottom, Type Writing Scope into the Oauth Entity Scope field. Click the green check mark and update.

    image23

  19. Navigate to System Web Services -> REST Messages. Click the blue New button. In the resulting dialog window, fill in the following fields:

  20. Name: Descriptive REST Message Name

  21. Endpoint: The url endpoint of the Ansible Tower action you wish to do. This can be taken from the browsable API at https://<tower_url>/api
  22. Authentication Type: Oauth 2.0
  23. Oauth Profile: Select the Oauth profile you created

    Right-click inside the grey area at the top; click Save.

    image10-1

  24. Click the Get Oauth Token button on the REST Message screen. This will generate a pop-up window asking to authorize ServiceNow against your Ansible Tower instance/cluster. Click Authorize. ServiceNow will now have an OAuth2 token to authenticate against your Ansible Tower server.

    image22

  25. Under the HTTP Methods section at the bottom, click the blue New button. At the new dialog window that appears, fill in the following fields:

  26. HTTP Method: POST

  27. Name: Descriptive HTTP Method Name
  28. Endpoint: The url endpoint of the Ansible Tower action you wish to do. This can be taken from the browsable API at https://<tower_url>/api
  29. HTTP Headers (under the HTTP Request tab)
    • The only HTTP Header that should be required is Content-Type: application/json

You can kick off a RESTful call to Ansible Tower using these parameters with the Test link.

image6-3

Testing connectivity between ServiceNow and Ansible Tower

Clicking the Test link will take you to a results screen, which should indicate that the Restful call was sent successfully to Ansible Tower. In this example, ServiceNow kicks off an Ansible Tower job Template, and the response includes the Job ID in Ansible Tower: 276.

image eight

You can confirm that this Job Template was in fact started by going back to Ansible Tower and clicking the Jobs section on the left side of the screen; a Job with the same ID should be in the list (and, depending on the playbook size, may still be in process):

image15

Creating a ServiceNow Catalog Item to Launch an Ansible Tower Job Template

Now that you are able to make outbound RESTful calls from ServiceNow to Ansible Tower, it's time to create a catalog item for users to select in ServiceNow in a production self-service fashion. While in the HTTP Method options, click the Preview Script Usage link:

image nine

Copy the resulting script the appears, and paste it into a text editor to reference later.

  1. In ServiceNow, navigate to Workflow -> Workflow Editor. This will open a new tab with a list of all existing ServiceNow workflows. Click on the blue New Workflow button:

    image16

  2. In the New Workflow dialog box that appears, fill in the following options:

  3. Name: A descriptive name of the workflow

  4. Table: Requested Item sc_req_item

    Everything else can be left alone. Click the Submit button.

    image1-10

  5. The resulting Workflow Editor will have only a Begin and End box. Click on the line (it will turn blue to indicate it has been selected), then press delete to get rid of it.

    image14-1

  6. On the right side of the Workflow Editor Screen, select the Core tab and, under Core Activities->Utilities, drag the Run Script option into the Workflow Editor. In the new dialog box that appears, type in a descriptive name, and paste in the script you captured from before. Click Submit to save the Script.

    image12-1

  7. Draw a connection from Begin, to the newly created Run Script Box, and another from the Run Script box to End. Afterward, click on the three horizontal lines to the left of the Workflow name, and select the Publish option. You are now ready to associate this workflow with a catalog item.

    image8-1

  8. Navigate to Service Catalog -> Catalog Definitions -> Maintain Items. Click the blue New button on the resulting item list. In the resulting dialog box, fill in the following fields:

  9. Name: Descriptive name of the Catalog Item

  10. Catalog: The catalog that this item should be a part of
  11. Category: Required if you wish users to be able to search for this item

    In the Process Engine tab, populate the Workflow field with the Workflow you just created. Click the Submit Button. You've not created a new catalog item!

    image5-4

  12. Lastly, to run this catalog item, navigate to Self-Service -> Homepage and search for the catalog item you just created. Once found, click the order now button. You can see the results page pop up in ServiceNow, and you can confirm that the Job is being run in Ansible Tower.

Congratulations! After completing these steps, you can now use a ServiceNow Catalog Item to launch Job and Workflow Templates in Ansible Tower. This is ideal for allowing end users to use a front end they are familiar with in order to perform automated tasks of varying complexities. This familiarity goes a long way toward reducing the time to value for the enterprise as a whole, rather than just the teams responsible for writing the playbooks being used.




Kubernetes Operators with Ansible Deep Dive, Part 2

Kubernetes Operators with Ansible Deep Dive, Part 2

In part 1 of this series, we looked at operators overall, and what they do in OpenShift/Kubernetes. We peeked at the Operator SDK, and why you'd want to use an Ansible Operator rather than other kinds of operators provided by the SDK. We also explored how Ansible Operators are structured and the relevant files created by the Operator SDK when building Kubernetes Operators with Ansible.

In this the second part of this deep dive series, we'll:

  1. Take a look at creating an OpenShift Project and deploying a Galera Operator.
  2. Next we'll check the MySQL cluster, then setup and test a Galera cluster.
  3. Then we'll test scaling down, disaster recovery, and demonstrate cleaning up.

Creating the project and deploying the operator

We start by creating a new project in OpenShift, which we'll simply call test:

$ oc new-project test --display-name="Testing Ansible Operator"
Now using project "test" on server "https://ec2-xx-yy-zz-1.us-east-2.compute.amazonaws.com:8443"

We won't delve too much into this role, however the basic operation is:

  1. Use set_fact to generate variables using the k8s lookup plugin or other variables defined in defaults/main.yml.
  2. Determine if any corrective action needs to be taken based on the above variables. For example, one variable determines how many Galera node pods are currently running. This is compared against the variable defined on the CustomResource. If they differ, the role will add or remove pods as needed.

To begin the deployment, we have a simple script, which builds the operator image and pushes it to the OpenShift registry for the test project:

$ cat ./create_operator.sh
#!/bin/bash

docker build -t docker-registry-default.router.default.svc.cluster.local/test/galera-ansible-operator:latest .
docker push docker-registry-default.router.default.svc.cluster.local/test/galera-ansible-operator:latest
kubectl create -f deploy/operator.yaml
kubectl create -f deploy/cr.yaml

Before we run this script, we need to first deploy the RBAC rules and custom resource definition for our Galera example:

$ oc create -f deploy/rbac.yaml
clusterrole "galera-ansible-operator" created
clusterrolebinding "default-account-app-operator" created
$ oc create -f deploy/crd.yaml
customresourcedefinition "galeraservices.galera.database.coreos.com" created

Now, we run the script (after using the login command to allow docker to connect to the OpenShift registry we created):

$ docker login -p $(oc whoami -t) -u unused docker-registry-default.router.default.svc.cluster.local
Login Succeeded

$ ./create_operator.sh
Sending build context to Docker daemon 490 kB
...
deployment.apps/galera-ansible-operator created
galeraservice "galera-example" created

In short order, we will see the galera-ansible-operator pod start up, followed by a single pod named galera-node-0001 and a LoadBalancer service which provides our ingress to our Galera cluster:

$ oc get all
NAME DOCKER REPO TAGS UPDATED
is/galera-ansible-operator docker-registry-default.router...:5000/test/galera-ansible-operator latest 3 hours ago

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/galera-ansible-operator 1 1 1 1 4m

NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/galera-external-loadbalancer 172.30.251.195 172.29.17.210,172.29.17.210 33066:30072/TCP 1m
svc/glusterfs-dynamic-galera-node-0001-mysql-data 172.30.49.250 <none> 1/TCP 1m

NAME DESIRED CURRENT READY AGE
rs/galera-ansible-operator-bc6cd548 1 1 1 4m

NAME READY STATUS RESTARTS AGE
po/galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 4m
po/galera-node-0001 1/1 Running 0 1m

Verifying the MySQL cluster, initial setup and testing

We can use the describe function to see the status of our custom resource, specifically the size we specified:

$ kubectl describe -f deploy/cr.yaml |grep -i size
Galera _ Cluster _ Size: 1

Now that we have a MySQL cluster, let's test it using sysbench. As mentioned above, we have a system from which to do the testing so we can avoid internet round trips. But first, we'll need some info. We need to know the forwarded port we can connect to through the load balancing service created as part of the operator deployment:

$ oc get services

Next, we need to know the IP of the master. We can get this with oc describe:

$ oc describe node ec2-xx-yy-zz-1.us-east-2.compute.amazonaws.com| grep ^Addresses
Addresses: 10.0.0.46,ec2-xx-yy-zz-1.us-east-2.compute.amazonaws.com

So for this test, we'll be connecting to the IP 10.0.0.46 on port XXXXX. The port value 33066 was specified in the spec above, and is the port which will receive the forwarded traffic. We'll export those to make it a little easier to re-use our test commands.

From the test server:

$ export MYSQL_IP=10.0.0.46
$ export MYSQL_PORT=XXXXX

Before running sysbench, we need to create the database it expects (future versions of the Galera operator will be able to do this automatically):

$ mysql -h $MYSQL_IP --port=$MYSQL_PORT -u root -e 'create database sbtest;'

Next, we'll prepare the test by running sysbench using the OLTP read-only test with a table of 1 million rows:

$ sysbench --db-driver=mysql --threads=150 --mysql-host=${MYSQL_IP} --mysql-port=${MYSQL_PORT} --mysql-user=root --mysql-password= --mysql-ignore-errors=all --table-size=1000000 /usr/share/sysbench/oltp_read_only.lua prepare
sysbench 1.0.9 (using system LuaJIT 2.0.4)
Initializing worker threads...
Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'

...

Note that we use 150 threads here, as a single MySQL/MariaDB instance defaults to this size for its maximum connections allowed.

So now that everything's ready, lets run our first test with sysbench:

$ sysbench --db-driver=mysql --threads=150 --mysql-host=${MYSQL_IP} --mysql-port=${MYSQL_PORT} --mysql-user=root --mysql-password= --mysql-ignore-errors=all /usr/share/sysbench/oltp_read_only.lua run
sysbench 1.0.9 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 150
Initializing random number generator from current time
Initializing worker threads...
Threads started!
SQL statistics:
    queries performed:
        read:                            174776
        write:                           0
        other:                           24968
        total:                           199744
    transactions:                        12484  (1239.55 per sec.)
    queries:                             199744 (19832.77 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
General statistics:
    total time:                          10.0700s
    total number of events:              12484
Latency (ms):
         min:                                  3.82
         avg:                                120.66
         max:                               1028.51
         95th percentile:                    292.60
         sum:                            1506263.71
Threads fairness:
    events (avg/stddev):           83.2267/42.84
    execution time (avg/stddev):   10.0418/0.02

This was just one run, but re-running a few times produces similar results. So our one-node cluster can process about 20K queries/second. But a cluster with only one member isn't very useful - so lets scale it up. We do this by editing the custom resource we defined earlier and changing the galera_cluster_size variable. For now, we'll spin up to a three-node cluster:

$ oc edit -f deploy/cr.yaml
galeraservice.galera.database.coreos.com/galera-example edited

Next, we can verify OpenShift sees this new value:

$ kubectl describe -f deploy/cr.yaml | grep -i size
Galera _ Cluster _ Size: 3

And in short order, we see the Ansible operator receive an event signalling the change and start working to update the cluster:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 30m
galera-node-0001 1/1 Running 0 26m
galera-node-0002 0/1 Running 0 1m
galera-node-0003 0/1 Running 0 56s

And after about a minute (each Galera node has to start and sync data from another member), we see the new pods become ready:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 31m
galera-node-0001 1/1 Running 0 27m
galera-node-0002 1/1 Running 0 2m
galera-node-0003 1/1 Running 0 2m

Now that we have a three node cluster, we can re-run the same test as earlier:

$ sysbench --db-driver=mysql --threads=150 --mysql-host=${MYSQL_IP} --mysql-port=${MYSQL_PORT} --mysql-user=root --mysql-password= --mysql-ignore-errors=all /usr/share/sysbench/oltp_read_only.lua run
sysbench 1.0.9 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 150
Initializing random number generator from current time
Initializing worker threads...
Threads started!
SQL statistics:
    queries performed:
        read:                            527282
        write:                           0
        other:                           75326
        total:                           602608
    transactions:                        37663  (3756.49 per sec.)
    queries:                             602608 (60103.86 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
General statistics:
    total time:                          10.0247s
    total number of events:              37663
Latency (ms):
         min:                                  4.30
         avg:                                 39.88
         max:                               8371.55
         95th percentile:                     82.96
         sum:                            1501845.63
Threads fairness:
    events (avg/stddev):           251.0867/87.82
    execution time (avg/stddev):   10.0123/0.01

With dramatic results! Our cluster is now able to process 60K queries per second! How far can we take this? Well, if you noticed our node count at the start we have five nodes in our k8s cluster, so lets make our Galera cluster match that:

$ oc edit -f deploy/cr.yaml
galeraservice.galera.database.coreos.com/galera-example edited
$ kubectl describe -f deploy/cr.yaml | grep -i size
Galera _ Cluster _ Size: 5

The Ansible operator starts growing the Galera cluster...:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 35m
galera-node-0001 1/1 Running 0 32m
galera-node-0002 1/1 Running 0 7m
galera-node-0003 1/1 Running 0 7m
galera-node-0004 0/1 Running 0 38s
galera-node-0005 0/1 Running 0 34s

And again after about a minute or so we have a Galera cluster with five pods ready to serve queries:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 36m
galera-node-0001 1/1 Running 0 33m
galera-node-0002 1/1 Running 0 8m
galera-node-0003 1/1 Running 0 8m
galera-node-0004 1/1 Running 0 1m
galera-node-0005 1/1 Running 1 1m

Oddly, the fifth node had a problem, but OpenShift retried it after it failed and it came up and into the cluster. Great!

So let's rerun our same test once again:

$ sysbench --db-driver=mysql --threads=150 --mysql-host=${MYSQL_IP} --mysql-port=${MYSQL_PORT} --mysql-user=root --mysql-password= --mysql-ignore-errors=all /usr/share/sysbench/oltp_read_only.lua run
sysbench 1.0.9 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 150
Initializing random number generator from current time
Initializing worker threads...
Threads started!
SQL statistics:
queries performed:
        read:                            869260
        write:                           0
        other:                           124180
        total:                           993440
    transactions:                        62090  (6196.82 per sec.)
    queries:                             993440 (99149.17 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
General statistics:
    total time:                          10.0183s
    total number of events:              62090
Latency (ms):
         min:                                  5.41
         avg:                                 24.18
         max:                                159.70
         95th percentile:                     46.63
         sum:                            1501042.93
Threads fairness:
    events (avg/stddev):           413.9333/78.17
    execution time (avg/stddev):   10.0070/0.00

And we're hitting 100K queries per second. Our cluster has thus-far scaled linearly with the number of nodes we've spun up. At this point, we've maxed out the resources of our OpenShift cluster, and spinning up more Galera nodes doesn't help:

$ oc edit -f deploy/cr.yaml
galeraservice.galera.database.coreos.com/galera-example edited
$ kubectl describe -f deploy/cr.yaml | grep -i size
Galera _ Cluster _ Size: 9

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 44m
galera-node-0001 1/1 Running 0 41m
galera-node-0002 1/1 Running 0 16m
galera-node-0003 1/1 Running 0 16m
galera-node-0004 1/1 Running 0 9m
galera-node-0005 1/1 Running 1 9m
galera-node-0006 1/1 Running 0 1m
galera-node-0007 1/1 Running 0 1m
galera-node-0008 1/1 Running 0 1m
galera-node-0009 1/1 Running 0 1m

$ sysbench --db-driver=mysql --threads=150 --mysql-host=${MYSQL_IP} --mysql-port=${MYSQL_PORT} --mysql-user=root --mysql-password= --mysql-ignore-errors=all /usr/share/sysbench/oltp_read_only.lua run
sysbench 1.0.9 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 150
Initializing random number generator from current time
Initializing worker threads...
Threads started!
SQL statistics:
    queries performed:
        read:                            841260
        write:                           0
        other:                           120180
        total:                           961440
    transactions:                        60090  (5995.71 per sec.)
    queries:                             961440 (95931.35 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
General statistics:
    total time:                          10.0208s
    total number of events:              60090
Latency (ms):
         min:                                  5.24
         avg:                                 24.98
         max:                                192.46
         95th percentile:                     57.87
         sum:                            1501266.08
Threads fairness:
    events (avg/stddev):           400.6000/134.04
    execution time (avg/stddev):   10.0084/0.01

Performance actually decreased a bit! This shows that MySQL/MariaDB are pretty resource-intensive, so if you want to continue scaling out the performance you may need to add more OpenShift cluster resources. But at this point, our cluster is serving nearly 5x the traffic as when we originally started it up. Continued tuning of MySQL/MariaDB and Galera could extend that and allow us to increase performance further. However the goal here was to show how to create an Ansible operator to control a very complex, data-oriented application.

Scaling the cluster down

Since those extra nodes aren't helping out (other than providing a bit more redundancy in the event of a failure), lets scale the cluster back down to five nodes:

$ oc edit -f deploy/cr.yaml
galeraservice.galera.database.coreos.com/galera-example edited
$ kubectl describe -f deploy/cr.yaml | grep -i size
Galera _ Cluster _ Size: 5

After a short while, we see the operator begin to terminate pods that are no longer required:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 46m
galera-node-0001 1/1 Running 0 43m
galera-node-0002 1/1 Running 0 18m
galera-node-0003 1/1 Running 0 18m
galera-node-0004 1/1 Running 0 11m
galera-node-0005 1/1 Running 1 11m
galera-node-0006 0/1 Terminating 0 3m
galera-node-0007 0/1 Terminating 0 3m
galera-node-0008 0/1 Terminating 0 3m
galera-node-0009 0/1 Terminating 0 3m

Disaster recovery

Now, let's add some chaos. Looking at our first worker xx-yy-zz-2, we can see which pods are running on the node:

$ oc describe node ec2-xx-yy-zz-2.us-east-2.compute.amazonaws.com
...
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
openshift-monitoring node-exporter-bqnzv 10m (0%) 20m (1%) 20Mi (0%) 40Mi (0%)
openshift-node sync-hjtmj 0 (0%) 0 (0%) 0 (0%) 0 (0%)
openshift-sdn ovs-55hw4 100m (5%) 200m (10%) 300Mi (4%) 400Mi (5%)
openshift-sdn sdn-rd7kp 100m (5%) 0 (0%) 200Mi (2%) 0 (0%)
test galera-node-0004 0 (0%) 0 (0%) 0 (0%) 0 (0%)
...

So galera-node-0004 is running here, along with some other infrastructure bits. Lets restart it from the AWS EC2 console and see what happens...

$ oc get nodes
NAME STATUS AGE
ec2-xx-yy-zz-1.us-east-2.compute.amazonaws.com Ready 1d
ec2-xx-yy-zz-2.us-east-2.compute.amazonaws.com NotReady 1d
ec2-xx-yy-zz-3.us-east-2.compute.amazonaws.com Ready 1d
ec2-xx-yy-zz-4.us-east-2.compute.amazonaws.com Ready 1d
ec2-xx-yy-zz-5.us-east-2.compute.amazonaws.com Ready 1d
ec2-xx-yy-zz-6.us-east-2.compute.amazonaws.com Ready 1d
ec2-xx-yy-zz-7.us-east-2.compute.amazonaws.com Ready 1d
ec2-xx-yy-zz-8.us-east-2.compute.amazonaws.com Ready 1d

Eventually, we see galera-node-0004 enter an unknown state:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 50m
galera-node-0001 1/1 Running 0 47m
galera-node-0002 1/1 Running 0 22m
galera-node-0003 1/1 Running 0 22m
galera-node-0004 1/1 Unknown 0 16m
galera-node-0005 1/1 Running 1 16m

And in a while the pod will be terminated, after which the Ansible operator will restart it:

$ oc get pods
NAME READY STATUS RESTARTS AGE
galera-ansible-operator-bc6cd548-46b2r 1/1 Running 5 55m
galera-node-0001 1/1 Running 0 52m
galera-node-0002 1/1 Running 0 27m
galera-node-0003 1/1 Running 0 27m
galera-node-0004 1/1 Running 1 1m
galera-node-0005 1/1 Running 1 21m

... and our cluster is back to its requested capacity!

Cleanup

Since this is a test we'll want to clean up after ourselves. When we're done we use the delete_operator.sh script to remove the custom resource and the operator deployment:

$ ./delete_operator.sh
galeraservice.galera.database.coreos.com "galera-example" deleted
deployment.apps "galera-ansible-operator" deleted

In a couple of minutes, everything is gone:

$ oc get all
NAME DOCKER REPO TAGS UPDATED
is/galera-ansible-operator docker-registry-default.router...:5000/test/galera-ansible-operator latest 4 hours ago

Summary

The Galera operator is a work in progress and is most definitely not ready for production. If you'd like to view the playbooks themselves, you can see the code here:

https://github.com/water-hole/galera-ansible-operator

We're going to be continuing development on this with the goal of making it the de facto example for other data storage applications. Thanks for reading!




Kubernetes Operators with Ansible Deep Dive, Part 1

Kubernetes Operators with Ansible Deep Dive, Part 1

This deep dive series assumes the reader has access to a Kubernetes test environment. A tool like minikube is an acceptable platform for the purposes of this article. If you are an existing Red Hat customer, another option is spinning up an OpenShift cluster through cloud.redhat.com. This SaaS portal makes trying OpenShift a turnkey operation.

In this part of this deep dive series, we'll:

  1. Take a look at operators overall, and what they do in OpenShift/Kubernetes.
  2. Take a quick look at the Operator SDK, and why you'd want to use an Ansible operator rather than other kinds of operators provided by the SDK.
  3. And finally, how Ansible Operators are structured and the relevant files created by the Operator SDK.

What Are Operators?

For those who may not be very familiar with Kubernetes, it is, in its most simplistic description - a resource manager. Users specify how much of a given resource they want and Kubernetes manages those resources to achieve the state the user specified. These resources can be pods (which contain one or more containers), persistent volumes, or even custom resources defined by users.

This makes Kubernetes useful for managing resources that don't contain any state (like pods of web servers or load balancing resources). However, Kubernetes doesn't provide any built-in logic for managing resources like databases or caches which are stateful and sensitive to restarts. Operators were created to bridge this gap by providing a way for users to specify a piece of code (traditionally written in Golang) tied to custom resource definitions in Kubernetes.

Operators were so named because they allow you to embed your operational logic of an application into an automated manager running on Kubernetes/OpenShift.

The Operator SDK, and a quick overview of Ansible Operators

Red Hat created the Operator Framework to make the job of creating and managing operators easier across their full lifetime. As part of the framework, the Operator SDK is tasked with creating and building operators in an automated manner for users. Over time it has grown to add several operator types. In 2018, we began work on adding the Ansible Operator type to the SDK. We want to make it easier to build operators in Kubernetes environments based on Ansible.

Why use Ansible for Operators?

At first, operators were written in Golang. This immediately sets the bar somewhat high for anyone who wants to write an operator --- someone has to know a relatively low-level programming language to get started. On top of this, you must also be familiar with Kubernetes internals, such as the API and how events are generated for resources.

The Ansible Operator was created to address this short-coming. The Ansible Operator consists of two main pieces:

  1. A small chunk of Golang code, which handles the interface between Kubernetes/OpenShift and the operator.
  2. A container, which receives events from the above code and runs Ansible Playbooks as required.

That's it! The Ansible and Operator SDK abstract away all of the difficult parts of writing an operator and allows you to focus on what matters --- managing your applications. If you already have a large base of Ansible knowledge in your organization, you can immediately begin managing applications using Ansible Operator. A further added bonus of using Ansible for your operators is that you immediately have access to any module that Ansible can run. This allows folks to incorporate off-cluster management tasks related to your application. For example:

  1. Creating DNS entries for your newly deployed applications
  2. Spinning up resources external to your cluster, such as storage or networking
  3. More easily do off-site backups to external cloud services
  4. Manage external load balancing based on custom metrics

There are a number of possibilities that Kubernetes Operators written with Ansible can provide a potential solution for.

Creating a Kubernetes Operator with Ansible from scratch

First, install the Operator SDK following their instructions. Once the install is complete, we can create a new operator with the following command:

$ operator-sdk new test-operator \
    --api-version=test.ansible-operator.com/v1 \
    --kind=Test \
    --type=ansible

INFO[0000] Creating new Ansible operator 'test-operator'.
...
INFO[0000] Project creation complete.

$ cd test-operator/

Kubernetes Operator with Ansible structure and files

[Now that we have our Operator skeleton, let's take a look at some of the main files used when deploying Operators in general, as well as what the Ansible Operator type generated specifically. These are the:

  1. watches.yaml file.
  2. build directory.
  3. deploy directory.
  4. roles directory.

One other directory is present here as well: the molecule directory, which contains files to automate testing your roles/playbooks using Molecule. We will not be covering the use of Molecule here it's noted for the sake of being complete.

If you run ls -l in the above test-operator directory, you see these files/directories there after creating the new operator skeleton.

The watches.yaml file

This file is used by the Ansible Operator to tell Kubernetes/OpenShift which custom resources (based on the Group/Version/Kind fields) the operator is responsible in handling. It is the glue that ties our custom code to the Kubernetes API:

---
- version: v1
  group: test.ansible-operator.com
  kind: Test
  role: /opt/ansible/roles/test

specifying any other playbook boilerplate. However if you are running more than one role in your operator you can change that line to be:

playbook: /opt/ansible/playbook.yaml

Also, you'll need to tweak the build/Dockerfile (more on this below) to copy the playbook into the container so add this line:

COPY playbook.yaml ${HOME}/playbook.yaml

You would then create the specified playbook in the same directory as the watches.yaml file.

The build directory

This directory contains a few files related to building the operator artifact. Because operators are just another application to OpenShift/Kubernetes, this artifact is a container built using a Dockerfile. The other files here are related to testing via Molecule, which we are not going to cover in this blog series.

The Dockerfile is very simple, so we won't delve into it much other than to say it is based on the ansible-operator image from quay.io, and copies the roles and watches.yml file into the container image.

The deploy directory

This directory contains YAML files for deploying the operator into OpenShift/K8s using the oc CLI commands.

The CustomResourceDefinition (CRD) and CustomResource (CR) are defined in the deploy/crds/ directory. The CRD is what the watches.yaml file references, meaning all instances (CRs) of this definition will be controlled by our operator.

The CRD is defined in deploy/crds/test_v1_test_crd.yaml and is mostly boilerplate for OpenShift/Kubernetes:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: tests.test.ansible-operator.com
spec:
  ...

You can see the operator-sdk command above filled in most of these fields with the values we specified. By themselves, CRDs are not very useful, you need actual instances of what they define --- this is what CustomResources do. Our CustomResource (CR) is defined in deploy/crds/test_v1_test_cr.yaml, and is relatively short (compared to the other YAML files, anyway):

apiVersion: test.ansible-operator.com/v1
kind: Test
metadata:
  name: example-test
spec:
  size: 3

Each of the values set under the spec entry become variables passed into Ansible as extra variables. Using these, we can customize the behavior of our operator. The default example creates an entry named size, which we can use in our roles to dynamically scale the application our operator is managing.

The deploy/role.yaml and deploy/role_binding.yaml (not shown), define some RBAC controls which give your login access to manage the custom resources defined above. Role Based Access Control (RBAC) is not covered in this post, so again we're just mentioning them for completeness.

Finally, the deploy/operator.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-operator
spec:
  ...

This file is quite long, but mainly it creates a new Deployment resource in OpenShift/Kubernetes, which helps ensure that our operator stays up and running.

The roles directory

This is the directory where you place any roles you wish to include with your operator, and should be familiar to experienced Ansible users. As noted above, this directory is copied completely into the Ansible Operator container, and roles here can be referenced in the watches.yaml file or other playbooks you include.

Roles commonly use the k8s module (included in Red Hat Ansible Automation since the 2.6 release) to manage resources on the cluster. If you are familiar with Kubernetes resource files, this module will be very intuitive (the YAML from a resource file can be copy/pasted directly as the input to this module). To learn more, you can read the documentation for the k8s module here:

https://docs.ansible.com/ansible/latest/modules/k8s_module.html

Summary

This concludes our deep dive into operators, Operator SDK, and Ansible Operator creation and structure. Operators written using Ansible give you the power of operators in general, while allowing you to leverage preexisting Ansible expertise to quickly get up to speed on deploying applications on OpenShift or Kubernetes.




The Future of Ansible Content Delivery

The Future of Ansible Content Delivery

Everyday, I'm in awe of what Ansible has grown to be. The incredible growth of the community and viral adoption of the technology has resulted in a content management challenge for the project.

I don't want to echo a lot of what's been said by our dear friend Jan-Piet Mens or our incredible Community team, but give me a moment to take a shot at it.

Our main challenge is rooted in the ability to scale. The volume of pull requests and issues we see day to day severely outweigh the ability of the Ansible community to keep up with that rate of change.

As a result, we are embarking on a journey. This journey is one that we know that the community, both our content creators and content consumers, will be interested in hearing about.

This New World Order (tongue in cheek), as we've been calling it, is a model that will allow for us to empower the community of contributors of Ansible content (read: modules, plugins, and roles) to provide their content at their own pace.

To do this, we have made some changes to how Ansible leverages content that is not "shipped" with it. In short, Ansible content will not have to be a part of a milestone Core release of the Engine itself. We will be leveraging a delivery process and content structure/format that helps alleviate a lot of the ambiguity and pain that is currently there due to tying plugins to the Core Engine.

The cornerstone of this journey is something you may have heard rumblings of out in the interwebs. This thing is called a Ansible Content Collection, or Collection(s), for short.

To create Ansible Content Collections, we took a look at a lot of things already in practice. We looked at other tools, other packaging formats, delivery engines, repositories, and ultimately, ourselves. In all of that investigation we feel we have come up with a pretty sound spec. Below we cover some details of that.

A Collection is a strict project/directory structure for Ansible Content. Similar to the role directory structure; we are now highlighting what is important to Ansible Playbook execution. Here's a graphic of that spec, created by my teammate, Tim Appnel.

Screenshot_future-of-content-1

As you can see, this structure does look very similar to roles. There are some slight differences though. Notice that the roles directory no longer contains a library folder? The idea here is that a Collection itself is the true encapsulation of every piece of content relevant to it, and the playbook that is executing that content. So we've taken the libraries out of the various roles that could live in a collection, and placed them at the top level in the plugins directory. There, all types of plugins (yes modules are there because modules are actually plugins) will be usable by the roles and ultimately all playbooks that could potentially call them. Because this content will be "installed" in a location that the Engine is aware of, and will know to look for content that is being called in the playbook.

Also, with these changes, we have introduced some namespacing concepts into playbooks as well. Here's another graphic, by Tim, that is a snippet out of a playbook that highlights that namespacing.

Screenshot_future-of-content-2

So what we've got here is a very simple playbook. In this playbook we have highlighted the list of Collections that we're interested in using. For each task, we are using the FQCN (Fully Qualified Collection Namespace) path to the module. Of course, we still want to make this simple. So playbook creators won't have to always fully qualify their content path. As you see in the fourth task, creators can still use the shorthand name of a module. Ansible will search the path of collections in a first come first serve approach, as defined in Ansible configuration or within the play itself.

That's about all I've got for going into Collections.

Happy Automating folks!