Desirable Databases Features
Postgres is a candidate database since it is already in use by Tower. MongoDB is the other database considered for an in depth analysis for many reasons including: (1) document store of the JSON structure, (2) arbitrary structure query support, (3) and the ability to run on a single machine and later scale.
Comparing the two databases begins with identifying our System Tracking requirements. What is System Tracking?
For Tower, System Tracking is time based snapshots of machine facts. Facts are key value pairs of system state. The inspiration comes from the open source Ansible. Ansible facts will be supported in System Tracking along with 3 other fact types: packages (i.e. rpm or apt-get), services (i.e. mongodb, apache2), and files (i.e. /var/log/message). We support these 3 facts as fact modules. We also want to allow for customer built modules. Facts generated by the 4 core modules plus user created modules would be stored in Tower’s System Tracking database; allowing customers to take advantage of our fact visualization features. Below are concrete examples of fact structures.
Side note. Many other databases (i.e Riak, CouchDB, Cassandra) were looked at for this in-depth comparison.
A JSON structure is easily consumable by Ansible. In this form, an Ansible user may reference facts in their playbooks, much like they already do with Ansible facts. It is not a requirement that the facts be stored in a JSON structure. However, it is a requirement that the facts be represented in a JSON structure for Ansible consumption and Tower API purposes.
Example packages JSON fact structure.
Side note. We did not figure out the module (i.e. packages, services, files) fact structure all at once. We took inspiration from the Ansible open source fact structure and worked on the package structure first. We communicated with the Ansible open source team to flush out a structure that was easiest to consume by playbooks. The other module fact structures followed.
At this point we have a good understanding of our data to be stored and retrieved. We now need to understand how the data will be presented to our users. Our UI team identified 2+1 views of the data (I will explain the +1).
The first view presents package version for comparison. We refer to this view as the host compare view.
The second view presents a single fact across a number of hosts at a given point in time. We call this view single fact.
Side note. Identifying the views actually came first in our system tracking feature process.
The third view (2+1) does not stand on its own but is complex enough to be called a view. For a chosen host, present the times at which facts were collected. We call this view the timeline.
From the above views we were able to identify common queries needed to generate the desired views. Full describing these queries required identifying the query parameters. Below are the identified queries and parameters. Note there is a 1-1 mapping to the views. This was a nicety. Future views may require multiple queries to generate.
MongoDB Data Structures
One database; two collections. (1) Host containing the hostname plus any other host details and (2) Fact containing a reference to host, timestamp, module name, and set of facts from the scan. Indexing is heavily used.
Postgres Data Structure
Making the aforementioned queries performant in Postgres requires a more granular data structure. Each key value pair from a fact scan is stored as a separate record. Fact key value pairs are differential. A fact record is only created if a fact has changed; otherwise the timestamp is updated to reflect the continued validity of the matching fact. This design was chosen to take advantage of indexing as much as possible. Other needed meta-data tables exist (i.e. host)..
With the queries and data identified we can now begin to construct our experiments. Since the system tracking will be an Enterprise customer offering we need to support facts for > 10k hosts for a period of 1-5 years. Thus, we must ensure (1) our queries are performant and our (2) data overhead is reasonable for a large data set.
We conduct an experiment for each query for a total of 3 experiments.
For all experiments the same data set is used. The database state was examined to verify data set was created correctly (see table below).
Each query is executed multiple times with randomized values chosen for query parameters. The number of query executions chosen resulted in a 99% confidence interval and (4) aggregating the query results (i.e. query time average, min, max, stdev).
For example let’s look at the host compare experiment.
4 TB EBS Storage RAID 0 SSD
*Postgres required we use a r3.large instance to speed up the database instantiation process. After the database was instantiate, the volume was attached to the m3.medium instance. Before we performed this step we decided we might as well run the experiment on the large instance (see Postgres Query Performance).
Note that the Y axis in the graph above is a log scale.
When ran on the same hardware configuration, m3.medium, Mongo outperformed postgres by orders of magnitude for the timeline and host compare query and a single order of magnitude for the single fact query. The observed large performance difference led us to run the postgres tests on a r3.large instance. We used the r3.large instance that we used to batch load in the data set to run the experiment. From the results you can see that Postgres performed better on the r3.large instance but that mongo still performed orders of magnitude better.
The MongoDB storage data overhead is acceptable and is close to the raw data being stored. Analyzing the Postgres storage data overhead is more complex due the differential design. The storage data overhead is sensitive to the fact change percentage; at 5% the storage requirements are less than MongoDB. Larger, real-world fact differences, would result in large storage requirements.
We chose MongoDB for our System Tracking database. MongoDB performs orders of magnitude better than Postgres on the same, even double, the hardware and has other desirable features (i.e. arbitrary JSON structure querying, horizontal scaling).
Postgres is in current use by our enterprise offering, Tower. From an operations standpoint, Postgres is an easy choice. Our inclusion of the experiment running on an r3.large instance shows that, given more resources, Postgres performs better for our System Tracking workload. Going forward with Postgres for the System Tracking feature would require development work not required by choosing MongoDB (i.e. serialize/deserialize fact data, converting user fact queries to sql).