Blue-Green Deployment Strategies for PCF Microservices

Blue-green deployment is a well-known pattern for updating software components by switching between simultaneously available environments or services. The context in which a blue-green deployment strategy is used can vary from switching between data centers, web servers in a single data center, or microservices in a Pivotal Cloud Foundry (PCF) deployment.

In a microservices architecture, it’s often challenging to monitor and measure the performance of a microservice updated via blue-green deployment, specifically when determining the impact on consumers of the service, the overall business process, and existing service level agreements (SLAs).

But there’s good news. PCF—with its built-in router and commands for easily managing app requests, and its sound orchestration of containerized apps—makes implementing a blue-green deployment trivial.

Our Example App

In this blog, we’ll focus on a simplified PCF Spring Boot microservice that implements a REST service to process orders, and a single orders endpoint for posting new orders. For simplicity’s sake, this service has a single “downstream” dependency on an account microservice, which it calls via the account service REST API.

You’ll find the example app here: https://github.com/jaholmes/orderapp.

The order-service will use the account-service to create an account. In this scenario, an order is submitted and a user requests a new account created as part of the order submission.

Deploying the Green Version of the Account Service

We will target the account-service to perform a blue-green deployment of a new version of the account-service, which includes some bug fixes. We’ll perform the blue-green deployment using the CF CLI map-route and unmap-route commands.

When we push the account-service app, we’ll adopt an app-naming strategy that appends -blue or -green to the app name, and assume our deployment pipeline would automatically switch between the two prefixes from one deployment to the next.

So our initial deployment, based on the manifest.yml here, would be:

$ cf push account-service-blue -i 5

After pushing this app, we create the production route and map it to the blue version of the app using cf map-route.

$ cf map-route account-service-blue apps.<pcf-domain> –hostname
prod-account-service

Then we create the user-provided service.

$ cf cups account-service -p ‘{ “route”:
“prod-account-service.apps.<pcf-domain”}’,

And bind the order-service to this user-provided service by referencing it in the order-service’s manifest.yml.

When pushed, the order-service app consumes the account-service route from its environment variables, and uses this route to communicate with the account-service, regardless of whether it’s the blue or green version of the account-service.

The initial deployment shows both the order and blue account-service running with five instances.

Both services have started the AppDynamics APM agent integrated in the Java buildpack and are reporting to an AppDynamics controller. The flowmap for the order service/orders endpoint shows the requests flowing from the order-service to the account-service, and the response time averaging 77 milliseconds (ms). A majority of that time is being consumed by the account-service. The instances are represented as nodes on the flowmap:

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cid2B4E3F40-83F5-7B4D-B29E-9B1B49406216.png

Performing Blue/Green Deployment

Now we’re ready to push the updated “green” account-service that implements fixes to known issues. We’ll push it and change the app name to “account-service-green” so it’s deployed separately from account-service-blue.

$ cf push account-service-green -i 5

At this point, the Apps Manager shows both versions of the account-service app running, but only the blue version is receiving traffic.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidF0181812-9B19-B54A-B3C1-3DE28C3FF7D8.png

We can validate this by referencing a monitoring dashboard that displays call per minute and response time for blue and green versions of the account-service. Below, the dashboard shows no activity for Account-service-green.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cid02E85E8C-9A07-894A-8302-6B6DD613D738.png

This dashboard distinguishes between blue and green versions by applying a filtering condition, which matches node names that start with “account-service-blue” or “account-service-green.”

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidFC785568-6013-4749-8B4A-807807B560C8.png

This node-matching criteria will match the nodes’ names assigned by the AppDynamics integration included in the Java buildpack, which uses the pattern <pcf-app-name>:<instance id>. Below is a list of the node names reporting under the accountservice tier that shows this pattern.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidBD99BA86-DAF3-2840-9AB2-A0D698A76677.png

To complete the blue-green deployment, we use the cf map-route command to change the routing from the blue to green version.

$ cf map-route account-service-green apps.<pcf-domain> –hostname prod-account-service

$ cf unmap-route account-service-blue apps.<pcf-domain> –hostname prod-account-service

This instructs the CF router to route all requests to prod-account-service.<pcf-domain> (used by the order-service app) to the green version. At this point, we want to evaluate as quickly as possible the performance impact of the green version on the order-service.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidAAD007D3-3552-654C-81FD-960FA40D05A2.png

Our blue-green dashboard shows the traffic has indeed switched from the blue to the green nodes, but performance has degraded. Our blue version of the account-service was averaging well below 100 ms, but the green version is showing an uptick to around 150 ms (artificially introduced for the sake of example).

We see a proportional impact on the order service, which is taking an additional 100 ms to process requests. This could be a case where a rollback is necessary, which again is straightforward using cf map-route and unmap-route.

Baselines and Health Rules

Rather than relying strictly on dashboards to decide whether to rollback a deployment, we can establish thresholds in health rules that compare performance to baselines based on historical behavior.

For example, we could create health rules with thresholds based on the baseline or average performance of the order-service, and alert if the performance exceeds that baseline by two standard deviations.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cid9CBE7643-13AE-C548-BF6A-670052152CAD.png

When we deployed the green version, we were quickly alerted to a performance degradation of the order-service, as shown in our dashboard:

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidB976C4A2-3755-DF4A-83B7-680C9F7045BA.png

We also defined a health rule that focuses on the aggregate baseline of the account-service (including average performance of all blue and green versions) to determine if the latest deployment of the account-service is behaving poorly. Again, this produced an alert based on our poorly performing green version:

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidB2371FA1-3024-CC48-B497-36784BEA6E94.png

The ability to identify slower-than-normal transactions and capture detailed diagnostic data is also critical to finding the root case.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidF2E1050C-6C99-2044-B133-2774A54C6209.png

In the case of the slower account-service version, we can drill down within snapshots to the specific lines of code responsible for the slowness.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidFA81DD41-3E93-5041-9D07-B3EF34EE7D25.png

Monitoring Impact on a Business Process

In the previous dashboards, we tracked the impact of an updated microservice on clients and the overall application. If the microservice is part of a larger business process, it would also be important to compare the performance of the business process before and after the microservice was updated via blue-green deployment.

In the dashboard below, the order microservice, which depends on the account microservice, may be part of a business process that involves converting offers to orders. In this example, conversion rate is the key performance indicator we want to monitor, and a negative impact on conversion rates would be cause to consider a rollback.

The Power of APM on PCF

While PCF makes deploying microservices relatively simple, monitoring the impact of updates is complex. It’s critical to have a powerful monitoring solution like AppDynamics, which can quickly and automatically spot performance anomalies and identify impacts on service clients and overall business processes.

Understanding Performance of PayPal as a Service (PPAAS)

In a previous post – Agile Performance Testing – Proactively Managing Performance – I discussed some of the challenges faced in managing a successful performance engineering practices in an Agile development model.  Let’s continue this with a real world example, highlighting how AppDynamics simplifies the collection and comparison of Key Performance Indicators (KPIs) to give visibility into an Agile development team’s integration with PayPal as a Service (PPaaS).

Our dev team is tasked with building a new shopping cart and checkout capability for an online merchant. They have designed a simple Java Enterprise architecture with a web front-end, built on Apache TomEE, a set of mid-tier services, on JBoss AS 7, and have chosen to integrate with PayPal as the backend payment processor. With PayPal’s Mobile, REST and Classic SDKs, integrating secure payments into their app is a snap and our team knows this is a good choice.

However, the merchant has tight Service Level Agreements (SLAs) and it’s critical the team proactively analyze, and resolve, performance issues in pre-production as part of their Agile process. In order to prepare for meeting these SLAs, they plan to use AppDynamics as part of development and performance testing for end-to-end visibility, and to collect and compare KPIs across sprints.

The dev team is agile and are continuously integrating into their QA test and performance environment. During one of the first sprints they created a basic checkout flow, which is shown below:

Screen Shot 2014-04-29 at 9.28.21 AM

For this sprint they stubbed several of the service calls to PayPal, but coded the first step in authenticating — getting an OAuth Access Token, used to validate payments.

Enabling AppDynamics on their application was trivial, and the dev team got immediate end-to-end visibility into their application flow, performance timings across all tiers, as well as the initial call to PayPal. Based on some initial performance testing everything looks great!

Untitled1

NOTE: in our example AppDynamics is configured to identify backend HTTP Requests (REST Service Invocations) using the first 3 segments of the target URL. This is an easy change and the updated configuration is automatically pushed to the AppDynamics agent without any need to change config files, or restart the application.

Untitled2

In a later sprint, our dev team finished integrating the full payments process flow. They’re using PayPal’s SDK and while it’s a seamless integration, they’re unclear exactly what calls to PayPal are happening under the covers.

Because AppDynamics automatically discovers, maps, and scores all incoming transactions end-to-end, our dev team was able to get immediate and full visibility into two new REST invocations, authorization and payment.

Untitled3

The dynamic discovery of AppDynamics is extremely important in an Agile, continuous integration, or continuous release models where code is constantly changing. Having to manually configure what methods to monitor is a burden that degrades a team’s efficiency.

Needing to understand performance across the two sprints, the team leverages AppDynamics’ Compare Releases functionality to quickly understand the difference between performance runs across the sprints.

Untitled4

AppDynamics flow map visualize the difference in transaction flow between the sprints, highlighting the additional REST calls required to fully process the payment. Also, the KPI comparison gives the dev team an easy way to quickly measure the differences in performance.

Untitled5

Performance has changed, as expected, when implementing the full payment processing flow. During a performance test AppDynamics automatically identifies and takes diagnostics on the abnormal transactions.

Untitled6

Transaction Snapshots capture source line of code call graphs, end-to-end across the Web and Service tiers. Drilling down across the call graphs, the dev team clearly identifies the payment service as the long running call.

Untitled7

AppDynamics provides full context on the REST invocation, and highlights the SDK was configured to talk to PayPal’s sandbox environment, explaining the occasional high-response times.

To recap, our Agile dev team leveraged AppDynamics to get deep end-to-end visibility across their pre-production application environment. AppDynamics release comparison provided the means to understand differences in the checkout flows across sprints, and the dynamic discovery, application mapping, and automatic detection allowed the team to quickly understand and quantify their interactions with PayPal. When transactions deviated away from normal, AppDynamics automatically identified and captured the slowness to provide end-to-end source line of code root-cause analysis.

Take five minutes to get complete visibility into the performance of your production applications with AppDynamics today.

Going Mobile with AppDynamics REST API

Its always great when customers want to build their own applications on top of your data and platform. A few weeks back one of our customers in Europe decided to build their own mobile application, so they could monitor the performance of their mission-critical business transactions from any smart phone or mobile device. Here is the unedited story we received of how this customer went mobile with AppDynamics:

I looked into AppDynamics’ REST API and was very keen to use that data but was unsure of how I could visualize the it. Since all data per monitoring point was available in either XML or JSON format, it seemed the ideal choice to go with a Javascript user interface. After searching around I found the Dojo Gauges and started to code up a simple webapp using AppDynamics’ REST API and data we collect everyday.

Technology Stack

Here is a quick overview of the architecture and technologies used in the mobile application. 

The Mobile Application

Our application is hosted within Django so I can use django’s powerful dynamic admin interface for a backend. Though this is not implemented yet so for now I am just using it to serve some Static content and proxy AJAX calls to AppDynamics.

The core application consists of:

  • index.html
  • dojo + jquery
  • views.js + settings.js
  • widget-glossygauge.js
  • widget-jira.js
  • images

Functionally, the webapp serves the basic static page layout to the browser on the client device and then instantiates the Dojo GlossyGauges which I have extended to include their own embedded self-update timers. The update timers update the gauge values by making REST calls to AppDynamics through the proxy module. The proxy module is necessary as the browser will block cross domain ajax requests. You probably don’t need it if you host the application on the same server as AppDynamics and use apache + mod_proxy.

Here is a screenshot of our mobile application:

The gauges update themselves in real-time so there is no need to refresh the page.

Adding new gauges is as simple as exporting the Information Point REST-URL from AppDynamics and adding it to my settings.js file and then creating a view for it in views.js. This manual process will be replaced by simply adding it to the Django admin interface at a later stage and then dynamically generating views.js and settings.js via a django view.

It was also simple enough to extend the interface to get current open cases from JIRA and retrieve current events from our CMS systems.

It is also possible to show our performance data by location on a map layout as shown below:

If you would like to share any applications, plugins or custom reports that utilize AppDynamics data then drop me an email at: appman @ appdynamics (.) com.