Top Takeaways from KubeCon + CloudNativeCon

The Emerald City played host to the North American edition of KubeCon + CloudNativeCon 2018, bringing together technologists eager to share their knowledge of Kubernetes and Cloud Native topics. The event, hosted by the Cloud Native Computing Foundation (CNCF), sold out over a month before it opened, a strong sign of growing interest in all things cloud.

The figurative 800-pound gorilla at the conference was, of course, Kubernetes (written shorthand as K8s). Perhaps to soften K8s’ reputation for complexity, Phippy the yellow giraffe and her animated friends were on hand to join the CNCF. Popularized by Matt Butcher’s and Karen Chu’s “The Illustrated Children’s Guide to Kubernetes,” Phippy was brought onboard to help developers explain resource management in distributed Cloud Native applications to friends, family and whoever else will listen.

The Omnipresent Cloud

Cloud computing continues to evolve at a rapid pace, and with the rise of IoT and edge computing, the cloud is showing up everywhere. Thanks to the hard work spearheaded by the CNCF, cloud-optimized platforms—everything from public cloud infrastructure to a secure private cloud in a datacenter—are now open to a broader range of workloads. You may be interacting with a Cloud Native workload right now, in fact. From your car acting as an edge cloud (or node)—making sure you arrive safely and timely to your destination—to the infrastructure powering this blog, the cloud is ubiquitous and evolving.

Projects Charging Forward

As noted by Aqua Security technology evangelist Liz Rice in her keynote, the CNCF has seen tremendous growth in recent years with the number of projects, with this year’s Seattle attendance more than doubling the 2017 turnout in Austin. Key projects are advancing, too. Envoy, which provides a scalable service mesh for communications between microservices and components, and Prometheus, an open-source systems monitoring and alerting toolkit, became graduated projects this year, joining Kubernetes at the CNCF’s highest maturity level. In addition, a year-over-year comparison of the CNCF Landscape (an interactive guide of Cloud Native technologies), shows more icons being squeezed onto a single webpage—a good indication of more CNCF projects and vendors being brought onboard.

If Kubernetes is Happy, Are You?

The Kubernetes Web UI can give you a quick view of cluster health. And if you don’t see any red on the dashboard, all’s right with the world…right? Not necessarily. Cluster health only tells part of the story. Kubernetes is designed to be generic for a wider swath of workload placement. Applications that hog cluster infrastructure by triggering an autoscaler can go unnoticed until the cluster limits are reached. Advancements like Operators can help make Kubernetes more application-specific, but too often the end-to-end picture of the user journey is incomplete. AppDynamics can add value in monitoring both the Kubernetes platform and workloads placed on Kubernetes clusters. The efficiency and health of your Kubernetes platform, and the complete picture of the user journey—which can transverse Kubernetes and non-Kubernetes workloads—can be monitored with AppDynamics.

Focus on Your Customers

With technology changing so rapidly, it’s easy to feel left behind if you’re not adopting the latest stack. At the conference, Aparna Sinha, Google’s Product Manager for Kubernetes, gave an excellent interview on trends and capabilities of Kubernetes and KNative. A lot of what is driving this rapid change is direct feedback from Google’s customers.

Organizations of all sizes are striving to strengthen the customer experience, but often it’s challenging to justify technology infrastructure to meet business or customer outcomes. AppDynamics is the premier platform for monitoring and enhancing the user journey across your Cloud Native applications. We’re continuing to expand our level of automation in orchestrators such as Kubernetes and Mesos, and we’re working to enable these platforms to take truly autonomous actions to enhance the user experience.

AIOps: Decision Center for Orchestrators

AIOps is one of those buzzwords that seems to be everywhere these days, and for good reason. The power that AIOps brings is its ability to guide orchestrators to act by analyzing and hypothesizing outcomes. Kubernetes and other orchestrators like Mesos and Nomad are very good at action (e.g., triggering a scaling event). Where they fall short, though, is in analyzation because, again, they are trying to capture a wide swatch of workloads. The worst case scenario is that an autoscaler keeps getting triggered within a cluster—perhaps due to run-of-the-mill memory pressure or network saturation—and the resource manager can’t place work anymore, leading to a queue of unfulfilled requests. Depending on the organization, a frantic Slack to an SRE would not be far behind.

With AIOps, this event could be avoided: the system could spot the trend and redeploy a version of the application and application infrastructure better suited to the load. Coupled with the power of Cisco’s stack, AIOps-infused AppDynamics will soon be more resilient without operator intervention.

Better Together

AppDynamics was excited to partner with Cisco at KubeCon + CloudNativeCon to run joint sessions on Cisco Cloud Center and the AppDynamics platform. We’re increasing our cross-pollination with Cisco Container Platform and Cisco Cloud to monitoring and manage the next generation of workloads, too.


Ravi Lachhman, in Cisco’s booth, gives a presentation on Cloud Native infrastructure.

If the pace of Cloud Native innovation keeps up with CNCF expansion, next year is certainly going to be exciting. AppDynamics and Cisco are the engine that helps combine technical decisions with business outcomes. Be sure to sign up for and watch Cisco’s Cloud Unfiltered Podcast, which includes interviews with the key technologists moving cloud workloads forward.

See You in SD!

We’re excited to see what next year has in store for the CNCF and Cloud Native workloads. KubeCon + CloudNativeCon 2019 will take place in sunny San Diego, and we hope to see you there. Don’t forget to fire up your favorite learning platforms such as Katacoda and Cisco DevNet to sharpen your Cloud Native skills. See you in San Diego!

Why Kubelet TLS Bootstrap in Kubernetes 1.12 is a Very Big Deal

Kubelet TLS Bootstrap, an exciting and highly-anticipated feature in Kubernetes 1.12, is graduating to general availability. As you know, the Kubernetes orchestration system provides such key benefits as service discovery, load balancing, rolling restarts, and the ability to maintain container counts by replacing failed containers. And by using Kubernetes-compliant extensions, you can seamlessly enhance system functionality. This is similar to how Istio (with Kubernetes) provides added benefits such as robust tracing/monitoring, traffic management, and so on.

Until now, however, Kubernetes did not provide similar automation features for security best practices, such as mutually-authenticated TLS connections (mutual-TLS or mTLS). These connections enable developers to use simple certificate directives that limit nodes to communicate with predetermined services—all without writing a single line of additional code. Even though the use of TLS 1.2 certificates for service-to-service communication is a known best-practice, very few companies use mutual-TLS to deploy their systems. This lack of adoption is due mostly to greater deployment difficulties in creating and managing public key infrastructures (PKI). This is why the new TLS Bootstrap module in Kubernetes 1.12 is so exciting: It provides features for adding authentication and authorization to each service at the application level.

The Power of mTLS

Mutual-TLS mandates that both the client and server must authenticate themselves by exchanging identities (certificates). mTLS is made possible by provisioning a TLS certificate to each Kubelet. The client and server use the TLS handshake protocol to negotiate and set up a secure encryption channel. As part of this negotiation, each party checks the validity of the other party’s certificate. Optionally, they can add more verification, such as authorization (the principle of least privilege). Hence, mTLS will provide added security to your application and data. Even if malicious software has taken over a container or host, it cannot connect to any service without providing a valid identity/authorization.

In addition, the Kubelet certificate rotation feature (currently in beta) has an automated way to get a signed certificate from the cluster API server. The Kubelet process accepts an argument, -rotate-certificates, which controls whether the kubelet will automatically request a new certificate as the current one nears expiration. The kube-controller-manager process accepts the argument –experimental-cluster-signing-duration, which controls the length of time each certificate will be in use.

When a kubelet starts up, it uses its initial certificate to connect to the Kubernetes API and issue a certificate-signing request. Upon approval (which can be automated with a few checks), the controller manager signs a certificate issued for a time period specified by the duration parameter. This certificate is then attached to the Certificate Signing Request. The kubelet uses an API call to retrieve the signed certificate, which it uses to connect to the Kubernetes API. As the current certificate nears expiration, the kubelet will use the same process described above to get a new certificate.

Since this process is fully automated, certificates can be created with a very short expiry time. For example, if the expiration time is one hour, even if a malicious agent gets hold of the certificate, the compromised certificate will still expire in an hour.

Robust Security and the Strength of APM

Mutual-TLS and automated certificate rotation give organizations robust security without having to spend heavily on firewalls or intrusion-detection services. mTLS is also the first step towards eliminating the distinction of trusted and non-trusted connections. In this new paradigm, connections coming from inside the firewall or corporate network are treated exactly the same as those from the internet. Every client must identify itself and receive authorization to access a resource, regardless of the originating host’s location. This approach safeguards resources, even if a host inside the corporate firewall is compromised.

AppDynamics fully supports mutually-authenticated TLS connections between its agents and the controller. Our agents running inside a container can communicate with the controller in much the same way as microservices connect to each other. In hybrid environments, where server authentication is available only for some agents and mutual authentication for others, it’s possible to set up and configure multiple HTTP listeners in Glassfish—one for server authentication only, another for both server and client authentication. The agent and controller connections can be configured to use the TLS 2 protocol as well.

See how AppDynamics can provide end-to-end, unified visibility into your Kubernetes environment!

 

 

Blue-Green Deployment Strategies for PCF Microservices

Blue-green deployment is a well-known pattern for updating software components by switching between simultaneously available environments or services. The context in which a blue-green deployment strategy is used can vary from switching between data centers, web servers in a single data center, or microservices in a Pivotal Cloud Foundry (PCF) deployment.

In a microservices architecture, it’s often challenging to monitor and measure the performance of a microservice updated via blue-green deployment, specifically when determining the impact on consumers of the service, the overall business process, and existing service level agreements (SLAs).

But there’s good news. PCF—with its built-in router and commands for easily managing app requests, and its sound orchestration of containerized apps—makes implementing a blue-green deployment trivial.

Our Example App

In this blog, we’ll focus on a simplified PCF Spring Boot microservice that implements a REST service to process orders, and a single orders endpoint for posting new orders. For simplicity’s sake, this service has a single “downstream” dependency on an account microservice, which it calls via the account service REST API.

You’ll find the example app here: https://github.com/jaholmes/orderapp.

The order-service will use the account-service to create an account. In this scenario, an order is submitted and a user requests a new account created as part of the order submission.

Deploying the Green Version of the Account Service

We will target the account-service to perform a blue-green deployment of a new version of the account-service, which includes some bug fixes. We’ll perform the blue-green deployment using the CF CLI map-route and unmap-route commands.

When we push the account-service app, we’ll adopt an app-naming strategy that appends -blue or -green to the app name, and assume our deployment pipeline would automatically switch between the two prefixes from one deployment to the next.

So our initial deployment, based on the manifest.yml here, would be:

$ cf push account-service-blue -i 5

After pushing this app, we create the production route and map it to the blue version of the app using cf map-route.

$ cf map-route account-service-blue apps.<pcf-domain> –hostname
prod-account-service

Then we create the user-provided service.

$ cf cups account-service -p ‘{ “route”:
“prod-account-service.apps.<pcf-domain”}’,

And bind the order-service to this user-provided service by referencing it in the order-service’s manifest.yml.

When pushed, the order-service app consumes the account-service route from its environment variables, and uses this route to communicate with the account-service, regardless of whether it’s the blue or green version of the account-service.

The initial deployment shows both the order and blue account-service running with five instances.

Both services have started the AppDynamics APM agent integrated in the Java buildpack and are reporting to an AppDynamics controller. The flowmap for the order service/orders endpoint shows the requests flowing from the order-service to the account-service, and the response time averaging 77 milliseconds (ms). A majority of that time is being consumed by the account-service. The instances are represented as nodes on the flowmap:

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cid2B4E3F40-83F5-7B4D-B29E-9B1B49406216.png

Performing Blue/Green Deployment

Now we’re ready to push the updated “green” account-service that implements fixes to known issues. We’ll push it and change the app name to “account-service-green” so it’s deployed separately from account-service-blue.

$ cf push account-service-green -i 5

At this point, the Apps Manager shows both versions of the account-service app running, but only the blue version is receiving traffic.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidF0181812-9B19-B54A-B3C1-3DE28C3FF7D8.png

We can validate this by referencing a monitoring dashboard that displays call per minute and response time for blue and green versions of the account-service. Below, the dashboard shows no activity for Account-service-green.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cid02E85E8C-9A07-894A-8302-6B6DD613D738.png

This dashboard distinguishes between blue and green versions by applying a filtering condition, which matches node names that start with “account-service-blue” or “account-service-green.”

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidFC785568-6013-4749-8B4A-807807B560C8.png

This node-matching criteria will match the nodes’ names assigned by the AppDynamics integration included in the Java buildpack, which uses the pattern <pcf-app-name>:<instance id>. Below is a list of the node names reporting under the accountservice tier that shows this pattern.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidBD99BA86-DAF3-2840-9AB2-A0D698A76677.png

To complete the blue-green deployment, we use the cf map-route command to change the routing from the blue to green version.

$ cf map-route account-service-green apps.<pcf-domain> –hostname prod-account-service

$ cf unmap-route account-service-blue apps.<pcf-domain> –hostname prod-account-service

This instructs the CF router to route all requests to prod-account-service.<pcf-domain> (used by the order-service app) to the green version. At this point, we want to evaluate as quickly as possible the performance impact of the green version on the order-service.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidAAD007D3-3552-654C-81FD-960FA40D05A2.png

Our blue-green dashboard shows the traffic has indeed switched from the blue to the green nodes, but performance has degraded. Our blue version of the account-service was averaging well below 100 ms, but the green version is showing an uptick to around 150 ms (artificially introduced for the sake of example).

We see a proportional impact on the order service, which is taking an additional 100 ms to process requests. This could be a case where a rollback is necessary, which again is straightforward using cf map-route and unmap-route.

Baselines and Health Rules

Rather than relying strictly on dashboards to decide whether to rollback a deployment, we can establish thresholds in health rules that compare performance to baselines based on historical behavior.

For example, we could create health rules with thresholds based on the baseline or average performance of the order-service, and alert if the performance exceeds that baseline by two standard deviations.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cid9CBE7643-13AE-C548-BF6A-670052152CAD.png

When we deployed the green version, we were quickly alerted to a performance degradation of the order-service, as shown in our dashboard:

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidB976C4A2-3755-DF4A-83B7-680C9F7045BA.png

We also defined a health rule that focuses on the aggregate baseline of the account-service (including average performance of all blue and green versions) to determine if the latest deployment of the account-service is behaving poorly. Again, this produced an alert based on our poorly performing green version:

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidB2371FA1-3024-CC48-B497-36784BEA6E94.png

The ability to identify slower-than-normal transactions and capture detailed diagnostic data is also critical to finding the root case.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidF2E1050C-6C99-2044-B133-2774A54C6209.png

In the case of the slower account-service version, we can drill down within snapshots to the specific lines of code responsible for the slowness.

/var/folders/3m/33kp4svn2vn977nnjhf51zz40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/cidFA81DD41-3E93-5041-9D07-B3EF34EE7D25.png

Monitoring Impact on a Business Process

In the previous dashboards, we tracked the impact of an updated microservice on clients and the overall application. If the microservice is part of a larger business process, it would also be important to compare the performance of the business process before and after the microservice was updated via blue-green deployment.

In the dashboard below, the order microservice, which depends on the account microservice, may be part of a business process that involves converting offers to orders. In this example, conversion rate is the key performance indicator we want to monitor, and a negative impact on conversion rates would be cause to consider a rollback.

The Power of APM on PCF

While PCF makes deploying microservices relatively simple, monitoring the impact of updates is complex. It’s critical to have a powerful monitoring solution like AppDynamics, which can quickly and automatically spot performance anomalies and identify impacts on service clients and overall business processes.

Monitoring Kubernetes and OpenShift with AppDynamics

Here at AppDynamics, we build applications for both external and internal consumption. We’re always innovating to make our development and deployment process more efficient. We refactor apps to get the benefits of a microservices architecture, to develop and test faster without stepping on each other, and to fully leverage containerization.

Like many other organizations, we are embracing Kubernetes as a deployment platform. We use both upstream Kubernetes and OpenShift, an enterprise Kubernetes distribution on steroids. The Kubernetes framework is very powerful. It allows massive deployments at scale, simplifies new version rollouts and multi-variant testing, and offers many levers to fine-tune the development and deployment process.

At the same time, this flexibility makes Kubernetes complex in terms of setup, monitoring and maintenance at scale. Each of the Kubernetes core components (api-server, kube-controller-manager, kubelet, kube-scheduler) has quite a few flags that govern how the cluster behaves and performs. The default values may be OK initially for smaller clusters, but as deployments scale up, some adjustments must be made. We have learned to keep these values in mind when monitoring OpenShift clusters—both from our own pain and from published accounts of other community members who have experienced their own hair-pulling discoveries.

It should come as no surprise that we use our own tools to monitor our apps, including those deployed to OpenShift clusters. Kubernetes is just another layer of infrastructure. Along with the server and network visibility data, we are now incorporating Kubernetes and OpenShift metrics into the bigger monitoring picture.

In this blog, we will share what we monitor in OpenShift clusters and give suggestions as to how our strategy might be relevant to your own environments. (For more hands-on advice, read my blog Deploying AppDynamics Agents to OpenShift Using Init Containers.)

OpenShift Cluster Monitoring

For OpenShift cluster monitoring, we use two plug-ins that can be deployed with our standalone machine agent. AppDynamics’ Kubernetes Events Extension, described in our blog on monitoring Kubernetes events, tracks every event in the cluster. Kubernetes Snapshot Extension captures attributes of various cluster resources and publishes them to the AppDynamics Events API. The snapshot extension collects data on all deployments, pods, replica sets, daemon sets and service endpoints. It captures the full extent of the available attributes, including metadata, spec details, metrics and state. Both extensions use the Kubernetes API to retrieve the data, and can be configured to run at desired intervals.

The data these plug-ins provide ends up in our analytics data repository and instantly becomes available for mining, reporting, baselining and visualization. The data retention period is at least 90 days, which offers ample time to go back and perform an exhaustive root cause analysis (RCA). It also allows you to reduce the retention interval of events in the cluster itself. (By default, this is set to one hour.)

We use the collected data to build dynamic baselines, set up health rules and create alerts. The health rules, baselines and aggregate data points can then be displayed on custom dashboards where operators can see the norms and easily spot any deviations.

An example of a customizable Kubernetes dashboard.

What We Monitor and Why

Cluster Nodes

At the foundational level, we want monitoring operators to keep an eye on the health of the nodes where the cluster is deployed. Typically, you would have a cluster of masters, where core Kubernetes components (api-server, controller-manager, kube-schedule, etc.) are deployed, as well as a highly available etcd cluster and a number of worker nodes for guest applications. To paint a complete picture, we combine infrastructure health metrics with the relevant cluster data gathered by our Kubernetes data collectors.

From an infrastructure point of view, we track CPU, memory and disk utilization on all the nodes, and also zoom into the network traffic on etcd. In order to spot bottlenecks, we look at various aspects of the traffic at a granular level (e.g., reads/writes and throughput). Kubernetes and OpenShift clusters may suffer from memory starvation, disks overfilled with logs or spikes in consumption of the API server and, consequently, the etcd. Ironically, it is often monitoring solutions that are known for bringing clusters down by pulling excessive amounts of information from the Kubernetes APIs. It is always a good idea to establish how much monitoring is enough and dial it up when necessary to diagnose issues further. If a high level of monitoring is warranted, you may need to add more masters and etcd nodes. Another useful technique, especially with large-scale implementations, is to have a separate etcd cluster just for storing Kubernetes events. This way, the spikes in event creation and event retrieval for monitoring purposes won’t affect performance of the main etcd instances. This can be accomplished by setting the –etcd-servers-overrides flag of the api-server, for example:

–etcd-servers-overrides =/events#https://etcd1.cluster.com:2379;https://etcd2. cluster.com:2379;https://etcd3. cluster.com:2379

From the cluster perspective we monitor resource utilization across the nodes that allow pod scheduling. We also keep track of the pod counts and visualize how many pods are deployed to each node and how many of them are bad (failed/evicted).

A dashboard widget with infrastructure and cluster metrics combined.

Why is this important? Kubelet, the component responsible for managing pods on a given node, has a setting, –max-pods, which determines the maximum number of pods that can be orchestrated. In Kubernetes the default is 110. In OpenShift it is 250. The value can be changed up or down depending on need. We like to visualize the remaining headroom on each node, which helps with proactive resource planning and to prevent sudden overflows (which could mean an outage). Another data point we add there is the number of evicted pods per node.

Pod Evictions

Evictions are caused by space or memory starvation. We recently had an issue with the disk space on one of our worker nodes due to a runaway log. As a result, the kubelet produced massive evictions of pods from that node. Evictions are bad for many reasons. They will typically affect the quality of service or may even cause an outage. If the evicted pods have an exclusive affinity with the node experiencing disk pressure, and as a result cannot be re-orchestrated elsewhere in the cluster, the evictions will result in an outage. Evictions of core component pods may lead to the meltdown of the cluster.

Long after the incident where pods were evicted, we saw the evicted pods were still lingering. Why was that? Garbage collection of evictions is controlled by a setting in kube-controller-manager called –terminated-pod-gc-threshold.  The default value is set to 12,500, which means that garbage collection won’t occur until you have that many evicted pods. Even in a large implementation it may be a good idea to dial this threshold down to a smaller number.

If you experience a lot of evictions, you may also want to check if kube-scheduler has a custom –policy-config-file defined with no CheckNodeMemoryPressure or CheckNodeDiskPressure predicates.

Following our recent incident, we set up a new dashboard widget that tracks a metric of any threats that may cause a cluster meltdown (e.g., massive evictions). We also associated a health rule with this metric and set up an alert. Specifically, we’re now looking for warning events that tell us when a node is about to experience memory or disk pressure, or when a pod cannot be reallocated (e.g., NodeHasDiskPressure, NodeHasMemoryPressure, ErrorReconciliationRetryTimeout, ExceededGracePeriod, EvictionThresholdMet).

We also look for daemon pod failures (FailedDaemonPod), as they are often associated with cluster health rather than issues with the daemon set app itself.

Pod Issues

Pod crashes are an obvious target for monitoring, but we are also interested in tracking pod kills. Why would someone be killing a pod? There may be good reasons for it, but it may also signal a problem with the application. For similar reasons, we track deployment scale-downs, which we do by inspecting ScalingReplicaSet events. We also like to visualize the scale-down trend along with the app health state. Scale-downs, for example, may happen by design through auto-scaling when the app load subsides. They may also be issued manually or in error, and can expose the application to an excessive load.

Pending state is supposed to be a relatively short stage in the lifecycle of a pod, but sometimes it isn’t. It may be good idea to track pods with a pending time that exceeds a certain, reasonable threshold—one minute, for example. In AppDynamics, we also have the luxury of baselining any metric and then tracking any configurable deviation from the baseline. If you catch a spike in pending state duration, the first thing to check is the size of your images and the speed of image download. One big image may clog the pipe and affect other containers. Kubelet has this flag, –serialize-image-pulls, which is set to “true” by default. It means that images will be loaded one at a time. Change the flag to “false” if you want to load images in parallel and avoid the potential clogging by a monster-sized image. Keep in mind, however, that you have to use Docker’s overlay2 storage driver to make this work. In newer Docker versions this setting is the default. In addition to the Kubelet setting, you may also need to tweak the max-concurrent-downloads flag of the Docker daemon to ensure the desired parallelism.

Large images that take a long time to download may also cause a different type of issue that results in a failed deployment. The Kubelet flag –image-pull-progress-deadline determines the point in time when the image will be deemed “too long to pull or extract.” If you deal with big images, make sure you dial up the value of the flag to fit your needs.

User Errors

Many big issues in the cluster stem from small user errors (human mistakes). A typo in a spec—for example, in the image name—may bring down the entire deployment. Similar effects may occur due to a missing image or insufficient rights to the registry. With that in mind, we track image errors closely and pay attention to excessive image-pulling. Unless it is truly needed, image-pulling is something you want to avoid in order to conserve bandwidth and speed up deployments.

Storage issues also tend to arise due to spec errors, lack of permissions or policy conflicts. We monitor storage issues (e.g., mounting problems) because they may cause crashes. We also pay close attention to resource quota violations because they do not trigger pod failures. They will, however, prevent new deployments from starting and existing deployments from scaling up.

Speaking of quota violations, are you setting resource limits in your deployment specs?

Policing the Cluster

On our OpenShift dashboards, we display a list of potential red flags that are not necessarily a problem yet but may cause serious issues down the road. Among these are pods without resource limits or health probes in the deployment specs.

Resource limits can be enforced by resource quotas across the entire cluster or at a more granular level. Violation of these limits will prevent the deployment. In the absence of a quota, pods can be deployed without defined resource limits. Having no resource limits is bad for multiple reasons. It makes cluster capacity planning challenging. It may also cause an outage. If you create or change a resource quota when there are active pods without limits, any subsequent scale-up or redeployment of these pods will result in failures.

The health probes, readiness and liveness are not enforceable, but it is a best practice to have them defined in the specs. They are the primary mechanism for the pods to tell the kubelet whether the application is ready to accept traffic and is still functioning. If the readiness probe is not defined and the pods takes a long time to initialize (based on the kubelet’s default), the pod will be restarted. This loop may continue for some time, taking up cluster resources for no reason and effectively causing a poor user experience or outage.

The absence of the liveness probe may cause a similar effect if the application is performing a lengthy operation and the pod appears to Kubelet as unresponsive.

We provide easy access to the list of pods with incomplete specs, allowing cluster admins to have a targeted conversation with development teams about corrective action.

Routing and Endpoint Tracking

As part of our OpenShift monitoring, we provide visibility into potential routing and service endpoint issues. We track unused services, including those created by someone in error and those without any pods behind them because the pods failed or were removed.

We also monitor bad endpoints pointing at old (deleted) pods, which effectively cause downtime. This issue may occur during rolling updates when the cluster is under increased load and API request-throttling is lower than it needs to be. To resolve the issue, you may need to increase the –kube-api-burst and –kube-api-qps config values of kube-controller-manager.

Every metric we expose on the dashboard can be viewed and analyzed in the list and further refined with ADQL, the AppDynamics query language. After spotting an anomaly on the dashboard, the operator can drill into the raw data to get to the root cause of the problem.

Application Monitoring

Context plays a significant role in our monitoring philosophy. We always look at application performance through the lens of the end-user experience and desired business outcomes. Unlike specialized cluster-monitoring tools, we are not only interested in cluster health and uptime per se. We’re equally concerned with the impact the cluster may have on application health and, subsequently, on the business objectives of the app.

In addition to having a cluster-level dashboard, we also build specialized dashboards with a more application-centric point of view. There we correlate cluster events and anomalies with application or component availability, end-user experience as reported by real-user monitoring, and business metrics (e.g., conversion of specific user segments).

Leveraging K8s Metadata

Kubernetes makes it super easy to run canary deployments, blue-green deployments, and A/B or multivariate testing. We leverage these conveniences by pulling deployment metadata and using labels to analyze performance of different versions side by side.

Monitoring Kubernetes or OpenShift is just a part of what AppDynamics does for our internal needs and for our clients. AppDynamics covers the entire spectrum of end-to-end monitoring, from the foundational infrastructure to business intelligence. Inherently, AppDynamics is used by many different groups of operators who may have very different skills. For example, we look at the platform as a collaboration tool that helps translate the language of APM to the language of Kubernetes and vice versa.

By bringing these different datasets together under one umbrella, AppDynamics establishes a common ground for diverse groups of operators. On the one hand you have cluster admins, who are experts in Kubernetes but may not know the guest applications in detail. On the other hand, you have DevOps in charge of APM or managers looking at business metrics, both of whom may not be intimately familiar with Kubernetes. These groups can now have a productive monitoring conversation, using terms that are well understood by everyone and a single tool to examine data points on a shared dashboard.

Learn more about how AppDynamics can help you monitor your applications on Kubernetes and OpenShift.

How Top Investment Banks Accelerate Transaction Time and Avoid Performance Bottlenecks

A complex series of interactions must take place for an investment bank to process a single trade. From the moment it’s placed by a buyer, an order is received by front-office traders and passed through to middle- and back-office systems that conduct risk management checks, matchmaking, clearing and settlement. Then the buyer receives the securities and the seller the corresponding cash. Once complete, the trade is sent to regularity reporting, which insures the transaction was processed under the right regulatory requirements. One AppDynamics customer, a major financial firm, utilizes thousands of microservices to complete this highly complex task countless times throughout the day.

To expedite this process, banks have implemented straight-through processing (STP), an initiative that allows electronically entered information to move between parties in the settlement process without manual intervention. But one of the banks’ biggest concerns with STP is the difficulty of following trades in real-time. When trades get stuck, manual intervention is needed, often impacting service level agreements (SLAs) and even trade reconciliation processes. One investment firm, for instance, told AppDynamics that approximately 20% of its trades needed manual input to complete what should have been a fully automated process—a bottleneck that added significant overhead and resource requirements. And with trade volumes increasing 25% year over year, the company needed a fresh approach to help manage its rapid growth.

AppDynamics’ Business Transactions (BT) enabled the firm to track and follow trades in real time, end-to-end through its systems. The BT traces through of all the necessary systems and microservices—applications, databases, third-party APIs, web services, and so on—needed to process and respond to a request. In investment banking, a BT may include everything from placing an order, completing risk checks or calculations, booking and confirming different types of trades, and even post-trade actions such as clearing, settlement and regularity reporting.

The AppDynamics Business Journey takes this one step further by following a transaction across multiple BTs; for example, following an individual trade from order through capture and then to downstream reporting. The Business Journey provides true end-to-end, time-enabling tracking against SLAs, and traces the transaction across each step to monitor performance and ensure completion.

Once created, the Business Journey allows you to visualise key metrics with out-of-the-box dashboards.

Real-Time Tracking with Dashboards

Prior to AppDynamics, one investment bank struggled to track trades in real time. They were doing direct queries on the database to find out how many trades had made it downstream to the reporting database. This method was slow and inefficient, requiring employees to create and share small Excel dashboards, which lacked real-time trade information. AppDynamics APM dashboards, by comparison, enabled them to get a real-time, high-level overview of the health and performance of their system.

After installing AppDynamics, the investment bank instrumented a dashboard to show all the trades entering its post-trade system throughout the day. This capability proved hugely beneficial in helping the firm monitor trading spikes and ensure it was meeting its SLAs. And Business IQ performance monitoring made it possible to slice and dice massive volumes of incoming trades to gain real-time insights into where the transactions were coming from (i.e., which source system), their value, whether they met the SLAs, and which ones failed to process. Additionally, AppDynamics Experience Level Management provided the ability to report compliance against specific processing times.

Now the bank could automate complex processes and remove inefficient manual systems. Prior to AppDynamics, there was a team dedicated to overseeing more than 200 microservices. They had to determine why a particular trade failed, and then pass that information onto the relevant business teams for follow-up to avoid losing business. But too often a third-party source would send invalid data, or update its software and send a trade in an updated format unfamiliar to the bank’s backend system, creating a logistical mess too complex for one human to manage. With Business IQ, the bank was able to immediately spot and follow up on invalid trades.

Searching for Trades Across All Applications

Microservices offer many advantages but can bring added complexity as well. The investment bank had hundreds of microservices but lacked a fast and efficient way to search for an individual trade. In the event of a problem, they would take the trade ID and look into the log files of multiple microservices. On average, they had to open up some 40 different log files to locate a problem. And although the firm had an experienced support staff that knew the applications well, this manual process wasn’t sustainable as newer, inexperienced support people were brought onboard. Nor would this system scale as trade volume increased.

By using Business IQ to monitor every transaction across all microservices, the bank was able to easily monitor individual trade transactions throughout the lifecycle. And by capturing the trade ID, as well as supplementary data such as the source, client, value and currency, they could then go into AppDynamics Application Analytics and very quickly identify specific transactions. For example, they could enter the trade ID and see every transaction for the trade across the entire system.

This feature was particularly loved by the support staff, which now had immediate access to all of a trade’s interactions within a single screen, as well as the ability to easily drill down to find the find the root cause of a failed transaction.

Tracking Regulatory SLAs in Real Time

Prior to AppDynamics, our customer didn’t have an easy way to track the progress of a trade in real time. Rather, they were manually verifying that trades were successfully being sent to regulatory reporting systems, as well as ensuring that this was completed within the required timeframe. This was difficult to do in real time, meaning that when there was an issue, often it was not found until after the SLA had been breached. With AppDynamics they were able to set up a dashboard to visualise data in real time; the team then set up a health rule to indicate if trade reporting times were approaching the SLA. They also configured an alert that enabled them to proactively see and resolve any issues ahead of an SLA breach.

Proactively Tracking Performance after Code Releases

The bank periodically introduces new functionality to meet the latest business or regulatory requirements, in particular MiFID II, introduced to improve investor protection across Europe by harmonizing the rules for all firms with EU clients. Currently, new releases happen every week, but this rate will continue to increase. These new code releases introduce risk, as previous releases have either had a negative impact on system performance or have introduced new defects. In one two-month period, for instance, the time required to capture a trade increased by about 20%. If this continued, the bank would have had to scale out hugely—buying new hardware at significant cost—to avoid breaching its regulatory SLA.

The solution was to create a comparative dashboard in AppDynamics that showed critical Business Transactions and how they were being changed between releases (response times, errors, and so on). If any metric degraded from the previous version or deviated from a certain threshold, it would be highlighted on the dashboard in a different color, making it easier to decide whether to proceed with a rollout or determine which new feature or change had caused the deviation.

Preventing New Hardware Purchases

After refining its code based on AppDynamics’ insights, the bank saw a dramatic 6X performance improvement. This saved them from having to—in their words—“throw more hardware at the problem” by buying more CPU processing power to push through more trades.

By instrumenting their back office systems with AppDynamics, the bank gained deep insights that enabled them to refine their code. For instance, calls to third-party APIs were taking place unnecessarily and trades were being captured unintentionally within multiple different databases. Without AppDynamics, it’s unlikely this would have been discovered. The insight enabled the bank to make some very simple changes to fine-tune code, resulting in a significant performance improvement and enabling the bank to save money by scaling with their existing hardware profile.

Beneficial Business Outcomes

From the bank’s perspective, one of the greatest gains of going with AppDynamics was the ability to follow a trade through its many complex services, from the moment an order is placed, through to capture and down to regularity reporting. This enabled them to improve system performance, avoid expensive (and unnecessary) hardware upgrades, quickly search for trade IDs to locate and find the root cause of issues, and proactively manage SLAs.

See how AppDynamics can help your own business achieve positive outcomes.

How Anti-Patterns Can Stifle Microservices Adoption in the Enterprise

In my last article, Microservice Patterns That Help Large Enterprises Speed Development, Deployment and Extension, we went over some deployment and communication patterns that help keep microservices manageable as you use more of them. I also promised that in my next post, I’d get into how microservice patterns can become toxic, create more overhead and unreliability, and become an unmanageable mess. So let’s dig in.

First things first: Patterns are awesome.

They help formalize ideas into reusable chunks that you can distribute and communicate easily to your teams. Here are some useful things that patterns do for engineering departments of any size:

  • Make ideas distributable

  • Lower the barriers to success for junior members

  • Create building blocks

  • Create consistency in disparate/complex systems

Patterns are usually an intentional act. When we put patterns in place, we’re making a clear choice and acting on a decision to make things less painful. But not all patterns are helpful. In the case of the anti-pattern, it has the potential to create more trouble for the engineering team and the business.

The Anatomy of An Anti-Pattern

Like patterns, anti-patterns tend to be identifiable and repeatable.

Anti-patterns are rarely intentional, and usually you identify them long after their effects are visible. Individuals in your organization often make well-meaning (if poor) choices in the pursuit of faster delivery, rushed deadlines, and so on. These anti-patterns are often perpetuated by other employees who decide, “Well, this must be how it’s done here.”

In this way, anti-patterns can become a very dangerous norm.

For companies that want to migrate their architecture to microservices, anti-patterns are a serious obstacle to success. That’s why I’d like to share with you a few common anti-patterns I’ve seen repeated many times in companies making the switch to microservices. These moves eventually compromised their progress and created more of the problems they were trying to avoid.

Data Taffy

The first anti-pattern is the most common—and the most subtle—in the chaos and damage it causes.

The Problem

The data taffy anti-pattern can manifest in a few different ways, but the short explanation is that it occurs when all services have full access to all objects in the database.

That doesn’t sound so bad, right?

You want to be able to create complex queries and complex data-ingestion scenarios to go across many domains. So at first glance, it makes sense to have everything call what it needs directly from the database. But that’s a problem when you need to scale an individual domain in your application. Rarely does data grow uniformly across all domains, but rather does so in bursts on individual domains. Sometimes it’s very difficult to predict which domains will grow the fastest. The entangled data becomes a lot like taffy: difficult to pull apart. It stretches and gets stuck in the cogs of business.

In this scenario, companies will have lots of stored procedures, embedded complex queries across many services, and object relationship managers all accessing the database—each with its own understanding of how a domain is “supposed” to be used. This nearly always leads to data contamination and performance issues.

But there are even bigger challenges, most notably when you need to make structural changes to your database.

Here’s an example based on a real-life experience I had with a large, privately owned company, one that started small and expanded rapidly to service tens of thousands of clients. Say you have a table whose primary key is an int starting at 0. You have 2,147,483,647 objects before you’ll run out of keys—no big deal, right?

So you start building out services, and this table becomes a cornerstone object in your application that every other domain touches in some meaningful way. Before you know it, there are 125 applications calling into this table from queries or stored procedures, totaling some 13,000 references to the table. Because this is a core table it gets a ton of data. Soon you’re at 2,100,000,000 objects with 10,000,000 new records being added daily.

You have four days before things go bad—real bad.

You try adding negative values to buy time, not realizing that half the services have hard-coded rules that IDs must be greater than 0. So you bite the bullet and manually scrub through EVERY SERVICE to find every instance of every object that has been created that uses this data, and then update the type from an integer to a large integer. You then have to update several other tables, objects, and stored procedures with foreign key relationships. This becomes a hugely painful effort with all hands on deck frantically trying to keep the company’s flagship product from being DOA.

Clearly, not an ideal scenario for any company.

Now if this domain was contained behind a single service, you’d know exactly how the table is being used and could find creative solutions to maintain backwards compatibility. After all, you can do a lot with code that you simply can’t do by changing a data value in a database. For instance, in the example above, there are about 800 million available IDs that could be reclaimed and mapped for new adds, which would buy enough time for a long-term plan that doesn’t require a frantic, all-hands-on-deck approach. This could even be combined with a two-key system based on a secondary value used to partition the data effectively. In addition, there’s one partitionable field we could use to give us 10,000x more available integers, as well as a five-year window to create more permanent solves with no changes to any consuming services.

This is just one anecdote, but I have seen this problem consistently halt scale strategies for companies at crucial times of growth. So try to avoid this anti-pattern.

How to Solve

To solve the data taffy problem, you must isolate data to specific domains only accessible via services designed to service them. The data may start on the same database but use schema and access policy to limit access to a single service. This enables you to change databases, create partitions, or move to entirely new data storage systems without any other service or system having to know or care.

Dependency Disorder

Say you’ve switched to microservices, but deployments are taking longer than ever. You wish you had never tried to break down the monolith. If this sounds familiar, you may be suffering from dependency disorder.

The Problem

Dependency disorder is one of the easiest anti-patterns to detect. If you have to know the exact order that services must be deployed to keep them from failing, it’s a clear signal the dependencies have become nested in a way that won’t scale well. Dependency disorder generally comes from domains calling sideways from one domain’s stack to another (instead of down the stack from the UI to the gateway) and then to the services that enable the gateway. Another big problem resulting from dependency disorder: unknown execution paths that take arbitrarily long times to execute.

How to Solve

An APM solution is a great starting point for resolving dependency disorder problems. Try to utilize a solution that provides a complete topology of your service execution paths. By leveraging these maps, you can make precision cuts in the call chain and refocus gateways to make fan-out calls that execute asynchronously rather than doing sideways calls. For some examples of helpful patterns, check out part one of this series. Ideally, we want to avoid service-to-service calls that create a deep and unmanageable call stack and favor a wider set of calls from the gateway.

Microlith

Microliths basically are well-meaning, clear-service paths that take dependency disorder to its maximum entropic state.

The Problem

Imagine having a really well-designed service, database and gateway implementation that you decide to isolate into a container—you feel great! You have a neat-and-tidy set of rules for how data gets stored, managed and scaled.

Believing you’ve reached microservice nirvana, you breathe a sigh of relief and wait for the accolades. Then you notice the releases gradually start taking longer and longer, and that data-coupling is happening in weird ways. You also find yourself deploying nearly the entire suite of services with every deployment, causing testing issues and delays. More and more trouble tickets are coming in each quarter, and before you know it, the organization is ready to scrap microservices altogether.

The promise of microservices is that there are no rules—you just put out whatever you want. The problem is that without a clear definition of how data flows down the stack, you’re basically creating a hybrid problem between data taffy and dependency disorder.

How to Solve

The mediation process here is effectively the same as with the dependency disorder. If you are working with a full-blown microlith, however, it will take some diligence to get back to stable footing. The best advice I can give is, try to get to a point where you can deploy a commit as soon as it’s in. If your automation and dependency orders are well-aligned, new service features should always be ready to roll out as soon as the developer commits to the code base. Don’t stand on formality. If this process is painful, do it more. Smooth out your automated testing and deployment so that you can reliably get commits deployed to production with no downtime.

Final Thoughts

I hope this gets the wheels spinning in your head about some of the microservices challenges you may be having now or setting yourself up for in the future. This information is all based on my personal experiences, as well as my daily conversations with others in the industry who think about these problems. I’d love to hear your input, too. Reach out via email, chase.aucoin@appdynamics.com; Twitter, https://twitter.com/ChaseAucoin; or LinkedIn, https://www.linkedin.com/in/chaseaucoin/. If you’ve got some other thoughts, I’d love to hear from you!

Microservice Patterns That Help Large Enterprises Speed Development, Deployment and Extension

This is the first in a two-part series on microservice patterns and anti-patterns. In this article, we’ll focus on some useful patterns that, when leveraged, can speed up development, deployment, and extension. In the next article, we’ll focus on how microservice patterns can become toxic, create more overhead and unreliability, and become an unmanageable mess.

Microservice patterns assume a large-scale enterprise-style environment. If you’re working with only a small set of services (1 to 5) you won’t feel the positive impact as strongly as organizations with 10, 100, or 1000+ services. My biggest pattern for startups and smaller projects is to not overthink and add complexity for complexity’s sake. Patterns are meant to aid in solving problems—they are not a hammer with which to bludgeon the engineering team. So use them with this in mind.

Open for Extension, Closed for Modification

We’re going to start our patterns talk with a principle rather than a pattern. Software development teams working on microservices get more mileage out of this one principle than most any other pattern or principle. It’s a classic pattern from the SOLID principles of Robert C. Martin (Uncle Bob).

In short, being open for extension and closed for modification means leaving your code open to add new functionality via inheritance but closed for direct modifications. I take a bit of a looser definition that tends to be more pragmatic. My definition is, “Don’t break existing contracts.” It’s fine to add methods but don’t change the signature of existing methods. Nor should you change the functionality of an established method.

Why is This Pattern So Powerful in Microservices?

When you have disparate teams working on different services that have to interoperate with one another, you need a certain level of reliability. I, as a consumer of a service, need to be able to depend on the service in the future, even as new features are added.

How Do We Manifest this Pattern?

Easy. Don’t break your contracts. There’s never a good reason to break an existing production contract. However, there could be lots of good reasons to add to it, thereby making it “open for extension and closed for modification.” For example, if you have to start collecting new data as part of a service, add a new endpoint and set a timeline to depreciate the old service call, but don’t do both in a single step. Likewise with data management: If you need to rename a column of data, just add a new column and leave the old column empty for awhile. When you depreciate an old service, it’s a good time to do any clean-up that goes with that depreciation. If you can adhere to this principle, everyone in your organization with have a better development experience.

Pattern: Enterprise Services with SPA Gateways

When we start building out large applications and moving towards a microservice paradigm, the issue of manageability quickly rises to the surface. We have to address manageability at many layers, and also must consider dependency management. Microservices can quickly become a glued-together mess with tightly coupled dependencies. Instead of having a monolithic “big ball of mud,” we create a “mudslide.”

One way to address these problems is to introduce the notion of Enterprise Domain Services that are responsible for the tasks within different domains in your organization, and then combine that domain-specific logic into more meaningful activities (i.e., product features) at the Single Page Application (SPA) gateway layer. The SPA gateway serves to take some subset of the overall functionality of an application (i.e., a single page worth) and codify that functionality, delegating the “hard parts” (persistence, state management, third-party calls, etc.) off to the associative enterprise services. In this pattern, each enterprise service either owns its own data as a single database, a collection of databases, or as an owned schema as part of a larger enterprise database.

Pattern: SPA Services with Gateway and ETL

Now we are going to ramp up the complexity a bit. One of the big questions people get into when they start down the microservices path is, “How do I join complex data?” In the Enterprise Services with SPA gateways example above, you would just call into multiple services. This is fine when you’re combining two to three points of data, but what about when you need really in-depth questions answered? How do you find, for instance, all the demographic data for one region’s customers who had invoices for the green version of an item in the second quarter of the month?

This question isn’t incredibly difficult if you have all the data together in a single database. But then you might start violating single responsibility principles pretty fast. The goal here then is to delegate that responsibility to a service that’s really good at just joining data via ETL (Extract, Transform, Load). ETL is a pattern for data warehousing where you extract data from disparate data sources, transform the data into something meaningful to the business, and load the transformed data somewhere else for utilization. The team that owns the domain that will be asking these types of demographic questions will be responsible for the care and feeding of services that perform the ETL, the database or schema where the transformed data is stored, and the services(s) that provide access to it.

Why Not Just Make a Multi-Domain Call at the Database?

This is a fair question, and on a small project it may be reasonable to do so. But on large projects with lots of moving parts, each part must be able to move independently. If we are combining directly at the DB level, we’re pretty much guaranteeing that the data will only ever travel together on that single DB, which is no big deal with small volumes of data. However, once we start dealing with tens, hundreds or thousands of terabytes, this becomes more of a big deal as it greatly impacts the way we scale domains independently. Using ETLs and data warehousing strategies to provide an abstraction layer on the movement and combination of our data might require us to update our ETL if we move the data around. But this feat is much more manageable than trying to untangle thousands of nested, stored procedures across every domain.

Closing Thoughts

Remember, these are just some of the available patterns. Your goal here is to solve problems, not create more problems. If a particular pattern isn’t working well for your purpose, it’s okay to create your own, or mix and match.

One of the easiest ways to get a handle on large projects with lots of domains is to use tools like AppDynamics with automatic mapping functionality to get a better understanding of the dependency graph. This will help you sort out the tangled mess of wires.

Remember, the best way to eat an elephant is one bite at a time.

In my next blog, we’ll look at some common anti-patterns, which at first may seem like really good ideas. But anti-patterns can cause problems pretty quickly and make it difficult to scale and maintain your projects.

Advances In Mesh Technology Make It Easier for the Enterprise to Embrace Containers and Microservices

More enterprises are embracing containers and microservices, which bring along additional networking complexities. So it’s no surprise that service meshes are in the spotlight now. There have been substantial advances recently in service mesh technologies—including Istio’s 1.0, Hashi Corp’s Consul 1.2.1, and Buoyant merging Conduent into LinkerD—and for good reason.

Some background: service meshes are pieces of infrastructure that facilitate service-to-service communication—the backbone of all modern applications. A service mesh allows for codifying more complex networking rules and behaviors such as a circuit breaker pattern. AppDev teams can start to rely on service mesh facilities, and rest assured their applications will perform in a consistent, code-defined manner.

Endpoint Bloom

The more services and replicas you have, the more endpoints you have. And with the container and microservices boom, the number of endpoints is exploding. With the rise of Platform-as-a-Services and container orchestrators, new terms like ingress and egress are becoming part of the AppDev team vernacular. As you go through your containerization journey, multiple questions will arise around the topic of connectivity. Application owners will have to define how and where their services are exposed.

The days of providing the networking team with a context/VIP to add to web infrastructure—such as services.acme.com/shoppingCart over port 443—are fading. Today, AppDev teams are more likely to hand over a Kubernetes YAML to add services.acme.com/shoppingCart to the Ingress controller, and then describe a behavior. Example: the shopping cart Pod needs to talk to the shopping cart validation Pod, which can only be accessed by the shopping cart because the inventory is kept on another set of Reddis Pods, which can’t be exposed to the outside world.

You’re juggling all of this while navigating constraints set by defined and deployed Kubernetes networking. At this point, don’t be alarmed if you’re thinking, “Wow, I thought I was in AppDev—didn’t know I needed a CCNA to get my application deployed!”

The Rise of the Service Mesh

When navigating the “fog of system development,” it’s tricky to know all the moving pieces and connectivity options. With AppDev teams focusing mostly on feature development rather than connectivity, it’s very important to make sure all the services are discoverable to them. Investments in API management are the norm now, with teams registering and representing their services in an API gateway or documenting them in Swagger, for example.

But what about the underlying networking stack? Services might be discoverable, but are they available? Imagine a Venn diagram of AppDev vs. Sys Engineer vs. SRE: Who’s responsible for which task? And with multiple pieces of infrastructure to traverse, what would be a consistent way to describe networking patterns between services?

Service Mesh to the Rescue

Going back to the endpoint bloom, consistency and predictability are king. Over the past few years, service meshes have been maturing and gaining popularity. Here are some great places to learn more about them:

Service Mesh 101

In the Istio model, applications participate in a service mesh. Istio acts as the mesh, and then applications can participate in the mesh via a sidecar proxy—Envoy, in Istio’s case.

Your First Mesh

DZone has a very well-written article about standing up your first Java application in Kubernetes to participate in an Istio-powered service mesh. The article goes into detail about deploying Istio itself in Kubernetes (in this case, MinuKube). For an AppDev team, the new piece would be creating the all-important routing rules, which are deployed to Istio.

Which One of these Meshes?

The New Stack has a very good article comparing the pros and cons of the major service mesh providers. The post lays out the problem in granular format, and discusses which factors you should consider to determine if your organization is even ready for a service mesh.

Increasing Importance of AppDynamics

With the advent of the service mesh, barriers are falling and enabling services to communicate more consistently, especially in production environments.

If tweaks are needed on the routing rules—for example, a time out—it’s best to have the ability to pinpoint which remote calls would make the most sense for this task. AppDynamics has the ability to examine service endpoints, which can provide much-needed data for these tweaks.

For the service mesh itself, AppDynamcs in Kubernetes can even monitor the health of your applications deployed on a Kubernetes cluster.

With the rising velocity of new applications being created or broken into smaller pieces, AppDynamics can help make sure all of these components are humming at their optimal frequency.

Strangler Pattern: Migrate to Microservices from a Monolithic App

In my 20-plus years in the software industry, I’ve worn a lot of hats: developer, DBA, performance engineer and—for the past 10 years prior to joining AppDynamics—software architect. I’ve been coding since sixth grade and have seen some pretty dramatic changes over the years, from punch cards and 8-inch floppies to DevOps and microservices.

This may surprise you, but during my career I’ve spent more time fixing broken software than building new and innovative applications. I’ve encountered pretty much every variety of enterprise software snafu, many requiring time-consuming fixes I had to do manually. There was a silver lining to this pain, however: I learned a lot about what does and doesn’t work in software development and deployment. Below are some observations drawn from my experiences in the field.

Enter the Strangler

You may already be familiar with the “Strangler Pattern,” an architectural framework for updating or modernizing software and enterprise applications. While the concept isn’t new—esteemed author Martin Fowler was discussing the pattern way back in 2004—it’s even more relevant for today’s microservices- and DevOps-focused organizations.

Essentially, the term is a metaphor for modern software development. The Strangler tree, or fig, is the popular name for a variety of tropical and subtropical plant species. Vines sprout from the top of the tree and extend their roots into the ground, enveloping and sometimes killing their host, and shrouding the carcass of the original tree under a thick set of vines.

This “strangler” effect is not unlike the experience that an organization encounters when  transitioning from a monolithic legacy application to microservices—breaking apart pieces of the monolith into smaller, modular components that can be built faster and deployed quicker. While the enterprise version of the Strangler tree won’t kill off its host entirely—some legacy functions won’t transfer to microservices and must remain—the strategy is essential for any organization striving for agile development.

A Hybrid Approach

The Strangler Pattern is a representation of agility within the enterprise. If you’re moving in the agile direction or doing a legacy modernization, you’re using the Strangler, whether you realize it or not.

The pattern helps software developers, architects, and the business side align the direction of their legacy transition. Anytime you hear “cloud hybrid,” “hybrid cloud” or “on-prem plus cloud,” it’s a Strangler Pattern, as the organization must maintain connectivity between its legacy application and the microservices it’s pushing to the cloud.

When enterprises start their agile journey, they soon realize there’s a huge cost of trying to reverse-engineer legacy code—much of it written in mainframe, Cobol, C++, or old .NET or Java—to make it smaller and more modular. They also discover that the hybrid or strangler approach, while certainly not easy, is easier than trying to rewrite everything.

Agile Enterprise vs. Agile Development

Developers are very quick to adopt agile. This may seem like a good thing, but it’s actually one of the core problems in organizations I’ve worked with over the years: Developers are agile-ready, the organization is not.

For a legacy transition to work, the focus must be on the agile enterprise, not just agile development. Processes must be in place to determine requirements, mock up screens, hash out wireframes, and generally move things along. Some businesses have this down—Google, Amazon and Netflix come to mind—but many companies don’t have these processes in place. Rather, they jump in head first, quickly going to microservices because of the buzz, without really considering the implications of what this will mean to their organizational requirements. The catalyst may be a new CTO coming in and saying, “Let’s move to the cloud.”

But a poorly conceived microservices transition, one where the entire enterprise neither embraces the agile philosophy nor understands what it means to go to a microservices strategy, can have disastrous consequences.

Bad App, Big Bill

Developers and DevOps and infrastructure folks usually understand what it means to go to a microservices strategy, but the business doesn’t always get it.

There are a lot of misconceptions about what microservices are and what they can do. For the Strangler Pattern to work, you need a comprehensive understanding of the potential impacts of a cloud transition.

I’ve seen situations where an application ran great on a developer’s local desktop, where it wasn’t a problem if the app woke every few seconds, checked a database, went back to sleep, and repeated this process over and over. But a month after pushing this app to the cloud, the company got a bill from its cloud provider for thousands of dollars of CPU time. Clearly, no one at the company considered the ramifications of porting this app directly to the cloud, rather than optimizing it for the cloud.

The moral? There are different approaches for different cloud models, migrations and microservice strategies. It’s critically important for your organization to understand the pros and cons of each approach, and how you as a developer or architect can work with the organization’s agile enterprise strategy.

Lift-and-Shift: A Fairy Tale

When attempting a “lift and shift”—moving an entire application or operation to the cloud—companies often adopt a methodical approach. They start with static services that can be moved easily, and services that don’t contain sensitive company data or personal customer information.

The ultimate goal of lift-and-shift is to move everything to the cloud, but in my experience that’s a fairy tale. It’s aspirational but not achievable: You’re either building for the cloud from the ground up or lifting some services, usually the ones easiest to shift. Whenever I mention “lift and shift” to developers, architects and customers, they usually laugh because they’ve gone through enough pain to understand it’s not entirely possible, and that their organization will be in a hybrid or transitional state for an extended period of time.

If you lift-and-shift an application that runs great on-prem, it’s likely to all of a sudden spin resources, causing you to scale unnecessarily because the code was written to perform in an environment that’s very different from the one it’s in. The Strangler Pattern again may come into play: understand what elements of your application or service are a natural fit for the cloud, e.g., have elasticity requirements or unpredictable scale, and move them first. You can then move the remaining pieces more easily into a cloud environment that behaves more predictably.

Putting It All Together

There are plenty of challenges that come with cloud migration. Your enterprise, top to bottom, must be ready for the move. If you haven’t invested in a DevOps strategy along with the people capable of executing it, codifying all the dependencies and deployment options to make an application run efficiently in the cloud, you’ll likely find your team pushing bug fixes all day and troubleshooting problems, rather than being agile and developing code and features that help users.

The ability to monitor your environment is fundamentally important as well. Being able to see the performance of your services, to quickly root-cause a breakdown in communication between microservices and their dependencies, is absolutely critical to an effective strategy.

Without an agile transformation, you’ll never truly achieve a microservices architecture. That’s why the Strangler Pattern is a good approach. You can’t just say one day, “We’re going agile and into the cloud,” and the whole organization replies, “Okay, we’re ready!”

You’ll meet resistance. Processes will have to be rewritten, code and deployment processes will need to change. It’s a huge undertaking.

The Strangler’s piecemeal approach makes a lot more sense. You don’t want to learn that you screwed up your continuous integration and development pipeline on 500 new microservices. It’s much wiser to start with two or three services and learn what works and what doesn’t.

I’ve experienced first-hand all of the software migration problems I’ve described above. I’ve repaired them manually, which was time-consuming and painful. The silver lining is that I learned a lot about fixing broken software, and now I’m able to share these hard-earned lessons.

In conclusion, be sure to implement a DevOps strategy for continuous integration and deployment, as well as good monitoring and logging systems to understand your performance.

The AppD Approach: Deployment Options for .NET Microservices Agent

There are numerous ways to develop .NET applications, and several ways to run them. As the landscape expands for .NET development—including advances in .NET Core with its cross-platform capabilities, self-contained deployments, and even the ability to run an ASP.NET Core app on a Raspberry PI with the upcoming .NET Core 2.1 ARM32 support—it’s only fitting that AppDynamics should advance its abilities to monitor this new landscape.

One of these advancements is our new .NET Microservices Agent. Like .NET Core, this agent has evolved to become more portable and easier to use, providing more value to our customers who monitor .NET Core applications. Its portability and refinement enable a couple of installation options, both of which align closely with the movement to host .NET applications in the cloud, the development of microservices, and the growing use of containers. This flexibility in deployment was a requirement of our customers, as they had concerns over the one-size fits all deployment options of some of our competitors. These deployment methods include:

  • Installing via the AppDynamics Site Extension in Azure

  • Installing via the NuGet package bundled with the application

Each method has its advantages and disadvantages:

AppDynamics Site Extension

    • Advantage: Azure Site Extension is an easy deployment method that decouples the AppDynamics agent from the code. A couple of clicks and some basic configuration settings and—voila!—an Azure App Service has an AppDynamics monitoring solution.

    • Disadvantage: It is an Azure App Service-only option. Should the application need to be moved to another service such as Azure Service Fabric, a different installation method would be needed.

AppDynamics NuGet Package

  • Advantage: the NuGet package installation method is super versatile. Since it’s bundled with the application, wherever it goes, the agent and monitoring go too. An excellent option for microservices and containers.

  • Disadvantage: It’s biggest advantage is also a drawback, as coupling the agent with the application increases operational requirements. Agent updates, for instance, would require small configuration changes and redeployments.

The Easy Option: AppDynamics Site Extension

Azure provides the ability to add Site Extensions, a simple way to add functionality and tooling to an Azure App Service.

In the case of AppDynamics’ .NET Microservices Agent, Site Extensions is a wonderful deployment method that allows you to set up monitoring on an Azure App Service without having to modify your application. This method is great for an operations team that either wants to monitor an existing Azure App Service without deploying new bits, or decouple the monitoring solution from the application.

The installation and configuration of the AppDynamics Site Extension is simple:

  1. Add the Site Extension to the App Service from the Site Extension Gallery.

  2. Launch the Controller Configuration Form and set up the Agent.

As always, Azure provides multiple ways to do things. Let’s break down these simple steps and show installation from two perspectives: from the Azure Portal, and from the Kudu service running on the Azure App Service Control Manager site.

Installing the Site Extension via the Azure Portal

The Azure Portal provides a very easy method to install the AppDynamics Site Extension. As the Portal is the most common interface when working with Azure resources, this method will feel the most comfortable.

Step 1: Add the Site Extension

  • Log into the Azure Portal at https://portal.azure.com and navigate to the Azure App Service to install the AppDynamics Site Extension.

  • In the menu sidebar, click the Extensions option to load the list of currently installed Site Extensions for the Azure App Service. Click the Add button near the top of the page (see below) to load the Site Extension Gallery, where you can search for the latest AppDynamics Site Extension.

  • In the “Add extension” blade, select the AppDynamics Site Extension to install.
    (The Portal UI is not always the most friendly. If you hover over the names, a tooltip should appear showing the full extension name.)

  • After choosing the extension, click OK to accept the legal terms, and OK again to finish the selection. Installation will start, and after a moment the AppDynamics Site Extension will be ready to configure.

Step 2: Launch and Configure

  • To configure the AppDynamics Agent, click the AppDynamics Site Extension to bring up the details blade, and then click the Browse button at the top. This will launch the AppDynamics Controller Configuration form for the agent.

  • Fill in the configuration settings from your AppDynamics Controller, and click the Validate button. Once the agent setup is complete, monitoring will start.

  • Now add some load to the application. In a few moments, the app will show up in the AppDynamics Controller.

Installing the Site Extension via Kudu

Every Azure App Service is created with a secondary site running the Kudu service, which you can learn more about at the projectkudu on GitHub. The Kudu service is a powerful tool that gives you a behind-the-scenes look at your Azure App Service. It’s also the place where Site Extensions are run. Installing the AppD Site Extension from the Kudu service is just as simple as from the Azure Portal.

Step 1: Add Site Extension

  • Login to the Azure Portal at https://portal.azure.com and navigate to the Azure App Service to install the AppDynamics Site Extension.

  • The Kudu service is easy to access via the Advanced Tools selection on the App Service sidebar.

  • Another option is to login directly to the secondary site’s URL by including a “.scm” as a prefix to the “.azurewebsite.net” domain. For example: http://appd-appservice-example.azurewebsites.net becomes http://appd-appservice-example.scm.azurewebsites.net. (You can read more about accessing the Kudu service in the projectkudu wiki.)

  • On the Kudu top menu bar, click the Site Extensions link to view the currently installed Site Extensions. To access the Site Extension Gallery, click the Gallery tab.

  • A simple search for “AppDynamics” will bring up all the available AppDynamics Site Extensions. Simply click the add “+” icon on the Site Extension tile to install.

  • On the “terms acknowledgement” dialog pop-up, click the Install button.

  • Finish the setup by clicking the “Restart Site” button on the upper right. This will restart the SCM site and prepare the AppDynamics Controller Configuration form.

Step 2: Launch and Configure

  • Once the restart completes, click the “Launch” icon (play button) on the Site Extension tile. This will launch the AppDynamics Controller Configuration form.

  • Follow the same process as before by filling in the details and clicking the Verify button.

  • The agent is now set up, and AppDynamics is monitoring the application.

AppDynamics Site Extension in Kudu Debug Console

One of the advantages of the Kudo service is the ability to use the Kudu Debug Console to locate App Service files, including the AppDynamics Site Extension installation and AppDynamics Agent log files. Should the Agent need configuration changes, such as adding a “tier” name, you can use the Kudu Debug Console to locate the AppDynamicsConfig.json file and make the necessary modifications.

The Versatile Option: AppDynamics NuGet Packages

The NuGet package installation option is the most versatile deployment method, as the agent is bundled with the application. Wherever the application goes, the agent and monitoring solutions go too. This method is great for monitoring .NET applications running in Azure Service Fabric and Docker containers.

AppDynamics currently has four separate NuGet packages for the .NET Microservices Agent, and each is explained in greater detail in the AppDynamics documentation. Your choice of package should be based on where your application will be hosted, and which .NET framework you will use.

In the example below, we will use the package best suited for an Azure App Service, for a comparison to the Site Extension.

Installing the AppDynamics App Service NuGet Package

The method for installing a NuGet package will vary by tooling, but for simplicity we will assume a simple web application is open in Visual Studio, and that we’re using Visual Studio to manage NuGet packages. If you’re working with a more complex solution with multiple applications bundled together, NuGet package installation will vary by project deployment.

Step 1: Getting the Correct Package

  • On the web app project, right-click and bring up the context menu. Locate and click “Manage NuGet Packages…”.  This should bring up the NuGet Package Manager, where you can search for “AppDynamics” under the Browse tab.  

  • Locate the correct package—in this case, the “AppService” option—select the appropriate version and click Install.

  • Do a build of your project to add the AppDynamics directory to your project.

  • The agent is now installed and ready to configure.

Step 2: Configure the Agent

  • Locate the AppDynamicsConfig.json in the AppDynamics directory and fill in the Controller configuration information.

  • Publish the application to Azure and add some load to the application to test if monitoring was set up properly.

I hope these steps give you an overview of how easy it is to get started with our .NET Microservices Agent. Make sure to review our official .NET Microservices Agent and Deploy AppDynamics for Azure documentation for more information.