Monitoring Kubernetes and OpenShift with AppDynamics

Here at AppDynamics, we build applications for both external and internal consumption. We’re always innovating to make our development and deployment process more efficient. We refactor apps to get the benefits of a microservices architecture, to develop and test faster without stepping on each other, and to fully leverage containerization.

Like many other organizations, we are embracing Kubernetes as a deployment platform. We use both upstream Kubernetes and OpenShift, an enterprise Kubernetes distribution on steroids. The Kubernetes framework is very powerful. It allows massive deployments at scale, simplifies new version rollouts and multi-variant testing, and offers many levers to fine-tune the development and deployment process.

At the same time, this flexibility makes Kubernetes complex in terms of setup, monitoring and maintenance at scale. Each of the Kubernetes core components (api-server, kube-controller-manager, kubelet, kube-scheduler) has quite a few flags that govern how the cluster behaves and performs. The default values may be OK initially for smaller clusters, but as deployments scale up, some adjustments must be made. We have learned to keep these values in mind when monitoring OpenShift clusters—both from our own pain and from published accounts of other community members who have experienced their own hair-pulling discoveries.

It should come as no surprise that we use our own tools to monitor our apps, including those deployed to OpenShift clusters. Kubernetes is just another layer of infrastructure. Along with the server and network visibility data, we are now incorporating Kubernetes and OpenShift metrics into the bigger monitoring picture.

In this blog, we will share what we monitor in OpenShift clusters and give suggestions as to how our strategy might be relevant to your own environments. (For more hands-on advice, read my blog Deploying AppDynamics Agents to OpenShift Using Init Containers.)

OpenShift Cluster Monitoring

For OpenShift cluster monitoring, we use two plug-ins that can be deployed with our standalone machine agent. AppDynamics’ Kubernetes Events Extension, described in our blog on monitoring Kubernetes events, tracks every event in the cluster. Kubernetes Snapshot Extension captures attributes of various cluster resources and publishes them to the AppDynamics Events API. The snapshot extension collects data on all deployments, pods, replica sets, daemon sets and service endpoints. It captures the full extent of the available attributes, including metadata, spec details, metrics and state. Both extensions use the Kubernetes API to retrieve the data, and can be configured to run at desired intervals.

The data these plug-ins provide ends up in our analytics data repository and instantly becomes available for mining, reporting, baselining and visualization. The data retention period is at least 90 days, which offers ample time to go back and perform an exhaustive root cause analysis (RCA). It also allows you to reduce the retention interval of events in the cluster itself. (By default, this is set to one hour.)

We use the collected data to build dynamic baselines, set up health rules and create alerts. The health rules, baselines and aggregate data points can then be displayed on custom dashboards where operators can see the norms and easily spot any deviations.

An example of a customizable Kubernetes dashboard.

What We Monitor and Why

Cluster Nodes

At the foundational level, we want monitoring operators to keep an eye on the health of the nodes where the cluster is deployed. Typically, you would have a cluster of masters, where core Kubernetes components (api-server, controller-manager, kube-schedule, etc.) are deployed, as well as a highly available etcd cluster and a number of worker nodes for guest applications. To paint a complete picture, we combine infrastructure health metrics with the relevant cluster data gathered by our Kubernetes data collectors.

From an infrastructure point of view, we track CPU, memory and disk utilization on all the nodes, and also zoom into the network traffic on etcd. In order to spot bottlenecks, we look at various aspects of the traffic at a granular level (e.g., reads/writes and throughput). Kubernetes and OpenShift clusters may suffer from memory starvation, disks overfilled with logs or spikes in consumption of the API server and, consequently, the etcd. Ironically, it is often monitoring solutions that are known for bringing clusters down by pulling excessive amounts of information from the Kubernetes APIs. It is always a good idea to establish how much monitoring is enough and dial it up when necessary to diagnose issues further. If a high level of monitoring is warranted, you may need to add more masters and etcd nodes. Another useful technique, especially with large-scale implementations, is to have a separate etcd cluster just for storing Kubernetes events. This way, the spikes in event creation and event retrieval for monitoring purposes won’t affect performance of the main etcd instances. This can be accomplished by setting the –etcd-servers-overrides flag of the api-server, for example:

–etcd-servers-overrides =/events#https://etcd1.cluster.com:2379;https://etcd2. cluster.com:2379;https://etcd3. cluster.com:2379

From the cluster perspective we monitor resource utilization across the nodes that allow pod scheduling. We also keep track of the pod counts and visualize how many pods are deployed to each node and how many of them are bad (failed/evicted).

A dashboard widget with infrastructure and cluster metrics combined.

Why is this important? Kubelet, the component responsible for managing pods on a given node, has a setting, –max-pods, which determines the maximum number of pods that can be orchestrated. In Kubernetes the default is 110. In OpenShift it is 250. The value can be changed up or down depending on need. We like to visualize the remaining headroom on each node, which helps with proactive resource planning and to prevent sudden overflows (which could mean an outage). Another data point we add there is the number of evicted pods per node.

Pod Evictions

Evictions are caused by space or memory starvation. We recently had an issue with the disk space on one of our worker nodes due to a runaway log. As a result, the kubelet produced massive evictions of pods from that node. Evictions are bad for many reasons. They will typically affect the quality of service or may even cause an outage. If the evicted pods have an exclusive affinity with the node experiencing disk pressure, and as a result cannot be re-orchestrated elsewhere in the cluster, the evictions will result in an outage. Evictions of core component pods may lead to the meltdown of the cluster.

Long after the incident where pods were evicted, we saw the evicted pods were still lingering. Why was that? Garbage collection of evictions is controlled by a setting in kube-controller-manager called –terminated-pod-gc-threshold.  The default value is set to 12,500, which means that garbage collection won’t occur until you have that many evicted pods. Even in a large implementation it may be a good idea to dial this threshold down to a smaller number.

If you experience a lot of evictions, you may also want to check if kube-scheduler has a custom –policy-config-file defined with no CheckNodeMemoryPressure or CheckNodeDiskPressure predicates.

Following our recent incident, we set up a new dashboard widget that tracks a metric of any threats that may cause a cluster meltdown (e.g., massive evictions). We also associated a health rule with this metric and set up an alert. Specifically, we’re now looking for warning events that tell us when a node is about to experience memory or disk pressure, or when a pod cannot be reallocated (e.g., NodeHasDiskPressure, NodeHasMemoryPressure, ErrorReconciliationRetryTimeout, ExceededGracePeriod, EvictionThresholdMet).

We also look for daemon pod failures (FailedDaemonPod), as they are often associated with cluster health rather than issues with the daemon set app itself.

Pod Issues

Pod crashes are an obvious target for monitoring, but we are also interested in tracking pod kills. Why would someone be killing a pod? There may be good reasons for it, but it may also signal a problem with the application. For similar reasons, we track deployment scale-downs, which we do by inspecting ScalingReplicaSet events. We also like to visualize the scale-down trend along with the app health state. Scale-downs, for example, may happen by design through auto-scaling when the app load subsides. They may also be issued manually or in error, and can expose the application to an excessive load.

Pending state is supposed to be a relatively short stage in the lifecycle of a pod, but sometimes it isn’t. It may be good idea to track pods with a pending time that exceeds a certain, reasonable threshold—one minute, for example. In AppDynamics, we also have the luxury of baselining any metric and then tracking any configurable deviation from the baseline. If you catch a spike in pending state duration, the first thing to check is the size of your images and the speed of image download. One big image may clog the pipe and affect other containers. Kubelet has this flag, –serialize-image-pulls, which is set to “true” by default. It means that images will be loaded one at a time. Change the flag to “false” if you want to load images in parallel and avoid the potential clogging by a monster-sized image. Keep in mind, however, that you have to use Docker’s overlay2 storage driver to make this work. In newer Docker versions this setting is the default. In addition to the Kubelet setting, you may also need to tweak the max-concurrent-downloads flag of the Docker daemon to ensure the desired parallelism.

Large images that take a long time to download may also cause a different type of issue that results in a failed deployment. The Kubelet flag –image-pull-progress-deadline determines the point in time when the image will be deemed “too long to pull or extract.” If you deal with big images, make sure you dial up the value of the flag to fit your needs.

User Errors

Many big issues in the cluster stem from small user errors (human mistakes). A typo in a spec—for example, in the image name—may bring down the entire deployment. Similar effects may occur due to a missing image or insufficient rights to the registry. With that in mind, we track image errors closely and pay attention to excessive image-pulling. Unless it is truly needed, image-pulling is something you want to avoid in order to conserve bandwidth and speed up deployments.

Storage issues also tend to arise due to spec errors, lack of permissions or policy conflicts. We monitor storage issues (e.g., mounting problems) because they may cause crashes. We also pay close attention to resource quota violations because they do not trigger pod failures. They will, however, prevent new deployments from starting and existing deployments from scaling up.

Speaking of quota violations, are you setting resource limits in your deployment specs?

Policing the Cluster

On our OpenShift dashboards, we display a list of potential red flags that are not necessarily a problem yet but may cause serious issues down the road. Among these are pods without resource limits or health probes in the deployment specs.

Resource limits can be enforced by resource quotas across the entire cluster or at a more granular level. Violation of these limits will prevent the deployment. In the absence of a quota, pods can be deployed without defined resource limits. Having no resource limits is bad for multiple reasons. It makes cluster capacity planning challenging. It may also cause an outage. If you create or change a resource quota when there are active pods without limits, any subsequent scale-up or redeployment of these pods will result in failures.

The health probes, readiness and liveness are not enforceable, but it is a best practice to have them defined in the specs. They are the primary mechanism for the pods to tell the kubelet whether the application is ready to accept traffic and is still functioning. If the readiness probe is not defined and the pods takes a long time to initialize (based on the kubelet’s default), the pod will be restarted. This loop may continue for some time, taking up cluster resources for no reason and effectively causing a poor user experience or outage.

The absence of the liveness probe may cause a similar effect if the application is performing a lengthy operation and the pod appears to Kubelet as unresponsive.

We provide easy access to the list of pods with incomplete specs, allowing cluster admins to have a targeted conversation with development teams about corrective action.

Routing and Endpoint Tracking

As part of our OpenShift monitoring, we provide visibility into potential routing and service endpoint issues. We track unused services, including those created by someone in error and those without any pods behind them because the pods failed or were removed.

We also monitor bad endpoints pointing at old (deleted) pods, which effectively cause downtime. This issue may occur during rolling updates when the cluster is under increased load and API request-throttling is lower than it needs to be. To resolve the issue, you may need to increase the –kube-api-burst and –kube-api-qps config values of kube-controller-manager.

Every metric we expose on the dashboard can be viewed and analyzed in the list and further refined with ADQL, the AppDynamics query language. After spotting an anomaly on the dashboard, the operator can drill into the raw data to get to the root cause of the problem.

Application Monitoring

Context plays a significant role in our monitoring philosophy. We always look at application performance through the lens of the end-user experience and desired business outcomes. Unlike specialized cluster-monitoring tools, we are not only interested in cluster health and uptime per se. We’re equally concerned with the impact the cluster may have on application health and, subsequently, on the business objectives of the app.

In addition to having a cluster-level dashboard, we also build specialized dashboards with a more application-centric point of view. There we correlate cluster events and anomalies with application or component availability, end-user experience as reported by real-user monitoring, and business metrics (e.g., conversion of specific user segments).

Leveraging K8s Metadata

Kubernetes makes it super easy to run canary deployments, blue-green deployments, and A/B or multivariate testing. We leverage these conveniences by pulling deployment metadata and using labels to analyze performance of different versions side by side.

Monitoring Kubernetes or OpenShift is just a part of what AppDynamics does for our internal needs and for our clients. AppDynamics covers the entire spectrum of end-to-end monitoring, from the foundational infrastructure to business intelligence. Inherently, AppDynamics is used by many different groups of operators who may have very different skills. For example, we look at the platform as a collaboration tool that helps translate the language of APM to the language of Kubernetes and vice versa.

By bringing these different datasets together under one umbrella, AppDynamics establishes a common ground for diverse groups of operators. On the one hand you have cluster admins, who are experts in Kubernetes but may not know the guest applications in detail. On the other hand, you have DevOps in charge of APM or managers looking at business metrics, both of whom may not be intimately familiar with Kubernetes. These groups can now have a productive monitoring conversation, using terms that are well understood by everyone and a single tool to examine data points on a shared dashboard.

Learn more about how AppDynamics can help you monitor your applications on Kubernetes and OpenShift.

How Top Investment Banks Accelerate Transaction Time and Avoid Performance Bottlenecks

A complex series of interactions must take place for an investment bank to process a single trade. From the moment it’s placed by a buyer, an order is received by front-office traders and passed through to middle- and back-office systems that conduct risk management checks, matchmaking, clearing and settlement. Then the buyer receives the securities and the seller the corresponding cash. Once complete, the trade is sent to regularity reporting, which insures the transaction was processed under the right regulatory requirements. One AppDynamics customer, a major financial firm, utilizes thousands of microservices to complete this highly complex task countless times throughout the day.

To expedite this process, banks have implemented straight-through processing (STP), an initiative that allows electronically entered information to move between parties in the settlement process without manual intervention. But one of the banks’ biggest concerns with STP is the difficulty of following trades in real-time. When trades get stuck, manual intervention is needed, often impacting service level agreements (SLAs) and even trade reconciliation processes. One investment firm, for instance, told AppDynamics that approximately 20% of its trades needed manual input to complete what should have been a fully automated process—a bottleneck that added significant overhead and resource requirements. And with trade volumes increasing 25% year over year, the company needed a fresh approach to help manage its rapid growth.

AppDynamics’ Business Transactions (BT) enabled the firm to track and follow trades in real time, end-to-end through its systems. The BT traces through of all the necessary systems and microservices—applications, databases, third-party APIs, web services, and so on—needed to process and respond to a request. In investment banking, a BT may include everything from placing an order, completing risk checks or calculations, booking and confirming different types of trades, and even post-trade actions such as clearing, settlement and regularity reporting.

The AppDynamics Business Journey takes this one step further by following a transaction across multiple BTs; for example, following an individual trade from order through capture and then to downstream reporting. The Business Journey provides true end-to-end, time-enabling tracking against SLAs, and traces the transaction across each step to monitor performance and ensure completion.

Once created, the Business Journey allows you to visualise key metrics with out-of-the-box dashboards.

Real-Time Tracking with Dashboards

Prior to AppDynamics, one investment bank struggled to track trades in real time. They were doing direct queries on the database to find out how many trades had made it downstream to the reporting database. This method was slow and inefficient, requiring employees to create and share small Excel dashboards, which lacked real-time trade information. AppDynamics APM dashboards, by comparison, enabled them to get a real-time, high-level overview of the health and performance of their system.

After installing AppDynamics, the investment bank instrumented a dashboard to show all the trades entering its post-trade system throughout the day. This capability proved hugely beneficial in helping the firm monitor trading spikes and ensure it was meeting its SLAs. And Business IQ performance monitoring made it possible to slice and dice massive volumes of incoming trades to gain real-time insights into where the transactions were coming from (i.e., which source system), their value, whether they met the SLAs, and which ones failed to process. Additionally, AppDynamics Experience Level Management provided the ability to report compliance against specific processing times.

Now the bank could automate complex processes and remove inefficient manual systems. Prior to AppDynamics, there was a team dedicated to overseeing more than 200 microservices. They had to determine why a particular trade failed, and then pass that information onto the relevant business teams for follow-up to avoid losing business. But too often a third-party source would send invalid data, or update its software and send a trade in an updated format unfamiliar to the bank’s backend system, creating a logistical mess too complex for one human to manage. With Business IQ, the bank was able to immediately spot and follow up on invalid trades.

Searching for Trades Across All Applications

Microservices offer many advantages but can bring added complexity as well. The investment bank had hundreds of microservices but lacked a fast and efficient way to search for an individual trade. In the event of a problem, they would take the trade ID and look into the log files of multiple microservices. On average, they had to open up some 40 different log files to locate a problem. And although the firm had an experienced support staff that knew the applications well, this manual process wasn’t sustainable as newer, inexperienced support people were brought onboard. Nor would this system scale as trade volume increased.

By using Business IQ to monitor every transaction across all microservices, the bank was able to easily monitor individual trade transactions throughout the lifecycle. And by capturing the trade ID, as well as supplementary data such as the source, client, value and currency, they could then go into AppDynamics Application Analytics and very quickly identify specific transactions. For example, they could enter the trade ID and see every transaction for the trade across the entire system.

This feature was particularly loved by the support staff, which now had immediate access to all of a trade’s interactions within a single screen, as well as the ability to easily drill down to find the find the root cause of a failed transaction.

Tracking Regulatory SLAs in Real Time

Prior to AppDynamics, our customer didn’t have an easy way to track the progress of a trade in real time. Rather, they were manually verifying that trades were successfully being sent to regulatory reporting systems, as well as ensuring that this was completed within the required timeframe. This was difficult to do in real time, meaning that when there was an issue, often it was not found until after the SLA had been breached. With AppDynamics they were able to set up a dashboard to visualise data in real time; the team then set up a health rule to indicate if trade reporting times were approaching the SLA. They also configured an alert that enabled them to proactively see and resolve any issues ahead of an SLA breach.

Proactively Tracking Performance after Code Releases

The bank periodically introduces new functionality to meet the latest business or regulatory requirements, in particular MiFID II, introduced to improve investor protection across Europe by harmonizing the rules for all firms with EU clients. Currently, new releases happen every week, but this rate will continue to increase. These new code releases introduce risk, as previous releases have either had a negative impact on system performance or have introduced new defects. In one two-month period, for instance, the time required to capture a trade increased by about 20%. If this continued, the bank would have had to scale out hugely—buying new hardware at significant cost—to avoid breaching its regulatory SLA.

The solution was to create a comparative dashboard in AppDynamics that showed critical Business Transactions and how they were being changed between releases (response times, errors, and so on). If any metric degraded from the previous version or deviated from a certain threshold, it would be highlighted on the dashboard in a different color, making it easier to decide whether to proceed with a rollout or determine which new feature or change had caused the deviation.

Preventing New Hardware Purchases

After refining its code based on AppDynamics’ insights, the bank saw a dramatic 6X performance improvement. This saved them from having to—in their words—“throw more hardware at the problem” by buying more CPU processing power to push through more trades.

By instrumenting their back office systems with AppDynamics, the bank gained deep insights that enabled them to refine their code. For instance, calls to third-party APIs were taking place unnecessarily and trades were being captured unintentionally within multiple different databases. Without AppDynamics, it’s unlikely this would have been discovered. The insight enabled the bank to make some very simple changes to fine-tune code, resulting in a significant performance improvement and enabling the bank to save money by scaling with their existing hardware profile.

Beneficial Business Outcomes

From the bank’s perspective, one of the greatest gains of going with AppDynamics was the ability to follow a trade through its many complex services, from the moment an order is placed, through to capture and down to regularity reporting. This enabled them to improve system performance, avoid expensive (and unnecessary) hardware upgrades, quickly search for trade IDs to locate and find the root cause of issues, and proactively manage SLAs.

See how AppDynamics can help your own business achieve positive outcomes.

How Anti-Patterns Can Stifle Microservices Adoption in the Enterprise

In my last article, Microservice Patterns That Help Large Enterprises Speed Development, Deployment and Extension, we went over some deployment and communication patterns that help keep microservices manageable as you use more of them. I also promised that in my next post, I’d get into how microservice patterns can become toxic, create more overhead and unreliability, and become an unmanageable mess. So let’s dig in.

First things first: Patterns are awesome.

They help formalize ideas into reusable chunks that you can distribute and communicate easily to your teams. Here are some useful things that patterns do for engineering departments of any size:

  • Make ideas distributable

  • Lower the barriers to success for junior members

  • Create building blocks

  • Create consistency in disparate/complex systems

Patterns are usually an intentional act. When we put patterns in place, we’re making a clear choice and acting on a decision to make things less painful. But not all patterns are helpful. In the case of the anti-pattern, it has the potential to create more trouble for the engineering team and the business.

The Anatomy of An Anti-Pattern

Like patterns, anti-patterns tend to be identifiable and repeatable.

Anti-patterns are rarely intentional, and usually you identify them long after their effects are visible. Individuals in your organization often make well-meaning (if poor) choices in the pursuit of faster delivery, rushed deadlines, and so on. These anti-patterns are often perpetuated by other employees who decide, “Well, this must be how it’s done here.”

In this way, anti-patterns can become a very dangerous norm.

For companies that want to migrate their architecture to microservices, anti-patterns are a serious obstacle to success. That’s why I’d like to share with you a few common anti-patterns I’ve seen repeated many times in companies making the switch to microservices. These moves eventually compromised their progress and created more of the problems they were trying to avoid.

Data Taffy

The first anti-pattern is the most common—and the most subtle—in the chaos and damage it causes.

The Problem

The data taffy anti-pattern can manifest in a few different ways, but the short explanation is that it occurs when all services have full access to all objects in the database.

That doesn’t sound so bad, right?

You want to be able to create complex queries and complex data-ingestion scenarios to go across many domains. So at first glance, it makes sense to have everything call what it needs directly from the database. But that’s a problem when you need to scale an individual domain in your application. Rarely does data grow uniformly across all domains, but rather does so in bursts on individual domains. Sometimes it’s very difficult to predict which domains will grow the fastest. The entangled data becomes a lot like taffy: difficult to pull apart. It stretches and gets stuck in the cogs of business.

In this scenario, companies will have lots of stored procedures, embedded complex queries across many services, and object relationship managers all accessing the database—each with its own understanding of how a domain is “supposed” to be used. This nearly always leads to data contamination and performance issues.

But there are even bigger challenges, most notably when you need to make structural changes to your database.

Here’s an example based on a real-life experience I had with a large, privately owned company, one that started small and expanded rapidly to service tens of thousands of clients. Say you have a table whose primary key is an int starting at 0. You have 2,147,483,647 objects before you’ll run out of keys—no big deal, right?

So you start building out services, and this table becomes a cornerstone object in your application that every other domain touches in some meaningful way. Before you know it, there are 125 applications calling into this table from queries or stored procedures, totaling some 13,000 references to the table. Because this is a core table it gets a ton of data. Soon you’re at 2,100,000,000 objects with 10,000,000 new records being added daily.

You have four days before things go bad—real bad.

You try adding negative values to buy time, not realizing that half the services have hard-coded rules that IDs must be greater than 0. So you bite the bullet and manually scrub through EVERY SERVICE to find every instance of every object that has been created that uses this data, and then update the type from an integer to a large integer. You then have to update several other tables, objects, and stored procedures with foreign key relationships. This becomes a hugely painful effort with all hands on deck frantically trying to keep the company’s flagship product from being DOA.

Clearly, not an ideal scenario for any company.

Now if this domain was contained behind a single service, you’d know exactly how the table is being used and could find creative solutions to maintain backwards compatibility. After all, you can do a lot with code that you simply can’t do by changing a data value in a database. For instance, in the example above, there are about 800 million available IDs that could be reclaimed and mapped for new adds, which would buy enough time for a long-term plan that doesn’t require a frantic, all-hands-on-deck approach. This could even be combined with a two-key system based on a secondary value used to partition the data effectively. In addition, there’s one partitionable field we could use to give us 10,000x more available integers, as well as a five-year window to create more permanent solves with no changes to any consuming services.

This is just one anecdote, but I have seen this problem consistently halt scale strategies for companies at crucial times of growth. So try to avoid this anti-pattern.

How to Solve

To solve the data taffy problem, you must isolate data to specific domains only accessible via services designed to service them. The data may start on the same database but use schema and access policy to limit access to a single service. This enables you to change databases, create partitions, or move to entirely new data storage systems without any other service or system having to know or care.

Dependency Disorder

Say you’ve switched to microservices, but deployments are taking longer than ever. You wish you had never tried to break down the monolith. If this sounds familiar, you may be suffering from dependency disorder.

The Problem

Dependency disorder is one of the easiest anti-patterns to detect. If you have to know the exact order that services must be deployed to keep them from failing, it’s a clear signal the dependencies have become nested in a way that won’t scale well. Dependency disorder generally comes from domains calling sideways from one domain’s stack to another (instead of down the stack from the UI to the gateway) and then to the services that enable the gateway. Another big problem resulting from dependency disorder: unknown execution paths that take arbitrarily long times to execute.

How to Solve

An APM solution is a great starting point for resolving dependency disorder problems. Try to utilize a solution that provides a complete topology of your service execution paths. By leveraging these maps, you can make precision cuts in the call chain and refocus gateways to make fan-out calls that execute asynchronously rather than doing sideways calls. For some examples of helpful patterns, check out part one of this series. Ideally, we want to avoid service-to-service calls that create a deep and unmanageable call stack and favor a wider set of calls from the gateway.

Microlith

Microliths basically are well-meaning, clear-service paths that take dependency disorder to its maximum entropic state.

The Problem

Imagine having a really well-designed service, database and gateway implementation that you decide to isolate into a container—you feel great! You have a neat-and-tidy set of rules for how data gets stored, managed and scaled.

Believing you’ve reached microservice nirvana, you breathe a sigh of relief and wait for the accolades. Then you notice the releases gradually start taking longer and longer, and that data-coupling is happening in weird ways. You also find yourself deploying nearly the entire suite of services with every deployment, causing testing issues and delays. More and more trouble tickets are coming in each quarter, and before you know it, the organization is ready to scrap microservices altogether.

The promise of microservices is that there are no rules—you just put out whatever you want. The problem is that without a clear definition of how data flows down the stack, you’re basically creating a hybrid problem between data taffy and dependency disorder.

How to Solve

The mediation process here is effectively the same as with the dependency disorder. If you are working with a full-blown microlith, however, it will take some diligence to get back to stable footing. The best advice I can give is, try to get to a point where you can deploy a commit as soon as it’s in. If your automation and dependency orders are well-aligned, new service features should always be ready to roll out as soon as the developer commits to the code base. Don’t stand on formality. If this process is painful, do it more. Smooth out your automated testing and deployment so that you can reliably get commits deployed to production with no downtime.

Final Thoughts

I hope this gets the wheels spinning in your head about some of the microservices challenges you may be having now or setting yourself up for in the future. This information is all based on my personal experiences, as well as my daily conversations with others in the industry who think about these problems. I’d love to hear your input, too. Reach out via email, chase.aucoin@appdynamics.com; Twitter, https://twitter.com/ChaseAucoin; or LinkedIn, https://www.linkedin.com/in/chaseaucoin/. If you’ve got some other thoughts, I’d love to hear from you!

Microservice Patterns That Help Large Enterprises Speed Development, Deployment and Extension

This is the first in a two-part series on microservice patterns and anti-patterns. In this article, we’ll focus on some useful patterns that, when leveraged, can speed up development, deployment, and extension. In the next article, we’ll focus on how microservice patterns can become toxic, create more overhead and unreliability, and become an unmanageable mess.

Microservice patterns assume a large-scale enterprise-style environment. If you’re working with only a small set of services (1 to 5) you won’t feel the positive impact as strongly as organizations with 10, 100, or 1000+ services. My biggest pattern for startups and smaller projects is to not overthink and add complexity for complexity’s sake. Patterns are meant to aid in solving problems—they are not a hammer with which to bludgeon the engineering team. So use them with this in mind.

Open for Extension, Closed for Modification

We’re going to start our patterns talk with a principle rather than a pattern. Software development teams working on microservices get more mileage out of this one principle than most any other pattern or principle. It’s a classic pattern from the SOLID principles of Robert C. Martin (Uncle Bob).

In short, being open for extension and closed for modification means leaving your code open to add new functionality via inheritance but closed for direct modifications. I take a bit of a looser definition that tends to be more pragmatic. My definition is, “Don’t break existing contracts.” It’s fine to add methods but don’t change the signature of existing methods. Nor should you change the functionality of an established method.

Why is This Pattern So Powerful in Microservices?

When you have disparate teams working on different services that have to interoperate with one another, you need a certain level of reliability. I, as a consumer of a service, need to be able to depend on the service in the future, even as new features are added.

How Do We Manifest this Pattern?

Easy. Don’t break your contracts. There’s never a good reason to break an existing production contract. However, there could be lots of good reasons to add to it, thereby making it “open for extension and closed for modification.” For example, if you have to start collecting new data as part of a service, add a new endpoint and set a timeline to depreciate the old service call, but don’t do both in a single step. Likewise with data management: If you need to rename a column of data, just add a new column and leave the old column empty for awhile. When you depreciate an old service, it’s a good time to do any clean-up that goes with that depreciation. If you can adhere to this principle, everyone in your organization with have a better development experience.

Pattern: Enterprise Services with SPA Gateways

When we start building out large applications and moving towards a microservice paradigm, the issue of manageability quickly rises to the surface. We have to address manageability at many layers, and also must consider dependency management. Microservices can quickly become a glued-together mess with tightly coupled dependencies. Instead of having a monolithic “big ball of mud,” we create a “mudslide.”

One way to address these problems is to introduce the notion of Enterprise Domain Services that are responsible for the tasks within different domains in your organization, and then combine that domain-specific logic into more meaningful activities (i.e., product features) at the Single Page Application (SPA) gateway layer. The SPA gateway serves to take some subset of the overall functionality of an application (i.e., a single page worth) and codify that functionality, delegating the “hard parts” (persistence, state management, third-party calls, etc.) off to the associative enterprise services. In this pattern, each enterprise service either owns its own data as a single database, a collection of databases, or as an owned schema as part of a larger enterprise database.

Pattern: SPA Services with Gateway and ETL

Now we are going to ramp up the complexity a bit. One of the big questions people get into when they start down the microservices path is, “How do I join complex data?” In the Enterprise Services with SPA gateways example above, you would just call into multiple services. This is fine when you’re combining two to three points of data, but what about when you need really in-depth questions answered? How do you find, for instance, all the demographic data for one region’s customers who had invoices for the green version of an item in the second quarter of the month?

This question isn’t incredibly difficult if you have all the data together in a single database. But then you might start violating single responsibility principles pretty fast. The goal here then is to delegate that responsibility to a service that’s really good at just joining data via ETL (Extract, Transform, Load). ETL is a pattern for data warehousing where you extract data from disparate data sources, transform the data into something meaningful to the business, and load the transformed data somewhere else for utilization. The team that owns the domain that will be asking these types of demographic questions will be responsible for the care and feeding of services that perform the ETL, the database or schema where the transformed data is stored, and the services(s) that provide access to it.

Why Not Just Make a Multi-Domain Call at the Database?

This is a fair question, and on a small project it may be reasonable to do so. But on large projects with lots of moving parts, each part must be able to move independently. If we are combining directly at the DB level, we’re pretty much guaranteeing that the data will only ever travel together on that single DB, which is no big deal with small volumes of data. However, once we start dealing with tens, hundreds or thousands of terabytes, this becomes more of a big deal as it greatly impacts the way we scale domains independently. Using ETLs and data warehousing strategies to provide an abstraction layer on the movement and combination of our data might require us to update our ETL if we move the data around. But this feat is much more manageable than trying to untangle thousands of nested, stored procedures across every domain.

Closing Thoughts

Remember, these are just some of the available patterns. Your goal here is to solve problems, not create more problems. If a particular pattern isn’t working well for your purpose, it’s okay to create your own, or mix and match.

One of the easiest ways to get a handle on large projects with lots of domains is to use tools like AppDynamics with automatic mapping functionality to get a better understanding of the dependency graph. This will help you sort out the tangled mess of wires.

Remember, the best way to eat an elephant is one bite at a time.

In my next blog, we’ll look at some common anti-patterns, which at first may seem like really good ideas. But anti-patterns can cause problems pretty quickly and make it difficult to scale and maintain your projects.

Advances In Mesh Technology Make It Easier for the Enterprise to Embrace Containers and Microservices

More enterprises are embracing containers and microservices, which bring along additional networking complexities. So it’s no surprise that service meshes are in the spotlight now. There have been substantial advances recently in service mesh technologies—including Istio’s 1.0, Hashi Corp’s Consul 1.2.1, and Buoyant merging Conduent into LinkerD—and for good reason.

Some background: service meshes are pieces of infrastructure that facilitate service-to-service communication—the backbone of all modern applications. A service mesh allows for codifying more complex networking rules and behaviors such as a circuit breaker pattern. AppDev teams can start to rely on service mesh facilities, and rest assured their applications will perform in a consistent, code-defined manner.

Endpoint Bloom

The more services and replicas you have, the more endpoints you have. And with the container and microservices boom, the number of endpoints is exploding. With the rise of Platform-as-a-Services and container orchestrators, new terms like ingress and egress are becoming part of the AppDev team vernacular. As you go through your containerization journey, multiple questions will arise around the topic of connectivity. Application owners will have to define how and where their services are exposed.

The days of providing the networking team with a context/VIP to add to web infrastructure—such as services.acme.com/shoppingCart over port 443—are fading. Today, AppDev teams are more likely to hand over a Kubernetes YAML to add services.acme.com/shoppingCart to the Ingress controller, and then describe a behavior. Example: the shopping cart Pod needs to talk to the shopping cart validation Pod, which can only be accessed by the shopping cart because the inventory is kept on another set of Reddis Pods, which can’t be exposed to the outside world.

You’re juggling all of this while navigating constraints set by defined and deployed Kubernetes networking. At this point, don’t be alarmed if you’re thinking, “Wow, I thought I was in AppDev—didn’t know I needed a CCNA to get my application deployed!”

The Rise of the Service Mesh

When navigating the “fog of system development,” it’s tricky to know all the moving pieces and connectivity options. With AppDev teams focusing mostly on feature development rather than connectivity, it’s very important to make sure all the services are discoverable to them. Investments in API management are the norm now, with teams registering and representing their services in an API gateway or documenting them in Swagger, for example.

But what about the underlying networking stack? Services might be discoverable, but are they available? Imagine a Venn diagram of AppDev vs. Sys Engineer vs. SRE: Who’s responsible for which task? And with multiple pieces of infrastructure to traverse, what would be a consistent way to describe networking patterns between services?

Service Mesh to the Rescue

Going back to the endpoint bloom, consistency and predictability are king. Over the past few years, service meshes have been maturing and gaining popularity. Here are some great places to learn more about them:

Service Mesh 101

In the Istio model, applications participate in a service mesh. Istio acts as the mesh, and then applications can participate in the mesh via a sidecar proxy—Envoy, in Istio’s case.

Your First Mesh

DZone has a very well-written article about standing up your first Java application in Kubernetes to participate in an Istio-powered service mesh. The article goes into detail about deploying Istio itself in Kubernetes (in this case, MinuKube). For an AppDev team, the new piece would be creating the all-important routing rules, which are deployed to Istio.

Which One of these Meshes?

The New Stack has a very good article comparing the pros and cons of the major service mesh providers. The post lays out the problem in granular format, and discusses which factors you should consider to determine if your organization is even ready for a service mesh.

Increasing Importance of AppDynamics

With the advent of the service mesh, barriers are falling and enabling services to communicate more consistently, especially in production environments.

If tweaks are needed on the routing rules—for example, a time out—it’s best to have the ability to pinpoint which remote calls would make the most sense for this task. AppDynamics has the ability to examine service endpoints, which can provide much-needed data for these tweaks.

For the service mesh itself, AppDynamcs in Kubernetes can even monitor the health of your applications deployed on a Kubernetes cluster.

With the rising velocity of new applications being created or broken into smaller pieces, AppDynamics can help make sure all of these components are humming at their optimal frequency.

Strangler Pattern: Migrate to Microservices from a Monolithic App

In my 20-plus years in the software industry, I’ve worn a lot of hats: developer, DBA, performance engineer and—for the past 10 years prior to joining AppDynamics—software architect. I’ve been coding since sixth grade and have seen some pretty dramatic changes over the years, from punch cards and 8-inch floppies to DevOps and microservices.

This may surprise you, but during my career I’ve spent more time fixing broken software than building new and innovative applications. I’ve encountered pretty much every variety of enterprise software snafu, many requiring time-consuming fixes I had to do manually. There was a silver lining to this pain, however: I learned a lot about what does and doesn’t work in software development and deployment. Below are some observations drawn from my experiences in the field.

Enter the Strangler

You may already be familiar with the “Strangler Pattern,” an architectural framework for updating or modernizing software and enterprise applications. While the concept isn’t new—esteemed author Martin Fowler was discussing the pattern way back in 2004—it’s even more relevant for today’s microservices- and DevOps-focused organizations.

Essentially, the term is a metaphor for modern software development. The Strangler tree, or fig, is the popular name for a variety of tropical and subtropical plant species. Vines sprout from the top of the tree and extend their roots into the ground, enveloping and sometimes killing their host, and shrouding the carcass of the original tree under a thick set of vines.

This “strangler” effect is not unlike the experience that an organization encounters when  transitioning from a monolithic legacy application to microservices—breaking apart pieces of the monolith into smaller, modular components that can be built faster and deployed quicker. While the enterprise version of the Strangler tree won’t kill off its host entirely—some legacy functions won’t transfer to microservices and must remain—the strategy is essential for any organization striving for agile development.

A Hybrid Approach

The Strangler Pattern is a representation of agility within the enterprise. If you’re moving in the agile direction or doing a legacy modernization, you’re using the Strangler, whether you realize it or not.

The pattern helps software developers, architects, and the business side align the direction of their legacy transition. Anytime you hear “cloud hybrid,” “hybrid cloud” or “on-prem plus cloud,” it’s a Strangler Pattern, as the organization must maintain connectivity between its legacy application and the microservices it’s pushing to the cloud.

When enterprises start their agile journey, they soon realize there’s a huge cost of trying to reverse-engineer legacy code—much of it written in mainframe, Cobol, C++, or old .NET or Java—to make it smaller and more modular. They also discover that the hybrid or strangler approach, while certainly not easy, is easier than trying to rewrite everything.

Agile Enterprise vs. Agile Development

Developers are very quick to adopt agile. This may seem like a good thing, but it’s actually one of the core problems in organizations I’ve worked with over the years: Developers are agile-ready, the organization is not.

For a legacy transition to work, the focus must be on the agile enterprise, not just agile development. Processes must be in place to determine requirements, mock up screens, hash out wireframes, and generally move things along. Some businesses have this down—Google, Amazon and Netflix come to mind—but many companies don’t have these processes in place. Rather, they jump in head first, quickly going to microservices because of the buzz, without really considering the implications of what this will mean to their organizational requirements. The catalyst may be a new CTO coming in and saying, “Let’s move to the cloud.”

But a poorly conceived microservices transition, one where the entire enterprise neither embraces the agile philosophy nor understands what it means to go to a microservices strategy, can have disastrous consequences.

Bad App, Big Bill

Developers and DevOps and infrastructure folks usually understand what it means to go to a microservices strategy, but the business doesn’t always get it.

There are a lot of misconceptions about what microservices are and what they can do. For the Strangler Pattern to work, you need a comprehensive understanding of the potential impacts of a cloud transition.

I’ve seen situations where an application ran great on a developer’s local desktop, where it wasn’t a problem if the app woke every few seconds, checked a database, went back to sleep, and repeated this process over and over. But a month after pushing this app to the cloud, the company got a bill from its cloud provider for thousands of dollars of CPU time. Clearly, no one at the company considered the ramifications of porting this app directly to the cloud, rather than optimizing it for the cloud.

The moral? There are different approaches for different cloud models, migrations and microservice strategies. It’s critically important for your organization to understand the pros and cons of each approach, and how you as a developer or architect can work with the organization’s agile enterprise strategy.

Lift-and-Shift: A Fairy Tale

When attempting a “lift and shift”—moving an entire application or operation to the cloud—companies often adopt a methodical approach. They start with static services that can be moved easily, and services that don’t contain sensitive company data or personal customer information.

The ultimate goal of lift-and-shift is to move everything to the cloud, but in my experience that’s a fairy tale. It’s aspirational but not achievable: You’re either building for the cloud from the ground up or lifting some services, usually the ones easiest to shift. Whenever I mention “lift and shift” to developers, architects and customers, they usually laugh because they’ve gone through enough pain to understand it’s not entirely possible, and that their organization will be in a hybrid or transitional state for an extended period of time.

If you lift-and-shift an application that runs great on-prem, it’s likely to all of a sudden spin resources, causing you to scale unnecessarily because the code was written to perform in an environment that’s very different from the one it’s in. The Strangler Pattern again may come into play: understand what elements of your application or service are a natural fit for the cloud, e.g., have elasticity requirements or unpredictable scale, and move them first. You can then move the remaining pieces more easily into a cloud environment that behaves more predictably.

Putting It All Together

There are plenty of challenges that come with cloud migration. Your enterprise, top to bottom, must be ready for the move. If you haven’t invested in a DevOps strategy along with the people capable of executing it, codifying all the dependencies and deployment options to make an application run efficiently in the cloud, you’ll likely find your team pushing bug fixes all day and troubleshooting problems, rather than being agile and developing code and features that help users.

The ability to monitor your environment is fundamentally important as well. Being able to see the performance of your services, to quickly root-cause a breakdown in communication between microservices and their dependencies, is absolutely critical to an effective strategy.

Without an agile transformation, you’ll never truly achieve a microservices architecture. That’s why the Strangler Pattern is a good approach. You can’t just say one day, “We’re going agile and into the cloud,” and the whole organization replies, “Okay, we’re ready!”

You’ll meet resistance. Processes will have to be rewritten, code and deployment processes will need to change. It’s a huge undertaking.

The Strangler’s piecemeal approach makes a lot more sense. You don’t want to learn that you screwed up your continuous integration and development pipeline on 500 new microservices. It’s much wiser to start with two or three services and learn what works and what doesn’t.

I’ve experienced first-hand all of the software migration problems I’ve described above. I’ve repaired them manually, which was time-consuming and painful. The silver lining is that I learned a lot about fixing broken software, and now I’m able to share these hard-earned lessons.

In conclusion, be sure to implement a DevOps strategy for continuous integration and deployment, as well as good monitoring and logging systems to understand your performance.

The AppD Approach: Deployment Options for .NET Microservices Agent

There are numerous ways to develop .NET applications, and several ways to run them. As the landscape expands for .NET development—including advances in .NET Core with its cross-platform capabilities, self-contained deployments, and even the ability to run an ASP.NET Core app on a Raspberry PI with the upcoming .NET Core 2.1 ARM32 support—it’s only fitting that AppDynamics should advance its abilities to monitor this new landscape.

One of these advancements is our new .NET Microservices Agent. Like .NET Core, this agent has evolved to become more portable and easier to use, providing more value to our customers who monitor .NET Core applications. Its portability and refinement enable a couple of installation options, both of which align closely with the movement to host .NET applications in the cloud, the development of microservices, and the growing use of containers. This flexibility in deployment was a requirement of our customers, as they had concerns over the one-size fits all deployment options of some of our competitors. These deployment methods include:

  • Installing via the AppDynamics Site Extension in Azure

  • Installing via the NuGet package bundled with the application

Each method has its advantages and disadvantages:

AppDynamics Site Extension

    • Advantage: Azure Site Extension is an easy deployment method that decouples the AppDynamics agent from the code. A couple of clicks and some basic configuration settings and—voila!—an Azure App Service has an AppDynamics monitoring solution.

    • Disadvantage: It is an Azure App Service-only option. Should the application need to be moved to another service such as Azure Service Fabric, a different installation method would be needed.

AppDynamics NuGet Package

  • Advantage: the NuGet package installation method is super versatile. Since it’s bundled with the application, wherever it goes, the agent and monitoring go too. An excellent option for microservices and containers.

  • Disadvantage: It’s biggest advantage is also a drawback, as coupling the agent with the application increases operational requirements. Agent updates, for instance, would require small configuration changes and redeployments.

The Easy Option: AppDynamics Site Extension

Azure provides the ability to add Site Extensions, a simple way to add functionality and tooling to an Azure App Service.

In the case of AppDynamics’ .NET Microservices Agent, Site Extensions is a wonderful deployment method that allows you to set up monitoring on an Azure App Service without having to modify your application. This method is great for an operations team that either wants to monitor an existing Azure App Service without deploying new bits, or decouple the monitoring solution from the application.

The installation and configuration of the AppDynamics Site Extension is simple:

  1. Add the Site Extension to the App Service from the Site Extension Gallery.

  2. Launch the Controller Configuration Form and set up the Agent.

As always, Azure provides multiple ways to do things. Let’s break down these simple steps and show installation from two perspectives: from the Azure Portal, and from the Kudu service running on the Azure App Service Control Manager site.

Installing the Site Extension via the Azure Portal

The Azure Portal provides a very easy method to install the AppDynamics Site Extension. As the Portal is the most common interface when working with Azure resources, this method will feel the most comfortable.

Step 1: Add the Site Extension

  • Log into the Azure Portal at https://portal.azure.com and navigate to the Azure App Service to install the AppDynamics Site Extension.

  • In the menu sidebar, click the Extensions option to load the list of currently installed Site Extensions for the Azure App Service. Click the Add button near the top of the page (see below) to load the Site Extension Gallery, where you can search for the latest AppDynamics Site Extension.

  • In the “Add extension” blade, select the AppDynamics Site Extension to install.
    (The Portal UI is not always the most friendly. If you hover over the names, a tooltip should appear showing the full extension name.)

  • After choosing the extension, click OK to accept the legal terms, and OK again to finish the selection. Installation will start, and after a moment the AppDynamics Site Extension will be ready to configure.

Step 2: Launch and Configure

  • To configure the AppDynamics Agent, click the AppDynamics Site Extension to bring up the details blade, and then click the Browse button at the top. This will launch the AppDynamics Controller Configuration form for the agent.

  • Fill in the configuration settings from your AppDynamics Controller, and click the Validate button. Once the agent setup is complete, monitoring will start.

  • Now add some load to the application. In a few moments, the app will show up in the AppDynamics Controller.

Installing the Site Extension via Kudu

Every Azure App Service is created with a secondary site running the Kudu service, which you can learn more about at the projectkudu on GitHub. The Kudu service is a powerful tool that gives you a behind-the-scenes look at your Azure App Service. It’s also the place where Site Extensions are run. Installing the AppD Site Extension from the Kudu service is just as simple as from the Azure Portal.

Step 1: Add Site Extension

  • Login to the Azure Portal at https://portal.azure.com and navigate to the Azure App Service to install the AppDynamics Site Extension.

  • The Kudu service is easy to access via the Advanced Tools selection on the App Service sidebar.

  • Another option is to login directly to the secondary site’s URL by including a “.scm” as a prefix to the “.azurewebsite.net” domain. For example: http://appd-appservice-example.azurewebsites.net becomes http://appd-appservice-example.scm.azurewebsites.net. (You can read more about accessing the Kudu service in the projectkudu wiki.)

  • On the Kudu top menu bar, click the Site Extensions link to view the currently installed Site Extensions. To access the Site Extension Gallery, click the Gallery tab.

  • A simple search for “AppDynamics” will bring up all the available AppDynamics Site Extensions. Simply click the add “+” icon on the Site Extension tile to install.

  • On the “terms acknowledgement” dialog pop-up, click the Install button.

  • Finish the setup by clicking the “Restart Site” button on the upper right. This will restart the SCM site and prepare the AppDynamics Controller Configuration form.

Step 2: Launch and Configure

  • Once the restart completes, click the “Launch” icon (play button) on the Site Extension tile. This will launch the AppDynamics Controller Configuration form.

  • Follow the same process as before by filling in the details and clicking the Verify button.

  • The agent is now set up, and AppDynamics is monitoring the application.

AppDynamics Site Extension in Kudu Debug Console

One of the advantages of the Kudo service is the ability to use the Kudu Debug Console to locate App Service files, including the AppDynamics Site Extension installation and AppDynamics Agent log files. Should the Agent need configuration changes, such as adding a “tier” name, you can use the Kudu Debug Console to locate the AppDynamicsConfig.json file and make the necessary modifications.

The Versatile Option: AppDynamics NuGet Packages

The NuGet package installation option is the most versatile deployment method, as the agent is bundled with the application. Wherever the application goes, the agent and monitoring solutions go too. This method is great for monitoring .NET applications running in Azure Service Fabric and Docker containers.

AppDynamics currently has four separate NuGet packages for the .NET Microservices Agent, and each is explained in greater detail in the AppDynamics documentation. Your choice of package should be based on where your application will be hosted, and which .NET framework you will use.

In the example below, we will use the package best suited for an Azure App Service, for a comparison to the Site Extension.

Installing the AppDynamics App Service NuGet Package

The method for installing a NuGet package will vary by tooling, but for simplicity we will assume a simple web application is open in Visual Studio, and that we’re using Visual Studio to manage NuGet packages. If you’re working with a more complex solution with multiple applications bundled together, NuGet package installation will vary by project deployment.

Step 1: Getting the Correct Package

  • On the web app project, right-click and bring up the context menu. Locate and click “Manage NuGet Packages…”.  This should bring up the NuGet Package Manager, where you can search for “AppDynamics” under the Browse tab.  

  • Locate the correct package—in this case, the “AppService” option—select the appropriate version and click Install.

  • Do a build of your project to add the AppDynamics directory to your project.

  • The agent is now installed and ready to configure.

Step 2: Configure the Agent

  • Locate the AppDynamicsConfig.json in the AppDynamics directory and fill in the Controller configuration information.

  • Publish the application to Azure and add some load to the application to test if monitoring was set up properly.

I hope these steps give you an overview of how easy it is to get started with our .NET Microservices Agent. Make sure to review our official .NET Microservices Agent and Deploy AppDynamics for Azure documentation for more information.

Getting Started with Containers and Microservices

Get Ahead of Microservices and Container Proliferation with Robust App Monitoring

Containers and microservices are growing in popularity, and why not? They enable agility, speed, and resource efficiency for many tasks that developers work on daily. They are light in terms of coding and interdependencies, which makes it much easier and less time consuming to deliver apps to app users or migrate applications from legacy systems to cloud servers.

What Are Containers and Microservices?

Containers are isolated workload environments in a virtualized operating system. They speed up workload processes and application delivery because they can be spun up quickly; and they provide a solution for application-portability challenges because they are not tied to software on physical machines.

Microservices are a type of software architecture that is light and limited in scope. Single-function applications comprise small, self-contained units working together through APIs that are not dependent on a specific language. A microservices architecture is faster and more agile than traditional application architecture.

The Importance of Monitoring

For containers and microservices to be most effective and impactful as they are adopted, technology leaders must prepare a plan on how to monitor and code within them. They also must understand how developers will use them.

Foundationally, all pieces and parts of an enterprise technology stack should be planned, monitored, and measured. Containers and microservices are no exception. Businesses should monitor them to manage their use according to a planned strategy, so that best practice standards (i.e., security protocols, sharing permissions, when to use and not use, etc.) can be identified, documented, and shared. Containers and microservices also must be monitored to ensure both the quality and security of digital products and assets.

To do all of this, an organization needs robust application monitoring capabilities that provide full visibility into the containers and microservices; as well as insight into how they are being used and their influence on goals, such as better productivity or faster time-to-market.

Assessing Your Application Monitoring Capabilities

Some of questions that enterprises should ask as they assess their application-monitoring capabilities are:

  • How can we ensure development and operations teams are working together to use containers and microservices in alignment with enterprise needs?

  • Will we build our own system to manage container assignment, clustering, etc.? Or should we use third-party vendors that will need to be monitored?

  • Will we be able to monitor code inside containers and the components that make up microservices with our current application performance management (APM) footprint?

Do we need more robust APM to effectively manage containers and microservices? And how do we determine the best solution for our needs? To answer those questions and learn more about containers and microservices—and how to effectively use and manage them — read Getting Started With Containers and Microservices: A Mini Guide for Enterprise Leaders.

This mini eBook expands on the topics discussed in this blog and includes an 8-point plan for choosing an effective APM solution.

Go to the guide.

KubeCon + CloudNativeCon: A Diverse and Growing Community

The motto for this year’s KubeCon + CloudNativeCon, “Keep Cloud Native Weird,” proved to be as much a prediction as a slogan when temperatures plummeted last week and snow began to fall in Austin, Texas. Despite Austin uncharacteristically turning into a winter wonderland, the attendance for this third annual event was truly impressive, boasting over 4,100 attendees. Contrast this with just a few hundred attendees in its first rendition back in 2015, and you can see how quickly communities focused around containerization, dynamic orchestration, and microservices has grown. And with good reason.

These practices, all key tenets of cloud native, have seen a huge upshift in adoption over the past few years. Couple this with the growing utilization and support for open source software by even the largest companies, and it’s easy to see why the community around the projects hosted by the CNCF has exploded over the past 3 years. And while the Cloud Native Community Foundation (CNCF) has grown, so to has the number of projects created and maintained by that community.

As Dan Kohn, executive director of the CNCF said during his opening keynote, the number of projects expanded from just 4  in 2016 (Kubernetes, Prometheus, OpenTracing, and fluentd) to 14 projects in 2017.

In addition to nurturing technical innovation, the CNCF has been going the extra mile to keep its community open to all. This commitment was exemplified by the $250,000 raised to support 103 diversity scholarships for this year’s event. These scholarships were awarded to people from underrepresented and/or marginalized groups in the technology and/or open source communities. Working for a company which prides itself on diversity, I’m glad to see more groups making the effort to ensure that their communities are open and accepting of everyone.

Overall KubeCon + CloudNativeCon was an incredible event, and one not to be missed. But if you did miss this year’s event, fear not, KubeCon + CloudNativeCon will be coming to Copenhagen, Shanghai, and Seattle in 2018!

Application Architecture With Azure Service Fabric

Is Azure the dominant cloud-based development infrastructure of the future? There’s some good evidence to support that claim. At last year’s Dell World conference in Austin, TX, Microsoft CEO Satya Nadella announced on stage that there are only two horses in the contest for control of the cloud. “It’s a Seattle race,” Nadella said. “Amazon clearly is the leader, but we are number two. We have a huge run-rate. All up, our cloud business last time we talked about it was over $8 billion of run-rate.”

Normally, you could dismiss that as typical marketing speak, but market analysts tend to agree with him. Gartner’s Magic Quadrant for Cloud Infrastructure as a Service Report found that there are only two leaders in the space. AWS is ahead, but Microsoft Azure’s offerings are growing faster. Gartner concluded, “Microsoft Azure, in addition to Amazon Web Services, is showing strong legs for longevity in the cloud marketplace, with other vendors falling further to the rear and confined to more of a vendor-specific or niche role.”

The Rundown on Azure Service Fabric and Microservices

Service Fabric is the new middleware layer from Microsoft designed to help companies scale, deploy, and manage microservices. Service Fabric supports both stateless and stateful microservices. In stateful microservices, Service Fabric computes your storage and application code together, reducing latency and automatically provides replication services in the background to improve availability of your services.

Azure Service Fabric improves the deployment process for customers embracing DevOps with features like rolling upgrades and automatic rollback during deployments.

Empowering customers to deliver microservices using Azure Service Fabric is a key contributor powering Microsoft’s revenue growth, expanding 102 percent year-over-year through the success of Azure.

Top enterprises betting on Azure services today include global chocolatier The Hershey Company, Amazon’s e-commerce competition Jet.com, digital textbook builder Pearson, GE Healthcare, and broadcaster NBC Universal. Azure is an optimized multi-platform cloud solution that can power solutions running on Windows and Linux, using .NET, Node.js, and a host of other runtimes in the market, making it easier to adopt regardless of the language or underlying OS for customers deploying applications that scale using microservices.

Why Microsoft Chose Microservices Over Monolithic

When Microsoft started running cloud-scale services such as Bing and Cortana, it ran into several challenges with designing, developing, and deploying apps at cloud-scale. These were services that were always on and in high-demand. They required frequent updates with zero latency. The microservices architecture made much more sense than a traditional monolithic approach.

Microsoft’s Mark Fussell defined the problem with monolithic: “During the client-server era, we tended to focus on building tiered applications by using specific technologies in each tier. The term ‘monolithic application’ has emerged for these approaches. The interfaces tended to be between the tiers, and a more tightly coupled design was used between components within each tier. Developers designed and factored classes that were compiled into libraries and linked together into a few executables and DLLs.”

There were certainly benefits to that methodology at the time in terms of simplicity and faster calls between components using inter-process communication (IPC). Everybody’s on one team testing a single software, so it’s easier to coordinate tasks and collaborate without explaining what each is working on at a given moment.

Azure and Microservices

Monolithic started to fail when the app ecosphere turbocharged the speed of user expectations. If you want to scale a monolithic app, you have to clone it out onto multiple servers or virtual machines (or containers, but that’s another story). In short, there was no easy way to break out and scale components rapidly enough to satisfy the business needs of enterprise-level app customers. The entire development cycle was tightly interconnected by dependencies and divided by functional layers, such as web, business, and data. If you wanted to do a quick upgrade or fix, you had to wait until testing was finished on the earlier work. Monolithic and agility didn’t mix.

The microservices approach is to organize a development project based on independent business functionalities. Each can scale up or down at its own rate. Each service is its own unique instance that can be deployed, tested, and managed across all the virtual machines. This aligns more closely with the way that business actually works in the world of no latency and rapid traffic spikes.

In reality, many development teams start with the monolithic approach and then break it up into microservices bases, in which functional areas need to be changed, upgraded, or scaled. Today, DevOps teams that are responsible for microservices projects tend to be highly cost-effective but insular. APIs and communications channels to other microservices can suffer without strong leadership and foresight.

How Azure Service Fabric Helps

Azure Service Fabric is a distributed systems platform that assigns each microservice a unique name, which can be stateless or stateful. Service Fabric streamlines the management, packaging, and deploying of microservices, so DevOps teams and admins can just forget about the infrastructure complexities and get down to implementing workloads. Microsoft defined Azure Service Fabric as “the next-generation middleware platform for building and managing these enterprise-class, tier-1, cloud-scale applications.”

Azure Service Fabric is behind services like Azure SQL Database, Azure DocumentDB, Cortana, Microsoft Power BI, Microsoft Intune, Azure Event Hubs, Azure IoT Hub, and Skype for Business. You can create a wide variety of cloud native services that can immediately scale up across thousands of virtual machines. Service Fabric is flexible enough to run on Azure, your own bare metal on-premise servers, or on any third-party cloud. More importantly — especially if you’re an open-source house — is that Service Fabric can also deploy services as processes or in containers.

Azure Container Services

Open-source developers can use Azure Container Service along with Docker container orchestration and scale operations. You’re free to work with Mesos-based DC/OS, Kubernetes, or Docker Swarm, and Compose and Azure will optimize the configuration for .NET and Azure. The containers and your app configuration are fully portable. You can modify the size, the number of hosts, and which orchestrator tools you want to use, and then leave the rest to the Azure Container Service.

Any of the most popular development tools and frameworks are compatible because Azure Container Services exposes the standard API endpoints for their orchestration engine. That opens the door for all of the most common visualizers, monitoring platforms, continuous integration, and whatever the future brings. For .NET developers or those who have worked with the Visual Studio IDE, the Azure interface presents a familiar user experience. Developers can use Azure and cross-platform a fork of .NET known as .NET Core to create an open-source project running ASP.NET applications for Linux, Windows, or even Mac.

Taking on New Challenges With Service Fabric

Microsoft’s role as a hybrid cloud expert gives Azure an edge over virtual-only competitors like AWS and Google Cloud. Azure’s infrastructure is comprised of hundreds of thousands of servers, content distribution networks, edge computing nodes, and fiber optic networks. Azure is built and managed by a team of experts working around the clock to support services for millions of businesses all over the planet.

Developers experienced with microservices have found it valuable to architect around the concept of smart endpoints and dumb pipes. In this approach, the end goal of microservices applications is to function independently, decoupled but as cohesive as possible. Each should receive requests, act on its own domain logic, and then send off a response. Microservices can then be choreographed using RESTful protocols, as detailed by James Lewis and Martin Fowler in their microservices guide from 2014.

If you’re dealing with workloads that have unpredictable bursting, you want an infrastructure that’s reliable and secure while knowing that the data centers are environmentally sustainable. Azure lets you instantly generate a virtual machine with 32TB of storage driving more than 50,000 IOPS. Then, your team can tap into data centers with hundreds of thousands of CPU cores to solve seemingly impossible computational problems.

AppDynamics for Azure

In the end, the user evaluates the app as a singular experience. You need application monitoring that makes sure all the microservices are working together seamlessly and with no downtime. AppDynamics App iQ platform is what you need to handle the flood of data coming through .NET and Azure applications. You can monitor all of the .NET performance data from inside Azure, as well as frameworks and runtimes like WebAPI, OWIN, MVC, and ASP.NET Core on full framework, deploying AppDynamics agents in Azure websites, worker roles, Service Fabric, and in containers. In addition, you can monitor the performance of queues and storage for services like Azure SQL Server and Service Bus. This provides end to end visibility into your production services running in the cloud.

The asynchronous nature of microservices itself makes it nearly impossible to track down the root failure when it starts cascading through services unless you have solid monitoring in place. With AppDynamics, you’ll be able to visualize the services path from end to end for every single interaction — all the way from the origination through the services calls. Otherwise, you’ll get lost in the complexity of microservices and lose all the benefits of building on the Azure infrastructure.

While we see many developers in the Microsoft space attracted to Azure, AppDynamics realizes Azure is a cross-platform solution supporting both Windows and Linux. In addition to .NET runtimes, AppDynamics provides a rich set of monitoring capabilities that many of the modern technologies being used in the Azure cloud require, including Node.js, PHP, Python and Java applications.

Learn more

Learn more about our .NET monitoring solution or take our free trial today.