What’s up with the Network and Storage teams?


yoda-talk-to-handA few weeks ago I was presenting at CMG Performance and Capacity 2013 and during my presentation we (myself and a few audience members) got slightly side-tracked. Our conversation somehow took a turn and became a question of why it was so hard to get performance data from the network and storage teams. Audience members were asking me why, when they requested this type of data, they were typically stonewalled by these organizations.

I didn’t have a good answer for this question and in fact I have run into the same problem. Back when I was working in the Financial Services sector I was part of a team that was building a master dashboard that collected data from a bunch of underlying tools and displayed it in a drill-down dashboard format. It was, and still is, a great example of how to provide value to your business by bringing together the most relevant bits of data from your tooling ecosystem.

This master dashboard was focused on applications and included as many of the components involved in delivering each application as possible. Web servers, application severs, middleware, databases, OS metrics, business metrics, etc… were all included and the key performance indicators (KPIs) for each component were available within the dashboard. The entire premise of this master dashboard relied upon getting access to the tools that collected the data via API or through database queries.

The only problems that our group faced in getting access to the data we needed was with the network and storage teams. Why was that? Was it because these teams did not have the data we needed? Was it because these teams did not want anyone to see when they were experiencing issues? Was it for some other reason?

I know the network team had the data we required because they eventually agreed to provide the KPIs we had asked for. This was great, but the process was very painful and took way longer than it should have. Still, we eventually got access to the data. The big problem is that we never got access to the storage data. To this day I still don’t know why we were blocked at every turn but I’m hoping that some of the readers of this blog can share their insight.

Back in the day, when my team was chasing the storage team for access to their monitoring data, there weren’t really any tools that we could find for performance monitoring of storage arrays besides the tools that came with the arrays. These days I would have been able to get the data I needed for NetApp storage by using AppDynamics for Databases (which includes NetApp performance monitoring capabilities). You can read more about it by clicking here.

Have you been stonewalled by the network or storage or some other team? Did you ever get what you were after? Based upon my experiences talking with a lot of different folks at different organizations this seems to be a significant problem today. Are you on a network or storage team? Does your team freely share the data they have? Please share your experience, insight, or questions in the comments below. Just to clarify, I hold no ill will against any of you network or storage professionals out there. I’d just like to hear some opinions and gain some perspective.

  • Carl Robinson

    Good Points. So we have a similar need but actually are the Networking team.
    It is my experience that the Network is a “Root Blame” for everything. That being said, it creates a great divide in many organizations, I am sure.
    As an engineer our focus has mostly been in transport efficiency. The closer we get to monitoring Power and light, the closer we get to real root causes. In most mature environments, we have built in redundancy that appear to be a problem for most monitoring tools out of the box, including proprietary applications flows.

    This poses a unique challenge for application teams to follow packet flows across port-channels or equal cost routed links.

    The network typically does not account for packet reassembly, as this is handled higher in the stack by the client.
    Finding TCP Sequence breaks, or worse UDP is almost a fate worse than death for an engineer.
    Where many hours can be spent looking for the smallest changes in latency and re-transmittal.

    Most Network teams rely on ICMP and SNMP, and Syslog for monitoring. The last decade has introduced Netflow which help tremendously with reducing deep dive on packet captures, and IPSLA to create synthetic transactions across the infrastructure, with the ability to give per hop performance metrics.

    However, even with all these advances there is still a hole in how we bind performance KPI symptoms together. I liken it to chasing shiny objects, only to find out that they are a reflection.

    Most Application and network deployments do a poor job at conveying the architectural “lynch pins” that account for the KPI’s success, a little probing and prodding usually reveals that each silo is ignorant of the other.

    • Hi Carl, thanks for the informative comment. It sounds like an extremely challenging environment. I’m curious about your initial statement “So we have a similar need but actually are the Networking team.”… Does that mean that you need information from another silo or that you need better monitoring within your own realm?

      Also, as a member of a network team do you ever get requests for making monitoring information available to application teams? If so, how do you deal with it?