how to monitor from the resource stack all the way to the biz stack

January 11, 2012

Some time ago I had a revelation flying from NYC to Buffalo is a soviet style propeller plane. It came about in a form of shear fear. This is a quick flight and I was en-route to abig content provider’s brand new data center that gets it’s power feed from the Niagra falls.  Basically about 20 minutes into the flight we entered a dark storm. The other 10 or so passengers were all pinged to the windows looking at this gnarly looking black cumulus type black cloud starting to surround us, even though it was well before 7am. Bammm! First of the many drops as the turbulence heated up. I was catching at least 2 to 3 inches of air off my seat as we wobbled and dropped entering this weather pattern.  Not looking good!


Only thing I could do was to take my mind off something I had no control over. Obviously I could not ask the pilot to turn around. Basically we had about 45 min and had to ride it out. And then the lights went out!!


I was sure at this point that we were going to fall right out of the sky and this was the end of the line for me. I thought about my friends waiting for me at the data center and my family. How much worse would it get? And it got worse. The passenger to my left got sick, and I mean sick. I guess the MickeyD big breakfast had not settled well and the ups and downs of the ride got the better of the ol’ chap. Whatever pressure system we were riding on, could not justify Mc-egg and sausage sandwich.


I was trying to justify that this is as bad as it is going to get and the only thing left was for it to get better. This meant coming up with scenarios in my head such as “what if the pilot was blind!”. What? The pilot is blind?


I bet you would never fly in  a plane were the pilot was actually blind. Same way you would never operate large scale applications blind. And what exactly do you need to have visibility into?


I have broken down the stack and recommended some basic things below:

In general businesses that are based on web centric services will have three layers:


-              Compute resources: this encompasses network, storage, disk space, I/O, memory, CPU and other pieces easily translated to a hardware. Any of the out of the box SNMP solutions should work. Depending on your appetite for initial capex and your ability to run these systems, there a numerous services and packages available to you. Take a look st . I am going to pick just three out of the long list of available packages:


o             Nagios – been around for a while. Opensource and free. Easy to setup but will require some development and upkeep. Having implemented this numerous times, I can tell you take the “what goes in = what comes out” approach to this platform. It comes from the old school stack of netsaint, with a previous lead from SATAN (no longer around). There are multiple MIB’s available from the community for this stack.


o             NetFlow: more a network monitor but extendable to layer 4 on systems as well. Simple, but will also need development.


o             NetIQ: $$$ but it comes as a service. Enough said.



-              Service stack: these are the bits of software glued together to make thing run. If on a .NET stack, then IIS services, db calls, and .NET specific service architecture pieces and how they interconnect. If on a open source or a non-.NET stack, then (depending on the technology used below the hood), the actual code files, the build processes, the apache/lighttp services and what not, all mapped and interrelated. Here are a few good tools:


o             SolarWinds Orion: you can’t go wrong. It has a self discovery module with lots of bells and whistles. In terms of dollars and cents, it starts at around $7K but well worth it. It will map out service layer pretty thoroughly and is able to suppress unwanted red flags as needed. Does require deployment time, either as a service or stand alone.


o             HP Openview: this is probably the best stack available with all kinds of modules. However super expensive and more geared towards large production environments. It will also require significant deployment resources as well as maintenance.


o             Nexvu: great service engine analyzer.   Easy to deploy but $$$.



-              Business: this will directly depend on the business flow. There is no out of the box solution. In general you can take some siple BI tools and base your KPI’s in it in a way that is trackable. A great reasonably priced tool called QlickView is one of the best choices I have seen. Others include Pentaho. Pentaho is great free and open source , however it will require some serious dev time to get it to a state in which it is usable.


Now how do you glue the three layers to get a holistic view from the resource layer all the way to biz layer? Another blog for another time.

content at the edge

June 2, 2011

A while back I was reminiscing with Mr. Kobayashi about what we used to call “Nauziation steering committee”. It was a weekly shake down where we would take our thoughts, in power freaking point format of coarse, and get into a room. Then the ‘executives” would show up and beat us up on what we put up on the board.

Every Thursday morning at 11AM, I would shake hand with the fellow engineers as if I was headed to the gallops. It was the final farewell to a week long journey of mixing up deep down architecture, engineering and putting a business face to it all. They all knew the drill; if the smiley face on my big content provider messenger was laughing, then all was good; but if it was a devil or a sad face they knew the situation was not going as planned. The “nauziation steering committee” was for the big projects. Anything with a price tag bigger than $XM or so, or the strategic projects that had big impact across the board. And it was not just for network. It included anything and everything Infrastructure. Data Centers, Storage and even systems got a chance to make their way into the dungeons of nauziation steering committee. The usual members included: FBI Special Agent Jack Baer, head of all big content provider Ops; Mr. Kobayashi – at the time Principal Architect, Roger “Verbal” Kint- big content provider’s co-founder; Special Agent Dave Kujan– Verbal’s right hand man; Redfoot the Fence – our boss; Senor Righteous – Chief Architect and mine and Mr. Kobayashi’s counter part also reporting to Redfoot the Fence; .

There were subject matter experts that would rotate in and out depending on the subject of the week. The group would grill us on numbers, dollars, operations, architecture and just about anything they found a hole in. Each one of these characters had their own special knack for things and we had to manage it all. For example we all knew Redfoot the Fence would often turn on us (if it was not going his way), in the middle of the session, and take sides, even though he had seen the preso before hand and had approved it. Senor Righteous was generally quite and unless he had something to say (in order to back us up) he would take the 5th and simply observe. Verbal could pick the only number off in a spreadsheet from a mile away and Kujan was the twist in any conversation that took hold. We all did admit that “as a function of time” we all learned a hell of a lot from Redfoot the Fence. I would go as far as saying that he was probably the most engaging boss we ever had. It was truly a love/hate relationship….but somehow we managed to make it work.

Anyways, the reason I bring this up, is that the concept of CDN made it to the nauziation steering committee a few times and finally lead to the implementation of a strategy where all CDN pricing was disrupted globally. I remember working on design, capacity planning, deployment strategy and the dollars and cents and the stress of it all.

But what is a CDN? Content Delivery Networking is basically taking the content and storing it as close as possible to the user. Basically what this does is, cacheable objects are loaded to servers that are strategically located in rich peering locations or within a transit providers network and the users request for a specific type of content is fetched from these locations instead of it traveling all the way to the data center where the content was first created or uploaded. These pieces of content need not be dynamic. Static objects work best and we will later on get to how CDN will work for dynamic objects.

Generally speaking the caches have a relationship with the origin servers. Origin servers are those that the content is loaded up to and shared within applications. For example if a picture of Mr. T is on front page big content provider and it is also in big content provider news, which is a different application, that picture is loaded on to the origin server and a dynamic link for the object is embedded into the website. Furthermore, these origin servers are located in data centers and not network exchange points. However the caches for the CDN systems are located at these exchange points and they fetch the Mr. T picture from the origin, therefore millions of users get to only travel to the exchange point and not all the way to the data center. This will make a site much faster to load and reduce the cost bandwidth required for a content provider.

In order to get a bit deep into the subject, we first start off by seeing how a page is dynamically built. So best way, and easiest, way is to get Mozilla Firefox browser installed on your machine. Then do a big content provider search for “firebug” or go directly to

Firebug is a Firefox plug-in and it is safe. Once installed, you will see a little bug logo in the lower right hand corner of your browser.

Double click on the bug symbol and once a panel is displayed, you will see a bunch of boxes on the upper row of the firebug display. The button names include: Console; HTML; CSS; Script; DOM and Net.

Click on “Net” button and then click on Enable. Then in the next line of the firebug display, click on ‘All”.

After that in you browser’s location bar type http://www.big content

Each one of these links is a way for the dynamic page to fetch the right content for the right part of the CSS style-sheet. In the case of big content, if looked at in the US, there are roughly 47 objects that make up the page. Some of them are advertisements, some of them are dynamic text (such as news, which change on a regular basis) and some are pictures. These pictures maybe the objects associated with Mail ; Autos or even the infamous big content provider logo . All of these objects are cacheable on some sort of a CDN. And if you scroll over the links displayed in the firebug, you will see the small cacheable objects are the first to load. That is because they reside closest to the user.

To be continued…….

Build it to scale from Day 1

February 5, 2011

so let me tell you a little story…..

Scale from day one

“Design and build your startup or small Internet business infrastructure to grow with you and minimize tech debt. Make your startup more attractive to buyers by making integration easy” – not sure who said this, but it was not me…however it sounded good enough to end up on my blog.

It was a typical “valley” gathering. Some VC with way too much money had invited a bunch of people on the roof of some really expensive hotel in San Francisco. After a few tequila shots, I snapped out of my 4:20 mode and began talking to the various guests. Seemed the most interesting conversation was taking place around the Beluga Caviar table, flush with all kinds of raw edibles. I crammed in and found myself between the Twitter guy and the Google dude. Others included heads of ops and architecture from Ebay, Microsoft, Facebook and so on.

I knew these guys and the conversation was wrapped around the usual things: Asia bandwidth prices; Load balancer limitations; layer 7 routing and so on. But a rather gentle being who had, by accident, stumbled upon this group, was simply looking at us like deer in headlight. Ah! Fresh blood.

I made my way over to him and introduced myself. He was an Admin at a startup. A rarity in my circle of friends. He had absolutely no idea what we were discussing, so I asked him what he is doing. He told me about the idea behind the startup and the fact that they had just ran out of capacity on their very first server. Wow! A problem of scale.

So I started out by saying: “Amigo….what iz t u serving up that made you run out of capacity?” I caught myself halfway through, knowing I sounded like a crack dealer looking for the soft spot.

“well, we are doing some XYZ virtual SaaS, app which solves the ABC problem for 100000 gizzilion people”.
to be continued……..

So you asked….here it is: What it takes to work in Ops for the big boys?

February 5, 2011

You asked for it, and here it is……
“what it takes to be part of ops for one of the web giants?”
Well…it ain’t for faint of hearts, but once you get down the basics, it is like surfing the big waves.

First blog entry regarding “what it take….”

“what is an AS?” asked the young Jedi.

“Like an ASN…Autonomous system…dude, like BGP…are you kidding me? We are three weeks into this project and you are asking me what an AS is?” I replied and watched the blank stare on the interns face. Obviously what I had asked him to deliver vs. what he knew where a bit different and it was going to be a long summer indeed. What are these kids learning at these prestigious schools? The school’s name had landed him a lucrative internship, but as far the basic knowledge necessary….well, it is not there. Even in a Master’s program.

It was another day in paradise. Woke up at 6:30 AM by the VP of engineering asking me to sit in for him during the glorious morning call in for Operations. I knew the trick. The days where he was going to get beat up, I would get his infamous call. And he was very nice during those calls. Generally they had to do with “some conference call conflict” he had. Too bad I knew exactly the people he was going to have the call with and I could bet my life’s savings that they were going to be sound asleep during the proposed time for that call. So I had a window of roughly half an hour to scramble and find out what issues manifested themselves to the surface over night. And I have friends…friends in the right places. 2 minutes before the call I scramble my notes and get into the character.
I dial in. Beep…”This is Reza” …Beep…”this is Mr. Green”…Beep…”This is Mr. Orange” followed by another 8 beeps.

I tell myself that most of these people are truly not needed on this call, including myself.
After the initial greetings, we get into the various reports. The NOC is the first to go, followed by Data Center Operations. I was to go over Network Operations, Storage, Systems and architecture in general. My turn: Paint a true picture as to the cause of a DNS melt down that took the east coast of the US down. Everyone goes silent for about 20 seconds. My boss’s boss, who is essentially the COO of the company break the silence. “Well…why the heck they told me it was a capacity issue?” I had no answers…and playing politics and managing “up” was not part of my DNA. So I told him “not sure what you were under the impression of , but rotations on the Authoritative side is not a resource intensive process….as a matter of fact we have 5X the capacity we need…the problem is how the authoritative are replicating between one another. What happened last night was a human error”. And with that I knew that someone, somewhere, including my boss, was going to think I was throwing someone under the bus. Definitely my intention. When you build very large systems and need to run them, human error becomes one of the core and fundamental flaws that cause outages.

After this call, I avoided showing up to my cube for an hour or so. As usual, I did my morning rounds. Ended up in my partner in crime, Mr. Green’s, “Existentialist” chair. I listened intently to him going over the necessities of Anycast in the grand network he was building out in Asia. We went over the intricacies of ACL’s and blocked tunnels in China, network diversity in Taiwan, Singapore data center pricing and finally metro based fiber routes in Japan. The conversation was a physically moving, meaning that I followed him as he got his usual morning green tea, stepped outside to smoke and various hallway greetings. By the end of the conversation it was 9AM; time for the Europe calls.

The above couple of hours worth of experience, and similar experiences related to grand IP centric infrastructures, are fundamentally what I am focused on in my blogs. I hope by breaking it down, the mysticism clouting the “Big guys” is more digestible. At the end if others are building and looking at things as we do, it truly makes our job much easier.

What is ?

January 27, 2011

I started my job as the new CTO of Reply! First thing I was asked is to draw up the product/service roadmap for 2011. Before doing that, I asked myself, what is Reply!:

“We operate a proprietary auction marketplace that facilitates online locally-targeted marketing. We aggregate customer prospects for advertisers from many different online traffic sources and categorize those customer prospects based on user-provided information regarding a product or service of interest to the user and the location at which the user desires to purchase the product or receive the service. Our marketplace provides locally-targeted advertisers with performance-based marketing solutions on a cost-per-“Enhanced Click” or cost-per-lead basis. Our Enhanced Clicks are generated by customer prospects and provide user-submitted category information and the location at which the product will be purchased or the service will be rendered. In addition to providing all of the information contained in an Enhanced Click, our leads also provide our advertisers with the customer prospect’s contact information. We rank the quality of each customer prospect based upon our historical experience and other factors regarding the propensity of the prospect to take action, which enables advertisers to differentiate their bids for Enhanced Clicks and leads based on the quality of the customer prospect.

Our marketplace simplifies online locally-targeted marketing by eliminating an advertiser’s need to develop and maintain complex, expensive infrastructure and teams of experts to source online consumer traffic from many different channels, including search engine marketing, display and email. Additionally, compared to traditional lead generation businesses, our marketplace provides advertisers greater control over quality, volume and price, and therefore enables our advertisers to optimize their marketing efforts and better manage their cost per transaction. Our marketplace allows advertisers to adjust their bids on a real-time basis. Regardless of the advertiser’s level of sophistication, our marketplace is designed to deliver customer prospects in the format that best addresses the advertiser’s needs. The customer prospects can take the form of an Enhanced Click delivered to an advertiser’s website, or the advertiser can choose to receive the customer prospect in the form of a ready-to-call lead, bypassing the need for the advertiser to develop the necessary infrastructure to convert Enhanced Clicks into ready-to-call leads. Our technology allows us to offer our services for any industry. We currently serve advertisers primarily in the automotive, home improvement, insurance and real estate industries. “

Next Blog Post….What do we want to be?

I guess i will start by talking about CDN’s

November 24, 2008

Well here it is folks, my first blog ever.
I am quite excited that I finally mustered up the courage to do this. Plain and simple!

My blogs will be generally centered around the whole data networking business and large scale networks. Anything from data center related problems to fiber paths and etc etc etc.

Today i will put CDN’s on the block since they are near and dear to my heart.

I won’t really get into the history (will leave that to a whole new blog), but for some odd reason mom and pop CDN’s are popping up left and right. Seems Akamai will actually work for their money.

On one hand you have Level3 CDN, competing directly for the same dollars as Akamai. They are offering a service is definitely not as good as Akamai, but is defintiely an alternative to dealing with Akamai. no API, no portal and let’s face it… is a poor man’s answer to CDN.

I won’t really get into limelight, since they are simply in survivol mode. The whole 703 patent issue has left a bitter taste in everone’s mouth and these poor chaps are the true recepients of Akamai’s “strive for rightousness”. Amazon’s Cloud Front is simpluy a reseelign of limelight i

Coming right up is Panther. Lean mean and ready to play, this company is formed by the guys who started DoubleClick (now part of google). They definitely “suck less”, but they are a startup and you are paying for just that. Do not expect a major NOC or customizable code.

Internap is there and their squid based roll out is limited. They have a funky roll out with a supernode model. They have roughly 100 gigs worth of capacity.

Verizon now has a service based on Velocix. This is so new now one really knows the true details. I will have infor in a month or so.

There are many other CDN’s out there and in my posts later on I would like to talk abotu what is going on in Asia and South America.


Get every new post delivered to your Inbox.