System Overhaul

It’s become a bit of a routine over the last two weeks or so. Come in to work in the morning, look at the numbers, find that they have gone off, spend the morning investigating why they’d gone off. Spend the afternoon implementing a patch in response of the findings of the investigation. Leave the system to run for the rest of the day, come in the next morning, check the numbers, find that they have drifted again…

On and on and on, like a never-ending puzzle. Every time, we hose out a category of bugs, only to find another emerge, either from increased clarity, or as a result of the previous fixes.

Everyone was getting frustrated. Not the least, yours truly.

“If you can’t fit the whole picture in your head, something is very wrong”
– Wise words of colleague

It was exactly what I was struggling with. Every morning, I’d come in, look at the numbers going off, and spend a good amount of time just loading up the entire platform in my tiny brain to play out the scenarios, and form a logical explanation for why they were off.

Every patch we added was an added complexity that could very likely come back and bite us further down the track. Some of the actually did.

We needed to take a different approach.

It turned out, we’d made a few design decisions earlier on that didn’t account for one of the “heavier” components coming online more recently. Between the complexities of developing that new component and helping it play well with the rest of the platform, we didn’t have much left behind to even consider revisiting the design of some of the other components.

So we’re going to take some time to reengage the broader design issues and come up with a more permanent solution. Sure it’s weeks of sunk cost down the drain, but it is also potentially months of future debugging that we won’t need to do.

Glad we’ve come this far. No one could have seen this coming, and it’s unlikely that we’d get here by any other way.

So here’s to a new week, a deep overhaul that settles it once and for all, and moving on to productive work.

The bigger you are

It had been a long week at work. End of sprint, we’d somehow managed to graft our contraption onto business’ operations without causing too much damage. It wasn’t without its hiccups but I’m led to believe that it was an overall win.

So I was looking forward to my usual 2.5km ride to the station, and my 40 odd minute journey in a metal can. Metro had other plans for me.

The train was stopped at the station – all the doors were opened, and people were emptying out the carriages and making their way to the bus stop.

There had been a train/car accident at Cheltenham station. The staff at the bus stop suggested I rode to the Moorabbin station since taking the bike on the bus would be a bit of a stretch. “You look like a good rider”, he said, probably referring to my very manly looking cycling tights.

So I tried to mirror the severe disappointment around me at Metro. Secretly, I was delighted at the opportunity to do something out of the norm. I powered up the GPS tracker on my phone and was on my way.

This, I had to capture – all 11km’s of it. It would make good conversation fodder for the weekend compared to my usual nerdy contributions.

Along the way, the road blocks set up by the police nearer to the scene of the accident hinted at the seriousness of the matter, but nothing prepared me for the swell at Moorabbin station.

Lots of other less fortunate people having a less than ideal start to their weekends. At that point, any sense of mine being predicament quickly vanished – I was having a ball by comparison.

These photos are a stark reminder of the fire that I play with on a daily basis. I engineer information systems for an high volume online retailer. Not quite the scale of a metropolitan train network, but analogous enough.

The fundamental goal of any large scale system is to harness the economics of scale to reduce waste and increase efficiency. But what many fail to understand is with any large-scale monolithic system, the stakes increase exponentially with the gains. Potential points of failure proliferate with every corner cut, and it only takes a few minute defects before the whole thing crumbles in a sorry heap.

Scale is a gallant champion, but makes for a horrendous and putrid failure.

HTTP logs in RESTful SOA

The RESTful SOA system that we’re building is turning out to be a bit of a beast. We’re approaching a point where there are lots of inter-service chatter going on, and because there are so many independent moving parts in play, it is hard to keep track of where things are ‘crapping’ up.

This is where apache/nginx server logs and HTTP status codes have come into their own.

While performance takes a big hit when inter-service communication is done over HTTP, it comes with some advantages. The HTTP gap we’ve wedged in between services has allowed us to debug and interact with services individually. It has also given us a far more visibility on how our services are responding to each other, and to user requests.

With our services, we’ve worked very hard to adhere closely with standard HTTP status codes – e.g. 403 Unauthorized, 404 Resource not found, 400 Bad Request, etc.

In the past week, it seems our diligence has started to pay off because we’ve been able to quickly diagnose issues by simply tail-ing the HTTP logs files, watching the URL requests and tracking the status codes. This has helped us narrow the scope of concern very quickly and zero in on the source of a problem.