Jacques Mattheij

technology, coding and business

German BND Aided NSA in Spying on Dutch, French and Austrians

According to this article in ‘Der Spiegel (German)’ Deutsche Telekom (the largest German telecommunications infrastructure operator) aided the BND (the German intelligence service, the ‘Bundes Nachrichten Dienst’) in tapping internet connections to Austria, France and the Netherlands (and most likely others), apparently in collaboration with the NSA which also gained access to the data, which carried traffic from many other countries besides those that the connections led to.

Because of the strength of the German internet backbone and its central location any international traffic involving one of its neighbouring countries is likely to pass through Germany on the way to its destination. So much for ‘spying amongst friends’ being ‘not done’.

The German telco says it simply aided the BND and does not know what happened to the data after it left its premises.

A representative of the green party in Austria says he has concrete proof of the scheme which seems to have involved unfiltered access to the packet stream through one of Europes most important internet transit points, the exchange in Frankfurt (a.M.), which - weirdly - seems to be some kind of arrangement that gave preferential access to international traffic in exchange for not capturing German traffic.

Do Not Disclose Your Salary to Recruiters

Recruiters are not my most favorite people in the tech eco-system. They’re the people that will contact you on behalf of some client (the company that pays them) in order to try to find new employees. They typically get paid a fairly hefty sum per employee found and if ever there was a moment when the ‘you are the product’ meme was even remotely true it is when you are dealing with a recruiter. As with all products, you have a price and price matters, in fact it matters a lot.

Typically they will ask you a bunch of questions in order to figure out if you are suitable for the task at hand and a bunch of questions that have to do with the budget they’ve been given to see if they are going to be in the ball-park and to make the future salary negotiations easier for themselves and their employer.

The key question is ‘How much are you currently earning?’.

Note that you are absolutely not under any obligation to answer this question! That’s your private information and if you pass this little bit of information to the recruiter you’re doing several things:

  • you have just blown your negotiation position out of the water

  • you have passed information from your current employer to the recruiter and your potential future employer

  • you have made it that much harder for all savvy negotiators at your current company to play their cards close to their chest with this recruiter and your potential future employer

  • you are assisting in an underwater price-fixing scheme between employers where they will all attempt to pay roughly the same rather than to negotiate with each individual according to value presented

  • all of the above, even in case your future employer does not hire you

The number of scenarios where this information could be abused is large, one of the more interesting ones is where an employer is actually not looking to hire you but someone else instead and they’re using you to social engineer a picture of the salary scales at your company in order to hire others who may be more aware of their value.

Salary negotiations are an information game and presenting such valuable information for no return at all is a fools strategy.

When asked this question I would suggest you ask the recruiter for his budget in return, or to simply state your salary requirement (rather than your current salary) instead. That way you only volunteer what you could part with without hurting your position too much (and say ‘no thanks’ if that figure is then used to lowball you), or you gain some information yourself which you can then use to make your decision.

But do not state your current salary to someone who is effectively on the other side of the negotiations table, especially not because it isn’t just yourself that you are hurting!

Fifty

So, it has finally happened. I just turned 50 and I can’t say it was something I was looking forward to. First the good news: today seems to be a day like any other so if you’re younger than I am, from the far side of the great divide, I’m happy to inform you that there is nothing that feels abruptly different. The bad news is that even though nothing feels abruptly different plenty of things feel as if they’ve been changing subtly over time and not all are for the better.

Health-wise I fortunately have nothing to complain about. Other than some spare parts that were stolen from me during a visit to some medical institution and the usual assortment of dental and vision issues that are for now an almost un-avoidable part of aging I have been really lucky so far. Wounds take a bit longer to heal too so I have to be more careful. Not smoking, not drinking and not doing any drugs probably helped but there is no way of being 100% sure of that. I believe even absent proof that those choices contributed strongly to my present state of health and that those decisions factor in heavily whenever I’m working or playing along side people substantially younger and I don’t slow them down too much (well, that’s what I hope anyway).

Even so it’s a bit like the Sesame Street banana in your ear skit, it isn’t quite proof but the statistics seem to be on my side, it’s just that statistics don’t say much about any individual to begin with so what looks like good decision making could easily be luck.

When I’m in a group of people my own age I feel as though I’m around old people, I can’t help it. On a first encounter I instinctively use the formal form of address in Dutch and that’s really strange because I’m really going to have to admit that I’m one of them, and yet, I don’t feel like I’m old at all though I have to work a bit harder than in the past to stay fit and I find that pulling all nighters is no longer without a price. Other than that, from a physical point of view it’s business as usual.

What has really changed most over the years is my head.

My patience for bull-shit has definitely worn thin over the years, ditto patience for people that are hurting others for no particular reason, accidentally or maliciously. This always was a pet-peeve but I have to admit that I’m a bit shocked at how forceful I can position myself if I perceive injustices of any kind. The whole Aaron Swartz affair still has me raging and keeps me awake at night, the idiocy of mindless bureaucrats pushing brilliant young people (who are in their eyes apparently conceited enough to think that we shouldn’t wait with reforming the world until the next five generations have passed, but to do it today) to the point of suicide just to set an example and all their little helpers and apologists cause me to see red and dream dreams of terrible violence.

The flip side is that subjects that would have upset me greatly in the past now leave me cold, the reassuring fact that the world will keep on turning even though something upsetting has happened gives me peace when in the past it would have left me disturbed. I’m less inclined to micro-manage things, more relaxed and more at ease with letting others play a larger role and I’m far more inclined to accept that there are multiple valid ways to reach a certain goal than I would have been in the past. At least, so I hope :)

On the whole I feel a lot better and happier than I did when I was 40!

The part where I feel I’ve been most fortunate is the people that I’ve gotten to know over the years. The variety in cultural background, skills, geographic breadth and so on is simply astounding. And with every new person that I come in contact with I appreciate more that there is an interesting story behind just about everybody. All it takes is to be willing to listen. This is the thing that drives me to learn ever more languages.

For me ‘50’ is a bit magical in the sense that I can hardly escape the fact that I am now very very likely to be across the half-way mark of how long I’ll end up living, even if everything the future still holds works out in my favor.

Admitting that I’ll (even if I do become 100, which I more than likely will not) never amount to much that my 20 year old self would have been really proud of is definitely an admission of defeat and I don’t have any clear plan (or even any plan) on how I would go about doing better. 30 years just melted away like so much snow, the words ‘opportunity cost’ are now fully grokked. But then again, it is hard to define what I should have changed within my ’circle of influence’ to give me more striking power and how well I would have been able to wield such power. The person that indirectly caused me to start writing this blog in the first place died in March 2008, and I can hardly believe that it’s already been that long. What struck me about him is that he had more or less the same ideas I did but much better tools so I’d just decided to put my brain to work for him when I received a phone call telling me that that would not be happening. That’s one road that I’ll never find out about what it would have lead to, and I’m terribly sorry about that.

On the Technology front things are moving ever faster, the gramophone lasted for many decades, the CD for a few and new audio distribution formats now spring up and disappear with the seasons. At least mp3 seems to be here to stay. Ditto for other media, books are about to go extinct (and I’ll miss them), computer programs are now monstrosities of large numbers of layers of libraries cobbled together to create ever more slickly polished user interfaces showing ever more inane content and advertising. It is hard to stay relevant as an aging techie in that world but that will not stop me from relentlessly re-educating myself about the field that I have chosen to work in. I’m sure there will come a time when I am no longer able to keep up but for now this is one area where (self)discipline and perseverance can still level the playing field and I (still) have those in spades.

It’s probably official that I’m no longer ‘employable’ in the regular sense of the word, any company looking at me would see a person that does not care overly much about formalities or undeserved respect, that is going to be extremely hard to manage and that is going to upset most or all of their holy apple-carts when allowed to range free. So rather than pursuing employment directly I’m now relegated to the pasture that old programmers and other technical professionals go to: consulting. Usually this involves auditing other peoples work, figuring out why things go wrong, helping makings things that went wrong go right again, writing a (really) large invoice and then moving on to the next job.

It’s a win-win, I don’t have to spend years in the same chair and the companies that hire me for short-term contracts get the best of me without having to keep me busy once the biggest problems have been solved.

So, onwards. How many years of this hardworking version of me I have still in me I can’t tell, maybe 5, maybe 10, most likely not 15 though it might just happen.

What definitely hasn’t changed is my curiosity, the future is going to be interesting and I can’t wait to see it, how the machinery in it is put together and how that machinery will affect and shape our future lives.

50? Who cares ;)

Computers Are Brain Amplifiers

The lever, the transistor, the vacuum tube and the computer all have something in common. They’re amplifiers, they allow a relatively small change or capability in a domain to have a much larger effect.

Let me give you some examples: The crowbar is an instance of the lever concept, using a crowbar a relatively puny human can lift enormous weights simply by trading the distance that one end of the lever moved by how much the other will move. Push down one end a meter and the other end will go up 5 cm or so, allowing me (and you!) to lift approximately a ton. Similar to the lever are the block-and-tackle, gears, hydraulics and so on.

The transistor and the vacuum tube do the same for electrical signals. A small signal is used to control a larger one, which for instance can be used to create a radio where the very weak signal of the antenna is used to control a much larger signal driving a loud-speaker (there is a bit more to it than that, but that’s the basic element).

Where the lever works on mechanical energy and the transistor or the vacuum tube work with electrical energy the computer operates on information.

A computer allows you to amplify the power of your brain considerably by trading understanding for extreme speed of execution. This allows a person that is not well versed in math for instance to arrive at the correct answers for math problems either by trying a bunch of solutions in a row (called ‘brute forcing’) or by using the computer to figure out a rough approximation to the answer which then leads to the crucial insight required to get an exact answer.

Let me give you an illustration of how this might work. I’m a big fan of Project Euler, it’s a great way to get started with some new language and/or to learn a bit more math. My math skills are always below where I want them to be so for me project Euler serves a double purpose.

One interesting entry is problem #307:

Chip Defects

k defects are randomly distributed amongst n integrated-circuit chips produced by a factory (any number of defects may be found on a chip and each defect is independent of the other defects).

Let p(k,n) represent the probability that there is a chip with at least 3 defects. For instance p(3,7) ≈ 0.0204081633.

Find p(20 000, 1 000 000) and give your answer rounded to 10 decimal places in the form 0.abcdefghij

So, how is a math challenged person going to solve that particular problem without complete understanding of either the math or the problem domain?

Simple: use a computer! It will make you appear smarter than you really are by allowing you to trade understanding and math skills for the enormous speed with which it can operate.

The first thing to recognize here after careful reading of the problem is that there are two possible ways to solve it: first by directly plugging in the input variables into a formula that will give an exact answer, the second way, by simulating a chip factory and to work out the ratio as a result of the simulation.

Project Euler problems always helpfully include a ‘toy’ version of the real problem to help you get started and to give you an idea of whether or not you’re on the right track. Here the toy version is the p(3,7) one where you’re supposed to get 0.0204 etc as the answer.

Simulating this problem is not very hard: you need to break down the problem into ‘runs’ of simulations tallying how many chips have 3 or more defects, and then eventually dividing the number of chips with 3 or more defects by the total number of runs. That’s still something that I can figure out, so let’s program that (this is in ‘Erlang’, the language that I’m currently (slowly) learning, the syntax may be a bit off-putting but I’ve included quite a few comments that should help you to read the code):

% frequencies returns a list with a tuple per word indicating the
% word and the frequency with which it occured in the input list

frequencies(L) ->
    dict:to_list(lists:foldl(fun(Word, Dict) ->
            case dict:is_key(Word, Dict) of
                    true -> dict:store(Word, dict:fetch(Word, Dict)+1, Dict);
                    false -> dict:store(Word, 1, Dict)
            end
    end, dict:new(), L)).

% highestfrequency will return the number of times the highest
% frequency element of a list occurs, so for [1, 3, 2, 1, 3, 5] 
% it will return 2, the number of times 1 and 3 occured

highestfrequency(L) -> lists:max([B || {_A,B} <- frequencies(L)]).

% returns a list of random elements from a range (non-consuming, so the
% same element can be present multiple times)

randompicks(Npicks, Range) -> [random:uniform(Range) || _ <- lists:seq(1, Npicks)].

% return how frequently the the most frequently occuring circuit id 
% was present in a list of randomly assigned defects

simulate(Defects, Circuits) -> highestfrequency(randompicks(Defects, Circuits)).

% onerun does all the runs of the simulation in turn and 
% returns the number of chips with 3 or more defects for the batch

onerun(Ndefect, 0, _Defects, _Circuits) -> Ndefect;

onerun(Ndefect, Runs, Defects, Circuits) ->
    case simulate(Defects, Circuits) >= 3 of
           true -> onerun(Ndefect+1, Runs-1, Defects, Circuits);
           false -> onerun(Ndefect, Runs-1, Defects, Circuits)
    end.

problem_307(Defects, Circuits) ->
    Nruns = 10000, % the number of times we run the simulation
    onerun(0, Nruns, Defects, Circuits) / Nruns.

problem_307() -> problem_307(20000,1000000).

Running this on the toy problem produced 0.0204, so it seems to be working running it on the real problem produces (after a few minutes) 0.7291, so the actual answer should be close to that.

If we were running a chip factory and we needed an answer 0.7291 is a lot better already than no answer at all but the problem explicitly states to produce the answer to 10 decimal places presumably to avoid people solving the problem through simulation, so this is not a solution to the problem as stated.

By increasing the number of ‘runs’ we can obtain higher precision but most likely we’ll never have the solution to 10 decimal places because of the random element involved, so if you’re looking at this post to get the ‘real’ answer to problem 307 then you’ll have to look elsewhere but if you wanted an illustration of how computers really can help you solve problems that you otherwise could not then I hope you found what you were looking for.

Let’s Talk About Your Backups

Hard to believe, but the most frequently encountered ‘to-fix’ during due-diligence has to do with what you’d think would be a formality: having proper backups.

On an annual basis I see quite a few companies from the inside and it is (still!) a total surprise to me how many companies have a pretty good IT operation and still manage to get the basics wrong when it comes to something as as simpel as having backed up the data and software that is critical to running their business. Of course it’s easy to see why backups are an underappreciated piece of the machinery: you rarely need them. But if you you really need them not having them can rapidly escalate a simple hardware failure into an existential crisis.

Here is a list of the most common pitfalls encountered in the recent past:

  • Not having any backups at all. Yep, you read that right. Not having backups simply because they’re no longer needed, after all, we’ve got RAID, the cloud and so on. So you no longer need backups. WRONG. Hosting in the cloud, or having a RAID setup are not a substitute for a back-up system. Backups are there when your normal system changes to abnormal, they allow you to recover from the unforeseen and unplanned for (usually called disaster recovery) and will, besides maybe some time to order hardware, allow your business to function after the crisis has been dealt with. So hosting your stuff ‘in the cloud’ and using RAID are not backup strategies.

  • Not having everything backed up. What you should back up? EVERYTHING. Documentation, build tools, source code, all the data, databases, configuration files and so on. Missing even a single element that could be backed up in seconds right now might cause you days or even weeks of downtime if something unforeseen should happen. Making backups is very easy, re-creating software, trying to re-create a database that was lost especially when you’re missing any one of the other components as well is going to be many orders of magnitude harder than making a backup. A common way to get into this situation is to set up a good backup regime initially but not to update it when the service is upgraded or modified. Like that over time the gap between what’s backed up and what runs the production environment slowly increases.

  • Not verifying backups. After you’ve made your back-up you want to verify that the backup was done properly. The best way to do this is to restore the backup and to re-build a system comparable to the live system with it, preferably from scratch, and then to verify that this system actually works. If you can do that and you do that regularly (for instance: after every backup) you’ll have very high confidence that you can get your systems back online in case of an emergency. One good way to do this is to install the test system from the backup (after suitably anonymizing the data if you’re so required or if that is policy (it should be!)).

  • Not having your backups in a different location than your primaries. If your datacenter burns down it doesn’t help that your backup sat in another rack. Or maybe even in the same rack, on the same computer but a different VM or even on the same machine. Backups should be separated immediately after creation (or even during) from the originals by as large a gap as you can manage.

  • Having all your backups and originals accessible by one or more individuals or roles. What this means is that you are exactly one bad leaver or compromise away from being put out of business. The path through which backups can be erased should be a different one than through which they can be created (and should involve different people!). Don’t end up like CodeSpaces.

  • Not cycling your backups. A single copy of you stuff is better than nothing, but a good backup strategy implies having multiple copies going back in time. Yes, this can get costly if you only do ‘full’ backups but if you do incrementals with periodic full backups the cost can be kept under control and it will allow you to recover from more complex issues than simply losing all your data and restoring it. This means you will have multiple full backups and multiple partials to get you as close to a certain date wrt the state of your systems as you want.

  • Having all the backups and originals on the same media. This is unwise for several reasons, for one it implies that all the storage is rotating and online which makes it vulnerable, second it implies that if there is something wrong with any of the copies there might be something wrong with all of the copies. In that case you have a problem.

  • Replication in lieu of backups. If you use a replicating file store (such as a NAS), even if it is from a top-tier vendor you still need backups, no matter what the sales brochure says. Just having your data 3 times or even 5 times replicated is not the same as having a backup. One single command will wipe out all your data and all the replicas, do not rely on replication as your backup strategy.

  • If backups are encrypted store the decryption keys outside of the systems that are being backed up! It’s great that you have an internal wiki where you document all this stuff but if you’re standing there with your system down that wiki might be down too and it contains what you need to get your data back. Print that stuff out and stick it in a vault or some other place where it is safe so that you can reach it when you need it most. This also includes account credentials in case you use a third party service for part of your backup strategy as well as a list of emergency telephone numbers.

  • Not having a disaster recovery plan. Go through the motions of restoring after a total crash, document each and every little thing that you find that is missing if you’re not allowed to refer to the currently running live environment and make sure that you fix those issues, then test again until you can recover reliably. Re-do this periodically to make sure that nothing crept in that didn’t make it into the backup plan.

The Army of the New Independents

Almost every first world country has them: A legion of newly minted companies with just one person active in the company. They’re called 1099’ers, ZZPers, Freie Dienstnehmer and so on depending on the location but other than that the situations are quite comparable.

For many of these newly minted independent contractors their decision to go that route was born out of necessity rather than a choice made freely. One day they showed up at work, decided what to have for lunch that day and before they could consume their lunch they were out on the kerb with their box of personal belongings under their arm and the local equivalent of a pink slip in their pocket.

Too proud to go on the dole, too old or local economy too lousy to get another job these people decided that if they had no other options left independence would be better than starvation. It’s the new flexibility brothers and sisters, better roll with the tide. In a lot of these cases their first contract is - surprise - with their former employer to continue to do just what they were doing the other day.

Quite a few companies have caught on to the fact that they can lay off their expensive full timers and re-hire them as ‘flex’ workers that are independent enough that they don’t need to have all the trimmings that they had as regular employees. Huge cost savings for the companies and a much more streamlined business.

In my view this is all a huge step backward. Companies and their employees are engaged in a social contract where the companies make bank by assuming some of the risks and the employees hold up their end of the bargain by providing value for their wage. As soon as either the one or the other no longer wishes to hold up their end of that deal something will have to give.

And here is where it will go wrong: as an independent contractor you will likely (especially initially) not be able to afford any kind of insurance that will compensate you for lost income in case you are unable to work for whatever reason. You will likely not be able to pay towards your eventual retirement fund, you will likely not be able to cover the extra cost that your fledgling company has to make in order to comply with all the administrative rules and so on. In short, you are now assuming all the risk, and that carries a price tag.

But because of the Army of New Independents as a rule not being savvy about this little detail you’ll be caught in a race to the bottom where Joe from around the corner (who has no idea about any of this) can and will undercut you in price if you charge what you have to in order to make ends meet and be more or less as secure as you were when you were an employee. That Joe will eventually go bankrupt is currently not on his radar and his customers obviously do not care at all. And when and if Joe hits the wall there will be a whole legion of newly minted Joes to take his place.

This royally sucks because in the longer term that translates into a very large shift of burden from companies to the rest of society, or, alternatively (depends very much on the location) a much larger number of people that will end up in serious trouble.

Another option is for a number of independents in a locality to join forces and to present themselves as a single unified front.

So, to all you newly minted independents: try very very hard not to get caught in the race to the bottom, in the end it is your own neck that you’re cutting off, if you’re going to be working as an independent as a rule your hourly rate should be a pretty high multiple of the money that you were making as an employee in order to offset your down-time, insurances and increased cost. If you can’t manage that you’re doing it wrong and you should re-think your strategy. Ballpark figure: Anything below 70 ‘credits’ ($/#/E) / hour for IT work in the developed world even if you do not have a reputation yet is almost certainly too low.

If your customers are not complaining about your rates you are doing it wrong. (They’re still hiring you, right? Ever met a CFO that loved paying bills?) I aim for roughly 50% rejections based on the rate I’m charging, if it is less than that I’m too cheap, if it is more than that I end up with too much ‘down’ time.

Update, interesting twitter conversation in response to this post, what is my formula for calculating a baseline hourly rate when starting from a salary: I came up with 4.5:1 at a minimum. The reason why (and this is still wrong because we’re calculating from ‘need’ rather than from ‘value’ but that’s another topic entirely) is that you’re going to have to assume ~35-40% taxes, savings for downtime (assuming 50% downtime) and administrative overhead + insurances. The person asking the question figured that going from 60K to 140 / hour would be a ‘hard sell’, but that’s actually not all that big a jump from a customers perspective, they’re not required to keep you on a minute past the time where you are working on a specific job for them, you take care of all the paperwork, you assume some risk in invoicing after a chunk of the work is done, you take care of your own pension and healthcare and so on. Where I live it probably costs 140/hour or more to hire a half decent plumber!

All the Technology but None of the Love

This post has been re-written several times, so please forgive me if it does not come across as coherent as I would like to. The main reason for the re-write was this post by Martin Wedge. Originally I planned to scrap the whole thing but maybe having a different perspective will help, I’ve deleted those parts that he covers and have tried to massage the rest into something resembling a continuous whole but there are still big gaps, apologies for that.

When I wrote this the first time around it was titled ‘Those Cranky Old Men’, which was flawed for two obvious reasons, the fact that it isn’t just the men that are cranky and they’re not all that old. But since the majority of the complainers really are men (or at least, the ones that I’m presently aware of) and that even Linus is now officially old and a graybeard at that it wasn’t the worst possible title.

But it missed out on conveying the most important part of what I wrote. Which is that today, with technology at our fingertips that the majority of scientists and passionate programmers from not all that long ago would have gladly sold their grandmothers or a kidney for we are further than ever away from the future that I foresaw as a young (17 or so) kid finding his way into the world of software, and a good part of that is that there is an enormous influx of people into tech that are in it just for the money and that couldn’t give a damn about how they achieve their goals.

The ‘hacker’ moniker (long a badge of honor, now a derogatory term) applied to quite a few of those that were working to make computers do useful stuff. The systems were so limited (typically: 16K RAM, graphics 256x192, monochrome, or maybe a bit better than that and a couple of colours (16 if you were lucky) a while later) that you had to get really good at programming if you wanted to achieve anything worth displaying to others. And so we got good, or changed careers. The barrier to entry was so high that those that made it across really knew their stuff.

Of course that’s not what you want, you want all this to be as accessible as possible but real creativity starts when resources are limited. Being able to work your way around the limitations and being able to make stuff do what it wasn’t intended for in the first place (digital re-purposing) is what hacking was all about to me. I simply loved the challenges, the patient mirror that was the screen that would show me mercilessly what I’d done wrong over and over again until I finally learned how to express myself fluently and (relatively) faultlessly.

Being a young and very naive kid I ended up working for what today would be called ‘founders’ of companies that had no tech chops whatsoever but that were good at spotting talent and aiming that talent to serve their goals. This made some of those people very rich indeed (and I saw none of the proceeds). But instead of being bitter I learned the ways of business besides the ways of technology and when the internet hit I saw my chance and finally managed to make a (small) dent in the box that surrounded me.

If you’re a regular reader of James Hague, Martin Wedge linked above or any one of a number of ‘old guys’ then you’ll notice this recurring theme. Learned to program as a kid, totally fell in love with it and comes across as a grumpy old man that is about to tell you to get off their lawn.

I think I have a handle on why this common thread exists between all those different writings, in a word, disappointment.

What you could do with the hardware at your disposal is absolutely incredible compared to what we actually are doing with it. Note the complete lack of eye-candy on all these pages? Substance over form. Compare that to the latest github announcement of some two bit project with a catchy name and a super nicely designed logo but only an initial commit of a blank page. All form, no substance.

And then there’s the entrepreneurial climate. That’s a subject all by itself, the dog-eat-dog world of the new exploitation. CEOs of companies with two people, non-technical co-founders looking for a ‘first hire’ at a terribly low salary and with 0.05% stock to create their ‘vision’ which will surely ‘change the world’.

Want to see an entrepreneur change the world? This guy has the right idea. Another way to share cat pictures? Not so much. Most of these guys couldn’t code their way out of a wet paper bag if they had to and just like the people that managed to get my best time for peanuts are looking to repeat history.

Let me give you a hint when you’re looking for employment from a newly founded start-up: If you don’t feel the offer that you’ve received is worth spending $2K or so on a lawyer to have it vetted then you are probably better off with out it (thanks, brudgers).

To expand a bit on that: if you love technology but you are not business savvy you are essentially ripe for the plucking by people that are business savvy and that don’t give a damn about the tech other than when it suits their goals (usually: get rich quick).

The equation for becoming a low numbered employee at a start-up goes something like this:

Let ‘p’ be the probability of failure of a start-up at its current stage (pre-seed, seed, series-A)

Let ‘c’ be your proposed compensation per month.

Let ‘m’ be the market rate monthly for a person of your capabilities in your locality.

Let ‘e’ be the average exit for companies in that line of business.

Let ‘s’ be your proposed stock grant as a fraction of the total issued stock

Let ‘t’ be the time to exit in months for the average start-up

If ((1-p) * s * e) < (m-c)*t

Then you’re much better off just getting a market rate salary, even if you’re not in it for the money there is no reason to set yourself up for being (ab)used.

Simply treat that stock as you would treat any lottery ticket and look at your compensation package accordingly. Don’t let your love of the technology get the better of you and allow you to be used, lest you become another one of those ‘cranky old people’ in the longer term.

You’re probably better of keeping your love of the subject and chasing your own particular variety of rainbow and maybe you’ll come up with something great without burning out on the next advertising or marketing scheme. The world needs more people that love this work and fewer that burn out on crap projects, and that repeat the same mistakes over-and-over again, know your history, what was tried and why it succeeded or failed. This alone will save you years of your life.

Learn about efficiency, learn how that stuff works under the hood, your value will go up accordingly. Of course, that’s easy for me to say. When I encountered computers the revolution was just about to get underway so I ended up taking it all in piecemeal, one little step at the time. When you’re thrown in at the deep end in 2015 it will definitely seem to be overwhelming. But that’s just the surface, it’s all logical and you can dive down as far and as fast as you want to. Keep the love for the tech alive, ignore the get rich quick types and treat programming like you would treat any creative skill, like music, painting, woodworking or making stuff with glass or metal. It’s a craft and an art, as much as people have been trying to make it into an industry, without creativity you can’t make good software.

Please do not become one of those people in tech that are just in it for the money but that actually hate the technology itself.

All Programming Is Bookkeeping

Programmers tend to loathe writing bookkeeping software. Just thinking about doing something so mundane and un-sexy as writing a double-entry bookkeeping system is probably enough to make a whole raft of programmers contemplate switching careers in earnest.

This is interesting because at heart all programming is bookkeeping. If a program works well that means that all the results come out exactly as expected, there is no room at all for even a little bit of uncertainty in the result. So if you’re going to write correct software the internals of that software will look very much like a bookkeeping program!

Let me give you a concrete example of what I mean to clarify this:

Imagine you’re building a computer system to administer a telephone switch. The switch has a whole bunch of lines attached to it, some lines are ‘trunks’, lines with which you can forward messages to other telephone switches, and some lines that connect to subscribers. Each subscriber is represented by a a wire pair and by manipulating the signals on the wire and by listening to the wires you can communicate with the phone on the other side.

Your options for the wires to the subscribers are: detect the phone going ‘off hook’, detect the phone going ‘on hook’, detecting a short across the wires that lasts < 50 ms and so on. The trunk lines are much the same as the subscriber lines, only difference is that you have a limited number of them to route calls ‘out’, and instead of being tagged with subscriber ids they are tagged with their long-distance prefixes.

A bunch of hardware will take care of abstracting all this and presents a nice and clean port-mapped interface for your phone switchboard, or maybe it appears as a chunk of memory. If you’re doing all this in the present time then you’re going to be talking some digital protocol to a primary rate interface or some chunk of hardware further upstream using TCP/IP.

No matter how it is implemented though, you’re going to be looking after incoming calls, outgoing calls, subscribers that are available to be called, subscribers that are setting up calls, subscribers with calls in progress and trunks that are engaged, free or used for dialing or in-system messaging.

Whenever a call completes starts or ends you’ll have to log that call for billing purposes and so on.

If you don’t approach this like you would approach a bookkeeping problem with checks and balances you’re going to be tearing out your hair (or it’ll be prematurely grey) about questions such as “Why do we have so many dropped calls?” and “How come the billing doesn’t match the subscribers recollection?”.

To fix this here is what you can do:

For every interaction in the system you assign a simple counter. Phone goes off-hook: counter. Phone goes on-hook: counter. Call-start: counter. Call-end: counter. Calls-in-progress: number (that’s a harder one, but it should be incremented whenever a call is set-up and decremented when a call ends), etc, etc. And then you go and define all the ways in which these numbers relate to each other. At any given point in time the number of calls set up - the number of calls disconnected should be equal to the number of calls in progress. The number of calls in progress should always be >= 0. The number of ticks billed to the user should equal the number of ticks the subscriber lines were ‘in call’ on non-local numbers and that number in turn should be equal to the number of ticks on the trunks.

You can use this information in several ways:

  • when you’re debugging: When you are searching for a bug the information in the counters can guide you to the location of a potential problem. The counters and variables have all these nicely defined relations and if one of the relations does not work out then it will stand out like a big red flashing light. This indicates either that your understanding of how the system works is not accurate, that the system does not work in the way it should or that there is an alternative path through the code that wasn’t properly instrumented. Either way, something needs to be repaired to get the bookkeeping to match again.

  • when you’re testing: you define a number of scenarios and how you think they should affect the various counters and variables. Then you set up your tests, run them and check the state of the counters and variables after every test has run to make sure you get the desired result. If you’re doing longer test runs you might want to have thousands of concurrent sessions to put the system through its paces and then you check afterwards if the aggregate of the various counters and variables accurately reflects what you did during the test run.

  • to give you confidence that you have indeed fully mastered the system and that there are no undefined states or chunks of un-instrumented code that have effects on your billing or internal bookkeeping. This will help with you getting enough sleep at night.

  • that bookeeping is so useful that I usually keep it around in persistent storage (typically, a database table) on a per-minute or per-hour basis. That way I can reach back in time if there is anything amiss and I can get within an hour of the introduction of a problem. That’s usually close enough to be able to narrow down where in the system the problem is located.

Similar techniques apply to things like memory allocation, process initiation, opening files and so on.

In most languages you can then assert that these conditions are true at any time during execution of your tests (usually with something called ‘assert’). These are called ’invariants’ (things that should always hold true while your program runs).

In the end all programming is bookkeeping, even when at first glance it has nothing to do with bookkeeping.

Please Do Not Be a One Trick Pony

Paul Simon has a song called ‘One Trick Pony’ with a bit in it that goes like this:

He’s a one trick pony
One trick is all that horse can do
he does one trick only
It’s the principal source of his revenue

In the world of programming being a one-trick-pony is not an option. What it means is that when that one technology that you’re currently married to becomes obsolete you’ll be instantaneously out of a job or out of customers. Sure it pays off to be an expert in ‘x’ whilst ‘x’ is in fashion. But that is a limited time window.

Do yourself a favor and recognize that the tech world is moving much faster than it used to and that having only one blade in your arsenal is a surefire way to eventual obsolesence. It also helps to make you appreciate other points of view in debates about languages and their relative merits (not as important as you might think) and will make you a much better programmer. The single biggest boost to your programming career will come when you master a second technology as far removed as the one that you are currently using as you can stomach.

So if you’re a die hard python fan, learn yourself some Clojure, Erlang or maybe even prolog if you’re feeling brave. If JavaScript is your thing then move to the severside for a while in a language other than Javascript. If PHP is your first love then please do go and learn some Ruby or C#. If you know how to use just one framework and it looks as if you’re going to be doing this forever then forget about that and make a non-toy project in another framework. If you’ve only done client server stuff then try your hand at embedded, learn some assembly (yes, that’s still useful knowledge, even if in a more indirect way than in the past).

Lifetime careers no longer exist. I knew a bunch of people that were charging $200 per hour to do HTML in the early 90’s. Frontpage ate their lunch. Don’t become a dinosaur waiting for the impact of the comet in your safe little life be adaptive, and start adapting before you really have to.

Oh, and Happy 2015!

Saving a Project and a Company

I often get the question what it is exactly that I do for a living and usually I can’t really talk about it due to non-disclosure agreements and possible fall-out but in this case I got permission to do a write-up of my most recent job.

Recently I got called in on a project that had gone off the rails. The system (a relatively low traffic website with a few simple pages) was built using Angular and Symfony2 used postgres and redis for its back-end, centos as the operating system for the unix portion of the system, 8 HP blades with 128 G RAM each in two blade enclosures and a very large HP 3par storage array underneath all that as well as some exotic hardware related windows machines and one special purpose application server running a java app. At first glance the hardware should have been easily able to deal with the load. The system had a 1Gbps uplink, every blade ran multiple (VMware) VMs. After interviewing the (former) CTO I got the impression that he thought the system was working A-ok. Meanwhile users were leaving or screaming blue murder and the management was getting seriously worried about the situation being out of control and possibly beyond repair, an existential threat if there ever was one.

So for the last 5 weeks I’ve been working pretty much day and night on trying to save the project (and indirectly the company) and to get it to a stage where the users are happy again. I don’t mind doing these jobs, they take a lot of energy and they are pretty risky for me financially but in the end when - if - you can turn the thing around it is very satisfying. Usually I have a pretty good idea of what I’m getting into. This time, not so much!

The job ended up being team work, it was way too much for a single person and I’m very fortunate to have found at least one kindred spirit at the company as well as a network of friends who jumped to my aid at first call. Quite an amazing experience to see a team of such quality materialize out of thin air and go to work as if they had been working together for years. My ‘plan B’ was to do an extremely rapid re-write of the system but fortunately that wasn’t necessary.

It turned out that the people that had originally built the system (a company in a former east-block country) had done a reasonably good job when it came to building the major portions of the application. On the systems level there was lots that was either inefficient or flat-out wrong. The application software was organized more or less ok, but due to time pressure and mis-management of the relationship and the project as a whole the builders had gotten more and more sloppy towards the end of the project in an attempt to deliver something that clearly wasn’t ready by the agreed upon deadline. I don’t blame them for this, they were placed in a nearly impossible situation by their customer, the (dutch, not associated with the company that built the whole thing) project manager was dictating what hardware they were going to have to run their project on (I have a suspicion why but that’s another story) and how that was all set up and so the whole thing had to become the justification of all the expenses made up front. He also pushed them to release the project at a stage where they were pretty much begging him (in writing) for more time and had made it abundantly clear that they felt the system was not really ready for production. That letter ended up in a drawer somewhere. A single clueless person in a position of trust with non technical management, an outsourced project and a huge budget, what could possibly go wrong… And so the big red button was pushed and the system deployed to production and from there it went down very rapidly. By the time I got called in the situation had become very serious indeed.

Here is an enumeration of some of the major things that we found:

When I first started working on the project the system was comprised of a whopping 130 VMs, each of those seriously restricted in terms of CPU and memory available (typically: 3G RAM, 2 or 4 cores). I wished I was joking here, but I’m not, for every silly system function there was a VM, a backup VM and another two that were make-believe running in another DC (they weren’t, that second blade enclosure was sitting one rack over). Now that it’s all done the whole thing is running comfortably on a single server. Yes, that puts all your eggs in one basket. But that single server has an MTBF that is a (large) multiple of the system the way it was set up before and does not suffer from all the communications overhead and possible sources of trouble that are part and parcel of distributed systems. Virtualization when used properly is a very powerful tool. But you can also use it to burn up just about any CPU and memory budget without getting much performance (or even reliability) in return. Don’t forget that if you assign a VM to just about every process you are denying the guest OS the ability to prioritize and schedule, you’re entirely relying on the VM architecture (and hence on yourself) to divide resources fairly in a mostly static fashion and that setup doesn’t have the in-depth knowledge the guest OS does about the multiple processes it is scheduling. Never mind the fact that each and every one of those VMs will have to be maintained, kept in sync, tuned and secured. The overhead of keeping a VM spun up is roughly equivalent to keeping a physical server alive. So use virtualization, but use it with care, not with abandon and be aware of where and how virtualization will affect your budgets ($, cycles, mem). Use it for the the benefits it provides (high availability, isolation, snapshotting). Over time we got rid of most of the VMs, we’re left with a handful now, carefully selected with regards to functional requirements. This significantly reduced total system complexity and potential attack surface and made a lot of the problems they were experiencing tractable and eventually solvable.

Application level documentation was non-existent. For the first few days we were basically charting the system as we worked our way through it, to figure out what we had, which bits went where and how they communicated. Having some usable application level documentation would have been a real time saver here, but as is usual with jobs like these documentation is the thing everybody hates to do and pushes as far ahead as possible. It’s usually seen as some kind of punishment to have to write documentation. What I wouldn’t have given for a nice two level drawing and description of how the whole thing was supposed to work on day #1.

The system was scaled out prematurely. The traffic levels were incredibly low for a system this size (< 10K visitors daily) and still it wouldn’t perform. First you scale ‘up’ as far as you can, then you scale ‘out’. You don’t start with scaling out, especially not if you have < 100K visitors per day. At that level of traffic a well tuned server (possibly a slightly larger one) is what you need. Maybe at some point you’d have to off-load a chunk of the traffic to another server (static images for instance, or the database). And if you can no longer run your site comfortably on a machine with 64 cores and 256G of RAM or so (the largest still-affordable server that you can get quickly today) then the time has come to scale out. But you want to push that point forward as far as you can because the overhead and associated complexity of a clustered system compared to a single server will slow down your development, will make it much harder to debug and trouble-shoot and in general will eat up your most precious resource (the time the people on your project have) in a hurry. So keep an eye out for that point in time where you are going to have to scale out and try to plan for it but don’t do it before you have to. The huge cluster as outlined above should have been able to support more than one million users per day for the application the company runs, and yet it did not even manage to support 10K. You can blow your way through any budget, dollars, memory, cycles and storage if you’re careless.

There was no caching of any kind, anywhere. No opcode cache for the PHP code in Symfony, no partials caching, no full page cache for those pages that are the same for all non logged in users, no front-end caching using varnish or something to that effect. No caching of database queries that were repeated frequently and no caching of the results either. All this adds up to tremendous overhead. Caching is such a no-brainer that it should be the first action you take once the system is more or less feature complete and you feel you need more performance. The return is huge for the amount of time and effort invested. Adding some basic caching reduced the CPU requirements of the system considerably at a slight expense in memory. We found a weird combination of PHP version and opcode cache on the servers, PHP 5.5 with a disabled xcache. This is strange for several reasons, PHP 5.5 provides it’s own opcode cache but that one had not been compiled in. After enabling xcache it turned out the system load jumped up considerably without much of an explanation to go with that (it should have gone down!). Finally after gaining a few gray hairs and a suggestion by one of the people I worked with we threw out xcache and recompiled PHP to enable opcache support and then everything was fine. One more substantial jump in performance.

Sessions were stored on disk. The HP 3par is a formidable platform when used appropriately, but in the end it is still rotating media and there is a cost associated with that. Having a critical high-update resource on the other side of a wire doesn’t help, so we moved the sessions to RAM. Eventually these will probably be migrated to redis so they survive a reboot. Moving the sessions to RAM significantly reduced the time required per request.

The VMs were all ‘stock’, beyond some basic settings (notably max open files) they weren’t tuned at all for their workload. The database VM for instance had all of 3G worth of RAM, and was run with default settings + some half-assed replication thrown in. It was slow as molasses. Moving the DB to the larger VM (with far more RAM) and tuning it to match the workload significantly improved performance.

The database didn’t have indices beyond primary keys. Hard to believe that in this day and age there are people who call themselves DBA who will let a situation like that persist for more than a few minutes, but apparently it’s true. Tons of queries were logged dutifully by postgres as being too slow (> 1000 ms per query), typically because they were scanning tables with a few hundred thousand or even millions of records. Adding indices and paring down the databases to what was actually needed (one neat little table contained a record of every request ever made to the server and the associated response…) again made the system much faster than it had been up to that point.

The system experienced heavy loadspikes once every hour, and at odd moments during the night. These loadspikes would take the system from its - then - average load of 1.5 to 2 or so all the way to 100 and beyond, and caused the system to be unresponsive. This took some time to track down, eventually we found two main causes (with the aid of the man that had configured the VMware and storage subsystem, to rule out any malfunctioning there). The first is that the linux kernel elevator sort and the 3par get into a tug of war over who knows best what the drive geometry looks like. Setting the queue scheduler to ‘noop’ got rid of part of the problem, but the hourly loadspike remained. It turns out that postgres has an ‘auto vacuum’ setting that when it is enabled will cause the database to go on some introspective tour every hour which was the cause of the enormous periodical loads. Disabling auto vacuum and running it once nightly when the system is very quiet anyway solved that problem.

The system was logging copious information to disk, on every request large amounts of data would be written to logfiles that would get extremely large. Disabling these logs (they were there for debug purposes) sped up the system considerably and also closed a major security hole. In a bit of a hurry the builders had made the root of the web tree world writeable and the log files were stored there for easy access by the general public, wanna-be DDOSers, hackers and competitors. So disabling these logfiles killed two birds with one stone, it significantly reduced the amount of data written to disk for a performance gain, it also closed a major security hole.

What didn’t help either is that the hardware - in spite of being pretty pricey - broke down during all this. Kudos to VMware, I can confirm that their high availability solution can save your ass in situations like that, but still, it’s pretty annoying to have to deal with hardware failures on top of all the software issues. One of the blades failed, was fixed (faulty memory module) and then a few weeks later another blade failed (cause still unknown). Highly annoying, and for hardware this expensive I’d expect better. It probably is nothing but bad luck.

Besides all of the above there were numerous smaller fixes. So, now the load is at 0.6 or so when the system is busy. That’s still too high for my taste and I’m sure that it can be improved upon but it is more than fast enough now to keep the users happy and spending more time to make it faster still would be premature optimization. We fixed a ton of other issues besides that had direct impact on the user experience (front end stuff in Angular, and some back end PHP code) but since I’ve been mostly concentrating on system level stuff that’s what this post is about. The company is on a very long road to recover lost business now, it will take them a while to get there. But the arterial bleeding has been stopped, they’re doing an ok job of managing the project now with an almost entirely new local team in concert with the original supplier of the bespoke system. Emphasis will now be on testing methodology and incremental improvements, evolutionary rather than revolutionary and I’m sure they’ll do fine.

Instrumental in all this work was a system that we set up very early in the project that tracked interaction between users and the system in a fine grained manner using a large number of counters. This allowed us to make detailed analysis of the system under load in a non-intrusive way, and it also gave us a quick way to analyze the effect of a change (better or worse). In a sense this became a mini bookkeeping system that tracked the interactions in such a way that if it all worked the way it should that this would be reflected in certain relationships between the counters. A mismatch indicated either a lack of understanding in how the system worked or pointed to a bug (usually the latter…). Fixing the holes then incrementally improved the bookkeeping until we hit a margin of error for most of these counter that we were confident the system is working as intended. A few hold-outs remain but these have no easy fixes and will take longer to squash.

Lessons learned for the management of this particular company:

  • Trust but verify

It’s ok to trust people that you hand a chunk of your budget and responsibilities to but verify their credentials before you trust them and make sure they stay on track over time by checking up on them with regularity. Demand that you be shown what is being produced. Do not become isolated from your suppliers, especially not when working with freelancers. And if you’re not able to do the verification yourself get another party on board to do that for you who is not part of the execution. Do not put the executive and control functions in the same hands.

  • Don’t be sold a bill of goods

Know what you need. If you need an ‘enterprise level’ solution make sure your business is really ‘enterprise level’. It sounds great in to have a million dollars worth of hardware and software licenses but if you don’t actually need them it’s just money wasted.

  • Know your business

Every business has a number of key parameters that determine the health of that business. Know yours, codify them and make sure that every system that you run in house ties in with that in realtime or as close to realtime as you can afford (once per minute is fine). Monitor those kpis and if they are off act, immediately.

  • Be prepared to roll back

If the new system you accept on Monday causes a huge drop in turnover by Tuesday roll back, analyze the problem and fix it before you try to deploy again. A roll-back is a no brainer, if the drop is on the order of a few percent it may be worth to let the system continue to work but if you are operating at any kind of serious level of turnover a roll back is probably the most cost efficient solution. Having a turnover drop may be explained by some other cause but usually it simply is indicative of one or more problems with the new release.

  • Work incremental, release frequently

Try to stay away from big bang releases as if your company depends on it (it does). Releasing bit-by-bit whilst monitoring those kpis like a hawk is what will save your business from disaster, it will also make it much easier to troubleshoot any problems because the search space is so much smaller.

This was a pretty heavy job physically, there were days when I got home at 4:30 am and was back in the traces at 10:30 the next day. That probably doesn’t sound too crazy until you realize I live 200 km away from where the work was done, part of the time I spent in a holiday resort nearby just to save on the time and fuel wasted on traveling. I’ve been mostly sleeping for the last couple of days, recovery will be a while after this one, I’m definitely no longer in my 20’s when working this hard came easy. Even so, I am happy they contacted me to get their problems resolved and I’m proud to have worked with such an amazing team thrown together in such a short time.

Thanks guys! And a pre-emptive Happy and Healthy 2015 to everybody reading this.