Skip to: categories | main content
Esoteric Curio
About meThe Internet is not without a sense of irony:
The website is up.
In case they fix it They fixed it... thewebsiteisup.com just spewed PHP errors at the time I wrote this.
With the release of ZFS on Solaris 10, I sat down and marveled at the opportunities for off-site backups. I have already written a bit about ZFS detailing why I think it kicks so much ass. With zfs send and zfs receive, one can manage block-level incremental backups and restores. What's missing? An elegant hack leveraging that to provide a simple and reliable backup infrastructure for a network of ZFS capable machines (including Mac OS X and FreeBSD now, BTW).
So, I sat down and wrote Zetaback -- which is currently 1032 lines of perl code (including complete documentations) plus a thin agent on remote machines that is 290 lines of perl code (including complete documentation). I'd like to note that the only reason there is documentation, let alone complete documentation, is because of Eric Sproul. This really demonstrates to me that "Keep It Simple Stupid" still works for important tasks.
Zetaback is a rather full features backup and restore system. It can manage multiple hosts, multiple ZFS per host, both frequency and retention policies on full and incremental backups. It can report policy violators (things that haven't been backed up within the policy). It can manage the archiving of backups. It provides both non-interactive and interactive restores. It has an excellent command line syntax. And most importantly, it has saved my ass more times than I can count.
I'm not usually big on awards... I find the single unexpected email from someone saying: "damn that was useful, thanks!" to be more gratifying most of the time. However, Zetaback was one of the first projects we put up on labs, so being a 3rd place winner in the OpenSolaris Community Innovation Awards is pretty exciting.
So, you have an app. You can't change the code. Now this isn't the common case when I try to scale things. I usually roll up my sleeves and ignore application stack boundaries. This is a unique case where for political reasons, I can't touch the app. So.. the app was a tiny little site, then it got popular on facebook and collegehumor and instead of pushing 5-10 megabits, it was falling apart at around 105 megabits due to resource saturation (one box wasn't enough) and ended up needing to push 200 megabits.
200 megabits isn't all that much traffic anymore, but when the application wasn't written to scale horizontally, you are at the mercy of its raw performance and must scale vertically. If the application hasn't had a lot of focus on profiling and performance tuning, it means you are going to hit that extremely painful price point of vertical scaling. In this case, the architecture went live with an expectation of a 20Mbit/s peak and BOOM. Because it needed to be fixed quickly, purchasing new hardware is now a problem for scheduling reasons more than financial ones, we have plenty of similar hardware available, just nothing with twice the RAM and twice the cores and twice the disks.
The reason that this app couldn't scale is because it used not only a shared DB (which is very very common) it required filesystem use and thus needed a shared filesystem. So, how do you fix that without modifying the app? You study the app and look for patterns of use that can be exploited.
First we looked at the database. In this case, it was not being pushed very hard. We could easily handle a tenfold increase in traffic without exhausting database resources... That was a relief, because scaling a database "behind the scenes" without any application access can be more than a few hour exercise. Next we found that the app itself (PHP) was taxing memory, CPUs and disk I/O pretty heavily. The most important was memory and CPU, but disk I/O was a close second. This meant that if we just installed the app on another machine and NFS exported the first machine's mounts, it would "work" but not achieve out performance requirements because of I/O saturation. Quick testing in this arena showed about 15% increase in capacity -- just not enough.
So, this app needs a shared FS. Why? Well the user uploads assets, and then through the life of their session, the app serves them back to that user. EASY, session sticky load balancing (by source IP or by introduced cookie on the load balancer). Because of the nature of this app, session sticky load balancing produced extremely inequitable load distribution and we would have had to bump up to three servers. Not ideal, but acceptable -- this is triage. One step forward, flat on our face: it appears that under certain circumstances, the images I upload are served to another.
So, basically, all I need is to glue the static assets (uploaded by users) together under a common URL (and push 200Mbs or so). Some assets are on one server, some on another, and I have no way of knowing which server owns the asset without looking in the FS... or asking over HTTP and getting a 404 back.
I just happen to have a Varnish instance to provide content acceleration for other bits of infrastructure. And Varnish has (as its major selling point, IMO) the VCL language that allows me to script how it handles requests and satisfies them.
If I get a request, I want to try server one, if I get a 404, I'd like to retry the request against server two. As the number of servers goes up, this solution completely falls apart as the 404 isn't that cheap. I want it fast, efficient, and it'd be great to cache it. If it isn't fast and efficient, I've simply moved my problem instead of addressing it. This works well because serving a 404 on server one is cheap. Remember, triage.
backend obscuredserver1 {
.host = "10.225.209.89";
.port = "80";
}
backend obscuredserver2 {
.host = "10.225.209.90";
.port = "80";
}
sub vcl_recv {
if (req.http.host ~ "^fqdn\.of\.caching\.server$") {
if (req.restarts == 0) {
set req.backend = obscuredserver1;
} else {
set req.backend = obscuredserver2;
}
}
if (req.request != "GET" && req.request != "HEAD") {
pipe;
}
lookup;
}
sub vcl_fetch {
if (req.http.host ~ "^fqdn\.of\.caching\.server$" &&
req.restarts == 0 && obj.status == 404) {
restart;
}
if (!obj.cacheable) {
pass;
}
if (obj.http.Set-Cookie) {
pass;
}
set obj.prefetch = -30s;
deliver;
}
Now, this is a excerpt, my varnishes here have some other logic for other services that I can't share... However, they are rather lightly used. That particular instance went from serving an average of 6 Mbits/second to peaking at 200 Mbits/second. And the system load jumped from 0.01 to 0.06. It's nice when a triage exercise results in a quick hack that doesn't bust at the seams -- we've got plenty of headroom.
While I, in no way, consider this successful scaling. I consider it successful triage by creative engineering (a.k.a. hack). And for those that like pretty pictures, these demonstrate that when you encounter capacity issues, it isn't always pretty and graceful. Queueing theory is complicated and sometimes results in everyone getting screwed. Here's a visualization of queueing theory making trouble.


A long time ago, I wrote integration into the portable version of OpenSSH to allow direct authentication against an RSA ACE (SecurID) server. I've received many thanks over time for the work and I'm aware that it is used at some (very large) organizations. However, as with most security related things, people tend not to talk about what they do. As it is open source and no registration is required to download the patch, I think I might have underestimated the deployments.
Quite some time ago, Jim Matthews over at NASA took over maintenance of the patch. This sort of seamless transition of ownership is why I really love open source. Jim does a great job.
Since that patch's inception, it has been hosted on my old static projects page. That meant that James has to send me a copy to post every time a new version of the patch came out. How 1998. Anyway, since we went through all the effort of setting up open source hosting, how about I use it! The OpenSSH+SecurID integration effort has moved to labs! Get your one-time-password, two-factor security while it's hot!

