Skip to: categories | main content
Entries from December 2006
About meOne of the things that ZFS boast most is its scalability -- Z is for zetabyte after all. Trivia question: what is the first thing you do after you put data on your production ZFS volume? That's right, you back it up to your backup infrastructure. A lot of systems use tar or other archive like derivatives to manage backups. This technique is particularly awful with databases. Databases usually consist of very very large files (multi-gigabyte) that have minimal changes to them. With full archive systems, any attempt at incremental backups results in horrible space and time inefficiency as a small (8192 byte) change in a datafile will necessitate the whole file to be backed up in the next incremental.
Enter block-level incremental (BLI) backups. The idea here is that you ask your filesystem to track which blocks change from a certain moment in time. And you can ask the filesystem for all blocks of a filesystem (view consistent, of course) and then later ask it for the changeset. In other words its like doing:
Filesystems have supported this type of behavior for a while now (Veritas VxFS has a magnificent implementation). Needless to say I was ecstatic when I read the zfs manpage and learned of the 'zfs send' and 'zfs recv' operation. Functionally, they implement BLIs.
We have a database on which we have around 1TB of information on zfs. So, I figured we'd whip together a script to tie in zfs send (including incremental support) to our Veritas NetBackup infrastructure.
We have three mount points that we need to snap and send to NetBackup, so I create three FIFOs on disk and fork off three parallel 'zfs send' operations. Then I fork off three parallel netbackup jobs (one for each FIFO). We have three tape heads so, they all actually run in parallel and should fly like the wind (all over GigE).
# date; ./zbackup.sh -s -l 2006121402; date Thu Dec 14 12:58:43 EST 2006 ./zbackup.sh: backuplabel: 2006121402 full zfs destroy intmirror/xlogs@lastfull zfs destroy xsr_slow_1/pgdata@lastfull zfs destroy xsr_slow_2/pgdata@lastfull Backing up as '2006121402' starting postgres backup on label 2006121402 zfs snapshot intmirror/xlogs@lastfull zfs snapshot xsr_slow_1/pgdata@lastfull zfs snapshot xsr_slow_2/pgdata@lastfull stopping postgres backup on label 2006121402 /sbin/zfs send intmirror/xlogs@lastfull >> /pgods/scratch/intmirror:xlogs.lastfull.full & /sbin/zfs send xsr_slow_1/pgdata@lastfull >> /pgods/scratch/xsr_slow_1:pgdata.lastfull.full & /sbin/zfs send xsr_slow_2/pgdata@lastfull >> /pgods/scratch/xsr_slow_2:pgdata.lastfull.full & Sat Dec 23 15:39:47 EST 2006
SWEET JESUS! That's a 9 day, 2 hour, 41 minutes and 4 second backup. Somehow I think that doing daily incremental backups is out of the question. I tried zfs send redirected to /dev/null (just to demonstrate that netbackup was not the bottleneck) and, as expected, there was no noticeable speedup. I've tested this on some other machines and got the send operation to run quite fast. However, any time a very competitive I/O load is added, it just suffers miserably and becomes so slow that it is useless.
Reading the source code to the ZFS layer leads me to believe that all the operations for doing the send are scheduled serially (each after the previous completes) and compete equally for system I/O with all other processes. I saw no intuitive way to make the ioctl()s with ZFS act as if they were more important that other things going on in the system. This leads me to believe that it may not be so easy to fix. However, those Sun engineers have wicked tricks up their sleeves and tend to pull of some amazing feats. So, here's hoping!
Until then, I hereby suggest that the 'zfs send' be renamed 'zfs trickle'.
I've been working with Robert on benchmarking some search solutions. He's done quite a bit of work loading and transforming the citizendium data into Postgres and testing tsearch2. Now we're supposed to run all this stuff head to head against Lucene to see how PostgreSQL can hold up against Lucene.
Yes, Lucene is specifically designed for search, but there are many advantages to using something like PostgreSQL is it performs on par. The details of the search can be described more articulately in SQL than in a search grammar. Additionally, it would allow us to later join the search results against "other" data for the purposes of simple intersection as well as altering the relevance based on some piece of data known outside of Lucene. Of course, one could simply go reindex the dataset with the new data, but using SQL one can easily just alter the SQL expression to achieve the desired results. So, Robert tested tsearch2 in postgres and came up with some pretty reasonable results on the 3.5 million document dataset -- a sustained 350 qps with no tuning. I figure we could easily up that by 50% with some elbow grease.
There are requirements: (1) data can be added and updated in the index quickly enough and (2) it can scale to 10000 qps. So, the postgres solution seems reasonable so far, assuming we can bump 350 up to 525 with some effort, we're talking about loading this over a cluster of (10000/525)/0.70 ~ 27 machines. Why 0.70? -- never run a production system at over 70% capacity regularly.
So, while I am not a big fan of Java, I can certainly code it. I whipped out Lucene and started to index the data that Rob stuck in the postgres instance:
The net result of the indexing process was indexing a two long integers, a date, a static text token and two english paragraphs (tokenized) as well as a varying set of "tag" data (untokenized, individual words).
3741453ms indexing 3639937 documents.
That's about 970 documents/second. That will certainly meet the requirements for new indexing.
A simple test of search terms limited to documents in the 30 days varies widely with concurrency, like:
+(description:honda) +(description:atv) +(description:aftermarket) \ +(description:oem) +(description:handlebars) \ +(created_on:[20061121 TO 20070101])
So, the first lesson learned is that Lucene is quite sensitive to concurrency.
| concurrency | qps |
|---|---|
| 1 | 69 |
| 10 | 306 |
| 20 | 273 |
| 50 | 185 |
| 100 | abort/segv/java explosion |
300 queries per second isn't abysmal, but certainly not what I was expecting.
Now we issue queries like:
+(description:kenwood) +(description:area) +(description:in) \ +(description:bethesda) +(description:maryland)
1257 queries per second. Not too shabby.
Assuming that we have a mix of date range search, full text searches and tag searches, we can guesstimate around 800 queries per second. (10000/800)/0.70 comes to about 18 servers. Not so bad.
I have to say I'm a little surprised I didn't see faster numbers here. PostgreSQL has a lot of overhead (maintaining MVCC and all the other ACID guarantees) and chimes in with a very very respectable 350 qps.
A while ago, I posted looking for an aspiring developer and somehow the blogosphere delivered. Here's my second attempt.
OmniTI (currently) manages hundreds of machines running pretty much every flavor of operating system imaginable (almost zero Windows). We manage systems, routers, switches, load balancers, storage area networks -- basically, every operational component of today's large architectures. If you've ever seen me speak, I think one of the things that entertains people the most is the amusing anecdotes. With big systems processing critical business transactions, when @#$% hits the fan, the faint of heart end up with a moist seat.
Currently, we're looking for a junior SA to learn what it is to manage large systems. And like my previous post, here are some rules of engagement:
"Is this job right for me?" you might ask...
If you're interested in playing with big toys in an agile 24x7 environment, then feel free to apply. Send your resume to jobs at omniti dot com.

