speeding up postgres onlinebackup compression

Recently I stumbled over this blog entry where the benefits of xargs -P are outlined. In case you don’t know about -P yet, it allows you to specify the number of parallel processes run by xargs. So together with the -n switch you can run an arbitrary number or parallel jobs for your task without having to care about job control.

For quite some time I’m watching our big postgresql datawarehouse onlinebackup being compressed for 4 days, so when I read this it came instantly to my mind that we can drastically reduce the amount of time it takes to bzip the backup file. The changes to our backup script were easy. Instead of tar’ing everything to one big file I piped the output to split and put it into nice 512MB chunks.

tar c –ignore-failed-read –numeric-owner –exclude ‘lost+found’ –exclude ‘pg_xlog’ -f – /mnt/myPGDATA | split -b 536870912 -d – myPGDATAbackup.tar.

This command will create myPGDATAbackup.tar.000, myPGDATAbackup.tar.001, myPGDATAbackup.tar.002, etc. files in your current work directory, all maximum 512MB in size.

Afterwards you just execute xargs on this files with CoreCount being the number of cores you want parallel to compress the files:

ls -1 myPGDATAbackup.tar.*|xargs -r -n 1 -P $CoreCount bzip2 -9

You can assign as many cores as you want to your backup job (use CoreCount=0 to use spawn as many processes as files), however you should be careful not to shoot your IO-backend down. After increasing the memory to 512*$CoreCount we were able to hold all currently compressed backup files in the buffer cache, so that the cores do not have to wait for the io subsystem to catch up with read requests.

The performance advancement is amazing. Our onlinebackup compression time went from 49 hours to 5,5 hours! Now we should speed up the data gathering via tar, bit that is another story.

About these ads

1 comment so far

  1. Gonzalo Gil on

    I tried it, but i don’t have as much backup space as the database size is…

    so, using the idea i implement this solution:

    find $PGDATA -type f -not -path “$PGDATA/lost+found/*” -not -path “$PGDATA/pg_xlog/*” -not -path “$PGDATA/postmaster.pid” | xargs -r -n 2 -P $CoreCount bzip2 -9 -k

    so i paralyze the bzip2 and use less disk

    then i do:

    find $PGDATA -name “*.bz2″ | xargs tar cf $DATE”_backup_full.tar –remove


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: