Slow shared memory allocation

Recently we were deploying a sybase ASE 12.5.4 64bit on a CentOS 5.3 with kernel 2.6.18-164.11.1.el5 and 48GB RAM available. The ASE was configured to use 32GB shared memory in ‘lock shared memory = 1’ and ‘allocate max shared memory = 1’ mode. The database was running smooth until we created a ramfs filesystem as a storage place for our tempdb device files. We created 4 empty files with dd for log and data with a total size of 10GB. The creation of the devices from within the ASE went without problems, the targeted performance boost was measurable.
However, next time we restarted the database it did not come up properly but hung with the following log message:

00:00000:00000:2010/04/17 10:19:58.70 kernel  Using config area from primary master device.
00:00000:00000:2010/04/17 10:19:58.72 kernel  Detected 16 physical CPU’s
00:00000:00000:2010/04/17 10:19:58.73 kernel  Locking shared memory into physical memory.

The system started to behave very weirdly, a ‘ps xa’ was hanging, a ‘ls /proc’ also got unresponsive. We investigated this issue further with strace and found out that the ASE was hanging while calling mlock(2):

6128  shmget(0x1107fde5, 34359738368, IPC_CREAT|IPC_EXCL|0600) = 98307
6128  shmat(98307, 0, 0)                = ?
6128  mlock(0x2aaaabddd000, 34359738368

After several rounds of config switch flipping we were able to track down the problem to the ramdisk we were using. Since the ramdisk is empty on every reboot we must create the sybase tempdb device files prior to the start of the ASE. If we do not do this, the database will not be able to finish its startup recovery and not boot up properly. However, if we do not create the device files, the database will go past its critical mlock call and startup normally within seconds (up to the point where it tries to read the tempdb device files..). This behavior cropped up regardless whether we use ramfs or tmpfs for ramdisk. To add more confusion to this issue the behavior was only reproducible with shared memory allocations > 16GB.

The big surprise came after we left the database in this state in the evening and came back to work the next day just to discover that the database was up and running happily. Indeed, the mlock call was not hanging forever but it took it 2 hours to finish until the startup of the ASE continued. At this point we involved the sybase technical support, but they weren’t very helpful as this problem seems to have been new to them as well. Several rounds of strace runs and Q&A mails went around without any result.

Eventually we were able to work around this issue by starting the sybase ASE without the tempdb device files created and putting the database immediately in the background. We executed a “sleep 4” on the shell and created the device files just in time for the sybase to start recovery on them. Obviously this is a very crappy workaround and the timing is dependent on a number of factors; But it bought us some time to find the real issue and we were able to hand over the database to the application developers for testing.

I asked a friend of mine to write me a small program which does nothing else then to allocate an arbitrary amount of shared memory and release if afterwards. And indeed, the problem was fully reproducible with this program! The memory allocation was dog slow when grabbing the memory with mlock, so sybase itself was out of the game.
A couple of google rounds later I found an interesting article on LWN which was describing the behavior we faced. Together with kernel/Documentation/sysctl/vm.txt I found the kernel tunable /proc/sys/vm/zone_reclaim_mode which was set 2 on our system. Setting this value to 0 made the kernel process mlock way faster and giving us the normal fast startup speed of sybase.

To make a long story short, the kernel tries to allocate my 32GB shared memory sequentially from its memory zone pool. Since we already allocated 10GB RAM to a ramdisk it hits a lot of non-reclaimable, non-swapable pages. Iterating through all 10GB of occupied RAM takes time – in our case 2 hours. Setting zone_reclaim_mode = 0 makes the kernel skip allocated pages during its iteration, hence the shared memory allocation is faster -> the database starts quickly.

It seems that the kernel developers also think that zone_reclaim_mode = 0 is a good idea and it will be default in future kernel versions. Let’s hope sybase and redhat (centos) take this change from upstream and integrate it into their products / solving guides. I also think that this issue could probably hit you on oracle, postgres or other databases as well – probably anything that allocated shared memory.

No comments yet

Leave a comment