Friday, November 18, 2016

How THP and Kernel Version effects MySQL Stability

During last year , i have faced multiple stall issue. MySQL Service was getting stalled any time, without leaving any clue in any logs. MySQL Service was getting restarted after 15 Minute of stall. We (Me and My Team ) were clueless.

Initially ,we were under impression, we may hit any bug. We gone through all MySQL bug tried multiple options.


1. Moving Redo Logs to magnetic Disk
2. optimizing flushing method/Thread
3. Optimizing all Database (Recreate using mysqldump and restore)

Still no Luck , we were still clueless.

We move forward and tried capturing all information using pt-stalk.

pt-stalk --user= --ask-pass --collect --daemonize --run-time=10 --sleep=10 --cycles=3 –dest= --log=

On Next failure we analyzed and still unable to find RCA. We were under impression that there are some queries which causing this behavior . We changed our focus and try to optimize all possible. Still we were facing random downtime. MySQL got stall and We need to restart as it stops responding.


We used oprofile with pt-stalk which lead us to issues with THP. We also got clue from Oliver's blog post


We disabled THP, one system got stable while anothe was still stalling randonly. We started using perf to get more deep in system calls.


% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 70.81 1122.971901       27982     40132     14704 futex
 19.98  316.848873     1460133       217         6 restart_syscall
  5.70   90.315204       52448      1722           io_getevents

It lead us to futex Bug which was in kernel-2.6.32-504. According to Blooger


The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone.  If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
It was leaving is clueless every time. First we tested same by using strace. Which resumed MySQL process. Later we upgrdaed our Kernel version and MySQL service start working perfectly fine.
Detailed discussion about this bug is available at

https://groups.google.com/forum/#!searchin/mechanical-sympathy/futex/mechanical-sympathy/QbmpZxp6C64/BonaHiVbEmsJ



Learning

1. THP is not good for database.  
2. Linux expertise always for troubleshooting.
3. Start thinking outside of database for troubleshooting(MySQL always doesn't hit Bug :) )