During last year , i have faced multiple stall issue. MySQL Service was getting stalled any time, without leaving any clue in any logs. MySQL Service was getting restarted after 15 Minute of stall. We (Me and My Team ) were clueless.
Initially ,we were under impression, we may hit any bug. We gone through all MySQL bug tried multiple options.
1. Moving Redo Logs to magnetic Disk
2. optimizing flushing method/Thread
3. Optimizing all Database (Recreate using mysqldump and restore)
Still no Luck , we were still clueless.
We move forward and tried capturing all information using pt-stalk.
pt-stalk --user= --ask-pass --collect --daemonize --run-time=10 --sleep=10 --cycles=3 –dest= --log=
On Next failure we analyzed and still unable to find RCA. We were under impression that there are some queries which causing this behavior . We changed our focus and try to optimize all possible. Still we were facing random downtime. MySQL got stall and We need to restart as it stops responding.
We used oprofile with pt-stalk which lead us to issues with THP. We also got clue from Oliver's blog post
We disabled THP, one system got stable while anothe was still stalling randonly. We started using perf to get more deep in system calls.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
70.81 1122.971901 27982 40132 14704 futex
19.98 316.848873 1460133 217 6 restart_syscall
5.70 90.315204 52448 1722 io_getevents
It lead us to futex Bug which was in kernel-2.6.32-504. According to Blooger
It was leaving is clueless every time. First we tested same by using strace. Which resumed MySQL process. Later we upgrdaed our Kernel version and MySQL service start working perfectly fine.
Detailed discussion about this bug is available at
https://groups.google.com/forum/#!searchin/mechanical-sympathy/futex/mechanical-sympathy/QbmpZxp6C64/BonaHiVbEmsJ
Learning
1. THP is not good for database.
2. Linux expertise always for troubleshooting.
3. Start thinking outside of database for troubleshooting(MySQL always doesn't hit Bug :) )