{"id":43,"date":"2012-05-19T15:00:00","date_gmt":"2012-05-19T15:00:00","guid":{"rendered":"http:\/\/www.olivierdoucet.info\/wordpress\/?p=43"},"modified":"2012-05-19T15:00:00","modified_gmt":"2012-05-19T15:00:00","slug":"debugging-a-mysql-stall","status":"publish","type":"post","link":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/","title":{"rendered":"Debugging a MySQL stall"},"content":{"rendered":"<div>This article follows \u00ab\u00a0finding an usual cause of max_user_connections\u00a0\u00bb.<\/div>\n<div>After debugging a max_user_connections and a cache expiration problem in the customer app, we still experienced stalls of the mysqld process. MySQL refused all connections for 5 to 20 seconds, with no obvious reason. I&rsquo;ll try to explain what I did to solve it.<\/div>\n<p><!--more--><\/p>\n<h1>Go on, sherlock<\/h1>\n<div>CPU usage was the following :&nbsp;<\/div>\n<div><img decoding=\"async\" src=\"\/blog\/public\/iostatarticle.png\" alt=\"IOSTAT article\" style=\"margin: 0 auto; display: block; \" title=\"IOSTAT article, mai 2012\" \/><\/div>\n<div>Legend: light green is user usage ; dark green is system usage ; violet is IOwait. Y-axis is thread usage (one thread at 100% == 100 in the graph).<\/div>\n<div>Some useful informations on this image :&nbsp;<\/div>\n<div>&#8211; there is no visible pattern on when the problem is happening (confirmed with a graph on 2 days).<\/div>\n<div>&#8211; CPU usage is increasing for a minute &#8230;<\/div>\n<div>&nbsp; and this usage is done by the system<\/div>\n<div>I created a small bash script to get all data I needed when problem happened :&nbsp;<\/div>\n<p><code><\/p>\n<div>while sleep 1; do<\/div>\n<div>mysql -e \"SELECT 1\" &gt; \/dev\/null<\/div>\n<div>&nbsp; if [ \"$?\" != \"0\" ]; then<\/div>\n<div>&nbsp; &nbsp; echo \"Collect !\"<\/div>\n<div>&nbsp; &nbsp; &nbsp; &nbsp; STAMP=$(date \"+%H%M%S\")<\/div>\n<div>&nbsp; &nbsp; &nbsp; &nbsp; echo \"Problem ! Start collecing\"<\/div>\n<div># add here command line to be executed<\/div>\n<div>sleep 5<\/div>\n<div>&nbsp; &nbsp;else echo -ne \".\\r\"<\/div>\n<div>&nbsp; &nbsp;fi<\/div>\n<div>done ;<\/div>\n<p><\/code><\/p>\n<div>I first used this script to record a &lsquo;top&rsquo; session to check which processes were eating all CPU. To get top output for 10 seconds I used the following command line :<\/div>\n<pre>top -b -d 1 -n 10 &gt; $STAMP\"_top\" &amp;<\/pre>\n<div>Remember $STAMP is the timestamp used in my bash script to group outputs.<\/div>\n<div>Result ? mysqld is the top process, with linux kernel process [kworker]. Not a useful information by itself, but it definitely shows there is something wrong happening in the kernel.<\/div>\n<div>How do you debug a kernel on a production server ? After checking various websites, I decided to use \u00ab\u00a0oprofile\u00a0\u00bb (http:\/\/oprofile.sourceforge.net). It consists of a kernel driver and a daemon for collecting sample data, and several post-profiling tools for turning data into information. To my mind, pros of this tool are :&nbsp;<\/div>\n<div>\n<ul>\n<li>Sampling (compared to tools like strace)<\/li>\n<li>Can sample kernel too (compared to &#8230; strace)<\/li>\n<li>gather data in one place, and allow you to zip samples to perform the analysis you want on an other server.<\/li>\n<\/ul><\/div>\n<div>Unfortunately I didn&rsquo;t have oprofile on this server, so we decided to reboot the service in the middle of the night with the customer approval. The day after, we were able to sample some data.<\/div>\n<div>First, let&rsquo;s setup the tool with very low overhead :&nbsp;<\/div>\n<pre>opcontrol --setup --vmlinux=\/root\/vmlinux --event=CPU_CLK_UNHALTED:4000000<\/pre>\n<div>If you wondered : yes, I retrieved the uncompressed version of vmlinuz to \/root\/vmlinux. This is mandatory to profile kernel.<\/div>\n<div>Then, let&rsquo;s record a normal session to see what&rsquo;s happening when everything is fine :&nbsp;<\/div>\n<pre>opcontrol --start --vmlinux=\/root\/vmlinux<br \/>sleep 5<br \/>opcontrol --stop<br \/>opcontrol --dump<br \/>opcontrol --save=idlesession<\/pre>\n<div>Top 5 of symbols were the following :&nbsp;<\/div>\n<pre>samples &nbsp;% &nbsp; &nbsp; &nbsp; &nbsp;symbol name<br \/>462229 &nbsp; 99.0696 &nbsp;poll_idle<br \/>559 &nbsp; &nbsp; &nbsp; 0.1198 &nbsp;clear_page_c<br \/>185 &nbsp; &nbsp; &nbsp; 0.0397 &nbsp;find_busiest_group<br \/>139 &nbsp; &nbsp; &nbsp; 0.0298 &nbsp;clflush_cache_range<br \/>114 &nbsp; &nbsp; &nbsp; 0.0244 &nbsp;mutex_spin_on_owner<\/pre>\n<div>Everything normal here.<\/div>\n<div>Note : I was not able to see symbols on the same server because of a buggy binutils library. This is a known problem that is affecting oprofile. Solution was simple : copy \/var\/lib\/oprofile to another server and use opreport with &#8211;session-dir like this :&nbsp;<\/div>\n<pre>opreport --session-dir=\/root\/var\/lib\/oprofile --symbols session:idlesession vmlinux |head<\/pre>\n<div>&nbsp;&nbsp;<\/div>\n<div>You&rsquo;ll need vmlinux binary in your current folder (for debugging vmlinux symbols. Or if you want it system-wide, just use &#8211;root. See oprofile doc for more informations).<\/div>\n<div>I then modified the bash script I used earlier, and added the following :&nbsp;<\/div>\n<pre>opcontrol --start --vmlinux=\/root\/vmlinux<br \/>sleep 5<br \/>opcontrol --stop<br \/>opcontrol --dump<br \/>opcontrol --save=$STAMP<\/pre>\n<div>We waited a few hours, and collected two sessions. Results were very close so here is the output :<\/div>\n<pre>samples &nbsp;% &nbsp; &nbsp; &nbsp; &nbsp;symbol name<br \/>1000515 &nbsp;78.9433 &nbsp;poll_idle<br \/>217103 &nbsp; 17.1300 &nbsp;compaction_alloc<br \/>6822 &nbsp; &nbsp; &nbsp;0.5383 &nbsp;compact_zone<br \/>5304 &nbsp; &nbsp; &nbsp;0.4185 &nbsp;spin_lock<br \/>4378 &nbsp; &nbsp; &nbsp;0.3454 &nbsp;flush_tlb_others_ipi<\/pre>\n<div>Very interesting ! There were many CPU cycles in compaction_alloc. What is this function ? Let&rsquo;s read kernel source code mm\/compaction.c around line 410 :&nbsp;<\/div>\n<p><code><\/p>\n<div>406 \/*<\/div>\n<div>407 &nbsp;* This is a migrate-callback that \"allocates\" freepages by taking pages<\/div>\n<div>408 &nbsp;* from the isolated freelists in the block we are migrating to.<\/div>\n<div>409 &nbsp;*\/<\/div>\n<div>410 static struct page *compaction_alloc(struct page *migratepage,<\/div>\n<div>411 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; unsigned long data,<\/div>\n<div>412 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; int **result)<\/div>\n<p><\/code><\/p>\n<div>By experience, I know that this can be related to &lsquo;huge pages&rsquo; functionality. Are they enabled ?&nbsp;<\/div>\n<pre>&nbsp;sysctl -a |fgrep vm.nr_hugepages<br \/>&nbsp;vm.nr_hugepages = 0<\/pre>\n<div>Seems not, but wait a second : what about \u00ab\u00a0transparent huge pages\u00a0\u00bb ? This is closely related and after some thought, may fit more for the problem we have currently. If you want to read more about \u00ab\u00a0THP\u00a0\u00bb, this article by Jonathan Corbet is really interesting : <a href=\"http:\/\/lwn.net\/Articles\/423584\/\" hreflang=\"en\">http:\/\/lwn.net\/Articles\/423584\/<\/a>. You can also take a look at the official documentation inside kernel written by the same guy : <a href=\"http:\/\/lwn.net\/Articles\/423592\/\" hreflang=\"en\">http:\/\/lwn.net\/Articles\/423592\/<\/a><\/div>\n<div>To check if THP are enabled, do the following :&nbsp;<\/div>\n<pre>cat \/sys\/kernel\/mm\/transparent_hugepage\/enabled<br \/>[always] madvise never<\/pre>\n<div>This was confirmed by seeing a process \u00ab\u00a0khugepaged\u00a0\u00bb in processlist \ud83d\ude09<\/div>\n<h1>\u00ab\u00a0So, what happened ?\u00a0\u00bb<\/h1>\n<div>This server has 128GB RAM. Memory can get very fragmented, and once in a while, the Linux Kernel tried to migrate THP to defrag memory. This process seemd too agressive in our case and resulted to stall.&nbsp;<\/div>\n<h1>\u00ab\u00a0Can it be fixed ?\u00a0\u00bb<\/h1>\n<div>Long trend solution would be making this migration less aggressive. After a few reads on Linux Kernel mailing list, I found that someone already mentioned that the current code is too drastic and can be less aggressive : <a href=\"https:\/\/lkml.org\/lkml\/2011\/7\/26\/103\" hreflang=\"en\">https:\/\/lkml.org\/lkml\/2011\/7\/26\/103<\/a><\/div>\n<div>This patch is not in kernel 3.2 nor in current trunk and I do not want to maintain kernel patches in our compil process \ud83d\ude42 We could also tune khugepaged behaviour with keys in \/sys\/kernel\/mm\/transparent_hugepage\/khugepaged\/.<\/div>\n<div>But I needed a quick fix, so we disabled THP completely like this :&nbsp;<\/div>\n<pre>&nbsp;echo \"never\" &gt; \/sys\/kernel\/mm\/transparent_hugepage\/enabled<\/pre>\n<div>We now waited for 24 hours and no more stalls ! Problem solved \\o\/<\/div>\n<h1>Conclusion<\/h1>\n<div>In this process I discovered that oprofile can be a great ally when you are fighting kernel strange behaviour, or even if you want to know what MySQL is doing. I could not use gdb nor strace because it was crashing MySQL everytime (this is because this is a very big server with 64 threads and heavy load&#8230;). Oprofile gave me the results I wanted and has no impact on performance when it is not active.<\/div>\n<div>Once again, solving this bug required large panel of skills and I&rsquo;m very proud of it \ud83d\ude42 Fortunately, 99% of problems encountered everyday are not as tough as this one.<\/div>\n","protected":false},"excerpt":{"rendered":"<div>This article follows \u00ab\u00a0finding an usual cause of max_user_connections\u00a0\u00bb.<\/div>\n<div>After debugging a max_user_connections and a cache expiration problem in the customer app, we still experienced stalls of the mysqld process. MySQL refused all connections for 5 to 20 seconds, with no obvious reason. I&rsquo;ll try to explain what I did to solve it.<\/div>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[50,30,36,49],"class_list":["post-43","post","type-post","status-publish","format-standard","hentry","category-php-sql","tag-kernel","tag-linux","tag-mysql","tag-thp"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Debugging a MySQL stall - My thoughts<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Debugging a MySQL stall - My thoughts\" \/>\n<meta property=\"og:description\" content=\"This article follows &quot;finding an usual cause of max_user_connections&quot;.After debugging a max_user_connections and a cache expiration problem in the customer app, we still experienced stalls of the mysqld process. MySQL refused all connections for 5 to 20 seconds, with no obvious reason. I&#039;ll try to explain what I did to solve it.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/\" \/>\n<meta property=\"og:site_name\" content=\"My thoughts\" \/>\n<meta property=\"article:published_time\" content=\"2012-05-19T15:00:00+00:00\" \/>\n<meta name=\"author\" content=\"olivier\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@odoucet\" \/>\n<meta name=\"twitter:site\" content=\"@odoucet\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"olivier\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/\",\"url\":\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/\",\"name\":\"Debugging a MySQL stall - My thoughts\",\"isPartOf\":{\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/#website\"},\"datePublished\":\"2012-05-19T15:00:00+00:00\",\"author\":{\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/#\/schema\/person\/2a02230ae6aeac086b3379e5853c6f96\"},\"breadcrumb\":{\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.olivierdoucet.info\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Debugging a MySQL stall\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/#website\",\"url\":\"https:\/\/www.olivierdoucet.info\/blog\/\",\"name\":\"My thoughts\",\"description\":\"En fran\u00e7ais and in english :)\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.olivierdoucet.info\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/#\/schema\/person\/2a02230ae6aeac086b3379e5853c6f96\",\"name\":\"olivier\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/www.olivierdoucet.info\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/febd16d726b2266a512e951d03424de3?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/febd16d726b2266a512e951d03424de3?s=96&d=mm&r=g\",\"caption\":\"olivier\"},\"sameAs\":[\"http:\/\/\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Debugging a MySQL stall - My thoughts","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/","og_locale":"fr_FR","og_type":"article","og_title":"Debugging a MySQL stall - My thoughts","og_description":"This article follows \"finding an usual cause of max_user_connections\".After debugging a max_user_connections and a cache expiration problem in the customer app, we still experienced stalls of the mysqld process. MySQL refused all connections for 5 to 20 seconds, with no obvious reason. I'll try to explain what I did to solve it.","og_url":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/","og_site_name":"My thoughts","article_published_time":"2012-05-19T15:00:00+00:00","author":"olivier","twitter_card":"summary_large_image","twitter_creator":"@odoucet","twitter_site":"@odoucet","twitter_misc":{"\u00c9crit par":"olivier","Dur\u00e9e de lecture estim\u00e9e":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/","url":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/","name":"Debugging a MySQL stall - My thoughts","isPartOf":{"@id":"https:\/\/www.olivierdoucet.info\/blog\/#website"},"datePublished":"2012-05-19T15:00:00+00:00","author":{"@id":"https:\/\/www.olivierdoucet.info\/blog\/#\/schema\/person\/2a02230ae6aeac086b3379e5853c6f96"},"breadcrumb":{"@id":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.olivierdoucet.info\/blog\/2012\/05\/19\/debugging-a-mysql-stall\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.olivierdoucet.info\/blog\/"},{"@type":"ListItem","position":2,"name":"Debugging a MySQL stall"}]},{"@type":"WebSite","@id":"https:\/\/www.olivierdoucet.info\/blog\/#website","url":"https:\/\/www.olivierdoucet.info\/blog\/","name":"My thoughts","description":"En fran\u00e7ais and in english :)","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.olivierdoucet.info\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Person","@id":"https:\/\/www.olivierdoucet.info\/blog\/#\/schema\/person\/2a02230ae6aeac086b3379e5853c6f96","name":"olivier","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/www.olivierdoucet.info\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/febd16d726b2266a512e951d03424de3?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/febd16d726b2266a512e951d03424de3?s=96&d=mm&r=g","caption":"olivier"},"sameAs":["http:\/\/"]}]}},"_links":{"self":[{"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/posts\/43","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/comments?post=43"}],"version-history":[{"count":0,"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/posts\/43\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/media?parent=43"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/categories?post=43"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.olivierdoucet.info\/blog\/wp-json\/wp\/v2\/tags?post=43"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}