philipcristiano's posterous

 
Filed under

HTTP

 

Interesting analysis of Googlebot requests

We did it. We solved one of the unsolved big SEO (Search Engine Optimization) mysteries of the modern time. It took quite some time, dragged us down in the deep pits of the TCP/IP and HTTP 1.1 specification, but finally we emerged victorious.

What’s the mystery:

Sometimes in Google Webmaster Tools -> Sitemaps you see error messages like: “We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.

Webmaster Tools - Sitemaps-1

Then you start investigating: yes, the file is there, yes, it is accessible for googlebot, yes, it has content, yes, it’s like specified on sitemaps.org.

Then you dig deeper: the error logs, the access logs, all logs, and then you realize: that f*cking GET request does not exist in the timespan Google reported (to be sure you look at a longer timespan, still nothing there).

No GET request!

You look at the network, you look at the DNS, could it be that the requests went astray. You dig and dig, even bother to write a ticket to your server housing company. Still, nothing comes up, no leads.

Then you dig deeper: TCP/IP HTTP 1.1

  • You realize that googlebot makes multiple (we counted up to 11) GET requests in one single TCP/IP connection. (which is OK according to the HTTTP 1.1 spec).
  • You realize (with the help of stackoverflow) that these multiple GET requests in the some TCP/IP connection are processed in sequence (one after the other).
  • You realize that if one these GET requests has a major time lag (is much slower than the other GET requests) Google cuts the TCP/IP connection.
  • Because all the GET requests in the connection were processed in sequence, all the GET requests after the cut are lost. You don’t see them in the error/access logs as they were never processed, even though they were sent.
  • You see an error in Google Webmaster Tools, without a trace in your logfiles.

SEO Mystery solved.

If you don’t understand a single word i just wrote, please remember, we are geeks.

Filed under  //   Google   HTTP   Sitemap  

Comments [0]