How to tell Bing how often to crawl website.




Crawl delay and the Bing crawler, MSNBot

One day I found that my server is very slow. After digging more I realized that biggest contributor to the server?s load was my webserver ? Apache Tomcat. I check my webserver and application logs during the event, and found following entries:
2013-04-30 09:47:17,286 [http-apr-80-exec-24] [SEARCH ENGINE ACCESS:157.55.35.93] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,286 [http-apr-80-exec-66] [SEARCH ENGINE ACCESS:157.55.35.93] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,286 [http-apr-80-exec-70] [SEARCH ENGINE ACCESS:157.55.35.93] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,364 [http-apr-80-exec-61] [SEARCH ENGINE ACCESS:157.56.93.193] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,614 [http-apr-80-exec-31] [SEARCH ENGINE ACCESS:157.55.35.93] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,630 [http-apr-80-exec-1] [SEARCH ENGINE ACCESS:157.55.35.93] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,676 [http-apr-80-exec-68] [SEARCH ENGINE ACCESS:157.55.35.93] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2013-04-30 09:47:17,926 [http-apr-80-exec-14] [SEARCH ENGINE ACCESS:157.55.34.183] >>> Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)


It clearly shows that Bing was crawling my website with 8 requests per second. If you translate this to hours, it comes 28.8k calls per hour, or around 700,000 hits a day. This is significant load, and not for small or mid level websites.
If each page hit also causes DB access, that creates even higher load on your server. And all that slows down your overall server performance, and at the end has negative impact on Google rankings (site speed factor).

The simplest solution is to control crawling rate and let search engine crawlers know what speed they should crawl your website.

Specify Delay crawling frequency in the robots.txt file.
Bing supports the directives of the Robots Exclusion Protocol (REP) as listed in a site?s robots.txt file, which is stored at the root folder of a website. The robots.txt file is the only valid place to set a crawl-delay directive for MSNBot.
The robots.txt file can be configured to employ directives set for specific bots and/or a generic directive for all REP-compliant bots. Bing recommends that any crawl-delay directive be made in the generic directive section for all bots to minimize the chance of code mistakes that can affect how a site is indexed by a particular search engine.
Note that any crawl-delay directives set, like any REP directive, are applicable only on the web server instance hosting the robots.txt file.
How to set the crawl delay parameter
In the robots.txt file, within the generic user agent section, add the crawl-delay directive as shown in the example below:
User-agent: *
Crawl-delay: 1
Note: If you only want to change the crawl rate of MSNBot, you can create another section in your robots.txt file specifically for MSNBot to set this directive. However, specifying directives for individual user agents, in addition to using the generic set of directives, is not recommended. This is a common source of crawling errors as sections dedicated to specific user agent directives are often not updated with those in the generic section. An example of a section for MSNBot would look like this:
User-agent: msnbot
Crawl-delay: 1
The crawl-delay directive accepts only positive, whole numbers as values. Consider the value listed after the colon as a relative amount of throttling down you want to apply to MSNBot from its default crawl rate. The higher the value, the more throttled down the crawl rate will be.
Bing recommends using the lowest value possible, if you must use any delay, in order to keep the index as fresh as possible with your latest content. We recommend against using any value higher than 10, as that will severely affect the ability of the bot to effectively crawl your site for index freshness.


Here are some examples for possible values:

Crawl-delay setting

Index refresh speed

No crawl delay set

Normal

1

Slow

5

Very slow

10

Extremely slow




Please leave your comments