I have this in my htaccess file: RewriteCond %{HTTP_USER_AGENT} ^baiduspider [NC] RewriteRule .* - [F] RewriteCond %{HTTP_USER_AGENT} ^yandex [NC] RewriteRule .* - [F] and also this in robots.txt: User-agent: Baiduspider Disallow: / User-agent: Yandex Disallow: / Yet theyre still accessing the forum. Does anyone have any other ideas to put a complete stop to both bots from accessing a site? They seem to ignore the robots and get around the htaccess rules.
From what I'm reading, Baidu completely ignores the robot.txt file and is causing trouble for lots of people. The recommended action appears to be to edit the .httpd.conf file to say:- Code: SetEnvIfNoCase User-Agent "^Baidu" bad_bot <Directory /> Order Allow,Deny Allow from all Deny from env=bad_bot </Directory> I can't confirm or deny how well this works.. just the most popular suggestion I'm seeing across the net.
Thank you very much I will try that, hopefully it will work and stop the need for the rules in each sites htaccess and robots.txt as I dont want on them on any of our sites
It appears that neither spider takes a blind bit of notice of the robots.txt so anything in there will be completely ignored. I've read a few cases where it's spidering forums set as private areas as well.
You would think they would respect them wouldnt you. I always thought having such rules is like a no to them, please bugger off. The main reason im wanting rid of them is that everytime they come online they dont just bring one, there is at least 100 spiders crawling fast as hell and our VPS cant handle that as well as good bots, guests and members, not at the rate they crawl at either and I dont want to upgrade to compensate them when we get little to no traffic from them each month It just doesnt seem worth it lol I have added that ruleset using the includes editor in WHM, hope it works!
You'd think that you could complain about it or something, I am sure that we would all be really annoyed and complaining if Google was crawling our sites and we had no choice but to accept that they ignore robots.txt rules.
I know nothing about coding, but the following code has been added to a forum with Baiduspider problems and it stopped them coming. RewriteEngine On <Files *.*> order allow,deny allow from all deny from 220.181. </Files>
Well no it is the amount of them that is coming online, it isnt just one bot, it is up to 100 and once and for every bot that could be a user online. Our forum doesnt gain much traffic from baidu's search engine, roughly 20 visits per month from them and I dont think it ever will increase much, so I would rather block them and allow for 100 extra guests or users online. They do use bandwidth which I am not fussed over at all, it is just the amount theyre sending, it isnt just one or two. It isnt just baidu either, yandex spider also behaves the same. Thanks Barry, I have added a whole IP list along with the one you posted to the htaccess just in case they get on somehow.
Certainly. I just hope they dont come back :rofl: If they get through all of these measures god knows what will keep them out :X3:
Well, I'm getting hit with Baidu today...but don't see/notice any issues thus far. About 36 right now....
For a client I ended up blocking the Baidu IP ranges in the firewall because they were doing thousands upon thousands of hits one very page every day draining up to 200gb/month from the client. The problem with blocking them in a .htaccess is that they'll still see a "Forbidden" page which will be hits on the server and CPU cycles although it's a static page it won't use much it will *still* use up connections and resources even on a small scale. In a shared environment dropping Baidu in the system firewall simply won't be an option at most providers but on a VPS/Dedicated it would be up to you.
Would adding them to the firewall be more effective than having them rules posted earlier in apache's config?