Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

Just post the lines showing accesses by the IPs you are worried about. Just one line from each would be fine. You're looking for the User Agent string which, for normal users, shows the name of the browser. For well-behaved spiders, it will have an identification such as Googlebot. There are also individuals who use generic software to create their own spiders, sometimes for not nice reasons. Those may be hard to identify, sometimes they pretend to be a regular web browser.

I am unable to see those IP addresses in the access_logs.

any other suggestions?

Share this post


Link to post
Share on other sites
Just post the lines showing accesses by the IPs you are worried about. Just one line from each would be fine. You're looking for the User Agent string which, for normal users, shows the name of the browser. For well-behaved spiders, it will have an identification such as Googlebot. There are also individuals who use generic software to create their own spiders, sometimes for not nice reasons. Those may be hard to identify, sometimes they pretend to be a regular web browser.

I think this is one

86.145.13.224 - - [17/May/2007:11:09:43 +0100] "GET /images/infobox/corner_right.gif HTTP/1.1" 200 86 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.0)"

Share this post


Link to post
Share on other sites

I doubt that's an entry from a spider. That looks like a normal user browser. If you can't find the entries in the access log, then perhaps whatever you're using to detect these sessions is broken.

Share this post


Link to post
Share on other sites
I doubt that's an entry from a spider. That looks like a normal user browser. If you can't find the entries in the access log, then perhaps whatever you're using to detect these sessions is broken.

I think I got one !

 

82.153.66.61 - - [18/May/2007:08:41:48 +0100] "GET /product_info.php?products_id=28761 HTTP/1.0" 200 100374 "-" "-"

On My Whois online the IP address of this shows as unknown and this has been crawling my site now for a few weeks.

 

How can I get rid of it.

Pls help

 

Kunal

Share this post


Link to post
Share on other sites
I think I got one !

On My Whois online the IP address of this shows as unknown and this has been crawling my site now for a few weeks.

 

How can I get rid of it.

Pls help

 

Kunal

I doubt that's an entry from a spider. That looks like a normal user browser. If you can't find the entries in the access log, then perhaps whatever you're using to detect these sessions is broken.

I think I got one !

 

65.214.39.180 - - [18/May/2007:04:36:53 +0100] "GET /robots.txt HTTP/1.1" 200 3750 "-" "-"

65.36.241.81 - - [18/May/2007:04:33:38 +0100] "HEAD / HTTP/1.1" 200 - "-" "InternetSeer.com"

82.153.66.61 - - [18/May/2007:08:41:48 +0100] "GET /product_info.php?products_id=28761 HTTP/1.0" 200 100374 "-" "-"

64.233.182.136 - - [18/May/2007:08:47:12 +0100] "GET /images/247mid/LEXW002348148.jpg HTTP/1.0" 404 292 "-" "-"

On My Whois online the IP address of this shows as unknown and this has been crawling my site now for a few weeks.

 

How can I get rid of it.

Pls help

 

Kunal

Share this post


Link to post
Share on other sites

Other than the line for InternetSeer, all of the user agent strings are blank. You cannot use spiders.txt to defend against those.

 

You show four different IPs and it would appear these are four independent accesses of your site. If there is a particular IP you want to block, you can do that in a .htaccess file. The one that puzzles me is 64.233.182.136 - this is a google.com IP but there is no user agent string? That is highly unusual.

Share this post


Link to post
Share on other sites
Other than the line for InternetSeer, all of the user agent strings are blank. You cannot use spiders.txt to defend against those.

 

You show four different IPs and it would appear these are four independent accesses of your site. If there is a particular IP you want to block, you can do that in a .htaccess file. The one that puzzles me is 64.233.182.136 - this is a google.com IP but there is no user agent string? That is highly unusual.

Hi Steve

 

I have just seen a strange thing in my googlesite maps? It says:

 

Google URL's restricted by robots.txt (1841) all coming from ...allprods.php?...buy_now&products_id=

 

Does this mean that the robot.txt & spiders.txt is working as should, or is there another problem? This is the first time this has happened so I'm not sure what is going on. :blush:

 

Thanks for your help.

 

Julie

Share this post


Link to post
Share on other sites

I don't know - it suggests that something in your robots.txt is blocking Google's spider. Tell me your site URL (in a PM if you don't want to post it) and I'll take a look.

 

Google would not care about spiders.txt - in fact, it would be happy that it is not getting a session ID.

Share this post


Link to post
Share on other sites
I don't know - it suggests that something in your robots.txt is blocking Google's spider. Tell me your site URL (in a PM if you don't want to post it) and I'll take a look.

 

Google would not care about spiders.txt - in fact, it would be happy that it is not getting a session ID.

Thanks. I've pm'd you.

Share this post


Link to post
Share on other sites

Hi in my whois online section I am getting this IP Address 192.168.1.72 from many days, I noticed that this address is adding many items in cart and even access Write review page but no one can access this page unless log in. so how this Ip entered this page without creating account, because i checked all accounts and no one holding account tried to access this page.

got some ips from taiwan and china which made me more worried.

Please guide me what is happening and is this something to sorry about.

 

Thanks and regards

 

zee

Share this post


Link to post
Share on other sites

192.168.anything.anything is a "non-routable" IP range reserved for private networks. So the only way that that IP could show up when accessing your store is:

 

1) Whatever is displaying the IP is getting it wrong

2) The access is coming from a system on the same local network as the server

 

I keep seeing people report odd results from "who's online" displays. I don't trust them. Access logs is all I trust, and this is also where you will get the user agent string useful for spiders.txt filtering.

Share this post


Link to post
Share on other sites

Hi,

It seems my site is being hacked!

61.238.244.86 - - [22/May/2007:20:24:01 +0100] "POST /contact_us.php?action=send&osCsid=f45040a83f4a5da38577d0215a1062f4 HTTP/1.0" 500 599 "-" "-"

This IP address seems to be sending out spam email from my contact_us.php file.

 

Please help me stop this.

Kunal

Share this post


Link to post
Share on other sites

Hi Steve,

I've read through this thread - all 19 pages and am crossed-eyed now! I wonder if you can help:

 

My problem:

for the last 5 days, Yahoo Slurp has been using up huge amounts of bandwidth and seems to be stuck and is trying to do the following actions:

 

1) buy_now

2) notify

3) write_review

4) constantly hitting the cookie_usage page

5) hitting pages that are disallowed in robots.txt i.e. contact_us & conditions

 

From this thread, I have done:

1) added the code mentioned for includes/modules/product_listing (case 'PRODUCT_LIST_BUY_NOW':)

2) added cookie_usage.php to robots.txt

3) added product_reviews_write to robots.txt

 

Please can you tell me what else I should do?

Should I add the code for includes/functions/general.php for sort links?

Do I need to add the same code for includes/modules/product_listing to other files? If so - which files please?

My robots.txt says Disallow: /con which I thought would cover the contact_us & conditions files - why is slurp not obeying that?

 

I'm a bit stuck with this - I don't want to bannish slurp but it's not being very well behaved and is p-ing me off somewhat.

 

thanks so much in advance fo ranyone who can help.

Tiger

 

PS

I don't have a problem with session ids - already taken care of that.


I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Share this post


Link to post
Share on other sites

I find Yahoo Slurp to be a poorly-behaved spider, but I don't have this level of problem on my sites. Adding cookie_usage.php to robots.txt accomplishes nothing - what you need to do is prevent Yahoo from seeing links to pages you don't want it to see. On my sites, I display these links only if $session_started is true. Some of the links I protect this way are:

 

- Sort links

- Review Write

- Any Buy Now or add-to-cart or account pages

- Shopping cart

 

If the spider cannot see the links, it cannot follow them, though Yahoo remembers links it has seen and persists in trying to follow them even if it gets errors.

Share this post


Link to post
Share on other sites

Hi Steve,

thanks for replying. I agree, Slurp isn't the most intelligent or well behaved bot! enigma1 has given me a bit of code to try - a permanent header redirect 301 when the bots get redirected to cookie_usage page after trying to do an action. He thinks thats easier than hiding the links from spiders.

 

I'll try that and see how it goes.

 

Do you have an opinion on IRLbot/3.0 (compatible; MSIE 6.0; http://irl.cs.tamu.edu/crawler), he's also been bothering me and I read it's a bad bot - what do you think about it? Should it be banned? Can't see it doing me any good as it's not a search engine but it does eat bandwidth - why should I pay for that...

 

Thanks again

Tiger


I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Share this post


Link to post
Share on other sites

Hi all,

 

How can i stop this terrible bot Twiceler-0.9 from eating up almost all of my bandwidth?

I have the newest update of spiders.txt but this does not stop this damn bot.

 

Thanks,

Eric

Edited by Dynatech

Share this post


Link to post
Share on other sites

Entries in spiders.txt do not prevent spiders from crawling your site. All this does is keep the spider from getting a session, which keeps it from adding things to the cart and venturing into the account pages. If you want to "ban" a spider, your first step should be an entry in robots.txt, but you'll have to visit the spider's web site to see what ID it looks for. If this does not work, you can add a rewrite rule that gives a "failure" to visitors with specific strings in their user agent strings.

 

Hiding links may be more work, but it is far better than giving 301 redirects for pages, as the spiders will keep trying the links. Also, hiding the page sort links will dramatically decrease the bandwidth usage as the spiders won't keep revisiting the category pages with different sort orders. I think this will also improve ranking.

Share this post


Link to post
Share on other sites
Entries in spiders.txt do not prevent spiders from crawling your site. All this does is keep the spider from getting a session, which keeps it from adding things to the cart and venturing into the account pages. If you want to "ban" a spider, your first step should be an entry in robots.txt, but you'll have to visit the spider's web site to see what ID it looks for. If this does not work, you can add a rewrite rule that gives a "failure" to visitors with specific strings in their user agent strings.

 

Hiding links may be more work, but it is far better than giving 301 redirects for pages, as the spiders will keep trying the links. Also, hiding the page sort links will dramatically decrease the bandwidth usage as the spiders won't keep revisiting the category pages with different sort orders. I think this will also improve ranking.

 

Thanks Steve,

 

I think the code I added to not show links if there is no SID worked for buy_now on this file: includes/modules/product_listing.php

 

change

CODE

case 'PRODUCT_LIST_BUY_NOW':

$lc_align = 'center';

$lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> ';

break;

 

to

CODE

case 'PRODUCT_LIST_BUY_NOW':

$lc_align = 'center';

if ($session_started) {

$lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> ';

} else {$lc_text = ' '; }

break;

 

 

I do agree that the bots trying to do the actions is wasting bandwidth. I'm interested in a different opinion from the 301 redirect I mentioned. If I wanted to implement the code to hide the links if there is no session, what would I need to do?

 

From my logs, the problem with actions (buy_now, notify, write review) is with these files:

product_reviews_info.php

products_new.php

product_reviews.php

product_info.php

 

will the sid code work for these files? if so how should it be written? and can you say the exact route of these files?

 

Thanks again for your advice, much appreciated.

 

Tiger


I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Share this post


Link to post
Share on other sites

You do not want to block product_info.php - that is the page that describes an individual product. You also may not want to block product_reviews, as that means spiders can't see the reviews. The general method you show is what I use.

 

On my stores, I do not display links if there is no session to:

- Buy Now links

- product_reviews_write

- shopping cart

- login

- checkout_shipping

- account

- account_create

- product listing sort links

Share this post


Link to post
Share on other sites

Thanks Steve, I'll try to sus out what I need to do.

 

I've had some bot type behaviour on my site but it's getting a session ID - there's no user agent but it comes from ip 208.99.195.54

 

How would I know if this is a bot? It's not someone browsing the store, I'm sure.

 

Thanks

Tiger


I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Share this post


Link to post
Share on other sites

hello! i was wondering if anyone can help.. one of my payment modules doesn't work when i use spiders.txt as one of the lines is blocking the payment provider.. and when a customer goes back to the site their order doesn't complete.. if i remove spiders.txt their order completes.. but obviously i would love to continue using spiders.txt... is their a way i can find out which line to remove without going through manually through each line and doing an order each time?

 

the payment processor is https://securetrading.net/authorize/form.cgi

 

i was aware previously that i should remove java/ from the spiders.txt file to fix the securetrading module.. but i think there might be another one now too... any help would be much aprecciated, thankyou :)

kev.

Share this post


Link to post
Share on other sites

i've already removed /java from the list as aparently that is one that can block securetrading but there's another...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×