Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

Ok, good. Notice that the user agent contains "slurp". This means that it will not be assigned a NEW session, but if the spider comes in with a session already in the URL, it will keep it regardless. This problem can come up if you ran for a while without "prevent spider sessions" on (you DO have it on in admin, right?) and Yahoo remembered the URL with sessions.

 

Use this contrib to take care of that.

Share this post


Link to post
Share on other sites
Ok, good. Notice that the user agent contains "slurp". This means that it will not be assigned a NEW session, but if the spider comes in with a session already in the URL, it will keep it regardless. This problem can come up if you ran for a while without "prevent spider sessions" on (you DO have it on in admin, right?) and Yahoo remembered the URL with sessions.

 

Use this contrib to take care of that.

 

steve, eventhough that contrib works flawlessly, it has one crucial limitation:

 

RewriteCond %{HTTP_USER_AGENT} !(msnbot|slurp|googlebot) [NC]

 

 

this implies that you need to add every spider known to mankind to this line, basically what you have been doing with spiders.txt.

 

so why not re-use spiders.txt for that as realized in chemo's php version of this:

 

if ( $spider_flag == true ){

if ( eregi(tep_session_name(), $_SERVER['REQUEST_URI']) ){

$location = tep_href_link(basename($_SERVER['SCRIPT_NAME']), tep_get_all_get_params(array(tep_session_name())), 'NONSSL', false);

header("HTTP/1.0 301 Moved Permanently");

header("Location: $location"); // redirect...bye bye

}

}

 

to be used in application_top.php after spider identification and in the event of seo url's after the inclusion of that class.


Treasurer MFC

Share this post


Link to post
Share on other sites

I'm away from my sources so can't look at the code, but my recollection is that the search of spiders.txt (and hence the setting of $spider_flag) is skipped if the spider came in with a session ID already in the URL. Good thing in general as that's an expensive operation, though if there's a cookie set, you could skip it.

 

I was not thinking of listing every spider, just those known to be a problem, but the method you propose is nice if it doesn't impact normal users.

Share this post


Link to post
Share on other sites
I'm away from my sources so can't look at the code, but my recollection is that the search of spiders.txt (and hence the setting of $spider_flag) is skipped if the spider came in with a session ID already in the URL. Good thing in general as that's an expensive operation, though if there's a cookie set, you could skip it.

 

I was not thinking of listing every spider, just those known to be a problem, but the method you propose is nice if it doesn't impact normal users.

 

the enquiry of spiders.txt is only skipped if you force cookies.


Treasurer MFC

Share this post


Link to post
Share on other sites
the enquiry of spiders.txt is only skipped if you force cookies.

 

well, and ofcourse if you do not prevent spider sessions and when the user agent is void.

 

but the former should never have been an option and the latter is obvious.


Treasurer MFC

Share this post


Link to post
Share on other sites
Ok, good. Notice that the user agent contains "slurp". This means that it will not be assigned a NEW session, but if the spider comes in with a session already in the URL, it will keep it regardless. This problem can come up if you ran for a while without "prevent spider sessions" on (you DO have it on in admin, right?) and Yahoo remembered the URL with sessions.

 

Use this contrib to take care of that.

 

 

Actually, I did not have prevent spider sessions set to true (I'm rather red-faced right now). Thank you for pointing this out.

 

I set this up before ever I opened my store, but didn't know much about osC (still don't, but getting better).

 

Thanks for your help!

Nancy

Share this post


Link to post
Share on other sites

Thanks Steve,

 

Turning that attribute on prevented the 'carts' and the session Ids.. which allows me to tell which connections are spiders and which are customers.

 

Q: (if I may) Is it a normal for the yahoo slurp spider and the google bots to sit on your site 24x7?

 

If they do, is that a good thing or not?

 

Google is on there about 3/4 of the time, and yahoo has at least 1 connection all the time.

 

Should this concern me?

 

Thanks!!

Nancy

Share this post


Link to post
Share on other sites

Well, it isn't unusual when they are first indexing your site, especially if you have lots of links and it looks as if the URLs are different. Eliminating sessions can help. Another thing you can do, which I think is mentioned earlier in this thread, is to disable display of the product listing sort links if there is no session. Another is to not display "buy now" links without a session.

Share this post


Link to post
Share on other sites
Well, it isn't unusual when they are first indexing your site, especially if you have lots of links and it looks as if the URLs are different. Eliminating sessions can help. Another thing you can do, which I think is mentioned earlier in this thread, is to disable display of the product listing sort links if there is no session. Another is to not display "buy now" links without a session.

 

 

Thanks!

I'll try it!

Share this post


Link to post
Share on other sites

I've got a spider that's not being detected with the current spiders.txt

 

the log line is:

72.14.199.68 - - [05/Feb/2007:02:30:59 +0100] "GET /shop/rss.php HTTP/1.1" 200 1749 www.perfectpassion.co.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)" "-"

 

Thanks,

Tom

Share this post


Link to post
Share on other sites

Do you see it crawling your entire site? From what I can find out, Feedfetcher is just looking for RSS and Atom feeds. I generally leave out of spiders.txt those that don't pull up store pages. Are there other hits, especially those with session IDs?

Share this post


Link to post
Share on other sites

Fine. Adding that one to spiders.txt would not accomplish anything anyway. But if you do see spiders getting session IDs. then by all means let me know!

 

I just posted an update to the contrib - the rate of new spiders has fallen off quite a bit - I had not seen a new one for a couple of months.

Share this post


Link to post
Share on other sites

38.98.120.75

 

This constantly crawls my site.


Remember what the Bible says: He who is without sin, cast the first rock. And I shall smoketh it.

Share this post


Link to post
Share on other sites

Does anyone know what this is? It's showing up in my website constantly. At this moment, it's been in my site for over 17 consecutive hours. I lookup the IP address and I see this (copying and pasting):

 

OrgName: Inktomi Corporation

OrgID: INKT

Address: 701 First Ave

City: Sunnyvale

StateProv: CA

PostalCode: 94089

Country: US

 

NetRange: 74.6.0.0 - 74.6.255.255

CIDR: 74.6.0.0/16

NetName: INKTOMI-BLK-6

NetHandle: NET-74-6-0-0-1

Parent: NET-74-0-0-0-0

NetType: Direct Allocation

NameServer: NS1.YAHOO.COM

NameServer: NS2.YAHOO.COM

NameServer: NS3.YAHOO.COM

NameServer: NS4.YAHOO.COM

NameServer: NS5.YAHOO.COM

 

-------------------

 

And it's creating session ids for everything from product pages to the privacy policy. ???

 

Andrea

Share this post


Link to post
Share on other sites

That's Yahoo Slurp. If you have "Prevent Spider Sessions" set to TRUE, you should not be getting sessions. What user agent shows in your access logs for these entries?

Share this post


Link to post
Share on other sites

Yes - are you using my updated spiders.txt? It should have the string "slurp" in it that will catch this. What's the URL of your store? I can check to see if the spider check is working properly.

Share this post


Link to post
Share on other sites

My website is soapoperaworld.com

 

I only replaced my spiders.txt file with your updated file around an hour ago.

 

I'm seeing another thing now that I've seen a million times before, too. IP address-lookup is showing this:

 

OrgName: Microsoft Corp

OrgID: MSFT

Address: One Microsoft Way

City: Redmond

StateProv: WA

PostalCode: 98052

Country: US

 

NetRange: 65.52.0.0 - 65.55.255.255

CIDR: 65.52.0.0/14

NetName: MICROSOFT-1BLK

NetHandle: NET-65-52-0-0-1

Parent: NET-65-0-0-0-0

NetType: Direct Assignment

NameServer: NS1.MSFT.NET

NameServer: NS5.MSFT.NET

NameServer: NS2.MSFT.NET

NameServer: NS3.MSFT.NET

NameServer: NS4.MSFT.NET

----------------------------------

 

The weird thing is...it wasn't there an hour ago yet it says, in Who's Online, it's been there for over 18 hours. It seems to have replaced the Yahoo Slurp entry.

 

Agent: msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)

 

It's creating session ids, as well.

Share this post


Link to post
Share on other sites

That entry for the MSNbot has now been replaced again with Yahoo Slurp.

 

Is that normal? For different spiders to be exchanging places? I mean...whatever is going on...it's now been there for over 18 hours and the IP address keeps jumping every few minutes, back and forth, from MSN to Yahoo Slurp.

 

Here's what I'm seeing in Who's Online at the moment:

 

18:31:49 0 Guest 74.6.69.160 19:56:53 14:28:28 /product_info.php?products_id=1300&osCsid=14c31bf87f746de0c19a3e

 

As you can see, the spider is sitting on product number 1300 with a session ID attached.

Share this post


Link to post
Share on other sites

I tried your site with various user agent strings and all looks well. My guess is that you had Prevent Spider Sessions off for a while and these spiders picked up the links with session IDs. It's not that they're getting new ones. You should be able to see this by looking for the first access from a given IP - if it comes in with a session ID, then Prevent Spider Sessions isn't going to remove it.

 

There is a contrib Spider Session Remover which you can use to try to get the spiders to remove the session IDs.

 

I'm not sure what you're seeing that makes you think there's "switching" going on.

Edited by stevel

Share this post


Link to post
Share on other sites

hi guys

 

i get listed all spiders but one is called mozilla for last few weeks

 

copy/paste of those lines

 

Active Bot with session 00:46:40 Mozilla 85.10.36.100 21:40:47 22:27:27 /product_print.php?products_id=497&language=Sl No Not Found

 

Active Bot with session 00:00:00 Mozilla 74.6.87.44 22:27:22 22:27:22 /customer_testimonials.php?testimonial_id=21 No Not Found

 

Active Bot with session 00:18:24 Mozilla 85.10.36.125 22:08:01 22:26:25 /product_print.php?products_id=472&language=It No Not Found

 

Active Bot with session 00:00:00 Mozilla 74.6.69.167 22:23:17 22:23:17 /cookie_usage.php No Not Found

 

Inactive Bot with session 00:00:00 Mozilla 74.6.69.220 22:22:16 22:22:16 /index.php No Not Found

 

Inactive Bot with session 07:11:20 Mozilla 66.249.66.2 15:10:46 22:22:06 /customer_testimonials.php?testimonial_id=17 No Not Found

 

 

any idea?

 

thanx

Share this post


Link to post
Share on other sites

Post again with the relevant lines from your web server access log. All I can do is look up IP addresses here.

 

74.6.87.44 is someone from Slovenia

85.10.36.125 is Yahoo. This should not show up with a session unless it is following a link with a session ID

74.6.69.167 and 74.6.69.220 are also Yahoo

66.249.66.2 is Google

 

I do not trust whatever contrib you are using that displays these lines. It is clearly misidentifying at least some of these.

 

It is interesting that one of the Yahoo visitors got to your cookie_usage page. That suggests that you have active links to the cart or "buy now" and that this particular visitor has no session, another reason to think that the "with session" is bogus.

Share this post


Link to post
Share on other sites

i'll reinstall contrib. to see what will happen.

 

thanx

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×