Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

Your English is fine.

 

Ask your host where your web access logs are stored. This varies from host to host. You will want to find the log from the day of the access. It is a series of lines, one per access. Search it for the IP address and you should find one or more lines showing a GET access from that IP. When you find one, post a sample line here.

 

Here's one from one of my sites, as an example:

 

66.249.70.76 - - [12/Apr/2008:00:05:53 -0400] "GET /pg-070708.php HTTP/1.1" 200 22812 "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

 

Yours may look somewhat different, but it should be recognizable.

 

The various stats and "who's online" additions don't give you this information. Also, I have found that the "who's online enhancements", etc., are often mistaken about bots having sessions. The line from the access log should have a clue about this too.

Share this post


Link to post
Share on other sites

ok, heres the line

64.124.148.65 - - [04/May/2008:00:16:31 +0200] "GET /tower-crane-p-58.html HTTP/1.1" 200 27735 "-" "Mozilla/5.0 (compatible; FatBot 2.0; http://www.thefind.com/crawler)"

 

so youre right, its fatbot and it should be ok with "tbot", but i dont know why its not working... also these last 2 days, yahoo bot have been crawling my site and it seems is also creating sessions (or at least thats what Who's Online (and also Visitor Web Stats) says...

googlebot, on the other hand, is not creating sessions...

any ideas?

 

 

by the way, just to be sure... spiders.txt has to be located in /includes, right?

Share this post


Link to post
Share on other sites

It is not getting a session. If it were, you'd see the session ID in the URL. (VERY few if any crawlers accept cookies.) Not only would "tbot" catch this but so would the string "crawl". Yes, spiders.txt goes in /includes and you must enable "Prevent Spider Sessions" in admin.

 

If you'll post or PM me the URL of your store, I'll test it to make sure it's working. But otherwise, I'd say that everything is working the way it should.

Share this post


Link to post
Share on other sites
ScanAlert will already be detected as its UA includes the string "/bot".

Any idea why ScanAlert is still regurarly able to start session?

 

se2-scan02.scanalert.com

Name: Guest

ID: 0

 

IP Address: 209.67.114.33

 

User Agent: Mozilla/5.0 (compatible; MSIE 7.0; MSIE 6.0; ScanAlert; +http://www.scanalert.com/bot.jsp) Firefox/2.0.0.3

 

osCsid: 3ee6fd417750254a5b5782dd968dfff1


Absinthe Original Liquor Store

Share this post


Link to post
Share on other sites

If ScanAlert accepts cookies (very rare for a spider) and came in with a session ID in the URL, then spiders.txt would be skipped. Another possibility is that the display you're looking at is mistaken. A third is that your store is not properly using spiders.txt. If you'll give me the URL of your store I can test it.

Share this post


Link to post
Share on other sites

From what I can tell, spiders.txt is not being used on your store. I switched the user agent to Googlebot and it still got a session. I did notice with your store something I have seen with others in that even on the first page, the links don't have a session ID in the URL, indicating that a cookie was set initially. I'd be curious to know how that was done. But in any event, the "prevent spider sessions" code is not running.

Share this post


Link to post
Share on other sites

Hmm, that's strange, I've never seen Googlebot, Yahoo!, msnbot, Jakarta or you name it starting a session... but maybe I made a mistake somewhere in settings or in a contribution usage.

 

This is in my Sessions:

 

Allow Auto Login - true

Session Directory - /tmp

Force Cookie Use - False

Check SSL Session ID - False

Check User Agent - False

Check IP Address - False

Prevent Spider Sessions - True

Recreate Session - False

 

It's been a while but I believe that the missing session ID has been achieved by contribution called BR&R. I didn't really like the session ID in the URL and switching Force Cookie Use to "True" wasn't an option thus the BR&R.

Edited by mr_absinthe

Absinthe Original Liquor Store

Share this post


Link to post
Share on other sites

The settings don't help because you've clearly changed the code that handles session starting. You'd have to debug the code in application_top.php and follow the code flow.

Share this post


Link to post
Share on other sites

This one has just started spidering my site and seems to be picking up sessions:

 

64.40.117.118 - - [07/Jul/2008:02:50:51 +0200] "GET /robots.txt HTTP/1.0" 200 367 www.mysite.co.uk "-" "Sphere Scout&v4.0 - scout at sphere dot com" "-"

Edited by perfectpassion

Share this post


Link to post
Share on other sites

82.99.30.52 & 82.99.30.13 spiders have a session id and keeps adding stuff to cart.

 

Can some tell me how to stop this

 

Thanks

Edited by SpiceUp

Share this post


Link to post
Share on other sites

OrgName: RIPE Network Coordination Centre

OrgID: RIPE

Address: P.O. Box 10096

City: Amsterdam

StateProv:

PostalCode: 1001EB

Country: NL

 

ReferralServer: whois://whois.ripe.net:43

 

NetRange: 82.0.0.0 - 82.255.255.255

CIDR: 82.0.0.0/8

NetName: 82-RIPE

NetHandle: NET-82-0-0-0-1

Parent:

NetType: Allocated to RIPE NCC

NameServer: NS-PRI.RIPE.NET

NameServer: NS3.NIC.FR

NameServer: SEC1.APNIC.NET

NameServer: SEC3.APNIC.NET

NameServer: SUNIC.SUNET.SE

NameServer: TINNIE.ARIN.NET

Comment: These addresses have been further assigned to users in

Comment: the RIPE NCC region. Contact information can be found in

Comment: the RIPE database at http://www.ripe.net/whois

RegDate: 2002-11-23

Updated: 2004-03-16

 

# ARIN WHOIS database, last updated 2008-07-15 19:10

# Enter ? for additional hints on searching ARIN's WHOIS database.

 

 

What do i add to my spider txt to stop this sider from having a session id

 

Thanks

Edited by SpiceUp

Share this post


Link to post
Share on other sites

Is this what you need

 

82.99.30.70 - - [16/Jul/2008:13:30:03 -0500] "GET /store/catalog/product_info.php?products_id=792&osCsid=5eb127c7df7b17cad70f1e422efd6e71 HTTP/1.0" 200 34628 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

Share this post


Link to post
Share on other sites

Yes, that's it. Unfortunately, the user agent looks like a normal interactive user. You can add a DenyFrom entry into your .htaccess file (if your web host supports that) to block the 82.99.30.* IP range. spiders.txt won't help you here.

Share this post


Link to post
Share on other sites

Have you tried to disallow that page in your "robots.txt" file?

:unsure:


If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you.

 

"Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice."

- Me -

 

"Headers already sent" - The definitive help

 

"Cannot redeclare ..." - How to find/fix it

 

SSL Implementation Help

 

Like this post? "Like" it again over there >

Share this post


Link to post
Share on other sites

All other Yahoo bots are ok, just this one gets session id.

Edited by SpiceUp

Share this post


Link to post
Share on other sites
Have you tried to disallow that page in your "robots.txt" file?

:unsure:

 

Can you please tell me how to do that

 

Thanks

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×