Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

thankyou,

 

tbot was already there at the top section of my latest update of spiders.txt before bot/

 

Being unsure, i added to the bottom section as well...

Share this post


Link to post
Share on other sites

Well, it doesn't need to be there twice. I'll look at this more this evening and see if I can spot a problem. Send me a PM with your store URL (or post it here) so I can verify that you have Prevent Spider Sessions on properly.

Share this post


Link to post
Share on other sites

Thank you - i sent you a pm.

I have added them to my robots.txt as

 

User-agent: fatbot

Disallow: /

 

I am hoping this is correct to get rid of them. They have 8 separate connections to my site now. I dont want them eating my bandwidth and they are messing up my whos online with so many connections.

 

perhaps I should ban their ip ? how would i do this ?

Share this post


Link to post
Share on other sites

Unfortunately, you don't know if this robot even reads robots.txt, and if it did, you don't know what identifier it uses in robots.txt. You could ban the IP for now if you wanted to, but I assume that at some point you wouldn't mind it indexing your store if it was well behaved.

 

What puzzles me, though, is that you told me in your PM that you had Force cookie use set to true. If so, then there's no way a spider should be able to add items to a cart. I'll test this too.

Share this post


Link to post
Share on other sites

You were right, adding them to robots.txt didnt do anything (perhaps the wrong name)

 

I banned their ips through .htaccess and this has gotten rid of them. At some point I may want them to spider, but they have been there for days and with 8 different connections filling cart items, yikes, enough of that.

 

The settings i sent you in the pm were copied and pasted directly from my store configuration.

Share this post


Link to post
Share on other sites
Unfortunately, you don't know if this robot even reads robots.txt, and if it did, you don't know what identifier it uses in robots.txt. You could ban the IP for now if you wanted to, but I assume that at some point you wouldn't mind it indexing your store if it was well behaved.

 

What puzzles me, though, is that you told me in your PM that you had Force cookie use set to true. If so, then there's no way a spider should be able to add items to a cart. I'll test this too.

 

there are spiders who actually do accept cookies.


Treasurer MFC

Share this post


Link to post
Share on other sites

Ok, this is interesting. If you have Force Cookie Use set to on,, then "Prevent Spider Sessions" is bypassed. That means that if the spider accepts a cookie, then it will get a session, no matter what is in spiders.txt.

Share this post


Link to post
Share on other sites
Ok, this is interesting. If you have Force Cookie Use set to on,, then "Prevent Spider Sessions" is bypassed. That means that if the spider accepts a cookie, then it will get a session, no matter what is in spiders.txt.

 

exactly, I rewrote the spider stuff long time ago in the belief that no spider accepts cookies.

So I always read the possible cookie first, if I can, I know it cannot be a spider and I need not go

thru the spiders list evaluation, saves time.

 

That was until I suddenly got spiders filling baskets.

 

so I made the checks anyway but wrote an entry in the error log whenever I had a client who had a cookie set but was still identified as a spider by the list.

not many but they are there.


Treasurer MFC

Share this post


Link to post
Share on other sites

Amanda, care to share your code for this?

 

I always hated the way osC did cookies - it seemed to me that it should be possible to store the cookie right away if the browser accepted it and fall back on the sid in the URL if not. But I never got "a round tuit" to see if I could figure out how to implement this.

 

Force Cookie Use also doesn't work for shared SSL because of the overly simplistic code in tep_redirect.

Share this post


Link to post
Share on other sites
Amanda, care to share your code for this?

 

I always hated the way osC did cookies - it seemed to me that it should be possible to store the cookie right away if the browser accepted it and fall back on the sid in the URL if not. But I never got "a round tuit" to see if I could figure out how to implement this.

 

Force Cookie Use also doesn't work for shared SSL because of the overly simplistic code in tep_redirect.

 

well, I posted that a long time ago in tips and tricks under "speeding things up with a cookie".

 

basically, osc only sets a test cookie if you force cookies, a waste, why not always set a test cookie.

 

Next to that I also set cookie for screen resolution and response time anyway via BR&R so I can pick.

Those cookies last up to 30 days or so and are renewed.

So when a client comes in, I always check to see if I can read that test cookie (but I could pick either one).

And I always try to set one again.

 

If I can read it, why check the spider list every page load.

 

ofcourse, on the very very first visit or after cookie expiration, the spiders list needs to be consulted again but even then only on the first page load unless cookies are blocked.

 

however, with a few spiders seemingly having my cookie, I have to do an extra check on them just the same. It also means that force cookies is not a fullproof defence against spiders.

 

How I have it in application top :

 

// spider identification

$user_agent = strtolower(getenv('HTTP_USER_AGENT'));

$session_started = false;

$spider_flag = false;

$cookies_exist = false;

// always set a test cookie

tep_setcookie('cookie_test', 'ThankYou', time()+60*60*24*30, $cookie_path, $cookie_domain);

if ((isset($_COOKIE['cookie_test'])) && ($_COOKIE['cookie_test'] != '')) {

// cookie present

$cookies_exist = true;

if ( (stristr($user_agent, 'wisenutbot'))

or (stristr($user_agent, 'omniexplorer'))

or (stristr($user_agent, 'converacrawler'))

) {

// known spiders which have my cookie, write agent to errorlog to verify

error_log('spider cookie: '.$_COOKIE['cookie_test']."\n".'Agent:'.$user_agent."\n".'ip: '.$browser_ip."\n");

$spider_flag = true;

} else {

tep_session_start();

$session_started = true;

}

} else {

// no cookie set yet, check spiders list

require(DIR_WS_INCLUDES . 'spider_check.php');

if (!$spider_flag) {

tep_session_start();

$session_started = true;

}

}

 

// used to suppress session id in url

if (!tep_session_is_registered('cookies_exist')) {

tep_session_register('cookies_exist');

}


Treasurer MFC

Share this post


Link to post
Share on other sites
well, I posted that a long time ago in tips and tricks under "speeding things up with a cookie".

 

basically, osc only sets a test cookie if you force cookies, a waste, why not always set a test cookie.

 

Next to that I also set cookie for screen resolution and response time anyway via BR&R so I can pick.

Those cookies last up to 30 days or so and are renewed.

So when a client comes in, I always check to see if I can read that test cookie (but I could pick either one).

And I always try to set one again.

 

If I can read it, why check the spider list every page load.

 

ofcourse, on the very very first visit or after cookie expiration, the spiders list needs to be consulted again but even then only on the first page load unless cookies are blocked.

 

however, with a few spiders seemingly having my cookie, I have to do an extra check on them just the same. It also means that force cookies is not a fullproof defence against spiders.

 

How I have it in application top :

 

// spider identification

$user_agent = strtolower(getenv('HTTP_USER_AGENT'));

$session_started = false;

$spider_flag = false;

$cookies_exist = false;

// always set a test cookie

tep_setcookie('cookie_test', 'ThankYou', time()+60*60*24*30, $cookie_path, $cookie_domain);

if ((isset($_COOKIE['cookie_test'])) && ($_COOKIE['cookie_test'] != '')) {

// cookie present

$cookies_exist = true;

if ( (stristr($user_agent, 'wisenutbot'))

or (stristr($user_agent, 'omniexplorer'))

or (stristr($user_agent, 'converacrawler'))

) {

// known spiders which have my cookie, write agent to errorlog to verify

error_log('spider cookie: '.$_COOKIE['cookie_test']."\n".'Agent:'.$user_agent."\n".'ip: '.$browser_ip."\n");

$spider_flag = true;

} else {

tep_session_start();

$session_started = true;

}

} else {

// no cookie set yet, check spiders list

require(DIR_WS_INCLUDES . 'spider_check.php');

if (!$spider_flag) {

tep_session_start();

$session_started = true;

}

}

 

// used to suppress session id in url

if (!tep_session_is_registered('cookies_exist')) {

tep_session_register('cookies_exist');

}

 

here you see from the error log that this spider has the cookie set, the contents is simply a space and not the actual content we set it to but still, if you check for the existence of the test cookie, it will return true.

 

[22-May-2006 03:58:38] spider with cookie:

Agent:converacrawler/0.9d (+http://www.authoritativeweb.com/crawl)

ip: 63.241.61.7

[22-May-2006 03:58:58] spider with cookie:

Agent:converacrawler/0.9d (+http://www.authoritativeweb.com/crawl)

ip: 63.241.61.7


Treasurer MFC

Share this post


Link to post
Share on other sites

Hi there,

 

got a quick question:

I have enabled "Prevent Spider sessions" in my shop, I do use the robots.txt and of course have an updated version of your great spiders.txt-file included.

 

Is this enough for preventing spiders (those of them included in the spiders-file) to take the session id into the index ?

 

I have no sid killer or similiar installed yet.

As I have read in this thread, there could be a problem with the "buy now" button.

 

Can you give me a recommendation what to do for that problem ?

 

Thanks in advance,

kind regards

Andreas

Share this post


Link to post
Share on other sites

If the spider does not have a session, it won't be able to add items to the cart through "buy now", but it will try to follow the link and will end up at the cookie_usage page. If you have converted the buy now buttons to forms, this won't happen.

 

For the spiders that accept cookies, though, they may still add items to the cart by following buy now links.

Share this post


Link to post
Share on other sites

I have this contribution in combination with a good robot txt file and lately I've been seeing a lot of bot in my cookie usage page and other places they shouldn't be. Has something changed that I'm missing?

 

Pete


I find the fun in everything.

Share this post


Link to post
Share on other sites
I have this contribution in combination with a good robot txt file and lately I've been seeing a lot of bot in my cookie usage page and other places they shouldn't be. Has something changed that I'm missing?

 

Pete

 

UGH!

 

Here are my settings:

Title Value Action

Session Directory /tmp

Force Cookie Use False Info

Check SSL Session ID False Info

Check User Agent True Info

Check IP Address False Info

Prevent Spider Sessions True Info

Recreate Session False Info

 

Here is my current who's online (enhanced)

 

Active Bot with session 00:00:00 Mozilla 72.30.132.23 17:51:40 17:51:40 /cookie_usage.php No Not Found

 

Active Bot with session 00:00:00 Mozilla 68.142.249.88 17:50:19 17:50:19 /cookie_usage.php No Not Found

 

Active Bot with session 00:00:00 Mozilla 72.30.103.8 17:48:44 17:48:44 /small-pink-floating-heart-candle-iridescent-opalescent-glitter- No Not Found

 

Active Bot with session 00:00:00 Mozilla 72.30.128.12 17:48:09 17:48:09 /cookie_usage.php No Not Found

 

Inactive Bot with session 00:00:00 Mozilla 72.30.110.26 17:47:31 17:47:31 /cookie_usage.php No Not Found

 

Inactive with no Cart 00:00:47 Guest Admin 17:43:57 17:44:44 /create_account.php Yes Not Found

 

Inactive Bot with session 00:00:00 Mozilla 72.30.129.113 17:39:37 17:39:37 /cookie_usage.php No Not Found

 

Things have not been like this in a LONG time... what the heck is wrong?

 

Help!

 

Thanks,

 

PEte


I find the fun in everything.

Share this post


Link to post
Share on other sites

Nothing has changed that I know of. Those IP's are Yahoo search, but that doesn't really tell me anything. I'd want to see the entries from the access log for these visits. Ny guess is that you have "Buy It" links that the bot is following.

Share this post


Link to post
Share on other sites
Nothing has changed that I know of. Those IP's are Yahoo search, but that doesn't really tell me anything. I'd want to see the entries from the access log for these visits. Ny guess is that you have "Buy It" links that the bot is following.

 

 

I am almost certain I have eliminated all my buy it now buttons...

 

Here is a sample from my log:

 

72.30.133.110 - - [24/May/2006:23:59:18 -0400] "GET /robots.txt HTTP/1.0" 200 895 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

 

 

Is that what you're looking for?

 

the site is: thebestcandles (dot) com

 

Thanks for helping,

 

Pete


I find the fun in everything.

Share this post


Link to post
Share on other sites

I'm not sure I understand what you're looking for. When I go to my server's "Raw Log" (which I can see yesterday... but this problem was also around yesterday...)

 

Here's a "batch" that I see...

 

72.30.133.110 - - [24/May/2006:23:24:28 -0400] "GET /robots.txt HTTP/1.0" 200 895 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

72.30.103.225 - - [24/May/2006:23:24:29 -0400] "GET /cookie_usage.php HTTP/1.0" 200 24415 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

72.30.131.205 - - [24/May/2006:23:24:29 -0400] "GET /cookie_usage.php HTTP/1.0" 200 24402 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

72.30.133.110 - - [24/May/2006:23:31:07 -0400] "GET /robots.txt HTTP/1.0" 200 895 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"


I find the fun in everything.

Share this post


Link to post
Share on other sites

Ah. You've encountered "the spider that never forgets". At some point in the past Yahoo managed to get onto cookie_usage.php. Unlike most search engines which just follow links, Yahoo remembers all the pages it has visited and tries to fetch them - again and again and again. Even in the face of repeated 404s.

 

In your case, Yahoo sees nothing amiss. Do add cookie_usage.php to robots.txt - it will probably help - in a year or two - maybe. The spider does not have a session, despite what your earlier "who's online" excerpt suggested.

Share this post


Link to post
Share on other sites

It is in robots.txt... but, yes at one time... when I was young and stupid. LOL... it did go there. I guess that's just the way it is.

 

Nothing I can do to help it I suppose?


I find the fun in everything.

Share this post


Link to post
Share on other sites

Hi Steve,

 

thanks for your quick answer.

I have the cookie_usage.php in my robots.txt, so I guess, with that I am done, or ?

Great.

Tonight I had a visit from slurp = Yahoo.

Tracked was only one click and a very short visit time from 1 second.

Maybe I'm in the wrong forum or thread for my question, but is this correct, only one click?

I have a startup page, where one link is placed ("to the shop").

 

Thanks,

Regards

Andreas

 

 

If the spider does not have a session, it won't be able to add items to the cart through "buy now", but it will try to follow the link and will end up at the cookie_usage page. If you have converted the buy now buttons to forms, this won't happen.

 

For the spiders that accept cookies, though, they may still add items to the cart by following buy now links.

Share this post


Link to post
Share on other sites

Andreas, Question.

 

How were you able to track the number of clicks?

 

Pete


I find the fun in everything.

Share this post


Link to post
Share on other sites
How were you able to track the number of clicks?

Most of the contribs a la whos online etc. should do that.

Or you have an analysis tool from your website provider.

 

Regards

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×