sheepiedog Posted May 18, 2006 Share Posted May 18, 2006 thankyou, tbot was already there at the top section of my latest update of spiders.txt before bot/ Being unsure, i added to the bottom section as well... Quote Link to comment Share on other sites More sharing options...
stevel Posted May 18, 2006 Author Share Posted May 18, 2006 Well, it doesn't need to be there twice. I'll look at this more this evening and see if I can spot a problem. Send me a PM with your store URL (or post it here) so I can verify that you have Prevent Spider Sessions on properly. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
sheepiedog Posted May 18, 2006 Share Posted May 18, 2006 Thank you - i sent you a pm. I have added them to my robots.txt as User-agent: fatbot Disallow: / I am hoping this is correct to get rid of them. They have 8 separate connections to my site now. I dont want them eating my bandwidth and they are messing up my whos online with so many connections. perhaps I should ban their ip ? how would i do this ? Quote Link to comment Share on other sites More sharing options...
stevel Posted May 18, 2006 Author Share Posted May 18, 2006 Unfortunately, you don't know if this robot even reads robots.txt, and if it did, you don't know what identifier it uses in robots.txt. You could ban the IP for now if you wanted to, but I assume that at some point you wouldn't mind it indexing your store if it was well behaved. What puzzles me, though, is that you told me in your PM that you had Force cookie use set to true. If so, then there's no way a spider should be able to add items to a cart. I'll test this too. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
sheepiedog Posted May 19, 2006 Share Posted May 19, 2006 You were right, adding them to robots.txt didnt do anything (perhaps the wrong name) I banned their ips through .htaccess and this has gotten rid of them. At some point I may want them to spider, but they have been there for days and with 8 different connections filling cart items, yikes, enough of that. The settings i sent you in the pm were copied and pasted directly from my store configuration. Quote Link to comment Share on other sites More sharing options...
boxtel Posted May 19, 2006 Share Posted May 19, 2006 Unfortunately, you don't know if this robot even reads robots.txt, and if it did, you don't know what identifier it uses in robots.txt. You could ban the IP for now if you wanted to, but I assume that at some point you wouldn't mind it indexing your store if it was well behaved. What puzzles me, though, is that you told me in your PM that you had Force cookie use set to true. If so, then there's no way a spider should be able to add items to a cart. I'll test this too. there are spiders who actually do accept cookies. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted May 19, 2006 Author Share Posted May 19, 2006 Ok, this is interesting. If you have Force Cookie Use set to on,, then "Prevent Spider Sessions" is bypassed. That means that if the spider accepts a cookie, then it will get a session, no matter what is in spiders.txt. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 19, 2006 Share Posted May 19, 2006 Ok, this is interesting. If you have Force Cookie Use set to on,, then "Prevent Spider Sessions" is bypassed. That means that if the spider accepts a cookie, then it will get a session, no matter what is in spiders.txt. exactly, I rewrote the spider stuff long time ago in the belief that no spider accepts cookies. So I always read the possible cookie first, if I can, I know it cannot be a spider and I need not go thru the spiders list evaluation, saves time. That was until I suddenly got spiders filling baskets. so I made the checks anyway but wrote an entry in the error log whenever I had a client who had a cookie set but was still identified as a spider by the list. not many but they are there. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted May 19, 2006 Author Share Posted May 19, 2006 Amanda, care to share your code for this? I always hated the way osC did cookies - it seemed to me that it should be possible to store the cookie right away if the browser accepted it and fall back on the sid in the URL if not. But I never got "a round tuit" to see if I could figure out how to implement this. Force Cookie Use also doesn't work for shared SSL because of the overly simplistic code in tep_redirect. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 19, 2006 Share Posted May 19, 2006 Amanda, care to share your code for this? I always hated the way osC did cookies - it seemed to me that it should be possible to store the cookie right away if the browser accepted it and fall back on the sid in the URL if not. But I never got "a round tuit" to see if I could figure out how to implement this. Force Cookie Use also doesn't work for shared SSL because of the overly simplistic code in tep_redirect. well, I posted that a long time ago in tips and tricks under "speeding things up with a cookie". basically, osc only sets a test cookie if you force cookies, a waste, why not always set a test cookie. Next to that I also set cookie for screen resolution and response time anyway via BR&R so I can pick. Those cookies last up to 30 days or so and are renewed. So when a client comes in, I always check to see if I can read that test cookie (but I could pick either one). And I always try to set one again. If I can read it, why check the spider list every page load. ofcourse, on the very very first visit or after cookie expiration, the spiders list needs to be consulted again but even then only on the first page load unless cookies are blocked. however, with a few spiders seemingly having my cookie, I have to do an extra check on them just the same. It also means that force cookies is not a fullproof defence against spiders. How I have it in application top : // spider identification $user_agent = strtolower(getenv('HTTP_USER_AGENT')); $session_started = false; $spider_flag = false; $cookies_exist = false; // always set a test cookie tep_setcookie('cookie_test', 'ThankYou', time()+60*60*24*30, $cookie_path, $cookie_domain); if ((isset($_COOKIE['cookie_test'])) && ($_COOKIE['cookie_test'] != '')) { // cookie present $cookies_exist = true; if ( (stristr($user_agent, 'wisenutbot')) or (stristr($user_agent, 'omniexplorer')) or (stristr($user_agent, 'converacrawler')) ) { // known spiders which have my cookie, write agent to errorlog to verify error_log('spider cookie: '.$_COOKIE['cookie_test']."\n".'Agent:'.$user_agent."\n".'ip: '.$browser_ip."\n"); $spider_flag = true; } else { tep_session_start(); $session_started = true; } } else { // no cookie set yet, check spiders list require(DIR_WS_INCLUDES . 'spider_check.php'); if (!$spider_flag) { tep_session_start(); $session_started = true; } } // used to suppress session id in url if (!tep_session_is_registered('cookies_exist')) { tep_session_register('cookies_exist'); } Quote Treasurer MFC Link to comment Share on other sites More sharing options...
boxtel Posted May 23, 2006 Share Posted May 23, 2006 well, I posted that a long time ago in tips and tricks under "speeding things up with a cookie". basically, osc only sets a test cookie if you force cookies, a waste, why not always set a test cookie. Next to that I also set cookie for screen resolution and response time anyway via BR&R so I can pick. Those cookies last up to 30 days or so and are renewed. So when a client comes in, I always check to see if I can read that test cookie (but I could pick either one). And I always try to set one again. If I can read it, why check the spider list every page load. ofcourse, on the very very first visit or after cookie expiration, the spiders list needs to be consulted again but even then only on the first page load unless cookies are blocked. however, with a few spiders seemingly having my cookie, I have to do an extra check on them just the same. It also means that force cookies is not a fullproof defence against spiders. How I have it in application top : // spider identification $user_agent = strtolower(getenv('HTTP_USER_AGENT')); $session_started = false; $spider_flag = false; $cookies_exist = false; // always set a test cookie tep_setcookie('cookie_test', 'ThankYou', time()+60*60*24*30, $cookie_path, $cookie_domain); if ((isset($_COOKIE['cookie_test'])) && ($_COOKIE['cookie_test'] != '')) { // cookie present $cookies_exist = true; if ( (stristr($user_agent, 'wisenutbot')) or (stristr($user_agent, 'omniexplorer')) or (stristr($user_agent, 'converacrawler')) ) { // known spiders which have my cookie, write agent to errorlog to verify error_log('spider cookie: '.$_COOKIE['cookie_test']."\n".'Agent:'.$user_agent."\n".'ip: '.$browser_ip."\n"); $spider_flag = true; } else { tep_session_start(); $session_started = true; } } else { // no cookie set yet, check spiders list require(DIR_WS_INCLUDES . 'spider_check.php'); if (!$spider_flag) { tep_session_start(); $session_started = true; } } // used to suppress session id in url if (!tep_session_is_registered('cookies_exist')) { tep_session_register('cookies_exist'); } here you see from the error log that this spider has the cookie set, the contents is simply a space and not the actual content we set it to but still, if you check for the existence of the test cookie, it will return true. [22-May-2006 03:58:38] spider with cookie: Agent:converacrawler/0.9d (+http://www.authoritativeweb.com/crawl) ip: 63.241.61.7 [22-May-2006 03:58:58] spider with cookie: Agent:converacrawler/0.9d (+http://www.authoritativeweb.com/crawl) ip: 63.241.61.7 Quote Treasurer MFC Link to comment Share on other sites More sharing options...
Andreas2003 Posted May 25, 2006 Share Posted May 25, 2006 Hi there, got a quick question: I have enabled "Prevent Spider sessions" in my shop, I do use the robots.txt and of course have an updated version of your great spiders.txt-file included. Is this enough for preventing spiders (those of them included in the spiders-file) to take the session id into the index ? I have no sid killer or similiar installed yet. As I have read in this thread, there could be a problem with the "buy now" button. Can you give me a recommendation what to do for that problem ? Thanks in advance, kind regards Andreas Quote Link to comment Share on other sites More sharing options...
stevel Posted May 25, 2006 Author Share Posted May 25, 2006 If the spider does not have a session, it won't be able to add items to the cart through "buy now", but it will try to follow the link and will end up at the cookie_usage page. If you have converted the buy now buttons to forms, this won't happen. For the spiders that accept cookies, though, they may still add items to the cart by following buy now links. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
FixItPete Posted May 25, 2006 Share Posted May 25, 2006 I have this contribution in combination with a good robot txt file and lately I've been seeing a lot of bot in my cookie usage page and other places they shouldn't be. Has something changed that I'm missing? Pete Quote I find the fun in everything. Link to comment Share on other sites More sharing options...
FixItPete Posted May 25, 2006 Share Posted May 25, 2006 I have this contribution in combination with a good robot txt file and lately I've been seeing a lot of bot in my cookie usage page and other places they shouldn't be. Has something changed that I'm missing? Pete UGH! Here are my settings: Title Value Action Session Directory /tmp Force Cookie Use False Info Check SSL Session ID False Info Check User Agent True Info Check IP Address False Info Prevent Spider Sessions True Info Recreate Session False Info Here is my current who's online (enhanced) Active Bot with session 00:00:00 Mozilla 72.30.132.23 17:51:40 17:51:40 /cookie_usage.php No Not Found Active Bot with session 00:00:00 Mozilla 68.142.249.88 17:50:19 17:50:19 /cookie_usage.php No Not Found Active Bot with session 00:00:00 Mozilla 72.30.103.8 17:48:44 17:48:44 /small-pink-floating-heart-candle-iridescent-opalescent-glitter- No Not Found Active Bot with session 00:00:00 Mozilla 72.30.128.12 17:48:09 17:48:09 /cookie_usage.php No Not Found Inactive Bot with session 00:00:00 Mozilla 72.30.110.26 17:47:31 17:47:31 /cookie_usage.php No Not Found Inactive with no Cart 00:00:47 Guest Admin 17:43:57 17:44:44 /create_account.php Yes Not Found Inactive Bot with session 00:00:00 Mozilla 72.30.129.113 17:39:37 17:39:37 /cookie_usage.php No Not Found Things have not been like this in a LONG time... what the heck is wrong? Help! Thanks, PEte Quote I find the fun in everything. Link to comment Share on other sites More sharing options...
stevel Posted May 26, 2006 Author Share Posted May 26, 2006 Nothing has changed that I know of. Those IP's are Yahoo search, but that doesn't really tell me anything. I'd want to see the entries from the access log for these visits. Ny guess is that you have "Buy It" links that the bot is following. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
FixItPete Posted May 26, 2006 Share Posted May 26, 2006 Nothing has changed that I know of. Those IP's are Yahoo search, but that doesn't really tell me anything. I'd want to see the entries from the access log for these visits. Ny guess is that you have "Buy It" links that the bot is following. I am almost certain I have eliminated all my buy it now buttons... Here is a sample from my log: 72.30.133.110 - - [24/May/2006:23:59:18 -0400] "GET /robots.txt HTTP/1.0" 200 895 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" Is that what you're looking for? the site is: thebestcandles (dot) com Thanks for helping, Pete Quote I find the fun in everything. Link to comment Share on other sites More sharing options...
stevel Posted May 26, 2006 Author Share Posted May 26, 2006 Almost. I want to see the entry from the bot that got a 302 redirect to cookie_usage.php. It will usually be the entry just before the one for cookie_usage.php itself. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
FixItPete Posted May 26, 2006 Share Posted May 26, 2006 I'm not sure I understand what you're looking for. When I go to my server's "Raw Log" (which I can see yesterday... but this problem was also around yesterday...) Here's a "batch" that I see... 72.30.133.110 - - [24/May/2006:23:24:28 -0400] "GET /robots.txt HTTP/1.0" 200 895 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 72.30.103.225 - - [24/May/2006:23:24:29 -0400] "GET /cookie_usage.php HTTP/1.0" 200 24415 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 72.30.131.205 - - [24/May/2006:23:24:29 -0400] "GET /cookie_usage.php HTTP/1.0" 200 24402 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 72.30.133.110 - - [24/May/2006:23:31:07 -0400] "GET /robots.txt HTTP/1.0" 200 895 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" Quote I find the fun in everything. Link to comment Share on other sites More sharing options...
stevel Posted May 26, 2006 Author Share Posted May 26, 2006 Ah. You've encountered "the spider that never forgets". At some point in the past Yahoo managed to get onto cookie_usage.php. Unlike most search engines which just follow links, Yahoo remembers all the pages it has visited and tries to fetch them - again and again and again. Even in the face of repeated 404s. In your case, Yahoo sees nothing amiss. Do add cookie_usage.php to robots.txt - it will probably help - in a year or two - maybe. The spider does not have a session, despite what your earlier "who's online" excerpt suggested. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
FixItPete Posted May 26, 2006 Share Posted May 26, 2006 It is in robots.txt... but, yes at one time... when I was young and stupid. LOL... it did go there. I guess that's just the way it is. Nothing I can do to help it I suppose? Quote I find the fun in everything. Link to comment Share on other sites More sharing options...
Andreas2003 Posted May 26, 2006 Share Posted May 26, 2006 Hi Steve, thanks for your quick answer. I have the cookie_usage.php in my robots.txt, so I guess, with that I am done, or ? Great. Tonight I had a visit from slurp = Yahoo. Tracked was only one click and a very short visit time from 1 second. Maybe I'm in the wrong forum or thread for my question, but is this correct, only one click? I have a startup page, where one link is placed ("to the shop"). Thanks, Regards Andreas If the spider does not have a session, it won't be able to add items to the cart through "buy now", but it will try to follow the link and will end up at the cookie_usage page. If you have converted the buy now buttons to forms, this won't happen. For the spiders that accept cookies, though, they may still add items to the cart by following buy now links. Quote Link to comment Share on other sites More sharing options...
FixItPete Posted May 26, 2006 Share Posted May 26, 2006 Andreas, Question. How were you able to track the number of clicks? Pete Quote I find the fun in everything. Link to comment Share on other sites More sharing options...
stevel Posted May 26, 2006 Author Share Posted May 26, 2006 Robots vary their behavior. The better ones don't access a lot of your pages at the same time. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted May 27, 2006 Share Posted May 27, 2006 How were you able to track the number of clicks? Most of the contribs a la whos online etc. should do that. Or you have an analysis tool from your website provider. Regards Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.