Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

... do i just ftp spiders.txt to my side and replace the default one? is that it?

If you're looking at the same contribution I am (http://addons.oscommerce.com/info/2455), it says "A replacement for catalog/includes/spiders.txt - updated with newly seen spiders and optimized for quicker processing. For 2.2-MS2 or later." The readme file is worth looking at as well.

 

To answer your question directly, yes, that's all you have to do. Decide which file you want to use, rename it if you pick the large one, and replace the stock one.

Share this post


Link to post
Share on other sites
If you're looking at the same contribution I am (http://addons.oscommerce.com/info/2455), it says "A replacement for catalog/includes/spiders.txt - updated with newly seen spiders and optimized for quicker processing. For 2.2-MS2 or later." The readme file is worth looking at as well.

 

To answer your question directly, yes, that's all you have to do. Decide which file you want to use, rename it if you pick the large one, and replace the stock one.

 

Thanks! Very easy!

Share this post


Link to post
Share on other sites

Hi All I am having problems with livebot still getting session id's Googlebot does not but Livebot and msnbot is starting to annoy me.

 

livebot-65-55-210-42.search.live.com 22:55:16 22:55:16 /cookie_usage.php Yes Not Found

Name: Guest

 

ID: 0

 

IP Address: 65.55.210.42

 

User Agent: msnbot/1.1 (+http://search.msn.com/msnbot.htm)

 

I do have nbot in my spiders.txt but it does not seem to work and clues would be appreciated.

Share this post


Link to post
Share on other sites
I do not trust the display you are showing here. Post the entry from your web access log showing the GET of the page from msnbot.

 

 

GET /index.php cPath=42 80 - 65.55.210.37 msnbot/1.1+(+http://search.msn.com/msnbot.htm) 200 0 0 5654 298

GET /shopping_cart.php osCsid=1red00mdgjglncmjijk8rg5ig0 80 - 65.55.210.35 msnbot/1.1+(+http://search.msn.com/msnbot.htm) 200 0 0 4751 331

 

It keeps getting a session ID and being identified in WHo's Online as a customer not a BOT

I also have this one, not sure why it is spidering my site but :

 

GET /index.php - 80 - 208.122.4.142 FreeWebMonitoring+SiteChecker/0.1+(+http://www.freewebmonitoring.com) 200 0 64 343 238

Share this post


Link to post
Share on other sites

Are you sure that msnbot hasn't held onto an old osCsid and is using that to access your site? If the first access has the ID, that is likely.

 

That SiteChecker is probably not spidering your site - it is just looking to see if the site is up. You should not see any access other than index.php.

Share this post


Link to post
Share on other sites
Are you sure that msnbot hasn't held onto an old osCsid and is using that to access your site? If the first access has the ID, that is likely.

 

That SiteChecker is probably not spidering your site - it is just looking to see if the site is up. You should not see any access other than index.php.

 

 

That is what I thought at first, but I have never seen msnbot without a session ID.

 

Is their a way I can force a 301 redirect if the page is hit by a bot listed in spiders.txt, which should over time remove any session Id's from the index?

 

You help by the way if greatly appreciated !

Share this post


Link to post
Share on other sites
That is what I thought at first, but I have never seen msnbot without a session ID.

 

Is their a way I can force a 301 redirect if the page is hit by a bot listed in spiders.txt, which should over time remove any session Id's from the index?

 

You help by the way if greatly appreciated !

 

 

url is http:// shop . calibraweighing . co.uk

Edited by Jan Zonjee

Share this post


Link to post
Share on other sites

My test shows that msnbot does not get assigned a session on new visits. You may need to get rid of the sessions that it has previously indexed with Spider Session Remover.

Share this post


Link to post
Share on other sites
My test shows that msnbot does not get assigned a session on new visits. You may need to get rid of the sessions that it has previously indexed with Spider Session Remover.

 

 

Thank you for that, the Mod is an Apache rewrite which IIS does not have the functionality to this out of the box so not much use to me.

Share this post


Link to post
Share on other sites

Hmm. Well, you can do the equivalent in PHP by searching the user agent string for msnbot and if you find it and $session_started is true, use the "header" command to do a 301 redirect to the same URL minus the sid.

Share this post


Link to post
Share on other sites
Hmm. Well, you can do the equivalent in PHP by searching the user agent string for msnbot and if you find it and $session_started is true, use the "header" command to do a 301 redirect to the same URL minus the sid.

Hummm what an idea, but even better if we spent some time turning spidrs.txt into an array, then do it, this would speed up the processing time somewhat do you think?

Share this post


Link to post
Share on other sites

It gets turned into an array when processed in application_top.php. But what you're implying is that the array will get searched for every connection. As it is now, it gets searched only if there is no sid in the URL (or cookie), and then the only effect is to not start a new session.

Share this post


Link to post
Share on other sites
It gets turned into an array when processed in application_top.php. But what you're implying is that the array will get searched for every connection. As it is now, it gets searched only if there is no sid in the URL (or cookie), and then the only effect is to not start a new session.

Maybe somthing like this, my php is a bit rusty so don't laugh !!

<?php 
if (eregi ('oscsid', $_SERVER['REQUEST_URI'])) {
 $user_agent = $_SERVER['HTTP_USER_AGENT'];
 $bots= array("msnbot", "nbot");
  if (eregi ($bots, $user_agent)){
 header('Status: 301 Moved Permanently'); 
 header('Location: http://www.example.com/newurl.html'); 
 exit(); 
 }
 }
?>

Share this post


Link to post
Share on other sites
Maybe somthing like this, my php is a bit rusty so don't laugh !!

<?php 
if (eregi ('oscsid', $_SERVER['REQUEST_URI'])) {
 $user_agent = $_SERVER['HTTP_USER_AGENT'];
 $bots= array("msnbot", "nbot");
  if (eregi ($bots, $user_agent)){
 header('Status: 301 Moved Permanently'); 
 header('Location: http://www.example.com/newurl.html'); 
 exit(); 
 }
 }
?>

 

or maybe add an & to the spiders.txt then

<?PHP

if (eregi ('oscsid', $_SERVER['REQUEST_URI'])) {
$filename = "spiders.txt";
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
$spiders_array = explode("&",$contents);
$user_agent = $_SERVER['HTTP_USER_AGENT'];

  if (eregi ($spiders_array, $user_agent)){
 header('Status: 301 Moved Permanently'); 
 header('Location: http://www.example.com/newurl.html'); 
 exit(); 
 }
 }

Edited by parksey

Share this post


Link to post
Share on other sites

You like eregi, don't you? :D I would generally code the test for the SID as isset($_GET['osCsid']) I don't quite get what you're doing with the &. The "prevent spider sessions" code already creates an array, one element per record in spiders.txt.

 

I think a good compromise would be to test for the GET parameter, because for MOST users, that will be present for only one page and the cookie will take care of the rest. So if the osCsid GET parameter is present, do the spiders.txt search anyway, and if found (and the sid is in the URL), do the 301 redirect.

Share this post


Link to post
Share on other sites

Hi, i did download the updated spiders.txt, but right now it seems a bot is crawling my site:

64.124.148.21 k01.fatlens.com

64.124.148.22 k02.fatlens.com

64.124.148.23 k03.fatlens.com

64.124.148.24 k04.fatlens.com

64.124.148.26 k06.fatlens.com

64.124.148.27 k07.fatlens.com

64.124.148.28 k08.fatlens.com

64.124.148.65 k10.fatlens.com

64.124.148.66 k11.fatlens.com

64.124.148.67 k12.fatlens.com

 

and its creating sessions... i have searched google for fatlens and it seems its a bot from thefind.com (i did added my site some days ago).

 

i have tried to find something similar to "fatlens" in spiders.txt but didnt found anything.... what should i do?

 

thanks

Share this post


Link to post
Share on other sites

Show me a line from your access log for one of these. The IP doesn't help.

 

A Google search suggests that the user agent includes the string "Fatbot" which the spiders.txt string "tbot" should pick up.

Share this post


Link to post
Share on other sites
Show me a line from your access log for one of these. The IP doesn't help.

 

A Google search suggests that the user agent includes the string "Fatbot" which the spiders.txt string "tbot" should pick up.

 

thanks for the fast reply Steve, sorry im newbie on this and not sure where to look for my access log.... i did install "Visitor Web Stats" and "Who's online enhancement" and thats what im using... but i think you mean something else..

in Visitor Web Stats its just showing as

64.124.148.67

k12.fatlens.com 05/04/2008 00:00:36 1>>> 00:00:00 Guest en-us,en;q=0.5

english Direct

 

for comparison, another line for the googlebot shows

66.249.72.137

crawl-66-249-72-137.googlebot.com 05/03/2008 07:23:11 36>>> 16:36:46 Guest [Mozilla]

english Direct

 

 

and in the who's online, its just k01.fatlens.com....

 

where should i look for the log? has my host to give me access?

 

thanks and sorry for my english :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×