Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

I cant find a robots.txt file in my server, someone told me that i need it. Can someone please provide me with the link to download it?

 

Thanks

Rishi Patel

Share this post


Link to post
Share on other sites
I cant find a robots.txt file in my server, someone told me that i need it. Can someone please provide me with the link to download it?

 

Thanks

Rishi Patel

Look in the contributions area. Search for robots.


The Coopco Underwear Shop

 

If you live to be 100 years of age, that means you have lived for 36,525 days. Don't waste another, there aren't many left.

Share this post


Link to post
Share on other sites
Look in the contributions area. Search for robots.

 

 

Hi - I can see my robots.txt file but I cannot see any spiders.txt file.... how do I see it?

 

I am looking in the public_html folder and its not there.

 

Please advise....

 

Thanks


Whats the point of a signature?

Share this post


Link to post
Share on other sites
Please advise....

 

Thanks

Look in your catalog/includes/

 

robots is in catalog/


The Coopco Underwear Shop

 

If you live to be 100 years of age, that means you have lived for 36,525 days. Don't waste another, there aren't many left.

Share this post


Link to post
Share on other sites

i just spotted this one browsing as a guest:

 

msnbot-Products

Name: msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)

IP Address: 65.55.252.40

 

 

isn't the nbot entry in spiders.txt supposed to catch these? or because -products/1.0 is there, do i need to add a new entry?

 

 

 

 

hmm. on second look, it IS being treated as a bot. everywhere, except the categories. doesn't seem to like my seo url structure for categories. anyone else ever run into this??

google & yahoo have no problem with my categories like this. (as in, they are not shown as a customer/guest on who's online when indexing categories .. like msnbot-products is)

Edited by eww

Share this post


Link to post
Share on other sites
hmm. on second look, it IS being treated as a bot. everywhere, except the categories. doesn't seem to like my seo url structure for categories. anyone else ever run into this??

google & yahoo have no problem with my categories like this. (as in, they are not shown as a customer/guest on who's online when indexing categories .. like msnbot-products is)

What behavior do you see that is a problem? Do you see it having a session created? spiders.txt would prevent that.

Share this post


Link to post
Share on other sites

It seems that a bot is crawling my site, creating sessions and adding random products to the cart. Here are some lines from my access log:

208.99.195.54 - - [30/Dec/2008:19:20:31 -0700] "GET / HTTP/1.1" 200 6882

208.99.195.54 - - [30/Dec/2008:22:17:24 -0700] "GET /product_info.php?pName=ts&osCsid=f735ae1ed1e7085d43bece7f1bb19579 HTTP/1.0" 200 15237 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"

208.99.195.54 - - [30/Dec/2008:22:18:22 -0700] "GET /account_history_info.php HTTP/1.0" 302 26 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"

208.99.195.54 - - [30/Dec/2008:22:18:53 -0700] "GET /account.php?osCsid=f735ae1ed1e7085d43bece7f1bb19579 HTTP/1.0" 302 26 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"

208.99.195.54 - - [30/Dec/2008:22:26:35 -0700] "GET /create_account.php?guest_account=true&osCsid=f735ae1ed1e7085d43bece7f1bb19579 HTTP/1.0" 200 15170

208.99.195.54 - - [30/Dec/2008:22:47:55 -0700] "GET /account_edit.php HTTP/1.0" 302 26

208.99.195.54 - - [30/Dec/2008:22:47:58 -0700] "GET /account_newsletters.php HTTP/1.0" 302 26

208.99.195.54 - - [30/Dec/2008:22:52:40 -0700] "GET /document.all. HTTP/1.0" 404 309 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"

208.99.195.54 - - [30/Dec/2008:22:52:43 -0700] "GET /checkout_shipping.php HTTP/1.0" 302 26

Share this post


Link to post
Share on other sites

Why do you think this is a bot? I see no evidence of that. Everything, including the times of access, suggests a human. The user agent is that of MSIE 6; while this can be forged, it also makes it impossible to filter out based on spiders.txt.

Share this post


Link to post
Share on other sites
Why do you think this is a bot? I see no evidence of that. Everything, including the times of access, suggests a human. The user agent is that of MSIE 6; while this can be forged, it also makes it impossible to filter out based on spiders.txt.

Well, its behavior suggests me that it's a bot. It creates a cart, then leaves, then comes back and creates a new cart and always adds many random products, no consistency. It jumps from page to page and never stays for longer than two seconds on each page. It can reach as far as checkout_shipping.php without even logging in. The times of access that you see were just randomly picked from the access log. As I said, its activity is changing every second or two, without consistency. If I would post the whole activity log of this "human/bot", you would see that it's a strange for a human behavior.

Share this post


Link to post
Share on other sites
Well, its behavior suggests me that it's a bot. It creates a cart, then leaves, then comes back and creates a new cart and always adds many random products, no consistency. It jumps from page to page and never stays for longer than two seconds on each page. It can reach as far as checkout_shipping.php without even logging in and getting a cart. The times of access that you see were just randomly picked from the access log. As I said, its activity is changing every second or two, without consistency. If I would post the whole activity log of this "human/bot", you would see that it's a strange for a human behavior.

Share this post


Link to post
Share on other sites

Ok, but there's nothing in the information available which one could use to automatically decide that it's a bot. Is the IP address always the same? If so, you could block it in .htaccess or just add a test for that IP in the spider test.

Share this post


Link to post
Share on other sites
Ok, but there's nothing in the information available which one could use to automatically decide that it's a bot. Is the IP address always the same? If so, you could block it in .htaccess or just add a test for that IP in the spider test.

What do you mean by saying add a test for that IP in the spider test? How do I do that? Yes, the IP is always the same.

Share this post


Link to post
Share on other sites

In includes/application_top.php, just after this code:

 

  } elseif (SESSION_BLOCK_SPIDERS == 'True') {
$user_agent = strtolower(getenv('HTTP_USER_AGENT'));
$spider_flag = false;

 

add this:

 

if (tep_get_ip_address() == '208.99.195.54') $spider_flag = true;

Share this post


Link to post
Share on other sites

I have a frequent visitor that is referred by the following uRL

http://search.live.com/results.aspx?q=cookie

They have a session and usually have one item in their cart. The last set of digits in their IP changes and the IP always resolves to microsoft corp.

 

http://www.showmyip.com/?ip=65.55.109.146

 

If this is a bot, why does it have a session and why doesn't it show as a bot? I have the enhanced who's online added to the store.

 

Tim

 

Also I updated my spiders text file. and prevent spider sessions is set to true.

Share this post


Link to post
Share on other sites

What is the user agent string from the access log? Does the first URL it tries contain an osCid= session ID in the URL?

 

I'd be rather astonished that any bot has a referral URL at all.

Share this post


Link to post
Share on other sites

Steve,

 

Here are some entries from the access log:

 

65.55.110.167 - - [14/Jan/2009:06:34:04 -0500] "GET /immobilizer-900000-volt-cell-phone-stun-p-616.html?action=buy_now&page=1&sort=2d HTTP/1.0" 302 - "http://search.live.com/results.aspx?q=cookie" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
65.55.110.167 - - [14/Jan/2009:06:34:04 -0500] "GET /shopping_cart.php?cPath=28&page=1&sort=2d&osCsid=ef787039f88b28f4739d1789f3c2c213 HTTP/1.0" 200 34252 "http://search.live.com/results.aspx?q=cookie" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
65.55.110.167 - - [14/Jan/2009:06:34:05 -0500] "GET /stylesheet.css HTTP/1.0" 200 6637 "http://myknifestore.net/shopping_cart.php?cPath=28&page=1&sort=2d&osCsid=ef787039f88b28f4739d1789f3c2c213" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322

 

 

 

65.55.109.237 - - [14/Jan/2009:08:55:26 -0500] "GET /military-issue-swmi-p-364.html?action=buy_now&page=1&sort=2d HTTP/1.0" 302 - "http://search.live.com/results.aspx?q=cookie" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
65.55.109.237 - - [14/Jan/2009:08:55:27 -0500] "GET /shopping_cart.php?cPath=35&page=1&sort=2d&osCsid=13094c900428b9ca1531b764867defd3 HTTP/1.0" 200 34335 "http://search.live.com/results.aspx?q=cookie" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
65.55.109.237 - - [14/Jan/2009:08:55:27 -0500] "GET /stylesheet.css HTTP/1.0" 200 6637 "http://myknifestore.net/shopping_cart.php?cPath=35&page=1&sort=2d&osCsid=13094c900428b9ca1531b764867defd3" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322

Share this post


Link to post
Share on other sites

As far as Prevent Spider Sessions is concerned, that's a human user. There's nothing one can use to say "this is a bot". In fact, what you show looks very much like a human clicking on a Buy Now button, complete with loading a stylesheet (which a bot would NEVER do.) I will agree, though, that the IP is for msnbot. Very weird.

Share this post


Link to post
Share on other sites

I realize there is no accounting for human behavior, but this is really weird. This user comes back numerous times with the same referrer and adds one item to the cart and leaves. The last set of digits in the ip will change, but resolves to msn every time I check it.

 

Thanks for taking the time to look at this,

 

Tim

 

As far as Prevent Spider Sessions is concerned, that's a human user. There's nothing one can use to say "this is a bot". In fact, what you show looks very much like a human clicking on a Buy Now button, complete with loading a stylesheet (which a bot would NEVER do.) I will agree, though, that the IP is for msnbot. Very weird.

Share this post


Link to post
Share on other sites
As far as Prevent Spider Sessions is concerned, that's a human user. There's nothing one can use to say "this is a bot". In fact, what you show looks very much like a human clicking on a Buy Now button, complete with loading a stylesheet (which a bot would NEVER do.) I will agree, though, that the IP is for msnbot. Very weird.

 

The activity that knifeman described comes from msnbot's cloaking detector, that has been running since the summer or fall of '07. See this blog post from Vanessa Fox on Search Engine Land that discusses the behavior of an early version of this non-bot bot. It continues to crawl my site from time to time. I just live with it, but it is annoying. I may block the subnet(s) that it runs from in application_top.php as suggested for another bot that doesn't identify itself a few posts back, though changing the Buy Now buttons to form buttons may be an even better idea.

 

--Glen

Edited by SteveDallas

Share this post


Link to post
Share on other sites
The best thing I can suggest is to replace your Buy Now link buttons with a form button. There's a contrib to do this. Few if any bots follow forms.

 

I searched for this form in the contrib library, but couldn't find it. Do you remember the title or contrib number?

 

--Glen

Share this post


Link to post
Share on other sites

Thanks Glen,

 

That explains everything. It was bugging me because the bot crawls my site everyday. I checked with MSN and I am not blocked.

 

Tim

 

The activity that knifeman described comes from msnbot's cloaking detector, that has been running since the summer or fall of '07. See this blog post from Vanessa Fox on Search Engine Land that discusses the behavior of an early version of this non-bot bot. It continues to crawl my site from time to time. I just live with it, but it is annoying. I may block the subnet(s) that it runs from in application_top.php as suggested for another bot that doesn't identify itself a few posts back, though changing the Buy Now buttons to form buttons may be an even better idea.

 

--Glen

Share this post


Link to post
Share on other sites
Thanks Glen,

 

That explains everything. It was bugging me because the bot crawls my site everyday. I checked with MSN and I am not blocked.

 

Tim

 

I ended up changing my "Buy Now" buttons to forms, as outlined in the SID Killer contribution. While the MSN cloaking checker still crawls my site, usually a minute or two after msnbot has indexed the same page, I haven't noticed it creating carts. In making the change, I removed a previous mod that I had installed that enables the Buy Now button only if a session ID has been assigned. I figured that I no longer needed that, since the whole point of making form buttons is that bots won't follow them.

 

If its any consolation, the cloaking checker seems to visit less often over time.

 

--Glen

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×