boxtel Posted December 31, 2004 Share Posted December 31, 2004 Lewis, Yep. Comes by my sites reguarly. It's Microsoft's newest search engine that is supposed to compete with Google. ed <{POST_SNAPBACK}> it is ever present Quote Treasurer MFC Link to comment Share on other sites More sharing options...
ckyshop.co.uk Posted December 31, 2004 Share Posted December 31, 2004 Thanks. Is it in spiders.txt as it appears in my who's online all the time, even when the 'session create' is turned off? Quote Thanks for any help/comments. Regards, Lewis Hill Link to comment Share on other sites More sharing options...
stevel Posted December 31, 2004 Author Share Posted December 31, 2004 It is in the "updated spiders.txt" contribution. It is not in the spiders.txt that comes in 2.2-MS2. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ckyshop.co.uk Posted January 3, 2005 Share Posted January 3, 2005 I am using the updated spiders but can't seem to find it...is it down as msnbot? Cheers again Quote Thanks for any help/comments. Regards, Lewis Hill Link to comment Share on other sites More sharing options...
stevel Posted January 3, 2005 Author Share Posted January 3, 2005 It's "nbot". Any substring will match and this catches two or more spiders. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ckyshop.co.uk Posted January 3, 2005 Share Posted January 3, 2005 Ahh, OK, I see. Sorry for not understanding. Thanks. Quote Thanks for any help/comments. Regards, Lewis Hill Link to comment Share on other sites More sharing options...
kilroy13 Posted January 24, 2005 Share Posted January 24, 2005 Nice add. I was just having some issues last night with bots. Had 4 bots on site at the same time (and they were all indexing a 5000 product site!). I updated my who's online earlier so it showed Googlebot, NewMSNBot and can't remember the others... MSN was getting sessions :/ And I did searches on MSN and the links to the site did indeed have session ids... The eXavaBot bot was showing 25+ individual entries on who's line. Hopefully this along with a robots.txt which denies indexing to the product display page will limit these guys. Quote Link to comment Share on other sites More sharing options...
stevel Posted January 24, 2005 Author Share Posted January 24, 2005 Why would you want to deny the product info page to the spiders? I'd think that's exactly where you would want them to be. You just want to keep them off pages that require a session, so their indexes don't fill up with useless "cookie usage" pages. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Sincraft Posted February 9, 2005 Share Posted February 9, 2005 Quick question for you, I am checking what an affiliate program was looking at and found out they were adding things to my cart so my question is, #1 how do I add them to my spiders.txt and #2 how do I flush my database for useless carts? Http Code: 200 Date: Feb 09 00:17:32 Http Version: HTTP/1.0 Size in Bytes: 28702 Referer: - Agent: LinksManager.com (http://linksmanager.com/linkchecker.html) S Quote Link to comment Share on other sites More sharing options...
stevel Posted February 9, 2005 Author Share Posted February 9, 2005 I would suggest adding the string: linksmanager to your spiders.txt. As for adding to carts, these would be session carts that will go away eventually. You could look at the sessions table for these entries and delete the records, but it may be more trouble than it's worth. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
oldworldcharms Posted February 14, 2005 Share Posted February 14, 2005 I am trying to understand spider.txt files. What is the difference between spider.txt and robot.txt. I noticed I have a spider.txt file in both the root and the includes file. Where should this be? If I want to dissallow a page how can this be done. (where is it placed, how is it written and anything else should I know. Thanks Your help is appreciated Elizabeth http//oldworldcharms.net I would suggest adding the string: linksmanager to your spiders.txt. As for adding to carts, these would be session carts that will go away eventually. You could look at the sessions table for these entries and delete the records, but it may be more trouble than it's worth. <{POST_SNAPBACK}> Quote Link to comment Share on other sites More sharing options...
stevel Posted February 14, 2005 Author Share Posted February 14, 2005 Elizabeth, spiders.txt and robots.txt have two very different purposes. spiders.txt is used only by osCommerce and should reside in your catalog/includes folder. If you enable the "Prevent Spider Sessions" option in osC admin, osC will not start a session for a visit by a spider identified by a "user agent" string that contains any of the strings in spiders.txt. For example, Google's spider, when it opens a page on your site, has a user agent string containing "Googlebot". osC converts the user agent to lowercase and then each of the strings in spiders.txt is matched against it. The string "ebot" is found in "googlebot". When a match is found, osC allows the page to load but does not start a session. This means that any of the site features that require a session, such as "add to cart", or opening any of the account pages, will be unavailable to the spider. The benefit of this is that spiders will not waste site resources adding items to carts through "Buy Now" buttons and updating session information. More important, it keeps spiders from including session IDs in their indexes. robots.txt goes in the top-level folder of your site and is read by most spiders. The file contains your rules for which parts of your site a spider may not visit - this can be specified on a per-spider basis or for all spiders. The spider name you give in robots.txt is one assigned by the spider implementor and may have no relation to the user agent string. You really should use both files - robots.txt to keep spiders out of pages you typically would not want indexed, such as account and cart pages. spiders.txt does not prevent spiders from indexing your site, but does stop them from generating new sessions and from ncluding session IDs in their indexes. For more information on robots.txt and how it is used, see The Robot Exclusion Standard. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 19, 2005 Share Posted February 19, 2005 Steve, First and foremost, you da Man!!! :thumbsup: I wanted to make you aware that the current spiders list will send a browser with the Teoma Toolbar to the cookies_usage page. Based on my research Teoma's robot reports itself as: Mozilla/2.0 (compatible; Ask Jeeves/Teoma) I would recommend just adding a / in front of the current entry for teoma. I added this to my site and successfully tested the Teoma Toolbar. Quote Link to comment Share on other sites More sharing options...
stevel Posted February 19, 2005 Author Share Posted February 19, 2005 Marcello, Thanks for the suggestion. I had not heard of the Teoma Toolbar before. I'll include this in the next update. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 21, 2005 Share Posted February 21, 2005 Steve, I've had the following IPs trolling my site for the last 2 days: 222.95.33.29, 222.95.37.100. Showmyip.com shows: CHINANET jiangsu province network The same URL will show up multiple times in whos online and list different referrers. The referrer changes almost everytime they click including: www.bbxiong.com, nitreaous.exitfuel.com, world-of-tour.net, www.freefa.net, www.jebest.com. Any ideas? I'm pretty sure it's a bot but I don't know what the agent is. ed Quote Link to comment Share on other sites More sharing options...
stevel Posted February 21, 2005 Author Share Posted February 21, 2005 No idea, sorry. I have seen some IP ranges access sites as if they were bots, but they either don't request robots.txt or supply an uninformative user agent - the latter makes it impossible to filter in spiders.txt. One could try to maintain a list of IPs and either filter them in .htaccess or add yet another mechanism, but unless the activity poses a problem for your site, my advice is to ignore it, as at some point it starts costing you more processing power per legitimate page reference than you save by added filtering. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 21, 2005 Share Posted February 21, 2005 Just wanted to drop a note to Steve and company: This contribution is essential and your work keeping it up to date is appreciated by myself and everyone I recommend it to (which is everyone). Keep up the great work! Bobby Quote Link to comment Share on other sites More sharing options...
Guest Posted February 21, 2005 Share Posted February 21, 2005 Steve, I second Bobby's comments. Your work keeping this contrib updated is extremely valued and greatly appreciated. Now, If I could just get rid of those trawlers... Thanks, Ed Quote Link to comment Share on other sites More sharing options...
stevel Posted February 21, 2005 Author Share Posted February 21, 2005 Thanks for the kind words. If the trawlers are a problem for you, put the appropriate Deny From lines in a .htaccess. See http://httpd.apache.org/docs/mod/mod_access.html#deny for syntax. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 22, 2005 Share Posted February 22, 2005 Steve, I may go that route. Thanks for the link. I tweaked whos online to give me the user agent. The offending IPs are showing the following: User Agent: mozilla/4.0 (compatible; msie 4.0; windows 98) This isn't a person, it's a bot. It has been reguarly hitting the site every 30 seconds - 15 minutes for 3 days. It only looks at a single page each time. It's referrer changes each time and the ones that I can go to do not have links to this site. "Curiouser and curiouser" ed Quote Link to comment Share on other sites More sharing options...
sargenle Posted February 27, 2005 Share Posted February 27, 2005 Why would you want to deny the product info page to the spiders? I'd think that's exactly where you would want them to be. You just want to keep them off pages that require a session, so their indexes don't fill up with useless "cookie usage" pages. <{POST_SNAPBACK}> Speaking of robots.txt. Which files should be searched by the bots anyway. I have a small standard robots.txt file that I got as a contribution. I would like to make my robots.txt more useful in denying access to files that are useless when indexed. Could someone help me with a list of files/directories that should be allowed to be indexed. Leon Quote Link to comment Share on other sites More sharing options...
stevel Posted February 27, 2005 Author Share Posted February 27, 2005 (edited) Search engines will search whatever is linked. You can encourage them by installung the "All Products" contribution and changing the product_info.php page so that the product name is displayed in an h1 tag rather than plain text. Edited February 27, 2005 by stevel Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
jond Posted March 1, 2005 Share Posted March 1, 2005 Does the crawl in the file cover MSIECrawler? Quote Link to comment Share on other sites More sharing options...
stevel Posted March 1, 2005 Author Share Posted March 1, 2005 Yes - it covers any user agent containg the string "crawl" (case-blind). Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
selectronics4u Posted April 11, 2005 Share Posted April 11, 2005 i have been getting spidered by gecko but it looks like it only spiders images is this in spiders.txt or should it be added? thanks Don Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.