Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Updated spiders.txt Official Support Topic


stevel

Recommended Posts

  • 3 weeks later...

Nice add. I was just having some issues last night with bots. Had 4 bots on site at the same time (and they were all indexing a 5000 product site!).

 

I updated my who's online earlier so it showed Googlebot, NewMSNBot and can't remember the others...

 

MSN was getting sessions :/ And I did searches on MSN and the links to the site did indeed have session ids...

 

The eXavaBot bot was showing 25+ individual entries on who's line.

 

Hopefully this along with a robots.txt which denies indexing to the product display page will limit these guys.

Link to comment
Share on other sites

Why would you want to deny the product info page to the spiders? I'd think that's exactly where you would want them to be. You just want to keep them off pages that require a session, so their indexes don't fill up with useless "cookie usage" pages.

Link to comment
Share on other sites

  • 3 weeks later...

Quick question for you, I am checking what an affiliate program was looking at and found out they were adding things to my cart so my question is, #1 how do I add them to my spiders.txt and #2 how do I flush my database for useless carts?

 

Http Code: 200 Date: Feb 09 00:17:32 Http Version: HTTP/1.0 Size in Bytes: 28702

Referer: -

Agent: LinksManager.com (http://linksmanager.com/linkchecker.html)

 

 

S

Link to comment
Share on other sites

I would suggest adding the string:

 

linksmanager

 

to your spiders.txt. As for adding to carts, these would be session carts that will go away eventually. You could look at the sessions table for these entries and delete the records, but it may be more trouble than it's worth.

Link to comment
Share on other sites

I am trying to understand spider.txt files. What is the difference between spider.txt and robot.txt.

 

I noticed I have a spider.txt file in both the root and the includes file. Where should this be?

 

If I want to dissallow a page how can this be done. (where is it placed, how is it written and anything else should I know.

 

Thanks

Your help is appreciated

Elizabeth

 

http//oldworldcharms.net

 

 

 

I would suggest adding the string:

 

linksmanager

 

to your spiders.txt.  As for adding to carts, these would be session carts that will go away eventually.  You could look at the sessions table for these entries and delete the records, but it may be more trouble than it's worth.

Link to comment
Share on other sites

Elizabeth,

 

spiders.txt and robots.txt have two very different purposes.

 

spiders.txt is used only by osCommerce and should reside in your catalog/includes folder. If you enable the "Prevent Spider Sessions" option in osC admin, osC will not start a session for a visit by a spider identified by a "user agent" string that contains any of the strings in spiders.txt.

 

For example, Google's spider, when it opens a page on your site, has a user agent string containing "Googlebot". osC converts the user agent to lowercase and then each of the strings in spiders.txt is matched against it. The string "ebot" is found in "googlebot". When a match is found, osC allows the page to load but does not start a session. This means that any of the site features that require a session, such as "add to cart", or opening any of the account pages, will be unavailable to the spider. The benefit of this is that spiders will not waste site resources adding items to carts through "Buy Now" buttons and updating session information. More important, it keeps spiders from including session IDs in their indexes.

 

robots.txt goes in the top-level folder of your site and is read by most spiders. The file contains your rules for which parts of your site a spider may not visit - this can be specified on a per-spider basis or for all spiders. The spider name you give in robots.txt is one assigned by the spider implementor and may have no relation to the user agent string.

 

You really should use both files - robots.txt to keep spiders out of pages you typically would not want indexed, such as account and cart pages. spiders.txt does not prevent spiders from indexing your site, but does stop them from generating new sessions and from ncluding session IDs in their indexes.

 

For more information on robots.txt and how it is used, see The Robot Exclusion Standard.

Link to comment
Share on other sites

Steve,

 

First and foremost, you da Man!!! :thumbsup:

 

I wanted to make you aware that the current spiders list will send a browser with the Teoma Toolbar to the cookies_usage page.

 

Based on my research Teoma's robot reports itself as:

 

Mozilla/2.0 (compatible; Ask Jeeves/Teoma)

 

I would recommend just adding a / in front of the current entry for teoma. I added this to my site and successfully tested the Teoma Toolbar.

Link to comment
Share on other sites

Steve,

 

I've had the following IPs trolling my site for the last 2 days:

222.95.33.29,

222.95.37.100.

 

Showmyip.com shows: CHINANET jiangsu province network

 

The same URL will show up multiple times in whos online and list different referrers. The referrer changes almost everytime they click including: www.bbxiong.com, nitreaous.exitfuel.com, world-of-tour.net, www.freefa.net, www.jebest.com.

 

Any ideas? I'm pretty sure it's a bot but I don't know what the agent is.

 

ed

Link to comment
Share on other sites

No idea, sorry. I have seen some IP ranges access sites as if they were bots, but they either don't request robots.txt or supply an uninformative user agent - the latter makes it impossible to filter in spiders.txt.

 

One could try to maintain a list of IPs and either filter them in .htaccess or add yet another mechanism, but unless the activity poses a problem for your site, my advice is to ignore it, as at some point it starts costing you more processing power per legitimate page reference than you save by added filtering.

Link to comment
Share on other sites

Just wanted to drop a note to Steve and company:

 

This contribution is essential and your work keeping it up to date is appreciated by myself and everyone I recommend it to (which is everyone).

 

Keep up the great work!

 

Bobby

Link to comment
Share on other sites

Steve,

 

I second Bobby's comments. Your work keeping this contrib updated is extremely valued and greatly appreciated.

 

Now, If I could just get rid of those trawlers...

 

Thanks,

Ed

Link to comment
Share on other sites

Thanks for the kind words. If the trawlers are a problem for you, put the appropriate Deny From lines in a .htaccess. See http://httpd.apache.org/docs/mod/mod_access.html#deny for syntax.

Link to comment
Share on other sites

Steve,

 

I may go that route. Thanks for the link.

 

I tweaked whos online to give me the user agent. The offending IPs are showing the following:

User Agent: mozilla/4.0 (compatible; msie 4.0; windows 98)

 

This isn't a person, it's a bot. It has been reguarly hitting the site every 30 seconds - 15 minutes for 3 days. It only looks at a single page each time. It's referrer changes each time and the ones that I can go to do not have links to this site.

 

"Curiouser and curiouser"

 

ed

Link to comment
Share on other sites

Why would you want to deny the product info page to the spiders?  I'd think that's exactly where you would want them to be.  You just want to keep them off pages that require a session, so their indexes don't fill up with useless "cookie usage" pages.

 

Speaking of robots.txt.

 

Which files should be searched by the bots anyway.

 

I have a small standard robots.txt file that I got as a contribution.

 

I would like to make my robots.txt more useful in denying access to files that are useless when

 

indexed.

 

Could someone help me with a list of files/directories that should be allowed to be indexed.

 

Leon

Link to comment
Share on other sites

Search engines will search whatever is linked. You can encourage them by installung the "All Products" contribution and changing the product_info.php page so that the product name is displayed in an h1 tag rather than plain text.

Edited by stevel
Link to comment
Share on other sites

  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...