Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Updated spiders.txt Official Support Topic


stevel

Recommended Posts

Check your access log for the accesses by the payment service and look at the user agent string.

 

I was unaware that the presence of java/ was a problem.

Link to comment
Share on other sites

  • 2 weeks later...

Nothing - spiders.txt is not for banning individual users. But what you can do is look for the following code in includes/application_top.php:

 

	if ($spider_flag == false) {
  tep_session_start();
  $session_started = true;
}

 

Just before this insert:

 

	$ip_address = tep_get_ip_address();
if ($ip_address == '66.195.77.130') $spider_flag = true;

 

This will prevent that specific IP from getting a session. If you want to add more, just repeat the "if" line and specify another IP.

Link to comment
Share on other sites

Hi Steve,

I have this IP address creating sessions on my site all the time and adding random products on my site.

208.99.195.54 - - [09/Jul/2007:08:48:52 +0100] "GET /interactive-whiteboards-c-60.html?osCsid= HTTP/1.0" 200 46338 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"

It used to crawl my sites about 6 months ago and then stopped now it seems to be back again in a big way.

 

Can you help?

Kunal

Link to comment
Share on other sites

Thanks Steve, I'll try to sus out what I need to do.

 

I've had some bot type behaviour on my site but it's getting a session ID - there's no user agent but it comes from ip 208.99.195.54

 

How would I know if this is a bot? It's not someone browsing the store, I'm sure.

 

Thanks

Tiger

Tiger,

I am also getting the same IP address coming up on my site and adding products. It has been doing this for a while now. I am pretty convinced it is not a customer but a bot. Can make out these things by the kind of items it adds to its cart.

 

Kunal

Link to comment
Share on other sites

If it's a bot, it's trying hard to pretend it isn't one. You can't use spiders.txt for this but you can use the IP test in application_top.pho that I suggested earlier.

Link to comment
Share on other sites

If it's a bot, it's trying hard to pretend it isn't one. You can't use spiders.txt for this but you can use the IP test in application_top.pho that I suggested earlier.

Steve,

Thanks for your response.

Is there a way to add multiple IP addresses in the format you gave earlier?

Also can we some how block unknow IP addresses? or refuse people entering the site if the IP address is unknown.

 

Kunal

Link to comment
Share on other sites

As I suggested earlier, just add another "if" line for another IP. I don't know what you mean by an "unknown" IP. You ALWAYS know the IP of the connection.

Link to comment
Share on other sites

As I suggested earlier, just add another "if" line for another IP. I don't know what you mean by an "unknown" IP. You ALWAYS know the IP of the connection.

 

I have for the last 3 months an unknown IP address roaming on my site which creates sessions.

00:00:00 Guest unknown 17:22:16 17:22:16 Belkin F1U125UKIT - F1U125UKIT (Product) Yes Not Found

The above line is from my Who is online and that is what I see.

I have checked my access log too but on that it some how does not appear.

 

appreciate any help on this as I cant get rid of it at all.

 

Kunal

Edited by kunal247
Link to comment
Share on other sites

I don't know what triggers the "unknown" in that line, but it certainly isn't that the IP address is not known. Perhaps it's not finding a hostname translation, though these are unreliable in many cases.

 

You'll have to study the actual access log to see what you can do about it.

Link to comment
Share on other sites

  • 3 weeks later...
Ok, my error. You have to make TWO changes to tep_create_sort_heading in general.php. Change this:

	global $PHP_SELF;

$sort_prefix = '';
$sort_suffix = '';

if ($sortby) {

to this:

	global $PHP_SELF;
global $session_started;

$sort_prefix = '';
$sort_suffix = '';

if ($sortby && $session_started) {

 

Hi Steve,

I tried this code in includes/functions/general.php but the bots are still sorting products - did I do something wrong?

 

Thanks

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

If they have remembered URLs with sort values, they'll keep using them. But not displaying the sort values will help down the road.

 

Thanks Steve,

I see, I wish my memory was as good as the bots!! I guess there's no way to wipe their memory?

 

Not sure if this is off topic but I tried to create a site-map after I installed the sort code above. The xml-sitemaps bot concerned must have gotten a SID as he was sorting products and adding items to the cart. The sitemap is then useless as it's got sort-links. Any way to stop him getting a SID?

 

Thanks again

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

There's a contribution "spider session remover" which uses rewrite rules in .htaccess to remove SIDs from incoming spider links, with the disadvantage that you have to name the spiders (Yahoo, Google, etc.) You could extend that to sort links.

 

A drastic thing you can do but one that will kill all existing SIDs spiders have is to change the name of the session ID name string, default is osCsid. This is defined in the session code somewhere (I'm away from my sources.)

 

As for the sort tags, one thing I did for another store was, after I disabled the sort links for spiders, add a new parameter to the URL, such as &nn=1. If the incoming URL had a sort tag and NOT nn, it got a 404 response. (One could also do a 301 permanent redirect, but I have found some spiders to ignore that.) This was custom code I no longer have handy, so I'll have to leave it as an exercise for the reader.

Link to comment
Share on other sites

Thanks Steve,

I have Spider Session Remover but don't know how to adapt it for sort links.

 

Also, how can I stop site map crawlers getting SIDs? I can't seem to find their User Agent.

 

Thanks again

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

If crawlers are still getting SIDs, then perhaps you have not properly enabled "prevent spider sessions". If you'll give me your store URL I'll check.

 

I'm sorry to say that giving you detailed instructions on eliminating sort links is beyond what I'll be able to do for you. If you're using the Spider Session Remover, the idea is to use the same way it removes "osCsid" for "sort".

Link to comment
Share on other sites

Thanks again,

I'll have a look again at the code for the htaccess file.

 

I have set prevent spider sessions to true which works well - just that the ones crawling for site maps seem to get a SID - I'm asssuming that's because they're not listed in the spiders.txt but I don't know their name anyhow.

 

 

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

How do you know they're crawling the site map? If you find the entries in your web server access log, there will be a user agent string. If the spider is well-intentioned, it will have a UA string that identifies itself (and isn't empty or pretending to be MSIE, for example.)

Link to comment
Share on other sites

How do you know they're crawling the site map? If you find the entries in your web server access log, there will be a user agent string. If the spider is well-intentioned, it will have a UA string that identifies itself (and isn't empty or pretending to be MSIE, for example.)

I am trying to make a site map, so I enter my url and they crawl the site in order to make the site map. Some have no User Agent, they get SIDs, add stuff to cart, see sort links, try to write reviews etc etc etc. Most of the site map generator only allow 500 links and when they do all the thing I mentioned, I always reach the max links allowed but have a site map with all the links I don't want or need.

 

Maybe I'm just using rubbish site map generators? Anyone know of any good ones?

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

Oh, I get it now. The sitemap generator ought to supply a distinct user agent. If it doesn't, find another. However, I find that the All Products Page contribution works just fine for making it easy for spiders to walk the site.

Link to comment
Share on other sites

I didn'y know about "all products" so I'll take a look,

 

thanks for the tip.

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...