Latest News: (loading..)
stevel

Updated spiders.txt Official Support Topic

597 posts in this topic

Check your access log for the accesses by the payment service and look at the user agent string.

 

I was unaware that the presence of java/ was a problem.

Share this post


Link to post
Share on other sites

Hello,

 

Lately I have the following : http://www.66-195-77-130.static.twtelecom.net/ regularly visiting and adding items to the shopping cart....a headacje really........

 

I would just like that this user not get the session ID ......

 

how can I do that ? What do I put in spider.txt ?

 

dca

Share this post


Link to post
Share on other sites

Nothing - spiders.txt is not for banning individual users. But what you can do is look for the following code in includes/application_top.php:

 

	if ($spider_flag == false) {
  tep_session_start();
  $session_started = true;
}

 

Just before this insert:

 

	$ip_address = tep_get_ip_address();
if ($ip_address == '66.195.77.130') $spider_flag = true;

 

This will prevent that specific IP from getting a session. If you want to add more, just repeat the "if" line and specify another IP.

Share this post


Link to post
Share on other sites

Hi Steve,

I have this IP address creating sessions on my site all the time and adding random products on my site.

208.99.195.54 - - [09/Jul/2007:08:48:52 +0100] "GET /interactive-whiteboards-c-60.html?osCsid= HTTP/1.0" 200 46338 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"

It used to crawl my sites about 6 months ago and then stopped now it seems to be back again in a big way.

 

Can you help?

Kunal

Share this post


Link to post
Share on other sites
Thanks Steve, I'll try to sus out what I need to do.

 

I've had some bot type behaviour on my site but it's getting a session ID - there's no user agent but it comes from ip 208.99.195.54

 

How would I know if this is a bot? It's not someone browsing the store, I'm sure.

 

Thanks

Tiger

Tiger,

I am also getting the same IP address coming up on my site and adding products. It has been doing this for a while now. I am pretty convinced it is not a customer but a bot. Can make out these things by the kind of items it adds to its cart.

 

Kunal

Share this post


Link to post
Share on other sites

If it's a bot, it's trying hard to pretend it isn't one. You can't use spiders.txt for this but you can use the IP test in application_top.pho that I suggested earlier.

Share this post


Link to post
Share on other sites
If it's a bot, it's trying hard to pretend it isn't one. You can't use spiders.txt for this but you can use the IP test in application_top.pho that I suggested earlier.

Steve,

Thanks for your response.

Is there a way to add multiple IP addresses in the format you gave earlier?

Also can we some how block unknow IP addresses? or refuse people entering the site if the IP address is unknown.

 

Kunal

Share this post


Link to post
Share on other sites

As I suggested earlier, just add another "if" line for another IP. I don't know what you mean by an "unknown" IP. You ALWAYS know the IP of the connection.

Share this post


Link to post
Share on other sites
As I suggested earlier, just add another "if" line for another IP. I don't know what you mean by an "unknown" IP. You ALWAYS know the IP of the connection.

 

I have for the last 3 months an unknown IP address roaming on my site which creates sessions.

00:00:00 Guest unknown 17:22:16 17:22:16 Belkin F1U125UKIT - F1U125UKIT (Product) Yes Not Found

The above line is from my Who is online and that is what I see.

I have checked my access log too but on that it some how does not appear.

 

appreciate any help on this as I cant get rid of it at all.

 

Kunal

Edited by kunal247

Share this post


Link to post
Share on other sites

I don't know what triggers the "unknown" in that line, but it certainly isn't that the IP address is not known. Perhaps it's not finding a hostname translation, though these are unreliable in many cases.

 

You'll have to study the actual access log to see what you can do about it.

Share this post


Link to post
Share on other sites
Ok, my error. You have to make TWO changes to tep_create_sort_heading in general.php. Change this:

	global $PHP_SELF;

$sort_prefix = '';
$sort_suffix = '';

if ($sortby) {

to this:

	global $PHP_SELF;
global $session_started;

$sort_prefix = '';
$sort_suffix = '';

if ($sortby && $session_started) {

 

Hi Steve,

I tried this code in includes/functions/general.php but the bots are still sorting products - did I do something wrong?

 

Thanks

Tiger

Share this post


Link to post
Share on other sites

If they have remembered URLs with sort values, they'll keep using them. But not displaying the sort values will help down the road.

Share this post


Link to post
Share on other sites
If they have remembered URLs with sort values, they'll keep using them. But not displaying the sort values will help down the road.

 

Thanks Steve,

I see, I wish my memory was as good as the bots!! I guess there's no way to wipe their memory?

 

Not sure if this is off topic but I tried to create a site-map after I installed the sort code above. The xml-sitemaps bot concerned must have gotten a SID as he was sorting products and adding items to the cart. The sitemap is then useless as it's got sort-links. Any way to stop him getting a SID?

 

Thanks again

Tiger

Share this post


Link to post
Share on other sites

There's a contribution "spider session remover" which uses rewrite rules in .htaccess to remove SIDs from incoming spider links, with the disadvantage that you have to name the spiders (Yahoo, Google, etc.) You could extend that to sort links.

 

A drastic thing you can do but one that will kill all existing SIDs spiders have is to change the name of the session ID name string, default is osCsid. This is defined in the session code somewhere (I'm away from my sources.)

 

As for the sort tags, one thing I did for another store was, after I disabled the sort links for spiders, add a new parameter to the URL, such as &nn=1. If the incoming URL had a sort tag and NOT nn, it got a 404 response. (One could also do a 301 permanent redirect, but I have found some spiders to ignore that.) This was custom code I no longer have handy, so I'll have to leave it as an exercise for the reader.

Share this post


Link to post
Share on other sites

Thanks Steve,

I have Spider Session Remover but don't know how to adapt it for sort links.

 

Also, how can I stop site map crawlers getting SIDs? I can't seem to find their User Agent.

 

Thanks again

Tiger

Share this post


Link to post
Share on other sites

If crawlers are still getting SIDs, then perhaps you have not properly enabled "prevent spider sessions". If you'll give me your store URL I'll check.

 

I'm sorry to say that giving you detailed instructions on eliminating sort links is beyond what I'll be able to do for you. If you're using the Spider Session Remover, the idea is to use the same way it removes "osCsid" for "sort".

Share this post


Link to post
Share on other sites

Thanks again,

I'll have a look again at the code for the htaccess file.

 

I have set prevent spider sessions to true which works well - just that the ones crawling for site maps seem to get a SID - I'm asssuming that's because they're not listed in the spiders.txt but I don't know their name anyhow.

 

 

Tiger

Share this post


Link to post
Share on other sites

How do you know they're crawling the site map? If you find the entries in your web server access log, there will be a user agent string. If the spider is well-intentioned, it will have a UA string that identifies itself (and isn't empty or pretending to be MSIE, for example.)

Share this post


Link to post
Share on other sites
How do you know they're crawling the site map? If you find the entries in your web server access log, there will be a user agent string. If the spider is well-intentioned, it will have a UA string that identifies itself (and isn't empty or pretending to be MSIE, for example.)

I am trying to make a site map, so I enter my url and they crawl the site in order to make the site map. Some have no User Agent, they get SIDs, add stuff to cart, see sort links, try to write reviews etc etc etc. Most of the site map generator only allow 500 links and when they do all the thing I mentioned, I always reach the max links allowed but have a site map with all the links I don't want or need.

 

Maybe I'm just using rubbish site map generators? Anyone know of any good ones?

Share this post


Link to post
Share on other sites

Oh, I get it now. The sitemap generator ought to supply a distinct user agent. If it doesn't, find another. However, I find that the All Products Page contribution works just fine for making it easy for spiders to walk the site.

Share this post


Link to post
Share on other sites

I didn'y know about "all products" so I'll take a look,

 

thanks for the tip.

Tiger

Share this post


Link to post
Share on other sites

Hi Steve. Thanks for the file. Could you please clarify how much slower is the page load with the "large" file?

Share this post


Link to post
Share on other sites

Not really. I just figure it's a lot more data to read and loop through on every page access.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now