Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

That's an IP, not a user agent. What's the user agent string from the access log? That IP is not assigned to a specific domain.

 

It is PSI, Performance Systems International? Within minutes it is pulling up every page. I basically went and blocked it completely off my site. Then varients of that IP started crawling.


Remember what the Bible says: He who is without sin, cast the first rock. And I shall smoketh it.

Share this post


Link to post
Share on other sites
Fine. Adding that one to spiders.txt would not accomplish anything anyway. But if you do see spiders getting session IDs. then by all means let me know!

 

I just posted an update to the contrib - the rate of new spiders has fallen off quite a bit - I had not seen a new one for a couple of months.

 

Hi Stevel,

 

There is one bot, which gets session ID on my site and added almost all products to shoping cart and is always there:

217.106.233.192

it is webmoney bot i guess

Is it possible to add it to spiders list to prevent it from getting sessions?

I added webmone or money to spider txt file, but without any luck

Share this post


Link to post
Share on other sites

What is the user agent from your access log?

 

If you want to block a specific IP, you can do that in the .htaccess file with the line:

 

DenyFrom x.x.x.x

 

There's no reverse DNS (not even a domain name) associated with that IP so I don't know what else to advise you.

Share this post


Link to post
Share on other sites
What is the user agent from your access log?

 

If you want to block a specific IP, you can do that in the .htaccess file with the line:

 

DenyFrom x.x.x.x

 

There's no reverse DNS (not even a domain name) associated with that IP so I don't know what else to advise you.

 

looks like this:

 

217.106.233.192 - - [26/Mar/2007:05:04:47 -0400] Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060824/1.7.12 (Mozilla; http://mozilla.org; anonymous@anonymous.org)

Share this post


Link to post
Share on other sites

Hi

 

I have an IP address 213.123.219.228 who has gone beyond a customer & I suspect is a spider...with a massive basket & with a session :angry: First time I have had this happen.

 

Name: Guest

 

ID: 0

 

IP Address: 213.123.219.228

 

User Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

 

I have tried to look at my logs (which I have no clue about :blink: ) & it only goes up to dec 06. I do not know how to look at the logs for now. I downloaded the one from the control panel of my host & that was the last date showing.

 

Please can you tell me what to do? Urgent & totally confused.

 

Thanks

Julie

Share this post


Link to post
Share on other sites

Now it is creating loads of baskets! I don't know what is happening, but would appreciate any suggestions.

 

Thanks

Julie :thumbsup:

Share this post


Link to post
Share on other sites

Me again! :lol:

 

It shows btopenworld.com & this is possibly bt-yahoo.com etc

 

Does this help with a spider identification?

Share this post


Link to post
Share on other sites

No, it doesn't help. The user agent string looks like a generic web browser, but it's easy and common for non-well-behaved spiders to pretend to be a web browser. If it's a single IP that is the problem, add a DenyFrom line in .htaccess to block it.

Share this post


Link to post
Share on other sites
No, it doesn't help. The user agent string looks like a generic web browser, but it's easy and common for non-well-behaved spiders to pretend to be a web browser. If it's a single IP that is the problem, add a DenyFrom line in .htaccess to block it.

Thanks Steve

 

If it is Yahoo (BT use Yahoo for their search engine) it would be preferential not to block it. Is BT being a naughty spider! tut tut I have the amendment in "some" file :blush: suggested by Boxtel for getting these sessions removed, & I have also asked my host if they can get my logs for me. When they do I will try & give you the info needed. It was really weird seeing it & I panicked a bit too!

 

Failing this I will block it as it isn't doing me any good I take it?

 

Thanks for your help. :thumbsup:

Julie

Share this post


Link to post
Share on other sites

It's not Yahoo - it's a BT customer who is running their own spider. Ask BT if they would ask the user for that IP to please stop spidering other sites. You should simply block the IP.

 

Yahoo uses its own IPs and distinctive user agents for its spiders.

Share this post


Link to post
Share on other sites
It's not Yahoo - it's a BT customer who is running their own spider. Ask BT if they would ask the user for that IP to please stop spidering other sites. You should simply block the IP.

 

Yahoo uses its own IPs and distinctive user agents for its spiders.

:huh: :angry:

Why would a BT customer want to use a spider to search other sites? I am so niave!

 

Is there a special place or way to add

DenyFrom 213.123.219.228

to the .htaccess file please as I have never had to do this before?

 

Thanks

Julie

Share this post


Link to post
Share on other sites

Well, there are a number of possibilities. One is that they're looking for insecure sites to exploit. Another is that they're playing with spidering sites. A third, and perhaps more innocuous, is that they're using a tool to fetch an entire site, though these USUALLY have their own user agent string.

 

If you have a local copy of the .htaccess file (which you should), add that line to the bottom of the file then upload it to the site, being sure to use ASCII mode. If you don't have it, download it in ASCII mode. You may find it easier to rename the file as htaccess.txt on your local computer, upload it that way, delete the old .htaccess and rename the new one to .htaccess. Sometimes on Windows files with no file name and just a type cause problems.

Edited by stevel

Share this post


Link to post
Share on other sites
Well, there are a number of possibilities. One is that they're looking for insecure sites to exploit. Another is that they're playing with spidering sites. A third, and perhaps more innocuous, is that they're using a tool to fetch an entire site, though these USUALLY have their own user agent string.

 

If you have a local copy of the .htaccess file (which you should), add that line to the bottom of the file then upload it to the site, being sure to use ASCII mode. If you don't have it, download it in ASCII mode. You may find it easier to rename the file as htaccess.txt on your local computer, upload it that way, delete the old .htaccess and rename the new one to .htaccess. Sometimes on Windows files with no file name and just a type cause problems.

OK I am lost! :blush:

 

Is it ok to show you mt .htaccess file as I have forthe first time looked at it & don't know where to put this line. :-" Not sure whether I'm using ASCII, although I have noticed it switch to binary when I FTP...or am I really lost!

 

Thanks

Julie

Share this post


Link to post
Share on other sites

Hi,

I have been usign this contribution for 2 urs now and all seems to be working well.

 

last few days I have noticed a Search Engine Spider / a Hacker on my site but IP address is coming unknown

 

Online Name IP Address Entry Last Click Last URL Session? Referer?

00:02:30 Guest 81.10.82.136 08:27:20 08:29:50 /hp-nw9440-cd-t2700-233g-2gb-100g-ey616eaabu-p-49234.html Yes Yes

 

00:00:00 Guest 87.75.128.182 08:29:09 08:29:09 (Product) Yes Not Found

 

00:00:00 Guest 87.75.128.182 08:28:12 08:28:12 (Product) Yes Not Found

 

00:00:00 Guest unknown 08:28:05 08:28:05 Pioneer S-V40UK (Product) Yes Not Found

 

00:00:00 Guest unknown 08:28:03 08:28:03 Sandisk Memory Stick - SDMSH-64-E10 (Product) Yes Not Found

 

09:00:42 Mozilla 66.249.72.195 23:26:33 08:27:15 /ctx-m-110.html?sort=4d&page=1 No Not Found

 

00:00:00 Guest 192.168.1.72 08:26:51 08:26:51 LAMP FOR TOSHIBA TDP-MT700 PROJECTOR (Product) Yes Not Found

 

00:00:00 Guest unknown 08:26:14 08:26:14 Lenovo ThinkCentre A52 P4 2.8 512 80 DVD XPP - VSA72UK (Product) Yes Not Found

 

00:01:12 Guest 194.73.121.7 08:24:45 08:25:57 /sharp-xvz3000-p-46493.html Yes Yes

 

00:00:00 Mozilla 74.6.67.158 08:25:33 08:25:33 /samsung-ppm42m5s-42-silver-plasma-screen-p-8705.html No Not Found

 

00:01:14 ShopWiki 38.98.120.87 08:24:07 08:25:21 /acer-al1722hs-etl0408073-p-7320.html No Not Found

 

00:00:00 Guest unknown 08:24:31 08:24:31 Sandisk 1GB Memory Stick - SDMSPD-1024-E10M (Product) Yes Not Found

 

00:00:00 Mozilla 74.6.74.83 08:23:32 08:23:32 /sahara-s2000-p-26188.html No Not Found

 

00:00:00 Mozilla 193.47.80.42 08:22:18 08:22:18 Audica Tower CS-T1 Silver (pair) (Product) No Not Found

 

How can i stop this as it seems to be creating a session?

Please help!!

Share this post


Link to post
Share on other sites

I would say that you need to ask the author of the enhanced "Who's online" contrib you are using to ask why "unknown" is shown. As for spiders.txt, I'd need to see access log entries for these references to see what there is that can be blocked.

Share this post


Link to post
Share on other sites

First, thanks Stevel for the contrib. Looks like a lot of work.

 

2nd, can you explain to me what the difference is between these are:

 

! *******************Best Spiders List***********************!

! architext spider

! ask jeeves

! crawler

! crawle

etc...

 

and

 

! ****************knocker Spiders List!**********************!

 

.bot

/bot

/teoma

_bot

abcdatos

abot

accoona

acme

acoon

etc...

 

Im sorry I didnt really know what to do with this file but replace the stock one in my /includes folder (and prevent spider sessions of course), but am I seeing ! googlebot commented out? (the ones on top all are within !)

 

Thanks!

 

HerpAddict 87

Edited by Herpaddict87

Share this post


Link to post
Share on other sites

The text you show is not from my contrib, I can't comment on it. I have seen some other lists which, in some cases, show that the authors don't understand how spiders.txt is processed.

 

In my contrib, the string "ebot" catches googlebot. I'll comment that all spider strings in spiders.txt must be lowercase and that extra comment lines slow down processing.

Share this post


Link to post
Share on other sites

Thanks Stevel! Now I understand my own confusion! I must have been looking at a different contrib. I now have your latest .txt uploaded. :)

 

HerpAddict87

Share this post


Link to post
Share on other sites

Today getting a lot of spiders on my site addding items to the cart with random products from my site.

IP addresses are:

86.142.246.187

208.99.195.54

 

How to stop them??

Share this post


Link to post
Share on other sites
If you're interested in using spiders.txt to stop these, then you need the user agent string from the access logs. IPs aren't helpful.

What is the best way to show the acces_Log? The file size is about 3MB and too big to copy on here.

Also how do we read the log to find out who to block using spiders.txt

Also I am getting a spider on my site which creates sessions but IP address comes up as "unknown" on Who Is online.

 

Pls help!!

 

Kunal

Share this post


Link to post
Share on other sites

Just post the lines showing accesses by the IPs you are worried about. Just one line from each would be fine. You're looking for the User Agent string which, for normal users, shows the name of the browser. For well-behaved spiders, it will have an identification such as Googlebot. There are also individuals who use generic software to create their own spiders, sometimes for not nice reasons. Those may be hard to identify, sometimes they pretend to be a regular web browser.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×