Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

I'm using the September update of the spiders.txt. shopwiki is mentioned in that file.

 

However it seems to be ignoring the spiders.txt, as it's been sucking up a ridiculous amount of bandwith without any return.

 

Is there anything else that I can do to make sure that shopwiki is denied access to my site?

Share this post


Link to post
Share on other sites

You can contact the shopwiki.com support people - they were very responsive to a question I asked a short time ago. (The shopwiki spider has an annoying habit of trying variations on URLs, truncating them at punctuation points. I was told that this was their attempt to "optimize". It gets me a lot of 404 errors..)

 

If you truly want to deny access, you can add a "Deny from" entry to your .htaccess for their IP range (I don't know what it is offhand), but that won't stop them from trying.

Share this post


Link to post
Share on other sites
I'm using the September update of the spiders.txt. shopwiki is mentioned in that file.

 

However it seems to be ignoring the spiders.txt, as it's been sucking up a ridiculous amount of bandwith without any return.

 

Is there anything else that I can do to make sure that shopwiki is denied access to my site?

 

 

I was having the same problem but shopwiki have obeyed the robots.txt when I disallowed them...so far at least!

Share this post


Link to post
Share on other sites

A couple you migh want to add which are browsing mainly french sites:

 

GET /links.php HTTP/1.1" 200 71316 "-" "BIGLOTRON (Beta 2;GNU/Linux)

Comes regularly but only do a few pages

 

GET /robots.txt HTTP/1.0" 200 1666 "-" "Graal (http://www.gralon.net)

Comes very often tu update its directory/saearch engines and crawls about 500 pages at a time resulting in a big shopping cart :)

Share this post


Link to post
Share on other sites

Hello all,

 

I have a spider cruising around my site: 216.113.181.67, and it seems to be identified as EBay.

 

http://www.showmyip.com/?ip=216.113.181.67

 

 

OrgName: eBay, Inc

OrgID: EBAY

Address: 2145 Hamilton Ave

City: San Jose

StateProv: CA

PostalCode: 95008

Country: US

 

 

I have put put "ebay" into spiders.txt but it does not prevent this one from getting sessions; sometimes about 10 or 15 at a time!

 

Any ideas what to do? Does it have another name or user-agent?

 

Thanks for all ideas....

Share this post


Link to post
Share on other sites
It shows no user-agent in the whois entry... that is what is puzzling me.

 

(The other bots seem to).

Share this post


Link to post
Share on other sites

WHOIS entries rarely show user agent strings - in fact I have yet to see one that does. What I was asking for was the user agent string from your access log. That is what the "prevent spider sessions" feature looks at, not IPs.

Share this post


Link to post
Share on other sites

Hi Steve,

 

I got hits from someone with the referer "PycURL/7.15.5".

 

I already have an updated version of your spider-list, but do you know if this is a regular (search-engine) spider or an unfriendly one ?

 

I'm asking because I have hits from two different IP-addresses (one is located in Brazil and one in Saudi Arabia) and both had this referer as a.m.

 

Thanks for your opinion in advance,

kind regards

Andreas

Share this post


Link to post
Share on other sites

pycURL is an implementation of the cURL library for the Python language. It is not "bad", but you can assume that anyone using pycURL is not browsing your site normally and can be treated as a spider.

 

Add the string:

 

pycurl

 

to spiders.txt for now. I'll add this in the next update.

Share this post


Link to post
Share on other sites

Hi Steve,

 

thanks for your reply.

Is the pycURL already in your spider.txt, isnt it ??

 

I assumed so, because I was the opinion, that when I got that referer mentioned in my stats, the referer is already there.

 

You also said, that this is not for normal browsing. Do this mean, someone is trying to grab something from my site with pycURL ?

Share this post


Link to post
Share on other sites

I'm away from my files so I am not sure if it is there. But I think "curl" is there which would take care of this. Yes, this does indicate some sort of automated grabber.

Share this post


Link to post
Share on other sites

robots.txt will work only if the spider obeys it, and that is not likely in this case. You can block that user agent using .htaccess. Is this "visitor" causing problems for you?

Share this post


Link to post
Share on other sites

"grabber" is a term used for any kind of automated web page reader that stores copies of what it finds on the web page. A search engine spider is a grabber, but usually people reserve this term for scripts other than search engine spiders. For example, I know of a script that grabs copies of any favicons it finds on a site.

 

For the purpose of spiders.txt, you'd like to be able to recognize non-human visitors so as to avoid assigning a session to them. Being listed in spiders.txt does NOT restrict a non-human visitor from seeing the pages on your site (other than those that require a session, such as the cart).

 

If you have a non-human visitor that is causing you problems, such as excessive bandwidth, you have to look to other means to stop them. Well-behaved scripts do obey robots.txt, but there are many not well behaved (often run by individuals.) For these, you have to resort to other means such as IP and user agent blocks in .htaccess.

Share this post


Link to post
Share on other sites

It turns out that "curl" wasn't in spiders.txt and it definitely needs to be - especially due to pyCURL. I have updated the contrib to include this and some other strings.

Share this post


Link to post
Share on other sites

Hello,

 

Newbie here....... but, I'm getting a lot of hits from

74.6.86.148

74.6.66.51

74.6.73.248

74.6.86.148

74.6.87.103

 

and so on

 

The network id says it is the Inktomi Corporation (from domain dossier)

 

These connections are constant, make huge guest carts, and the connections are multiplying. I now have 6

 

Is this a spider? I am using your spiders_large.txt in my site (renamed of course) but it is not preventing this.

 

Is this an 'ok' connection, or something I should worry about? It seems to be cycling through all my products, over and over.

 

Anyone have advice?

 

Thanks,

Nancy

Share this post


Link to post
Share on other sites
That's one of Yahoo's spiders. Do you have the user agent string from the access log? Typically Yahoo's spiders have :"slurp" in the UA which spiders.txt includes.

 

 

yippee.. I figured out where to find the answer!

 

Yes, it does.. here is a paste of one of them:

 

74.6.74.31 - - [11/Dec/2006:15:09:38 -0600] "GET /products_new.php?action=buy_now&products_id=219&osCsid=f56ba6b26df5bafbf65ddae3118e7f88 HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

 

 

Incidentally, the number of connections has gone down.. I now only have 4, but their carts have grown - and one now contains 26 items!

 

Thanks,

Nancy

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×