Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

Like adding: && ($session_started)

 

To: if (isset($HTTP_GET_VARS['products_id']) && ($session_started)) {

 

In: /includes/boxes/product_notifications.php, etc...


· willross

··········

Share this post


Link to post
Share on other sites

Posted Oct 14 2005, 04:02 PM

Stevel: "...This option will keep some customers from purchasing at your store and will break your store if your domain name for HTTPS is not the same as for HTTP..."

 

I'll guess using the sub domain name for HTTPS is allowed like

 

https://admin.mydomain.com

http://www.mydomain.com

 

And this sub domain setting will work with the option "Force Cookie Use = TRUE" without breaking my store ?

 

-------------

I found this (if it's valid still):

 

http://www.oscommerce.info/kb/osCommerce/D...plementations/4

-------------

"As the cookie is set on the top level domain of the web server, the secured https server must also exist on the same domain.

 

For example, the force cookie usage implementation will work for the following servers:

 

http://www.domain-one.com

https://www.domain-one.com, or https://ssl.domain-one.com

 

but not for the following servers:

 

http://www.domain-one.com

https://ssl.hosting_provider.com/domain-one/"

Share this post


Link to post
Share on other sites
simply use :

 

if ($spider_flag) {

do not show js

} else {

show js

}

I am also using coolmenu.

Can some one please give me a sample of there page as to where this code should go.

I am completely new to the whole thing and really amazed by the whole stuff.

 

Would really appreciate your help.

Do you thing updating the spiders.txt would reduce the bandwith of my site, currently its going up to 9GB and costing me a fortune?

 

Thanks in advance.

Kunal

Share this post


Link to post
Share on other sites

spiders.txt does not, on its own, reduce the bandwidth used by spiders. It simply prevents spiders from getting a session registered so spiders don't do things such as add-to-cart and, more important, keeps session IDs out of links they record.

 

You can use the information that a spider is visiting and not present the spider with links that you want a human to see, such as product listing column sort links. Turning those off WILL cut down on bandwidth considerably.

Share this post


Link to post
Share on other sites
spiders.txt does not, on its own, reduce the bandwidth used by spiders. It simply prevents spiders from getting a session registered so spiders don't do things such as add-to-cart and, more important, keeps session IDs out of links they record.

 

You can use the information that a spider is visiting and not present the spider with links that you want a human to see, such as product listing column sort links. Turning those off WILL cut down on bandwidth considerably.

Steve,

Thank you for your prompt response.

Can you advise how do go about making the changes you have suggested.

 

Appreciate your help.

 

Regards,

Kunal

Share this post


Link to post
Share on other sites

Some people have mentioned about another "googlebot" and I have found it. It is not actually a bot. It is a direct allocation from Google that is used for evaluating sites using or applying for their services (abuse also). Here is the run-down:

 

 

OrgName: Google Inc.

OrgID: GOGL

Address: 1600 Amphitheatre Parkway

City: Mountain View

StateProv: CA

PostalCode: 94043

Country: US

 

NetRange: 66.249.64.0 - 66.249.95.255

CIDR: 66.249.64.0/19

NetName: GOOGLE

NetHandle: NET-66-249-64-0-1

Parent: NET-66-0-0-0-0

NetType: Direct Allocation

NameServer: NS1.GOOGLE.COM

NameServer: NS2.GOOGLE.COM

Comment:

RegDate: 2004-03-05

Updated: 2004-11-10

 

OrgTechHandle: ZG39-ARIN

OrgTechName: Google Inc.

OrgTechPhone: +1-650-318-0200

OrgTechEmail: arin-contact@google.com

 

Hope this clears up some confusion...


· willross

··········

Share this post


Link to post
Share on other sites

Hello,

 

I use a latest spiders.txt and in my Configuration/Sessions I have:

Session Directory /tmp

Force Cookie Use True

Check SSL Session ID True

Check User Agent True

Check IP Address True

Prevent Spider Sessions True

Recreate Session True

However, all spiders that visit my store receive session ID. Why is it happening?

 

I'll appreciate any ideas. Thanks,

Irina.

Share this post


Link to post
Share on other sites

Willross, the only thing that would be relevant here is if this other Googlebot spiders a site and has a user agent string not detected by spiders.txt. What user agent does it use?

 

Irina, it's difficult to tell without actually testing your store and looking at the files. I will comment that you should set all the "Check" values to False, or else many customers will be unable to use your store.

 

The way I would diagnose this is to add some code to a page to do a print_r of the user agent string (I'd have to look up the variable name) and perhaps add some diagnostic code to the code that uses spiders.txt. I have not heard of a general problem with this feature, though.

Share this post


Link to post
Share on other sites
Willross, the only thing that would be relevant here is if this other Googlebot spiders a site and has a user agent string not detected by spiders.txt. What user agent does it use?

 

Irina, it's difficult to tell without actually testing your store and looking at the files. I will comment that you should set all the "Check" values to False, or else many customers will be unable to use your store.

 

The way I would diagnose this is to add some code to a page to do a print_r of the user agent string (I'd have to look up the variable name) and perhaps add some diagnostic code to the code that uses spiders.txt. I have not heard of a general problem with this feature, though.

Thanks for your reply, stevel. I set all my "Check" values to False as you recommended. What else can I do to solve this problem.

 

Thanks a lot,

Irina.

Share this post


Link to post
Share on other sites

Sorry, I had not meant to suggest that the "Check" values were related to the problem. It was just something I thought you should know.

 

Send me a Private Message with a link to your store and I can try it out. To actually debug it, though, I'd need permission to modify files on your site server. If you want me to do that, send me the FTP server name, login name and password in a private message.

Share this post


Link to post
Share on other sites

Hello:

 

::!Newb Alert!:: :blush:

 

I am trying to use gsitemap by Vigos http://www.vigos.com/products/gsitemap/

 

When I use the spider to make the map, it gets a session ID

 

I have no idea how to find the user agent string.

 

I have all session values to false (except prevent spider session) :thumbsup:

 

I don't know if it is just gsitemap getting the session ID or all the bots. How do I also check this?

Share this post


Link to post
Share on other sites

Just check your access log and find the accesses from this spider. Figure out what an appropriate string would be, for example, "gsitemap", and add it to spiders.txt. Remember that the string added to spiders.txt must be lower case - it will match any case in the user agent.

Share this post


Link to post
Share on other sites

Hi there,

 

I got this spider "User Agent: mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/hel

IP Adres 68.142.250.112" on my website. I want to get rid of it but failed everytime. How can I put this in my robots.txt file?

 

B.t.w. anyone idea where this spider comes from?

 

regards,

Taosan

Share this post


Link to post
Share on other sites

You mean spiders.txt and "slurp" is already there. It is Yahoo's spider. Why do you want to "get rid of it"? Do you not want your site indexed by Yahoo? Note that spiders.txt does not prevent spiders from indexing your site - it helps them do it better.

Share this post


Link to post
Share on other sites

Steve,

 

I wasn`t sure if it is yahoo spider because he was all the time indexing my site. I mean 24x7, so I was/am suspicious about it. Look at the url... If I click on it I get a 404 error.

 

I know that "slurp" is in the spiders.txt file therefore I thought that this spider is a new one.

 

regards,

Taosan

Share this post


Link to post
Share on other sites

The user agent has been truncated in what you posted. It should be "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)". The IP you give is Yahoo's.

 

If your site is new, Yahoo is trying to find it all. See http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html for how to slow it down.

 

You may also want to consider reducing the number of redundant links a spider can find on your site. In particular, the standard product listing has links at the top of each column for sorting, ascending or descending. That is potentially 3**ncolumns combinations of URLs that the spiders could see, all with really the same information. An easy way to deal with this is to edit the function tep_create_sort_heading in includes/functions/general.php. Change the line:

	if ($sortby) {

to:

	if ($sortby && $session_started) {

This will suppress the sort links for spiders while still leaving them for human visitors.

Share this post


Link to post
Share on other sites

I have spiders.txt in includes, it's chmoded 740, that correct?

 

Other issues hopefully can get help with.

 

Google has only indexed my home page. I have over 1200 Products.

 

IM me for a link to the url if you need it.

 

I installed SEO Sitemap contribution

http://www.oscommerce.com/community/contributions,2076

 

and

 

SEO for the osCommerce (2.2 Milestone 2)

http://www.jjwdesign.com/seo_oscommerce.html

 

I don't know much about htaccess files but I followed the instructions for the SEO Sitemap Contribution. My store is on the root.

 

My htaccess file (Chmod 644) is as follows, Any recommendations?:

 

# $Id: .htaccess,v 1.3 2003/06/12 10:53:20 hpdl Exp $
#
# This is used with Apache WebServers
#
# For this to work, you must include the parameter 'Options' to
# the AllowOverride configuration
#
# Example:
#
# <Directory "/usr/local/apache/htdocs">
#   AllowOverride Options
# </Directory>
#
# 'All' with also work. (This configuration is in the
# apache/conf/httpd.conf file)

# The following makes adjustments to the SSL protocol for Internet
# Explorer browsers

<IfModule mod_setenvif.c>
 <IfDefine SSL>
SetEnvIf User-Agent ".*MSIE.*" \
		 nokeepalive ssl-unclean-shutdown \
		 downgrade-1.0 force-response-1.0
 </IfDefine>
</IfModule>

# If Search Engine Friendly URLs do not work, try enabling the
# following Apache configuration parameter
#
# AcceptPathInfo On

# Fix certain PHP values
# (commented out by default to prevent errors occuring on certain
# servers)
#
#<IfModule mod_php4.c>
#  php_value session.use_trans_sid 0
#  php_value register_globals 1
#</IfModule>

RewriteEngine on
Options +FollowSymlinks
DirectoryIndex home.html home.php index.php index.html
AddType application/x-httpd-php php php4 php3 html htm
RewriteRule ^sitemap_categories.html$ sitemap_categories.php [L]
RewriteRule ^sitemap_products.html$ sitemap_products.php [L]
RewriteRule ^category_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)\.html$ index.php?cPath=$1_$2_$3_$4_$5_$6 [L]
RewriteRule ^category_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)\.html$ index.php?cPath=$1_$2_$3_$4_$5 [L]
RewriteRule ^category_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)\.html$ index.php?cPath=$1_$2_$3_$4 [L]
RewriteRule ^category_([1-9][0-9]*)_([1-9][0-9]*)_([1-9][0-9]*)\.html$ index.php?cPath=$1_$2_$3 [L]
RewriteRule ^category_([1-9][0-9]*)_([1-9][0-9]*)\.html$ index.php?cPath=$1_$2 [L]
RewriteRule ^category_([1-9][0-9]*)\.html$ index.php?cPath=$1 [L]
RewriteRule ^product_([1-9][0-9]*)\.html$  product_info.php?&products_id=$1 [L]

 

Thanks in advance,

Brady

Share this post


Link to post
Share on other sites
The user agent has been truncated in what you posted. It should be "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)". The IP you give is Yahoo's.

 

If your site is new, Yahoo is trying to find it all. See http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html for how to slow it down.

 

You may also want to consider reducing the number of redundant links a spider can find on your site. In particular, the standard product listing has links at the top of each column for sorting, ascending or descending. That is potentially 3**ncolumns combinations of URLs that the spiders could see, all with really the same information. An easy way to deal with this is to edit the function tep_create_sort_heading in includes/functions/general.php. Change the line:

	if ($sortby) {

to:

	if ($sortby && $session_started) {

This will suppress the sort links for spiders while still leaving them for human visitors.

 

This suppressed the sort header links for me ont he page even for a regular visitor. Could it be due to the "Force Cookie Use" being enabled? I do not have the osc session ID showing in the URL with "Force Cookie Use" enabled so that spiders can better index the site.

 

Is there something else to add to this to enable it to be used when "Force Cookie Use" is enabled?

 

Thanks,

 

John

Share this post


Link to post
Share on other sites

John,

 

If it suppressed links for you as a normal visitor, you are not getting a session, which is bad. Force Cookie Use does not make spiders "better index the site", but it does drive away some customers. Can you send me a link to your site?

 

Brady, spiders.txt can have the same protection as other files in includes - 644 or 755 is fine. Using spiders.txt would not prevent Google from indexing your site. It can take some time for Google to index new sites - weeks or even months. Is it visiting your product pages? (Look at the access log.)

Share this post


Link to post
Share on other sites
John,

 

If it suppressed links for you as a normal visitor, you are not getting a session, which is bad. Force Cookie Use does not make spiders "better index the site", but it does drive away some customers. Can you send me a link to your site?

 

Brady, spiders.txt can have the same protection as other files in includes - 644 or 755 is fine. Using spiders.txt would not prevent Google from indexing your site. It can take some time for Google to index new sites - weeks or even months. Is it visiting your product pages? (Look at the access log.)

 

 

Thanks Steve for the info. For now I have turned off the Force Cookie Use option because it was causing other issues as well which I haven't found an answer to on the forums here yet. You are probably right about it driving customers away, the biggest issue I was having with the cookies forced was that when you went to log in, you get the "Cookie Usage" warning page, because as you noted above a session hadn't started. Strange thing is, if I click on "My Account" I could get the log-in page like normal, but if you try to go straight to log-in, you get the cookie page. Then after the cookie is set, no more problem. Like I said, seems like too much trouble. I was under the impression that spiders did not like the "oscid" in the URL, and having it there would not be a good thing. Or will they just not get that when they crawl the site if I have spider friendly URL's enabled? How can I test what they will see vs. a regular customer?

 

If you want to check my site anyway, it is www.greenmountainspecialties.com. Right now I haven't added the previous fix back in, but will in the next day or so to see what happens.

 

Your thoughts on the oscid and cookies is also appreciated.

 

Thanks,

 

John

Share this post


Link to post
Share on other sites

John, your problem seems to be an incorrect configure.php so that the cookie cannot be set. Please make sure that HTTP_COOKIE_DOMAIN is 'www.greenmountainspecialties.com' and nothing more. I'd guess that you have a similar problem in the HTTPS defines - the COOKIE_DOMAIN defines must match the domain (can be hostname) only of the corresponding _SERVER define.

 

ewww, Becomebot is detected by the line "ebot".

Share this post


Link to post
Share on other sites
John, your problem seems to be an incorrect configure.php so that the cookie cannot be set. Please make sure that HTTP_COOKIE_DOMAIN is 'www.greenmountainspecialties.com' and nothing more. I'd guess that you have a similar problem in the HTTPS defines - the COOKIE_DOMAIN defines must match the domain (can be hostname) only of the corresponding _SERVER define.

 

ewww, Becomebot is detected by the line "ebot".

 

OK, I have messed around with all different combinations in my configure.php file, and they don't seem to make a real difference. I am setting a cookie, but not until the first attempt to access "My Account" page. In other words, no test cookie is ever sent until I click on "My Account". Once that is done, the browser has the test cookie, and a session will start, so that when "My Account" redirects to "Login", the test cookie exists and "Login" can load (if you click on "Login" first, you get the cookie_usage page). No problem with cookies after that, unless I delete the cookie file. I guess I don't really understand how this is supposed to work. When I have FORCE_COOKIES turned off, as soon as someone clicks on a page other than the home page, the session ID is generated in the URL. Then that session ID carries through to all pages. With FORCE COOKIES turned on, shouldn't we serve the test cookie upon load of the index page? Then it would be available for any subsequent page load.

 

Or is this the way it IS supposed to work, and my shop isn't doing it correctly? Any thoughts or clarification on how this should all work would be greatly appreciated.

 

Thanks,

 

John

Share this post


Link to post
Share on other sites
OK, I have messed around with all different combinations in my configure.php file, and they don't seem to make a real difference. I am setting a cookie, but not until the first attempt to access "My Account" page. In other words, no test cookie is ever sent until I click on "My Account". Once that is done, the browser has the test cookie, and a session will start, so that when "My Account" redirects to "Login", the test cookie exists and "Login" can load (if you click on "Login" first, you get the cookie_usage page). No problem with cookies after that, unless I delete the cookie file. I guess I don't really understand how this is supposed to work. When I have FORCE_COOKIES turned off, as soon as someone clicks on a page other than the home page, the session ID is generated in the URL. Then that session ID carries through to all pages. With FORCE COOKIES turned on, shouldn't we serve the test cookie upon load of the index page? Then it would be available for any subsequent page load.

 

Or is this the way it IS supposed to work, and my shop isn't doing it correctly? Any thoughts or clarification on how this should all work would be greatly appreciated.

 

Thanks,

 

John

 

how it works by default (and how I use it now):

 

http://forums.oscommerce.com/index.php?showtopic=182189


Treasurer MFC

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×