Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

Brady, spiders.txt can have the same protection as other files in includes - 644 or 755 is fine. Using spiders.txt would not prevent Google from indexing your site. It can take some time for Google to index new sites - weeks or even months. Is it visiting your product pages? (Look at the access log.)

Excuse me, where can I look for this? :rolleyes:

I would like to know which spiders access to my web!

Share this post


Link to post
Share on other sites
You'll have to ask your web host where the access log is. It varies.

Thanks!

I thought maybe it was a OSC or a contrib log.. :D

Share this post


Link to post
Share on other sites
You should not be getting a second session ID. What are your values for HTTP_SERVER, HTTPS_SERVER, HTTP_COOKIE_DOMAIN and HTTPS_COOKIE_DOMAIN?

 

 

Steve,

 

First let me say thank you for your patience in helping me understand the way this should work (and everyone else's help too!). I must be missing something or just not being clear explaining things. Here are my paths from configure.php:

 

  define('HTTP_SERVER', 'greenmountainspecialties.com'); // eg, http://localhost - should not be empty for productive servers
 define('HTTPS_SERVER', 'greenmountainspecialties.com'); // eg, https://localhost - should not be empty for productive servers
 define('ENABLE_SSL', true); // secure webserver for checkout procedure?
 define('HTTP_COOKIE_DOMAIN', 'greemountainspecialties.com');
 define('HTTPS_COOKIE_DOMAIN', 'greenmountainspecialties.com');
 define('HTTP_COOKIE_PATH', '/');
 define('HTTPS_COOKIE_PATH', '/');
 define('DIR_WS_HTTP_CATALOG', '/');
 define('DIR_WS_HTTPS_CATALOG', '/');

 

My catalog is in the root, and my SSL cert is issued to 'greenmountainspecialties.com'

 

When I test things out, I look in my cookies folder to see when it gets set (must refreshe between each step). I don't see it appear in the cookies folder until after the 2nd SSL request. Here is an example:

 

(NOTE: For this test, I disabled all the IF stateemnts in the application_top.php relating to spiders, etc. so that it should force the cookie on first page load regardless of what the setting in Admin is.)

 

Load home page -> no cookie

Click a product -> no cookie

Click Add to Cart -> no cookie

Click login -> Get the Cookie Usage Page -> sets test cookie in cookie folder (there is a [2] at the end of the cookie name - what is that for?)

Cookie text:

cookie_test
please_accept_for_session
greenmountainspecialties.com/
1024
249890944
29758436
483192304
29752401
*

Click login again -> Get the login page -> cookie file name changes to show [1] at the end, but file text is the same:

Cookie text:

cookie_test
please_accept_for_session
greenmountainspecialties.com/
1024
1019890944
29758436
1256632304
29752401
*

 

Now able to navigate and add products to cart and checkout. I guess what I think should happen is that test cookie should be set when a vistor comes to the first page (regardless of which page it is) by application_top. Then the next page load will accept that cookie for the remaining session. Am I not understanding correctly?

 

Again, your patience and help is greatly appreciated.

 

John

Share this post


Link to post
Share on other sites

I'm amazed that works at all. Try this:

define('HTTP_SERVER', 'http://greenmountainspecialties.com'); // eg, http://localhost - should not be empty for productive servers
define('HTTPS_SERVER', 'https://greenmountainspecialties.com'); // eg, https://localhost - should not be empty for productive servers

Share this post


Link to post
Share on other sites
I'm amazed that works at all. Try this:

define('HTTP_SERVER', 'http://greenmountainspecialties.com'); // eg, http://localhost - should not be empty for productive servers
define('HTTPS_SERVER', 'https://greenmountainspecialties.com'); // eg, https://localhost - should not be empty for productive servers

 

I am constantly amazed at this stuff that it works at all, and also all the great folks in this osCommerce community that try to help everyone! I switched the configure file back to that (Which is how I originally had it, but just wanted to test different possibilities), and got the same results.

 

Before I go any further, I realize that this is the spiders.txt support thread, so if this subject is getting too far off topic, I will be happy to submit it to a different thread. Just seems like it is pertinent to the topic, and since there are some knowledgeble folks here, I hope that it will be OK. If not, let me know.

 

OK, a little more experimenting. With cookies forced on, on the first page load, I do not get a session_id in the SESSIONS table. The header sent looks like this:

 

HTTP/1.1 200 OK
Date: Fri, 09 Dec 2005 01:53:23 GMT
Server: Apache/1.3.34 (Unix) mod_ssl/2.8.25 OpenSSL/0.9.7a FrontPage/5.0.2.2635 mod_throttle/3.1.2
P3P: CP="CAO DSP COR CURa PSAa IVDi CONi OUR NOR STP IND PHY ONL UNI PUR COM NAV INT DEM STA",policyref="/w3c/p3p.xml"
Set-Cookie: cookie_test=please_accept_for_session; expires=Sunday, 08-Jan-06 01:53:23 GMT; path=/; domain=greemountainspecialties.com
Connection: close
Content-Type: text/html

 

With force cookies off, I DO get the session_id on the first page load of any page in the SESSIONS table. There is no session_id shown in the url, but if you view the header that is sent, it looks like this:

 

HTTP/1.1 200 OK
Date: Fri, 09 Dec 2005 01:51:48 GMT
Server: Apache/1.3.34 (Unix) mod_ssl/2.8.25 OpenSSL/0.9.7a FrontPage/5.0.2.2635 mod_throttle/3.1.2
P3P: CP="CAO DSP COR CURa PSAa IVDi CONi OUR NOR STP IND PHY ONL UNI PUR COM NAV INT DEM STA",policyref="/w3c/p3p.xml"
Set-Cookie: osCsid=eb96b9bebeb7df48651ec53936b003e0; path=/; domain=greemountainspecialties.com
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Content-Type: text/html

 

The only difference I see in the two headers is that when we do not force cookies, the session cookie that is sent has the Expires date on the next line, instead of being part of the cookie. On the next page load with force cookies off, you then see the osCsid in the URL, and everything works as it should - no cookie_usage page. So why doesn't it work the same with the cookies forced on? Seems to me that I set the cookie, but then it is not being recognized on subsequent page loads until you enter a secure session.

 

Although I realize I could just not use cookies, I think it is more secure that way (no session ids in the URL for people to share or spiders to find accidentally), and it has now become an insane obsession of mine to get to the bottom of this (and hopefully share the learning with anyone else having similar problems).

 

Thanks again,

 

John

Share this post


Link to post
Share on other sites
I am constantly amazed at this stuff that it works at all, and also all the great folks in this osCommerce community that try to help everyone! I switched the configure file back to that (Which is how I originally had it, but just wanted to test different possibilities), and got the same results.

 

Before I go any further, I realize that this is the spiders.txt support thread, so if this subject is getting too far off topic, I will be happy to submit it to a different thread. Just seems like it is pertinent to the topic, and since there are some knowledgeble folks here, I hope that it will be OK. If not, let me know.

 

OK, a little more experimenting. With cookies forced on, on the first page load, I do not get a session_id in the SESSIONS table. The header sent looks like this:

 

HTTP/1.1 200 OK
Date: Fri, 09 Dec 2005 01:53:23 GMT
Server: Apache/1.3.34 (Unix) mod_ssl/2.8.25 OpenSSL/0.9.7a FrontPage/5.0.2.2635 mod_throttle/3.1.2
P3P: CP="CAO DSP COR CURa PSAa IVDi CONi OUR NOR STP IND PHY ONL UNI PUR COM NAV INT DEM STA",policyref="/w3c/p3p.xml"
Set-Cookie: cookie_test=please_accept_for_session; expires=Sunday, 08-Jan-06 01:53:23 GMT; path=/; domain=greemountainspecialties.com
Connection: close
Content-Type: text/html

 

With force cookies off, I DO get the session_id on the first page load of any page in the SESSIONS table. There is no session_id shown in the url, but if you view the header that is sent, it looks like this:

 

HTTP/1.1 200 OK
Date: Fri, 09 Dec 2005 01:51:48 GMT
Server: Apache/1.3.34 (Unix) mod_ssl/2.8.25 OpenSSL/0.9.7a FrontPage/5.0.2.2635 mod_throttle/3.1.2
P3P: CP="CAO DSP COR CURa PSAa IVDi CONi OUR NOR STP IND PHY ONL UNI PUR COM NAV INT DEM STA",policyref="/w3c/p3p.xml"
Set-Cookie: osCsid=eb96b9bebeb7df48651ec53936b003e0; path=/; domain=greemountainspecialties.com
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Content-Type: text/html

 

The only difference I see in the two headers is that when we do not force cookies, the session cookie that is sent has the Expires date on the next line, instead of being part of the cookie. On the next page load with force cookies off, you then see the osCsid in the URL, and everything works as it should - no cookie_usage page. So why doesn't it work the same with the cookies forced on? Seems to me that I set the cookie, but then it is not being recognized on subsequent page loads until you enter a secure session.

 

Although I realize I could just not use cookies, I think it is more secure that way (no session ids in the URL for people to share or spiders to find accidentally), and it has now become an insane obsession of mine to get to the bottom of this (and hopefully share the learning with anyone else having similar problems).

 

Thanks again,

 

John

 

those are 2 different cookies, the test one from force cookies and the sessionid one after it sees that the test cookie was accepted.


Treasurer MFC

Share this post


Link to post
Share on other sites

I made the change from

if ($sortby) {

to

if ($sortby && $session_started) {

 

and I also got the list headings to be not-sortable, logged in or logged out is same.

 

My admin settings are:

Session Directory /tmp

Force Cookie Use False

Check SSL Session ID False

Check User Agent False

Check IP Address False

Prevent Spider Sessions True

Recreate Session False

 

site is: http://www.sewingprose.com. It's been live for about 6 weeks now. Well, except for "those days". :)

 

please advise; thanks.


Toward Continued Success - - > Carol Hawkey - - > KidsLearnToSew.com - - > Wyoming, USA

Mods Installed - - > Authnet AIM2 - Bundled Products 1.4 - Fancier Invoice 6.1 - Email_HTML_Order_Link_Fixed - Header Tags Controller - Login aLa Amazon - JustOneAttribute - Article Manager - SPPC w/PB - spiders.txt - Dangling Carrot/Olive - Printable Catalog - CCGV(trad)

Planned Mods - - > Purchase Without Account - USPS Label - Ultimate SEO

Share this post


Link to post
Share on other sites

Carol, you seem to have a lot more problems than just this. If I open your catalog page, I get a page saying that the page is missing. If I click on a category, I get a blank page. I can't see enough of your site to help diagnose this.

Share this post


Link to post
Share on other sites

You have no idea :(

Thanks for the attempt Steve, but I had a major malfuction about 5 days ago and it's bad enough that I have to start over from scratch. The backup was corrupt. Actually I think that's what started it, I did the admin/restore, lost TABLE sessions, and everything went downhill from there. Yes I've tried all the forums suggestions on what to do for that but it all failed like quicksand. I'll post again later. Just popped on today to see if there's a thread for what to do when the admin/modules.php won't list the left_column for EDIT on only one payment->module.


Toward Continued Success - - > Carol Hawkey - - > KidsLearnToSew.com - - > Wyoming, USA

Mods Installed - - > Authnet AIM2 - Bundled Products 1.4 - Fancier Invoice 6.1 - Email_HTML_Order_Link_Fixed - Header Tags Controller - Login aLa Amazon - JustOneAttribute - Article Manager - SPPC w/PB - spiders.txt - Dangling Carrot/Olive - Printable Catalog - CCGV(trad)

Planned Mods - - > Purchase Without Account - USPS Label - Ultimate SEO

Share this post


Link to post
Share on other sites

MSN Spider appears to be creating sessions on my site, I'm not sure whether its the ultimate SEO mod I've installed or the spiders.txt

 

For Example listing from MSN using site:www.gadget-and-gizmos.co.uk

 

Brodit Proclip BMW Mini 01-04 Angled mount 15.00 Quick Find Use keywords to find the product you are looking for. Advanced Search Information Terms & Conditions Delivery Links Contact Us New Products Displaying ...

 

www.gadget-and-gizmos.co.uk/products_new.php?osCsid=d97d64567f3591a5175a01855d758fda

 

 

Any ideas

Share this post


Link to post
Share on other sites

You don't know if it's creating sessions or if it is holding onto a session from before you added the new spiders.txt, since 2.2-MS2 doesn't include msnbot in its list. You have to study your access logs to see.

Share this post


Link to post
Share on other sites

stevel

 

I noticed that the agent of a bot can change during an indexing. Have you seen this? Here is an example:

 

Host: 66.249.65.232

Http Code: 200 Date: Jan 10 20:50:20 Http Version: HTTP/1.1 Size in Bytes: 32959

Referer: -

Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

 

Host: 66.249.65.232

Http Code: 302 Date: Jan 10 20:53:08 Http Version: HTTP/1.1 Size in Bytes: 5

Referer: -

Agent: Mediapartners-Google/2.1

 

Sometimes they drop the "Agent" too and I only see it as: Host:??crawl-66-249-65-232.googlebot.com

 

Any ideas?

 

Also, on another subject, I noticed a hacker-type visit my store from a rather peculiar source. This person used Google and searched for: inurl:"php?shop=" site:us

 

Then hit the site adding variables to urls like: ../../../../../../../../../../../../../etc/passwd

 

Nothing seemes to have happened, but in the "View Visitors" contribution, only this guy's IP address fails to pop-up and view pages visited. The IP of the hacker is: 85.159.40.1


· willross

··········

Share this post


Link to post
Share on other sites

Googlebot and Mediapartners-Google are separate bots. Both are represented in spiders.txt. The latter I think is used if you are a Google AdWords subscriber. My guess is that your reporting tool is losing the user agent. You should read the actual access log to see what is happening.

 

The "hacker-type" thing is someone using Google to search for potential vulnerabilities. It isn't necessarily a spider and isn't something appropriate for spiders.txt. There's a lot of this sort of thing going on.

Share this post


Link to post
Share on other sites

stevel

 

I'll look through the raw logs and see. The latter was just another observation while writing the previous post. I didn't know if it was worthy of a new thread, so at least it can be found with a content search. Thanks.

Edited by willross

· willross

··········

Share this post


Link to post
Share on other sites

I can't even find Googlebot in the latest spinders.txt update?

Edited by Fredrik.r

Share this post


Link to post
Share on other sites

OK I'm back up. Boy-Howdy I don't want to go through that again!

referencing my previous post of Dec 26, when I use

if ($sortby && $session_started) {

in functions/general.php,

my product columns are not sortable.

They are sortable if I revert to

if ($sortby) {

 

TIA


Toward Continued Success - - > Carol Hawkey - - > KidsLearnToSew.com - - > Wyoming, USA

Mods Installed - - > Authnet AIM2 - Bundled Products 1.4 - Fancier Invoice 6.1 - Email_HTML_Order_Link_Fixed - Header Tags Controller - Login aLa Amazon - JustOneAttribute - Article Manager - SPPC w/PB - spiders.txt - Dangling Carrot/Olive - Printable Catalog - CCGV(trad)

Planned Mods - - > Purchase Without Account - USPS Label - Ultimate SEO

Share this post


Link to post
Share on other sites

Carol,

 

Your site is not properly setting the session cookie. Your HTTP_COOKIE_DOMAIN should be defined as 'sewingprose.com' and nothing else.

 

Why don't you send me e-mail at the address in the spiders.txt readme? I can help you with the $session_started issue which is not really relevant to this thread (other than that I mentioned it earlier.)

Share this post


Link to post
Share on other sites

Ok, my error. You have to make TWO changes to tep_create_sort_heading in general.php. Change this:

	global $PHP_SELF;

$sort_prefix = '';
$sort_suffix = '';

if ($sortby) {

to this:

	global $PHP_SELF;
global $session_started;

$sort_prefix = '';
$sort_suffix = '';

if ($sortby && $session_started) {

Share this post


Link to post
Share on other sites

Hi Everyone,

 

I have a couple questions about these Bots and Spiders etc.. I have alot of catching up to do regarding my oScommerce cart and the things going on around us as website and shop owners.. As i've mentioned in other posts as of recent.. I'm currently redoing my website and i haven't updated or done anything really since 2002.So yes, it's safe to say.. I'm behind.. LOL..

And as I add contributions and read more and more.. I'm seeing all this write up about bots and spiders..

 

Okay.. To get to the point here.. What are Spiders exactly? What are these Bots? And why are they coming to our websites and adding our products to their carts and making life basically miserable for us shop owners.. Because basically, that's how i'm interpreting this.

 

Also, my shop that is still up currently for public view.. I've checked into admin and clicked on who's online and I think i've actually seen these spiders in action.. I've seen the same IP multiple times with all this stuff in the cart.. And I can't figure out who and what would do this and why.. I think i'm in the right thread to ask these questions anyway.. LOL And sorry if it sounds stupid.. But like I've said.. I'm a bit behind in the times here.. And i'm playing catch up real quick!

 

So what are they? And why do they do this? Other than educated guessing.. I'm thinkin they're doing analysis's of websites etc.. But that's all I can think of..

 

Thanks :)

 

Christine


If it ain't broke, don't fix it! :)

Share this post


Link to post
Share on other sites

Christine,

 

spider = bot = automated computer program sent out by a search engine. The spider attempts to visit each page on your website, record the URL, title, and some of the content (visible on page and mete tags) and send it back to the search engine database. That process is called "indexing".

 

You want this to happen so that the serach engines know your web site and you get visitors = customers = sales! However, you want to control the spiders. They often go where you don't want them to go and see what you don't want them to see. You can use the spiders.txt and robots.txt contribs to control them.

 

spiders.txt contrib - copy the spiders.txt file in the contrib to your catalog/includes directory. Make sure that in the admin page, Configuration->Sessions->Prevent Spider Sessions is turned on (true). The contrib has an updated list of the most common spider user agents (names they tell your web server when they visit). The Prevent Spider Sessions in osCommerce admin says, if the visitor's user agent is in the spiders.txt file, don't give it an osCsid (session id). No session id means you can't add a product to a cart. It also means they won't include the session id in the data they send back to the search engine database.

 

robots.txt - you didn't ask about it but it is related and probably needs to be added/updated to. The contrib gives an example file called robots.txt that you put in your catalog/ directory. It includes names of files and directories that you don't want the spiders to see. So, admin/ directory should be in there as well as login.php, etc.

 

Hope that helps,

ed


Answers to osCommerce's most persistent questions! Tips & Tricks | Configuration | Common Problems.

Seek and ye shall find Contributions.

My Contributions

My Blog

Share this post


Link to post
Share on other sites

Thank you Ed,

 

The explanation you've provided is very helpful and educational as well.. Made perfect sense and at least now.. I know what the hell they are and want they want.. LOL :)

 

I do however have the lastest file that Steve put out yesterday and I replaced the spiders.txt file that I already had in my includes folder with that..

 

But you're right, I do NOT have a robots.txt and I think I will be needing it.. I did a search in the contributions for it.. I got something back with my search efforts.. But I'm not sure if that's an up to date file or not.. If you would be so kind, Could you direct me to the file that I'll be needing for sure? And it goes in the catalog directory right? Just double checkin here.. :)

 

Thanks again Ed.. You've been a great help!

 

 

Christine

Edited by Redsonya

If it ain't broke, don't fix it! :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×