Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

[Contribution] Spider Session Remover


peterr

Recommended Posts

  • Replies 76
  • Created
  • Last Reply

Top Posters In This Topic

so, im guessing the last part of the log is what your looking at?

 

Stuff like this would be normal users:

*Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

*Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

 

Stuff like this would be spiders:

*msnbot/1.0 (+http://search.msn.com/msnbot.htm)

*Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

 

right? Also in my cpanel error log, i notice alot of attempted views of robots.txt. I currently dont have one, think I should?

 

And if anybody can answer my .htaccess questions it would be greatly appreciated. Thanks :D

Link to comment
Share on other sites

Bob,

 

Is this support thread still monitored? I see its been months since last activity.

 

Why would the thread need to be monitored ? If people have no questions, then there _is_ no activity, right. :)

 

Anyhow, Im wondering a couple things. First off, what log files are you guys looking at? Ive never paid attention to any log files.....just webstats periodically. Are you talking about the error logs as can be viewed through Cpanel?

 

I see other people have answered your questions about what logs to look at, .... the web server logs, or raw access logs.

 

The only thing/s relevant to look for in the logs, for this contribution, are:

 

Spider entries with the session included.

 

Anything else is irrelevant to this contribution, and should be posted elsewhere. :D

 

Secondly, is the .htaccess file ready to be simply added to my root directory, or are modifications needed? I know I read somewhere that some paths needed to be entered in the file related to my specific URL, but im unsure if I need to or not. Is it possible for a quick example of what changes need to be made to the supplied .htaccess file, if any?

 

Thanks alot for this contribution. Ive been learning alot lately about search engines, search engine optimization, and now some of the after effects  :-"

 

Between the 'readme', the 'instal'l file, and a sample .htaccess file, that all come with the contribution, there is suficient information for you to install/use the contribution.

 

However, you need to read the 'readme', etc, and also read the early posts in this thread, to find out if you _really_ do need this contribution.

 

Peter

Link to comment
Share on other sites

Hi,

 

When I download the raw access logs, I get a MS-DOS app.

 

Usually, the name of the web server log file, will contain the domain name as a suffix. Therefore, your website mus be a ".COM", and it _looks_ like a MS-DOS app.

 

Just open it in any text editor, Notepad, Crimson Editor is free and very good.

 

Peter

Link to comment
Share on other sites

Hi,

 

so, im guessing the last part of the log is what your looking at?

 

Stuff like this would be normal users:

*Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

*Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

 

Stuff like this would be spiders:

*msnbot/1.0 (+http://search.msn.com/msnbot.htm)

*Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

 

right? Also in my cpanel error log, i notice alot of attempted views of robots.txt. I currently dont have one, think I should?

 

The web server log entries for spiders do not have a session ID, therefore, if all of your spider log entries don't have a session id (sid), then you have no need for this contribution, and you will only be complicating matters by using it.

 

The old saying, if it aint broke, .... don't fix it.

 

It is not essential that you have a robots.txt, however it is advisable, it will cut down on the 404 messages, and _most_ spiders/bots look for it. They do not have to 'obey' the rules you place in the file robots.txt, however _most_ do.

 

This is what we usually put on osCommerce sites.

 

User-Agent: *
Disallow: /login.php
Disallow: /create_account.php

 

Peter

Edited by peterr
Link to comment
Share on other sites

Between the 'readme', the 'instal'l file, and a sample .htaccess file, that all come with the contribution, there is suficient information for you to install/use the contribution.

 

However, you need to read the 'readme', etc, and also read the early posts in this thread, to find out if you _really_ do need this contribution.

 

Peter

 

Between learning about search engine optimization, spider habits, and now this SID issue, I have a handful of days wrapped up in this. I know for a fact I need this contribution, that isnt the question. Ive read everything you've included in your contribution...but ive also read a link to an article you mentioned somewhere on taming the beast.com related to this. It showed examples of where you have to include certain paths as related to your server or something of the sort. Im simply asking if the .htaccess file you included in your contribution can be used as is or if it has to be further modded for use. Ive never even opened an .htaccess file before so it's all greek to me. Any help would be appreciated, for at this point...even after reading everything available to me.....I am lost.

Thanks.

Link to comment
Share on other sites

Hi,

 

Between learning about search engine optimization, spider habits, and now this SID issue, I have a handful of days wrapped up in this. I know for a fact I need this contribution, that isnt the question. Ive read everything you've included in your contribution...but ive also read a link to an article you mentioned somewhere on taming the beast.com related to this. It showed examples of where you have to include certain paths as related to your server or something of the sort. Im simply asking if the .htaccess file you included in your contribution can be used as is or if it has to be further modded for use. Ive never even opened an .htaccess file before so it's all greek to me. Any help would be appreciated, for at this point...even after reading everything available to me.....I am lost.

Thanks.

 

1. Do you have 'prevent spider sesions' set, in admin ?

 

2. Are searches on Google, Yahoo,etc, for your website, including the session ID ?

 

3. You ask if the .htaccess file supplied can be used as is, well it is setup to look for either:

 

* msnbot

* slurp

* googlebot

 

so you would have to modify it, according to the spider/s that are showing session id's in web search engine results for your site.

 

4. Placement, .... it goes in the web root path of course.

 

5. Good article at http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html

 

6. The best advice I can give, for you to be sure that this will work for you, is for you to setup a 'test' path, which would be a complete copy of your osCommerce (/catalog/ path ) files, so that you have:

 

http://yourwebsite.com/test

 

(you will have to mod the 2 configure.php files to point to the test path).

 

then add the .htaccess file in the ../test path, the one from the contribution, and modify this line:

 

RewriteCond %{HTTP_USER_AGENT} !(msnbot|slurp|googlebot) [NC]

 

to ....................

 

RewriteCond %{HTTP_USER_AGENT} !(firefox) [NC]

 

From memory, the "NC" assures the command is not case-sensitive. Now, I'm assuming you have Firefox, because anyone who wants a secure browser shouldn't be using IE. :lol:

 

Anyway, you no doubt get the idea, you modify the .htaccess in the .../test path, to reflect the agent name of the browser that you use.

 

Now, you should be ready to test the mod_rewrite, use this URL:

 

http://yourwebsite.com/test/index.php?osCs...7b54174fae9c9b7

 

and if the /test path, and .htaccess have been modified correctly, the URL _should_ be re-written as:

 

http://yourwebsite.com/test/index.php

 

Setting up a "test" path might seem like a bit of work, however you should have one anyway, so that modifications are never done on a live website, they are done in 'test', and only moved to the live site, when appropriate.

 

HTH

 

Peter

Link to comment
Share on other sites

Yes to your first two questions.

 

placement- when you say root, you mean the main root directory where my main site resides or in the subdirectory folder(/catalog in my instance) where the actual copy of OSC resides? Sometimes access to my catalog would be through the main site in the root directory, and sometimes it would be through a URL that points directly at the "/catalog" directory.....if that makes any difference.

 

As for the test you mentioned, I do like the idea of trying that. I dont have a test copy of OSC, my changes are made to a live site and if anything goes too majorly wrong I have a complete site backup on hand to fix any issues. So, im guessing it wouldnt hurt anything to try the test on my live site, as im only adding/modifying one file. I also do use IE(actually a tab browser that runs off the IE engine), so what do I add for IE? This?:

 

RewriteCond %{HTTP_USER_AGENT} !(MSIE 6.0) [NC]

 

or just IE? or....?

 

Also in teh .htaccess file you supplied it says:

# This is used with Apache WebServers

#

# For this to work, you must include the parameter 'Options' to

# the AllowOverride configuration

#

# Example:

#

# <Directory "/usr/local/apache/htdocs">

#  AllowOverride Options

# </Directory>

 

do I need to do something related to that?

 

Thanks again for your time, its appreciated.

Link to comment
Share on other sites

ive been playing around. I had to add the + to "Options +FollowSymLinks" and move it up to get rid of the 403 error. I then changed the

RewriteCond %{HTTP_USER_AGENT} !(msnbot|slurp|googlebot) [NC]

to:

RewriteCond %{HTTP_USER_AGENT} !(msie) [NC]

and tried accessing the site using a SID. It worked....eleminated the SID in my browser address bar. Then I tried a spider simulator site:

 

http://www.webconfs.com/search-engine-spider-simulator.php

 

I tried the default spider list that comes in the .htaccess file you supplied, as well as trying "msie" and "mozilla", and that site always returns links with SID's in them. Any insite as to why?

 

Also, I checked my .htaccess file thats currently on my server. It has some site specific lines in it.......should I add all those lines into this replacement .htaccess file? Thanks

Link to comment
Share on other sites

Hi,

 

Then I tried a spider simulator site:

 

http://www.webconfs.com/search-engine-spider-simulator.php

 

I tried the default spider list that comes in the .htaccess file you supplied, as well as trying "msie" and "mozilla", and that site always returns links with SID's in them. Any insite as to why?

 

You will have to look in your web server logs, and find out what agent name the spider simulator uses, then place _that_ agent name in .htaccess.

 

Also, I checked my .htaccess file thats currently on my server. It has some site specific lines in it.......should I add all those lines into this replacement .htaccess file?

 

The .htaccess file needs to be in the 'web root' path, which is simply the path that is:

 

http://yourwebsite.com

 

So, to answer your question, yes, you would need to 'consolidate' the old .htaccess contents, plus the 'new' (this contrib) .htaccess contents, to create a new .htaccess.

 

You will need to be careful about the order of the commands though. :-"

 

Peter

Link to comment
Share on other sites

unfortunately I couldnt find anything that looked like a spider in my weblog. Ill try it again and d/l my log immediatly afterwards to see if its on top.

 

Anyhow, what about these lines in the .htaccess file:

 

# This is used with Apache WebServers
#
# For this to work, you must include the parameter 'Options' to
# the AllowOverride configuration
#
# Example:
#
# <Directory "/usr/local/apache/htdocs">
#  AllowOverride Options
# </Directory>

 

Do i need to bother with that?

Link to comment
Share on other sites

after yet more messing around........

 

I see that the .htaccess file in my /catalog/ directory is the same as the bottom half of your .htaccess contribution. You say to place your .htaccess file in the root directory though, not in the folder which contains OSC.......is this right? The .htaccess file in my main root directory(not where OSC is installed) is totally different.

 

As I said before, ive never dealt with .htaccess files before. From how it appears so far, your contribution should be in the OSC directory...and it looks like you used the .htaccess file that resides there after an OSC install to base your contribution off of.......so im just making sure I shouldnt put this contribution in my "www.mysite.com/catalog" directory instead of my root directory, as this would eliminate me having to splice my root .htaccess file(which is totally different) with the contribution one.

Link to comment
Share on other sites

Lines in .htaccess beginning with # are comments - they can be removed without effect.

 

You can have nested .htaccess files, it just slows down each page load a bit.

Link to comment
Share on other sites

after yet more messing around........

 

I see that the .htaccess file in my /catalog/ directory is the same as the bottom half of your .htaccess contribution. You say to place your .htaccess file in the root directory though, not in the folder which contains OSC.......is this right? The .htaccess file in my main root directory(not where OSC is installed) is totally different.

 

As I said before, ive never dealt with .htaccess files before. From how it appears so far, your contribution should be in the OSC directory...and it looks like you used the .htaccess file that resides there after an OSC install to base your contribution off of.......so im just making sure I shouldnt put this contribution in my "www.mysite.com/catalog" directory instead of my root directory, as this would eliminate me having to splice my root .htaccess file(which is totally different) with the contribution one.

 

 

anybody?

Link to comment
Share on other sites

.htaccess files can be in subfolders - they are used for references in that folder and any underneath. It's ok to have multiple .htaccess files, though it does slow down processing a bit.

Link to comment
Share on other sites

OK, heres my situation. Id greatly appreciate any help so that I can finish this up.

 

The .htaccess file in my root(www.mydomain.com) directory contains a bunch of other stuff in it that im iffy about merging with this contribution. The directory that my OSC resides in(www.mydomain.com/catalog) ALSO has a .htaccess file. This .htaccess file exactly matches the bottom half of the .htaccess that comes with this contribution. At this point I would rather add the contribution into my /catalog/ directory and not have to worry about botching up anything....unless theres some ill effect this will cause that somebody knows about.

 

I tried this contribution in my "www.mydomain.com/catalog/" directory. The problem is that then the .htaccess from this contribution not only removes the SID, it also removes the "/catalog/" part of the URL and I get a 404 returned. Is there a way to modify this?

 

Also, a question to Peter: did yo make the .htaccess file in this contribution specifically for installs where OSC resides in the root directory, since I see you based it off the stock OSC .htaccess file? Maybe it would be helpful to include a mofified .htaccess file for people that have OSC installed in a different directory and have them install the contribution there?

 

Anyhow, if anybody thats familiar with .htaccess files can make this NOT strip the "/catalog/" from my URLS and only the SID, id greatly appreciate it. :thumbsup:

Link to comment
Share on other sites

OK, heres my situation. Id greatly appreciate any help so that I can finish this up.

 

The .htaccess file in my root(www.mydomain.com) directory contains a bunch of other stuff in it that im iffy about merging with this contribution. The directory that my OSC resides in(www.mydomain.com/catalog) ALSO has a .htaccess file. This .htaccess file exactly matches the bottom half of the .htaccess that comes with this contribution. At this point I would rather add the contribution into my /catalog/ directory and not have to worry about botching up anything....unless theres some ill effect this will cause that somebody knows about.

 

I tried this contribution in my "www.mydomain.com/catalog/" directory. The problem is that then the .htaccess from this contribution not only removes the SID, it also removes the "/catalog/" part of the URL and I get a 404 returned. Is there a way to modify this?

 

Also, a question to Peter: did yo make the .htaccess file in this contribution specifically for installs where OSC resides in the root directory, since I see you based it off the stock OSC .htaccess file? Maybe it would be helpful to include a mofified .htaccess file for people that have OSC installed in a different directory and have them install the contribution there?

 

Anyhow, if anybody thats familiar with .htaccess files can make this NOT strip the "/catalog/" from my URLS and only the SID, id greatly appreciate it.  :thumbsup:

 

OK, semi-figured it out:

 

RewriteBase /catalog

RewriteBase /catalog/

 

Both of those work, is one prefered?

 

Also....when i test using it to strip the SID from my browser:

RewriteCond %{HTTP_USER_AGENT} !(msie) [NC]

it gives a page error if I go to any page that is SSL. Now I realize it doesnt matter for testing, but im wondering if that will cause any issues when it happens to the spiders. Do they even access secure pages? Using this contribution will return them an error(404 i think).......should I care? Thanks guys

Link to comment
Share on other sites

  • 2 weeks later...

Hey, thanks for all the help(or lack of in some cases). My SE links lost the SID's. Only side effect I noticed is that I went from first page ranking to.......not ranked. If anybody would know why that might be, im willing to listen.

 

Should I just leave this modded .htaccess file in forever or eventually swap it back with the original? Does it hurt anything to leave it in?

Link to comment
Share on other sites

  • 2 months later...

Hi,

 

I have a few msn.com results that show the oscsid... so I have done the following.

 

I took my current htaccess and at the very end I pasted the one in the contribution.

 

I have pasted my new htaccess file below. Could a person please take a look at it and tell me if this should fix my issue with msn and oscsid?

 

Thank you for your time. It is very appricated. :)

 

My new htaccess looks like this:

 

# $Id: .htaccess,v 1.3 2003/06/12 10:53:20 hpdl Exp $

#

# This is used with Apache WebServers

#

# For this to work, you must include the parameter 'Options' to

# the AllowOverride configuration

#

# Example:

#

# <Directory "/usr/local/apache/htdocs">

# AllowOverride Options

# </Directory>

#

# 'All' with also work. (This configuration is in the

# apache/conf/httpd.conf file)

 

# The following makes adjustments to the SSL protocol for Internet

# Explorer browsers

 

<IfModule mod_setenvif.c>

<IfDefine SSL>

SetEnvIf User-Agent ".*MSIE.*" \

nokeepalive ssl-unclean-shutdown \

downgrade-1.0 force-response-1.0

</IfDefine>

</IfModule>

 

# Fix certain PHP values

# (commented out by default to prevent errors occuring on certain

# servers)

 

#<IfModule mod_php4.c>

# php_value session.use_trans_sid 0

# php_value register_globals 1

#</IfModule>

#

#

#

# Spider Fix Added below

#

#

#

# $Id: .htaccess,v 1.3 2003/06/12 10:53:20 hpdl Exp $

 

# Set some options

Options -Indexes

Options FollowSymLinks

 

RewriteEngine on

RewriteBase /

#

# Skip the next two rewriterules if NOT a spider

RewriteCond %{HTTP_USER_AGENT} !(msnbot|slurp|googlebot) [NC]

RewriteRule .* - [s=2]

#

# case: leading and trailing parameters

RewriteCond %{QUERY_STRING} ^(.+)&osCsid=[0-9a-z]+&(.+)$ [NC]

RewriteRule (.*) $1?%1&%2 [R=301,L]

#

# case: leading-only, trailing-only or no additional parameters

RewriteCond %{QUERY_STRING} ^(.+)&osCsid=[0-9a-z]+$|^osCsid=[0-9a-z]+&?(.*)$ [NC]

RewriteRule (.*) $1?%1 [R=301,L]

 

#

# This is used with Apache WebServers

#

# For this to work, you must include the parameter 'Options' to

# the AllowOverride configuration

#

# Example:

#

# <Directory "/usr/local/apache/htdocs">

# AllowOverride Options

# </Directory>

#

# 'All' with also work. (This configuration is in the

# apache/conf/httpd.conf file)

 

# The following makes adjustments to the SSL protocol for Internet

# Explorer browsers

 

<IfModule mod_setenvif.c>

<IfDefine SSL>

SetEnvIf User-Agent ".*MSIE.*" \

nokeepalive ssl-unclean-shutdown \

downgrade-1.0 force-response-1.0

</IfDefine>

</IfModule>

 

# Fix certain PHP values

# (commented out by default to prevent errors occuring on certain

# servers)

 

#<IfModule mod_php4.c>

# php_value session.use_trans_sid 0

# php_value register_globals 1

#</IfModule>

I find the fun in everything.

Link to comment
Share on other sites

  • 1 year later...

Hi there,

just tried to add the contents of this contributions htaccess file to my root htaccess file but I'm getting a 403/404 error. Any idea what I'm doing wrong? I tried the text at the top and bottom of the file but no luck.

 

I need this contribution as google & MSN have managed to get some session IDs and keeps visiting with it and its indexed on search engines - must have got it before I new about preventing session IDs for robots - damn, why is that not set correctly on a default instal?

 

My spiders.txt and robots.txt are up to date and I'm preventing spider sessions in admin.

 

How can I get this contribution to work?

Do I also need Enigmas Session Regeneration contribution too?

 

Thanks folks

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

I finally got this working after much searching and am posting to help others with the same problem:

 

I just added this part of the mod to my htaccess file

# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT}!(msnbot�slurp�googlebot) [NC]
RewriteRule .* - [S=2]

# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&osCSid=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&osCSid=[0-9a-z]+$�^osCSid=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]

 

I used this post removing session ID

 

 

 

Still think I need Enigmas Session Regeneration Mod though???

Cheers

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...