Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

RE: New Spider


mugitty

Recommended Posts

  • Replies 72
  • Created
  • Last Reply

i am using both spider killers - yours and the html_output version - neither one seems to be getting inktomi - will check that link

 

thanks

Link to comment
Share on other sites

this is the list I am using:

 

$spiders = array("almaden.ibm.com", "appie", "arachnophilia", "arale", "inktomi", "araneo", "architext", "aretha", "ariadne", "arks", "aspider", "atn", "atomz", "auresys", "backrub", "bigbrother", "bjaaland", "blackwidow", "asterias2.0", "ahoy", "AlkalineBOT", "Anthill", "augurfind", "baiduspider", "blindekuh", "Bloodhound", "Ukonline", "borg-bot", "brightnet", "bspider", "cactvschemistryspider", "calif", "cassandra", "cgireader", "checkbot", "christcrawler", "churl", "cienciaficcion", "cmc", "Collective", "combine", "conceptbot", "CoolBot", "cosmos", "cruiser", "cusco", "cyberspyder", "deweb", "dienstspider", "digger", "diibot", "directhit", "dnabot", "download_express", "dragonbot", "dwcp", "e-collector", "ebiness", "eit", "elfinbot", "emacs", "emcspider", "esther", "fastcrawler", "roadrunner", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "googlebot", "henrythemiragorobot", "infoseek", "sidewinder", "lachesis", "mercator", "moget/1.0", "nationaldirectory-webspider", "naverrobot", "ncsa beta", "netresearchserver", "ng/1.0", "osis-project", "polybot", "pompos", "scooter", "inktomisearch", "seventwentyfour", "slurp/si", "slurp", "[email protected]", "steeler/1.3", "szukacz", "teoma", "turnitinbot", "vagabondo", "zao/0", "zyborg/1.0", "semanticdiscovery/0.1", "an-zyborg-g01.looksmart.com");

 

EDITED: to fit the screen. Thanks, Linda

Link to comment
Share on other sites

  • 2 weeks later...

In theory, Burt's is better than Ian's ... :shock:

 

However, Ian or Burt can better explain the differences than I can.

 

There were some situations where Ian's caused irratic behavior.

Link to comment
Share on other sites

Oh ok, I thought I'd read somewhere that while Ian's in theory was better Burt's doesn't have that aforementioned prob...

 

What is this talk about user agents anyway? I vaguely understand the concept, each spider sends it the the web server at the time of web page request? Does everyone send a user agent? Could we just make a list of user agents that people use, therefore blocking anything else that doesn't use these "good" user agents?

 

(I tried looking up in google exactly how it worked but wasn't very successful)

 

Thanks

- - - -

Sometimes, ignorance is bliss.

Link to comment
Share on other sites

Ian's has a problem under https connections, where Burts does not 8)

 

Ian is aware of the issue

 

Thanks, Jeff ... I knew it was something along those lines, but I hate to state things incorrectly. 8)

Link to comment
Share on other sites

a 'user agent' is simply a variable that tells you what the 'user' is, and search engine spiders show up as user agent = robot so they are easy to spot

 

there is a visitor status contribution that shows

 

Id. + ?Last Click?- + ?Access?- + ?IP Address?- + ?Browser Language?- + ?Site Language?- + ?Entry URI?- + ?Referer?-

 

the 'browser/language' field uses the user agent variable to display info on the user

Link to comment
Share on other sites

So there are too many "good" user agents?

 

I was thinking maybe one could pass parts that would be the users for sure, you know, all the browsers that are compatible with osCommerce.

 

As an example if some browser gave off the mozilla code instead of say googlebot, than allow session ids.

 

What about storing the session ID in a mysql variable. Does that remove it from the hyperlink?

 

and Linda, you NEVER sleep.

- - - -

Sometimes, ignorance is bliss.

Link to comment
Share on other sites

I am using burts sid killer. I tried the spider simulator mentioned earlier in the thread and find it only shows the text of theindex.html page. Now I do have an index.html page as a front door to my index.php page. Now when I use the URL mrsfieldsgoodies.com/index.php, it comes up with the text of the page. I have allprods, isn't that supposed to be what the spider "sees"?

This is the results I get when using mrsfieldsgoodies.com/index.php

 

Spidered Text :

Mrs Fields Goodies Top ? Catalog My Profile | Cart Contents | Checkout Categories African/African American HeritagBirds->Business and TravelCandles->Celebrate AmericaCherubs and ChildrenCountry DecorDecorative ClocksDevine InspirationDistinctive Oil BurnersDoll CollectionDolphin PlatesFar Eastern treasuresFashion WatchesFine JewelryFlowers and VasesGarden DecorGlass Creations->Glow in the DarkGolf NoveltiesGone FishinHome Decor->Hong TzeIncense and MoreJewlery Boxes and Cork SculptureKnifes and SwordsLiberty BronzeMajolica StyleMandarin IvoryMedieval LegendsMetal WorksMiniaturesMirrorsNative American HeritageNatures BeautyNautical NoveltiesNight LightsOcean AcrobaticsPhoto FramesToys and GamesUnicorn FantasyUnique MusicalsWater FountainsWhimsical WildlifeWindchimesWorks of Art Products Rooster Wall Plaque$9.95 Bill Blass Luggage Set$349.95 Clear Glass Carved Elephant$32.95 Alabastrite Praying Angel Photo Frame$12.95 Wood Candleholder with Tealights$14.95 Stained Glass Butterfly Wind Chime$12.95 Play-Doh on Keychain$4.99 Spun Glass Angel With Gold Wings$19.95 Earthworm Cat and Fishbowl With Fish$8.95 Patchwork Elephant - American Flag$12.95 Scented Pillar Designer Candle - Ginger & Lily$9.95 10K - Gold Lady's Sapphire Diamond Ring$49.95 Upcoming Products Date Expected Wood Candleholder with Tealights 06/14/2003 Ebony-Look African Mask 05/18/2003 9-Piece Ceramic Mini Cups and Saucers Set 05/10/2003 Spun Glass Sail Boat With Blue Base 05/08/2003 10-Piece Porcelain Mini Tea Set 05/05/2003 Distressed White Metal Chandelier Candle Holder 05/02/2003 Angel Capiz Tea Light Holder 04/30/2003 18" Porcelain Victorian Doll - Rebecca 04/29/2003 18" Porcelain Doll - Maria Isabel 04/29/2003 12-Function Camping Knife 04/29/2003 Login Here E-Mail address: Password:

 

Spidered Links :

http://mrsfieldsgoodies.com/index.php

http://mrsfieldsgoodies.com

http://mrsfieldsgoodies.com/index.php

https://host42.ipowerweb.com/~mrsfield//account.php

http://mrsfieldsgoodies.com/shopping_cart.php

https://host42.ipowerweb.com/~mrsfield//che...ut_shipping.php

http://mrsfieldsgoodies.com/index.php/cPath/58

http://mrsfieldsgoodies.com/index.php/cPath/55

http://mrsfieldsgoodies.com/index.php/cPath/34

http://mrsfieldsgoodies.com/index.php/cPath/72

http://mrsfieldsgoodies.com/index.php/cPath/42

http://mrsfieldsgoodies.com/index.php/cPath/64

http://mrsfieldsgoodies.com/index.php/cPath/67

http://mrsfieldsgoodies.com/index.php/cPath/38

http://mrsfieldsgoodies.com/index.php/cPath/66

http://mrsfieldsgoodies.com/index.php/cPath/21

http://mrsfieldsgoodies.com/index.php/cPath/60

http://mrsfieldsgoodies.com/index.php/cPath/78

http://mrsfieldsgoodies.com/index.php/cPath/48

http://mrsfieldsgoodies.com/index.php/cPath/26

http://mrsfieldsgoodies.com/index.php/cPath/25

http://mrsfieldsgoodies.com/index.php/cPath/31

http://mrsfieldsgoodies.com/index.php/cPath/70

http://mrsfieldsgoodies.com/index.php/cPath/62

http://mrsfieldsgoodies.com/index.php/cPath/44

http://mrsfieldsgoodies.com/index.php/cPath/41

http://mrsfieldsgoodies.com/index.php/cPath/40

http://mrsfieldsgoodies.com/index.php/cPath/75

http://mrsfieldsgoodies.com/index.php/cPath/51

http://mrsfieldsgoodies.com/index.php/cPath/46

http://mrsfieldsgoodies.com/index.php/cPath/49

http://mrsfieldsgoodies.com/index.php/cPath/35

http://mrsfieldsgoodies.com/index.php/cPath/56

http://mrsfieldsgoodies.com/index.php/cPath/69

http://mrsfieldsgoodies.com/index.php/cPath/50

http://mrsfieldsgoodies.com/index.php/cPath/45

http://mrsfieldsgoodies.com/index.php/cPath/80

http://mrsfieldsgoodies.com/index.php/cPath/32

http://mrsfieldsgoodies.com/index.php/cPath/76

http://mrsfieldsgoodies.com/index.php/cPath/52

http://mrsfieldsgoodies.com/index.php/cPath/53

http://mrsfieldsgoodies.com/index.php/cPath/39

http://mrsfieldsgoodies.com/index.php/cPath/24

http://mrsfieldsgoodies.com/index.php/cPath/54

http://mrsfieldsgoodies.com/index.php/cPath/22

http://mrsfieldsgoodies.com/index.php/cPath/27

http://mrsfieldsgoodies.com/index.php/cPath/43

http://mrsfieldsgoodies.com/index.php/cPath/59

http://mrsfieldsgoodies.com/index.php/cPath/23

http://mrsfieldsgoodies.com/index.php/cPath/33

http://mrsfieldsgoodies.com/index.php/cPath/71

http://mrsfieldsgoodies.com/index.php/cPath/61

http://mrsfieldsgoodies.com/product_info.p...roducts_id/1082

http://mrsfieldsgoodies.com/product_info.p...roducts_id/1082

http://mrsfieldsgoodies.com/product_info.p...products_id/827

http://mrsfieldsgoodies.com/product_info.p...products_id/827

http://mrsfieldsgoodies.com/product_info.p...roducts_id/1041

http://mrsfieldsgoodies.com/product_info.p...roducts_id/1041

http://mrsfieldsgoodies.com/product_info.p...products_id/943

http://mrsfieldsgoodies.com/product_info.p...products_id/943

http://mrsfieldsgoodies.com/product_info.p...products_id/113

http://mrsfieldsgoodies.com/product_info.p...products_id/113

http://mrsfieldsgoodies.com/product_info.p...products_id/670

http://mrsfieldsgoodies.com/product_info.p...products_id/670

http://mrsfieldsgoodies.com/product_info.p.../products_id/43

http://mrsfieldsgoodies.com/product_info.p.../products_id/43

http://mrsfieldsgoodies.com/product_info.p...products_id/190

http://mrsfieldsgoodies.com/product_info.p...products_id/190

http://mrsfieldsgoodies.com/product_info.p...products_id/164

http://mrsfieldsgoodies.com/product_info.p...products_id/164

http://mrsfieldsgoodies.com/product_info.p...products_id/838

http://mrsfieldsgoodies.com/product_info.p...products_id/838

http://mrsfieldsgoodies.com/product_info.p...products_id/611

http://mrsfieldsgoodies.com/product_info.p...products_id/611

http://mrsfieldsgoodies.com/product_info.p...products_id/280

http://mrsfieldsgoodies.com/product_info.p...products_id/280

http://mrsfieldsgoodies.com/product_info.p...products_id/688

http://mrsfieldsgoodies.com/product_info.p...roducts_id/1274

http://mrsfieldsgoodies.com/product_info.p...products_id/893

http://mrsfieldsgoodies.com/product_info.p...roducts_id/1196

http://mrsfieldsgoodies.com/product_info.p...products_id/894

http://mrsfieldsgoodies.com/product_info.p...products_id/254

http://mrsfieldsgoodies.com/product_info.p...products_id/472

http://mrsfieldsgoodies.com/product_info.p...products_id/945

http://mrsfieldsgoodies.com/product_info.p...products_id/946

http://mrsfieldsgoodies.com/product_info.p...products_id/980

http://mrsfieldsgoodies.com/shipping.php

http://mrsfieldsgoodies.com/privacy.php

http://mrsfieldsgoodies.com/conditions.php

http://mrsfieldsgoodies.com/contact_us.php

http://mrsfieldsgoodies.com/allprods.php

http://mrsfieldsgoodies.com/catalog_produc...with_images.php

http://mrsfieldsgoodies.com/advanced_search.php

https://www.paypal.com/xclick/business=chfi...rrency_code=EUR

http://www.oscommerce.com

Link to comment
Share on other sites

  • 2 weeks later...

Hello there,

I m using Stuart Owens's mod for session killing for spiders. I beleive it's working fine. I am also using 'User tracking with admin' mod to track visitors on my site. In the logs i had found 2 to 3 entries of spiders. I found 1 entry of TEOMA.COM and 1 of inktomisearch.com

I believe i need to update the codes provided by Stuart to tackle these spiders. Has anybody updated thier codes with latest and important spiders? If yes than please provide.

 

Thanks n Regards,

Jack

Link to comment
Share on other sites

  • 2 weeks later...

I went to one of the spider test sites.

 

Here is an example of the output:

 

Spidered Links :

http://mywebsite.com/allprods.php

http://mywebsite.com/advanced_search.php

http://mywebsite.com/default.php/cPath/30

http://mywebsite.com/default.php/cPath/23

http://mywebsite.com/default.php/cPath/63

http://mywebsite.com/default.php/cPath/24

http://mywebsite.com/default.php/cPath/91

http://mywebsite.com/default.php/cPath/48

http://mywebsite.com/default.php/cPath/22

http://mywebsite.com/default.php/cPath/35

http://mywebsite.com/default.php/cPath/69

http://mywebsite.com/default.php/cPath/78

http://mywebsite.com/default.php/cPath/80

https://my-secure-site.com/ssl/account.php/...a34026722bcd729

http://mywebsite.com/shopping_cart.php

https://my-secure-site.com/ssl/checkout_pay...a34026722bcd729

http://mywebsite.com/products_new.php

http://mywebsite.com/product_info.php/products_id/419

http://mywebsite.com/product_info.php/products_id/419

http://mywebsite.com/shipping.php

http://mywebsite.com/pdf.php

http://mywebsite.com/privacy.php

http://mywebsite.com/conditions.php

http://mywebsite.com/contact_us.php

http://mywebsite.com/gv_redeem.php

http://mywebsite.com/affiliate_info.php

https://my-secure-site.com/ssl/affiliate_af...a34026722bcd729

http://mywebsite.com/newsletter.php

http://mywebsite.com/default.php/cPath/22

http://mywebsite.com/default.php/cPath/63_77

http://mywebsite.com/default.php/cPath/63

http://mywebsite.com/default.php/cPath/24

http://mywebsite.com/default.php/cPath/23

http://mywebsite.com/default.php/cPath/30

http://mywebsite.com/default.php/cPath/35_90

http://mywebsite.com/default.php/cPath/22_47

http://mywebsite.com/product_info.php/cPat...products_id/248

http://mywebsite.com/default.php/cPath/55

http://mywebsite.com/default.php/cPath/48

http://mywebsite.com/default.php/cPath/22_58

http://mywebsite.com/product_info.php/cPat...products_id/176

http://mywebsite.com/default.php/cPath/69

http://mywebsite.com/default.php/cPath/35

http://mywebsite.com/default.php/cPath/22_67

http://mywebsite.com/default.php/cPath/22

 

I am using Burt's SID Killer.

 

It looks like links that go to my secure server are getting SIDs. Is there some way to stop this?

 

Best Regards, Ted

Link to comment
Share on other sites

For the last hour or so I have been reading these spider killer threads.

 

I have installed Henri's contribution, painless, thanks. I did not add Burt mod to the html_output.php as I kinda assumed that one was enough. Please let me know if you think not.

 

I then went to http://www.webconfs.com/search-engine-spid...r-simulator.php and spidered the default.php file. It does turn up a link with a SID. I traced this to the catalog link on the breadcrumb trail in the header bar. Now I don't know where to begin to check whether that has used the tep_href_link function which I guess it needs to for the spider killer to work.

 

I also went to the who's online page and it showed two guests from the appropriate IP address still there.

 

PS These threads have been a real eye opener. I didn't even think that this could be a problem. Thanks!

K

.....................................................................

When the going get's tough,

the tough get going.

Link to comment
Share on other sites

  • 1 month later...
...I then went to http://www.webconfs.com/search-engine-spid...r-simulator.php and spidered the default.php file. It does turn up a link with a SID.

After struggling with this for a while, I discovered something that isn't being made very clear in a lot of posts here, probably because it's obvious to many and time-consuming to explain.

 

Depending on the spider simulator you use, even after you make the suggested changes to your html_output.php file and check it with a spider simulator, you may still see links with session IDs. Why? The idea is to disable the SID for web spiders, not web browsers, and some spider simulators don't pass along a user-agent that matches something in your list of spiders.

 

For example, the simulator mentioned above seems to mask itself as a spider in my list, since it does not show any SIDs when testing my site. However, another spider simulator showed that I do have SIDs, because it's just passing the same user-agent as the browser I'm using. In that case, I'm supposed to see the SIDs, since I want SIDs for human beings using web browsers.

 

To make sure your change works regardless of the spider simulator you're using, look at the user-agent listed when you submit a site using the SearchEngineWorld tester. You should see something like "Mozilla," "Netscape," or whatever browser you happen to be using. Enter a word from this line in lower case (i.e., "mozilla" or "netscape") into your list of spiders in your modified html_output.php file. Make sure you cookies are turned off, and then test your site again using the above link. You should see that the SIDs are now gone, which means that your modification was successful. Be sure to go back and remove the "mozilla" (or whatever) entry you just added, or else legitimate customers may have trouble using your site.

 

I'm not sure if I've clarified things here or just made them muddier, but hopefully this will help others. I know I was confused until I discovered what was going on.

Link to comment
Share on other sites

  • 4 weeks later...

I'm using burts sid killer and its working great, but I notice that I still get sids in the navagaion bar. They are coming from the header and footer.php files. I don't think they are filtered through the html_output.php file. Is anyone else getting these?

Link to comment
Share on other sites

Yes i figured this out yesterday too (inktomi came): :(

 

I initialized the spiderkiller too late.

 

I wrote in the installation guide that you should include the spider_configure.php at the end of the application_top.php.

This is unfoutunally too late, cause the links in the navigation bar are written allready.

The installationguide should have been called:

 

- add in your application_top.php after

// include server parameters

require('includes/configure.php')

 

// Spiderkiller

require(DIR_WS_INCLUDES . 'spider_configure.php');

 

Sorry for this.

Henri

Link to comment
Share on other sites

  • 3 weeks later...

hello...i have added header controller , admin...and now burts sid killer...

 

this is what i get when testig on spider test

 

Status 200 (return error code 0)

Spider url http://www.joekilo.com

User Agent Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 5.0) 213.122.44.194

Referrer http://www.searchengineworld.com/cgi-bin/s.../sim_spider.cgi

Spider title Untitled Document

Spider meta desc No description available.

Spider meta keywords

 

its stating untitled doc...is this because i am not yet on a search engine...i did submit my url to google..but its not showing there yet either??/ any helpers please

jk

Link to comment
Share on other sites

AHHHHHHHHHH i have tried to install the code for the sid killer,, i folled it word for word and pasted what i needed to where i needed to and its still popping up errors. is there a way i can download a good html_output.php? or can someone send me one already done.?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...