Welcome to WebHeadStart.org

Web Technologies

Sponsored By

WebHeadStart.org is currently in beta.
Please pardon our appearance as we work to provide you with the most comprehensive reference on today's web technologies.

Interested in advertising on WebHeadStart? Become an advertising partner today!

[WWW-HTML Mailing List Archive Home] [Messages By Thread] [Messages By Date]

Re: Question about web spiders...

From: Peter Kupfer <peter.kupfer@sbcglobal.net>
Date: Sun, 26 Jun 2005 23:14:46 -0500
Message-ID: <42BF7D36.7040302@sbcglobal.net>
To: Lachlan Hunt <lachlan.hunt@lachy.id.au>
CC: www-html@w3.org

Lachlan Hunt wrote:
> Peter Kupfer wrote:
> 
>> Lachlan Hunt wrote:
>>
>>> The correct way to control the way a spider indexes your site is to 
>>> use robots.txt, assuming the spider in question implements it.
>>
>> In a robots.txt file can you control specifically what links a spider 
>> will follow on a certain page,
> 
> No, it controls which pages on a server the spider can access.
> 
>>  or just that it won't go to a certain page.
> 
> Essentially, yes.

This is what I thought, so, as you concluded, a robots.txt won't fix my 
problem here. :(

>> I want the spider to eventually hit each subdomain, just not from the 
>> home page, I have it start at each subdomain index?
> 
> Then HTML is the wrong place to specify such behaviour and robots.txt is 
> probaly not suitable for you either.  HTML is designed to markup the 
> semantics of the document's content by saying *what* the content is, not 
> describe how the content should be processed by a particular UA.  Having 
> said that though, processing instructions [1] are designed to supply 
> system specific information, but I don't know how suitable they would be 
> for your particular needs.

Fair enough.

> 
> I don't understand why it matters which path is followed to reach 
> subdomains, but I think you need to find a way to configure the robot 
> itself, not try to give it instructions from within the documents it reads.

With this service, freefind, it makes a site map, and depending on the 
path it takes through the site, varies how the site map is displayed.

>>> nofollow was discussed quite extensively on this list when Google
>>> introduced it and the vast majority of this community rejected it.
>>
>> I tried to search the archive, but didn't see it there, why was no 
>> follow rejected?
> 
> Then you didn't look very hard.  A search for "nofollow" in the archives 
> reveals most of the thread, appearing just below the messages from this 
> thread.  For your convenience, it actually started with a message on 
> www-html-editor [2|3], with most of the followup discussion on www-html 
> [4].
> 
> [1] http://www.is-thought.co.uk/book/sgml-8.htm#PI 
> [2] http://lists.w3.org/Archives/Public/www-html-editor/2005JanMar/0010 
> [3] 
> http://lists.w3.org/Archives/Public/www-html-editor/2005JanMar/thread#10 
> [4] http://lists.w3.org/Archives/Public/www-html/2005Jan/thread#64 

Perhaps. I searched for no follow, not in quotes and with a space, and I 
  got subjects like, "XML tags are just a cheap rip-off of PHP tags" & 
"DC in XHTML2", and other things that were not what I wanted. I will go 
back and search "nofollow", it didn't occur to me to leave out the space.

Thanks!


-- 
Peter Kupfer
peschtra@yahoo.com
Received on Monday, 27 June 2005 04:14:52 GMT
Valid XHTML 1.0! Valid CSS! Site Map | Privacy Policy | Terms of Use | WebHeadStart.org © 2005 All Rights Reserved.