IIS Index Server - Indexing my blog

This is Interesting: Free IT Magazines  
Home > Archive > IIS Index Server > March 2005 > Indexing my blog





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Indexing my blog
Hollis D. Paul

2005-02-20, 6:18 pm

I am trying to index my blog at http://msmvps.com/OBTS/ using my
SharePoint portal crawler. I have created an external content that
points to this URL. The blog consists of support sequences that I have
extracted from various newsgroups. I don't want the homepage included
because that has all of the current months entries concatenated, so a
hit on that page doesn't separate much. Nor do I want the monthly
archives, for the same reason. I just want the individual blog entries
included in the archive, so that the result set of a search has the URL
to the blog entry.

The URLs to various pages are:
home page: http://www.msmvps.com/OBTS

monthly summaries: http://www.msmvps.com/OBTS/archive/2005/02.aspx --
with the year and the month numbers being changed to fit the date.

The blog entries:
http://www.msmvps.com/OBTS/archive/...2/18/36388.aspx -- with the
numbers reflecting the date, and an individual number on the file name.
(These aren't the numbers from a real entry on my blog.)

So, in the include/exclude list I have the following:

http://www.msmvps.com/ include
http://www.msmvps.com/obts/archive/*/*/*/*.aspx include
http://www.msmvps.com/obts/archive/*.aspx exclude

When I reset the index, and did a full update, with a source content
group that has two blogs in it, I get one page from this
blog--http://www.msmvps.com/obts/ .

Others on the server have been given a Java script that sends the
search string out to Google and Google can find the blog entry pages.
What do I need to do to get SharePoint to find them?

Hollis D. Paul [MVP - Outlook]
Mukilteo, WA USA


Hollis D. Paul

2005-02-20, 6:18 pm

In article
<VA.00001f8b.012fa805@obts-outlookdev.outlookbythesound.mukwoods>,
Hollis D. Paul wrote:
> http://www.msmvps.com/obts/archive/*/*/*/*.aspx include
>

I just noticed a note in the Admin Guide on the page about
including/including content to the effect that aspx pages are by
default not included. How do I change this for this particular content
source?

Aspx is included as an indexed file type on that page in the WSS
central administration pages.

Hollis D. Paul [MVP - Outlook]
Hollis@outlookbythesound.com
Mukilteo, WA USA


Hilary Cotter

2005-02-20, 6:18 pm

Hi Hollis

I think this is more of a Sharepoint question.

Hilary
"Hollis D. Paul" <Hollis@outhousebythesound.com> wrote in message
news:VA.00001f8b.012fa805@obts-outlookdev.outlookbythesound.mukwoods...
>I am trying to index my blog at http://msmvps.com/OBTS/ using my
> SharePoint portal crawler. I have created an external content that
> points to this URL. The blog consists of support sequences that I have
> extracted from various newsgroups. I don't want the homepage included
> because that has all of the current months entries concatenated, so a
> hit on that page doesn't separate much. Nor do I want the monthly
> archives, for the same reason. I just want the individual blog entries
> included in the archive, so that the result set of a search has the URL
> to the blog entry.
>
> The URLs to various pages are:
> home page: http://www.msmvps.com/OBTS
>
> monthly summaries: http://www.msmvps.com/OBTS/archive/2005/02.aspx --
> with the year and the month numbers being changed to fit the date.
>
> The blog entries:
> http://www.msmvps.com/OBTS/archive/...2/18/36388.aspx -- with the
> numbers reflecting the date, and an individual number on the file name.
> (These aren't the numbers from a real entry on my blog.)
>
> So, in the include/exclude list I have the following:
>
> http://www.msmvps.com/ include
> http://www.msmvps.com/obts/archive/*/*/*/*.aspx include
> http://www.msmvps.com/obts/archive/*.aspx exclude
>
> When I reset the index, and did a full update, with a source content
> group that has two blogs in it, I get one page from this
> blog--http://www.msmvps.com/obts/ .
>
> Others on the server have been given a Java script that sends the
> search string out to Google and Google can find the blog entry pages.
> What do I need to do to get SharePoint to find them?
>
> Hollis D. Paul [MVP - Outlook]
> Mukilteo, WA USA
>
>



Hollis D. Paul

2005-02-20, 6:18 pm

In article <OQgiF3tFFHA.4052@TK2MSFTNGP14.phx.gbl>, Hilary Cotter
wrote:
> I think this is more of a Sharepoint question.
>

Yes. Once I noted that it aspx was excluded by default, then the game
here is up. I'll post back when I find out.

Hollis D. Paul [MVP - Outlook]
Hollis@outhousebythesound.com
Mukilteo, WA USA


Hollis D. Paul

2005-02-23, 6:00 pm

In article <OQgiF3tFFHA.4052@TK2MSFTNGP14.phx.gbl>, Hilary Cotter
wrote:
> I think this is more of a Sharepoint question.
>

Well, I haven't been able to get SharePoint Search to crawl the site
yet, but I have been able to put a Google search control in the
announcement area. Works like a charm. More importantly, I have the
Google search control on my Workstations Home Page, and it will bring
back search result sets from the blog.

Surely, if Google can crawl the pages, so can SharePoint! One just
needs to tell it with the right inflection?

Hollis D. Paul [MVP - Outlook]
Hollis@outhousebythesound.com
Mukilteo, WA USA


real_xetrov

2005-03-07, 7:09 pm

have you tried changing the order of the rules?

maybe try putting it in this order:

http://www.msmvps.com include
http://www.msmvps.com/obts/archive/*.aspx exclude
http://www.msmvps.com/obts/archive/*/*/*/*.aspx include

this is similar to the default config where the root "/" is excluded, but pages within site are included
Hollis D. Paul

2005-03-08, 2:48 am

In article <real_xetrov.1lk0p2@mail.webservertalk.com>, Real_xetrov
wrote:
> maybe try putting it in this order:
>
> http://www.msmvps.com include
> http://www.msmvps.com/obts/archive/*.aspx exclude
> http://www.msmvps.com/obts/archive/*/*/*/*.aspx include
>

Alas, this did not work, either.

Hollis D. Paul [MVP - Outlook]
Hollis@outhousebythesound.com
Mukilteo, WA USA


Hollis D. Paul

2005-03-24, 5:52 pm

In article <OQgiF3tFFHA.4052@TK2MSFTNGP14.phx.gbl>, Hilary Cotter
wrote:
> I think this is more of a Sharepoint question.
>

Turns out that it really isn't. Below is my latest and partially
successful attempts, made after the MSDN helper gave up. What values
should I set the page depth and hop limit to remain within my blog?

I did do some testing afterward, and my pages are indeed there, but
really hidden in all the other stuff.

*************************************
Wrong conclusion!! Specifically in light of the initial statement that
both Google and MSN were searching the blog.

So, to try something different, I deleted the source and re-created it
as http://www.msmvps.com/OBTS/ . This is the same as the original
definition except there is a final slash.

Then I changed my rules to:
www.msmvps.com
included
http://www.msmvps.com/obts/* ; included
http://www.msmvps.com/obts/archive/*/*/*/*.aspx ; included
http://www.msmvps.com/* ; exclude
http://www.msmvps.com/obts/archive/*.aspx ; exclude

That didn't get me any pages, so I went in to the source properties and
unchecked the crawl all pages on this site, and changed that to custom
selection and put in a page depth of 10.

That didn't help, so I changed the hop limit to 2. Below you will see
the end of the gatherer log before I managed to uncheck all the logging
options. As you can see it is going all over. As you can see it is
going all over. So, how should those two parameters be set so that I
just get the pages on my blog. It is still indexing, and the page
count has grown to 7092 when the stop took effect.

So, what should those two parameters be set at to restrict the crawl to
my blog?

Gatherer log at end of logging:

3/23/2005 3:36:57 PM Add http://www.datalan.com
The address has been redirected to http://www.datalan.com/

3/23/2005 3:36:57 PM Add http://www.datalan.com
Done

3/23/2005 3:36:57 PM Add http://www.parallelspace.com
The address has been redirected to http://www.parallelspace.com/

3/23/2005 3:36:57 PM Add http://www.parallelspace.com
Done (The document contains invalid utf-8 encoded characters)

3/23/2005 3:36:57 PM Add
http://blog.u2u.info/DottextWeb/pat...ve/2004/10.aspx
Done

3/23/2005 3:36:57 PM Add
http://blog.seattlepi.nwsource.com/...ves/004519.html
Done

3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Person...MyProducts.aspx
Links from this address were excluded because the page contains a
META NAME="ROBOTS" tag



3/23/2005 3:36:56 PM Add
http://google.blognewschannel.com/wp-login.php
The address was excluded because its file extension is restricted
in the file type rules.

****************************************
***

Hollis D. Paul [MVP - Outlook]
Hollis@outhousebythesound.com
Mukilteo, WA USA


Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com