Reality Tech

SharePoint insights, real world experience

Limiting Search Crawling to a subsite

I had an interesting challenge.  I was asked to limit Search Crawling to a single subsite.  The underlying issue was that a great deal of security in this farm was implemented via Audiences which is not a secure method of locking down content. Audiences expose documents and items to users, but don’t prevent the user from actually accessing the documents or items.  Search Content Sources expect to have nice and simple Web Application URLs to crawl.  So how best to restrict crawling to a subsite?

The simple answer is set up the Content Source to crawl the whole Web Application, but set up Crawl Rules to exclude everything else.  Only two rules are needed:

  1. Include: List the site to include, such as http://sharepoint/sites/site1/site2*.*
    Note the * at the end to ensure all sub-content is crawled.  Being the first crawl rule, this takes precedence over the next. Don’t forget the *.*
    It seems the testing of the crawl rule with just a * will appear to capture all content, but at crawl time, only a *.* will capture content with a file extension.
  2. Exclude: List everything else: http://*.*
    This will exclude anything not captured in the first rule.
  3. If you have a content source that includes people (sps3://sharepoint) be sure to use a wildcard on the protocol as well.

Voila!

About these ads

2 responses to “Limiting Search Crawling to a subsite

  1. Matthew Lamb July 12, 2012 at 5:27 am

    This rule didn’t seem to work for me : http://*.*
    When I test it using a URL from the main site collection (e.g. http://intranet/something), it doesn’t identify it, so it doesn’t get excluded.
    I think you have to specify more of the URL: http://intranet/*.*

    • Joel Plaut July 12, 2012 at 5:38 am

      For me, host header wasn’t required. Please do check the sequence of the rules. The crawl exclusion rules evaluate the rules you provide in the sequence you specify. On the right hand side you can change the sequence. Then on top, try a sample URL to see how the rules evaluate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 147 other followers

%d bloggers like this: