Introduction
This document is intended to highlight
possible causes of poor indexing performance on SharePoint 2007 farms as well
as provide some options/recommendations for further tuning
Content
Host - This is the server that hosts/stores the
content that the indexer is crawling. For example, if a content source crawls a
SharePoint site, the content host would be the web front end server that hosts
the site.
MSDMN.EXE - When a crawl is running, this process is visible in the task
manager on the indexer. This process is called the "Search Daemon”. When a
crawl is started, this process is responsible for connecting to the content
host (using the protocol handler and I-Filter), requesting content from the
content host and crawling the content. The Search Daemon has the biggest impact
on the indexer in terms of resource utilization.
MSSearch.exe - Once the MSDMN.EXE process is done crawling the content, it passes
the crawled content on to MSSearch.exe (this process also runs on the indexer, visible
in the task manager during a crawl). MSSearch.exe does two things. It writes
the crawled content on to the disk of the indexer and it also passes the
metadata properties of documents that are discovered during the crawl to the
backend database. Unlike the crawled content index, which gets stored on the
physical disk of the indexer, crawled metadata properties are stored in the
database.
SQL
Server (Search Database) - The search database stores
information such as information about the current status of the crawls,
metadata properties of documents/list items that are discovered during the
crawl
There are a number of reasons why crawls could
be taking longer than expected.
The indexer has to request content from the content host to crawl
the content. If the host is responding slowly, no matter how strong the indexer
is, the crawl will run slow.
Collect performance counter data from the indexer.
Performance data should be collected for at least 10% of estimated crawl
duration.
Monitor the following performance counters:
·
\\Office Server Search
Gatherer\ Threads Accessing Network
·
\\Office Server Search
Gatherer\ Idle Threads
The "Threads Accessing Network"
counter shows the number of threads on the indexer that are waiting on the
content host to return the requested content. A consistently high value of this
counter indicates that the crawl is starved by a "hungry host".
The indexer also uses the database during the
indexing process. If the database server becomes too slow to respond, crawl
durations will suffer. To see if the crawl is starved by a slow or busy
database server use the following performance counters
·
\\Office Server Search
Archival Plugin\Active Docs in First Queue
·
\\Office Server Search
Archival Plugin\Active Docs in Second Queue
Once MSDMN.EXE is done
crawling content, it passes this to MSSearch.exe, which then writes the index
on disk and inserts the metadata properties into the database.
MSSearch.exe queues all the
documents whose metadata properties need to be inserted into the database.
Consistently high values of
these two counters will indicate that MSSearch.exe has requested the metadata
properties to be inserted into the database, but the requests are just sitting
in the queue because the database is too busy. Ideally, you see the number of
documents in first queue increasing and then dropping to 0 at regular intervals
of time (every few minutes)
TuningTuningIf there are consistently high values of the
above mentioned two performance counters and if the database server is
dedicated to MOSS - consider optimization at the application level such as blob
cache
On the indexer monitor the performance counter \\Process\%Processor Time (msdmn.exe the
selected instance).
If msdmn.exe consistently at high CPU utilization, consider
changing the indexer performance from "Maximum" to "Partially
Reduced" If there are other
bottlenecks (memory, disk etc.), analyze performance data to get to a
conclusion. This article explains how you can determine potential bottlenecks on
your server
See Central Administration > Operations > Services on Server >
Office SharePoint Server Search Service Settings
Start with a crawler impact rule to “request 64
documents at a time” in order to maximize the number of threads the crawler can
use to index content, with the ultimate goal of increasing the crawl speed.
Resources usage on the indexer, dedicated
front-end and search database boxes should be monitored, and in case the
crawler generates too much activity, the crawler impact rule should be tuned by
decreasing the parallelism.
·
Reduced: Total number of threads = number of processors, Max
Threads/host = number of processors
·
PartlyReduced: Total number of threads = 4 times the number of
processors , Max Threads/host = 16 times the number of processors
·
Maximum: Total number of threads = 4 times the number of
processors , Max Threads/host = 16 times the number of processors (threads
are created at HIGH priority)
NOTE: It is a good idea to open Perfmon and look at the gatherer stats
while indexing. There is a statistic called Performance Level and this reflects
the actual level that the indexer is running at where 5 = max and 3 = reduced.
Even if you set everything to max the indexer may decide to run at reduced
An aspect of the crawling process is then the filtering daemons use up
to much memory (mssdmn.exe) they get automatically terminated and restarted.
There is of course a windup time when this happens and can slow down your
crawling. The default setting is pretty low (around 100M) so is easy to trip
when filter large files. You can and should increase the memory allocation by
adjusting the following registry keys (this will require a service
restart for the changes to take effect).
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office
Server\12.0\Search\Global\Gathering Manager
·
FolderHighPriority: Represents the number of
high priority folders that can be processed at one time. If this is too high then the cache in the
daemons will constantly be running out of space. If this is too low then the crawl will be
throttled waiting for more items to process.
·
FilterProcessMemoryQuota: Represents how much
memory can be consumed by the search daemon process before it gets killed by
the crawler. The OOB default has been chosen based on 4 GB of memory on the
indexer. If the customer has higher RAM, they can increase this value to cache
more data during the crawl.
·
DedicatedFilterProcessMemoryQuota: Same as for
FilterProcessMemoryQuota except this is the size of the single-threaded
daemons.
As an example, if the indexer box is 64-bit with 16 GB of RAM, the
following values have been tested successfully:
FolderHighPriority: set to
500
FilterProcessMemoryQuota: set to 208857600
DedicatedFilterProcessMemoryQuota: set to 208857600
Review the information architecture of
the content to be indexed.
Identify one subset of content where
most of the changes happen (“fresh” content), and another subset that is mostly
static (archived or older content).
Configuring more than one content source
to target the “fresh” content areas separately from the “static” areas will
provide more flexibility for crawl scheduling.
Multiple content sources will also
mitigate the impact of a long running operation of the crawler (like a full
crawl) in terms of latency for fresh content to appear in search results, - can
selectively activate the crawling on the desired content sources only,
postponing less important crawling activities to off-peak hours, etc
SharePoint gives priority to the first running crawl so that if you
already are indexing one system it will hold up the indexing of a second and
increase crawl times.
- Solution:
Schedule your crawl times so there is no overlap. Full crawls will take the
longest so run those exclusively.
The Adobe PDF I-Filter can only filter one file at a
time and that will slow crawls down, and has a high reject rate for new PDFs
- Solution:
Use a retail PDF filter e.g. Foxit
There is a setting in the registry that controls the number of times a
file is retried on error. This will severely slow down incremental crawls as
the default is 100. This retry count can be adjusted by this key:
- HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office
Server\12.0\Search\Global\Gathering Manager: set DeleteOnErrorInterval = 4
Decimal
Ensure that you have at least 2 Gig of free memory available before your
crawl even starts and that you have at least 2 real processors available.
If
you execute a full crawl that overlaps with normal business hours it could have
an adverse effect on SharePoint users if the crawl server is used for other
services such as query or WFE roles. Central Admin on the app server will be
impacted and if the query role wasn't moved to the WFE's then the search center
results page will also perform badly during crawl operations (even an
incremental crawl could have this impact if a lot of content had changed since
the last crawl).
Run
64bit OS
You should
exclude certain folders from being scanned. If you do not exclude these
folders, you can experience many unexpected issues
You may have to configure the antivirus software to
exclude the "Drive:\Program Files\Microsoft Office Servers"
folder from antivirus scanning for SharePoint Server 2007. If you do not want
to exclude the whole "Microsoft Office Servers" folder from antivirus
scanning, you can exclude only the following folders:
·
Drive:\Program Files\Microsoft Office Servers\12.0\Data.
(This folder is used for the indexing process. If the Index files are
configured to reside in a different folder, you also have to exclude that
location.)
·
Drive:\Program Files\Microsoft Office Servers\12.0\Logs
·
Drive:\Program Files\Microsoft Office Servers\12.0\Bin
(http://office.microsoft.com/en-us/sharepoint-server-help/configure-disk-based-cache-settings-HA010176284.aspx)
Note If you have Microsoft Office SharePoint Server 2007,
these folders should be excluded in addition to the folders listed
in the WSS 3.0 section below
Note When you install SharePoint
Server 2007 or when you apply a hotfix to an existing installation of
SharePoint Server 2007, you may have to disable the real-time option of the
antivirus program. Or, you may have to exclude the Drive:\Windows\Temp
folder from antivirus scanning if it is required.
You may have to configure the antivirus software to
exclude the following folders and subfolders from antivirus scanning.
Note The Drive placeholder represents the drive letter in which
you have installed Windows SharePoint Services 3.0 or SharePoint Server 2007.
Typically, this drive letter is C.
·
Drive:\Program Files\Common Files\Microsoft Shared\Web
Server Extensions
If you do not want to exclude the whole "Web Server Extensions"
folder from antivirus scanning, you can exclude only the following two folders:
o
Drive:\Program Files\Common Files\Microsoft Shared\Web
Server Extensions\12\Logs
o
Drive:\Program Files\Common Files\Microsoft Shared\Web
Server Extensions\12\Data\Applications
Note The Applications folder must be excluded only if the computer is
running the Windows SharePoint Services Search service. If the folder that
contains the index file is located in some other location, you must also
exclude that folder.
·
Drive:\Windows\Microsoft.NET\Framework\v2.0.50727\Temporary
ASP.NET Files
Note If you are running a 64-bit version of Windows, you should also
include the following directory:
o
Drive:\Windows\Microsoft.NET\Framework64\v2.0.50727\Temporary
ASP.NET Files
·
Drive:\Documents and Settings\All Users\Application
Data\Microsoft\SharePoint\Config
·
Drive:\Windows\Temp\WebTempDir
Note The WebTempDir folder is a replacement for the FrontPageTempDir
folder.
·
Drive\Documents and Settings\the account that the search
service is running as\Local Settings\Temp\
·
With 64Bit
2008 Server and Product Version this will be Drive\Users\the
account the search service is running as\Local\Temp\
Note The search account creates a folder in the "gthrsvc Temp"
folder to which it periodically needs to write.
·
Drive:\WINDOWS\system32\LogFiles
·
With 64
Bit 2008 Server and Product Version this will be Drive:\Windows\Syswow64\LogFiles
Note If you use a specific account for SharePoint services or
application pools identities, you may also have to exclude the following
folders:
o
Drive:\Documents and Settings\ServiceAccount\Local
Settings\Application Data
o
With 64
Bit 2008 Server and Product Version this will be Drive:\Users\ServiceAccount\Local
o
Drive:\Documents and Settings\ServiceAccount\Local
Settings\Temp
o
With 64
Bit 2008 Server and Product Version this will be Drive:\Users\ServiceAccount\Local\Temp
·
Drive:\Documents
and Settings\Default User\Local Settings\Temp
·
With 64
Bit 2008 Server and Product Version this will be Drive:\Users\Default\AppData\Local\Temp
Plan the SQL Server configuration in order to
scale to large numbers of items in the index (5m+).
o Indexes for
SharedServices1_search db
o Temp and
system databases/tables
o transaction
log for SharedServices1_search db
o Table
content for SharedServices1_search
Host the corresponding files on
different sets of disks, to keep the crawl and query loads segregated and
minimize I/O contention.
Note however that it makes no sense
to split the tables if you are not able to physically host the two filegroups
on different sets of disks.
The following query can be used on SQL Server 2005 or
higher to obtain all the indexes with a fragmentation level higher than 10%:
USE
DECLARE
@currentDdbId int
SELECT
@currentDdbId = DB_ID()
SELECT DISTINCT
i.name,
st.avg_fragmentation_in_percent
FROM
sys.dm_db_index_physical_stats (@currentDdbId, NULL, NULL , NULL, 'SAMPLED') st
INNER JOIN sys.indexes
AS i
ON st.object_id =
i.object_id
WHERE
st.avg_fragmentation_in_percent > 10
Indexes resulting from the above
query should be defragmented.
Rebuild the indexes on the
SharedSevice1_search_db database - especially the indexes on MSSDocProps table
SQL Server maintenance plans should
be configured with the following guidelines:
Search Database
·
Check Database integrity using the ‘DBCC CHECKDB WITH
PHYSICAL_ONLY’ syntax to reduce the overhead of the command. This should be run
on a weekly basis during off-peak hours. Any error returned from DBCC should be
analyzed and solved proactively. The full ‘DBCC CHECKDB’ command should be ran
with a lower frequency (e.g. once per month) to provide deeper analysis.
·
Do not shrink the Search database.
·
Index defragmentation should be executed following the
recommendation above.
Content Databases
·
Check Database Integrity
·
Include indexes
·
Shrink Database
·
Shrink database when it goes beyond: maximum expected size of
your content database + 20%
·
Amount of free space to remain after shrink: 10%
·
Return freed space to operating system
·
Reorganize Index
·
Compact large objects
·
Change free space per page percentage to: 70%
·
Maintenance Cleanup Task
Avoid the auto-growth behavior for
content databases by pre-setting the size to the maximum expected size (ALTER DATABASE
… MODIFY FILE … SIZE property). Configure the autogrowth values to a fixed
percentage (e.g. 10%) instead of a fixed space.
Establish a baseline for crawl performance. Schedule crawls such
that the number of crawls running at the same time does not exceed the
"healthy baseline".
To maintain an enterprise environment where SharePoint is a
business critical application, the correct hardware needs to be in place
Reasons for a search services administrator
to do a full crawl include:
· One or more hotfix or service pack was installed on servers in the
farm. See the instructions for the hotfix or service pack for more information.
· An SSP administrator added a new managed property.
· To re-index ASPX pages on Windows SharePoint Services 3.0 or Office
SharePoint Server 2007 sites.
The crawler cannot discover when
ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007
sites have changed. Because of this, incremental crawls do not re-index views
or home pages when individual list items are deleted. We recommend that you
periodically do full crawls of sites that contain ASPX files to ensure that
these pages are re-indexed.
· To detect security changes that were made on a file share after the
last full crawl of the file share.
· To resolve consecutive incremental crawl failures. In rare cases, if
an incremental crawl fails one hundred consecutive times at any level in a
repository, the index server removes the affected content from the index.
· Crawl rules have been added, deleted, or modified.
· To repair a corrupted index.
· The search services administrator has created one or more server
name mappings.
· The account assigned to the default content access account or crawl
rule has changed.
· An SSP administrator stopped the previous crawl.
· A content database was restored from backup.
If you are running the
Infrastructure Update for Microsoft Office Servers, you can use the restore
operation of the stsadm command-line tool to change whether a content database
restore causes a full crawl.
· A farm administrator has detached and reattached a content database.
· A full crawl of the site has never been done.
· The change log does not contain entries for the addresses that are
being crawled. Without entries in the change log for the items being crawled,
incremental crawls cannot occur.
· The account assigned to the default content access account or crawl
rule has changed.
· To repair a corrupted index.
Depending
upon the severity of the corruption, the system might attempt to perform a full
crawl if corruption is detected in the index