Saturday, 23 March 2013

SHAREPOINT INDEXING PERFORMANCE TUNING TIPS


Introduction

This document is intended to highlight possible causes of poor indexing performance on SharePoint 2007 farms as well as provide some options/recommendations for further tuning

Content Host - This is the server that hosts/stores the content that the indexer is crawling. For example, if a content source crawls a SharePoint site, the content host would be the web front end server that hosts the site.
MSDMN.EXE - When a crawl is running, this process is visible in the task manager on the indexer. This process is called the "Search Daemon”. When a crawl is started, this process is responsible for connecting to the content host (using the protocol handler and I-Filter), requesting content from the content host and crawling the content. The Search Daemon has the biggest impact on the indexer in terms of resource utilization.
MSSearch.exe - Once the MSDMN.EXE process is done crawling the content, it passes the crawled content on to MSSearch.exe (this process also runs on the indexer, visible in the task manager during a crawl). MSSearch.exe does two things. It writes the crawled content on to the disk of the indexer and it also passes the metadata properties of documents that are discovered during the crawl to the backend database. Unlike the crawled content index, which gets stored on the physical disk of the indexer, crawled metadata properties are stored in the database.
SQL Server (Search Database) - The search database stores information such as information about the current status of the crawls, metadata properties of documents/list items that are discovered during the crawl
There are a number of reasons why crawls could be taking longer than expected.
The indexer has to request content from the content host to crawl the content. If the host is responding slowly, no matter how strong the indexer is, the crawl will run slow.
Collect performance counter data from the indexer. Performance data should be collected for at least 10% of estimated crawl duration.
Monitor the following performance counters:
·    \\Office Server Search Gatherer\ Threads Accessing Network
·    \\Office Server Search Gatherer\ Idle Threads
The "Threads Accessing Network" counter shows the number of threads on the indexer that are waiting on the content host to return the requested content. A consistently high value of this counter indicates that the crawl is starved by a "hungry host".
The indexer also uses the database during the indexing process. If the database server becomes too slow to respond, crawl durations will suffer. To see if the crawl is starved by a slow or busy database server use the following performance counters
·    \\Office Server Search Archival Plugin\Active Docs in First Queue
·    \\Office Server Search Archival Plugin\Active Docs in Second Queue
Once MSDMN.EXE is done crawling content, it passes this to MSSearch.exe, which then writes the index on disk and inserts the metadata properties into the database.
MSSearch.exe queues all the documents whose metadata properties need to be inserted into the database.
Consistently high values of these two counters will indicate that MSSearch.exe has requested the metadata properties to be inserted into the database, but the requests are just sitting in the queue because the database is too busy. Ideally, you see the number of documents in first queue increasing and then dropping to 0 at regular intervals of time (every few minutes)
TuningTuningIf there are consistently high values of the above mentioned two performance counters and if the database server is dedicated to MOSS - consider optimization at the application level such as blob cache
On the indexer monitor the performance counter \\Process\%Processor Time (msdmn.exe the selected instance). 
If msdmn.exe consistently at high CPU utilization, consider changing the indexer performance from "Maximum" to "Partially Reduced"  If there are other bottlenecks (memory, disk etc.), analyze performance data to get to a conclusion. This article explains how you can determine potential bottlenecks on your server
SQL bottlenecks:






See Central Administration > Operations > Services on Server > Office SharePoint Server Search Service Settings
Start with a crawler impact rule to “request 64 documents at a time” in order to maximize the number of threads the crawler can use to index content, with the ultimate goal of increasing the crawl speed.
Resources usage on the indexer, dedicated front-end and search database boxes should be monitored, and in case the crawler generates too much activity, the crawler impact rule should be tuned by decreasing the parallelism.

·         Reduced: Total number of threads = number of processors, Max Threads/host = number of processors
·         PartlyReduced: Total number of threads = 4 times the number of processors , Max Threads/host = 16 times the number of processors
·         Maximum: Total number of threads = 4 times the number of processors , Max Threads/host = 16 times the number of processors (threads are created at HIGH priority)

NOTE: It is a good idea to open Perfmon and look at the gatherer stats while indexing. There is a statistic called Performance Level and this reflects the actual level that the indexer is running at where 5 = max and 3 = reduced. Even if you set everything to max the indexer may decide to run at reduced

An aspect of the crawling process is then the filtering daemons use up to much memory (mssdmn.exe) they get automatically terminated and restarted. There is of course a windup time when this happens and can slow down your crawling. The default setting is pretty low (around 100M) so is easy to trip when filter large files. You can and should increase the memory allocation by adjusting the following registry keys (this will require a service restart for the changes to take effect).
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager
·         FolderHighPriority: Represents the number of high priority folders that can be processed at one time.  If this is too high then the cache in the daemons will constantly be running out of space.  If this is too low then the crawl will be throttled waiting for more items to process.
·         FilterProcessMemoryQuota: Represents how much memory can be consumed by the search daemon process before it gets killed by the crawler. The OOB default has been chosen based on 4 GB of memory on the indexer. If the customer has higher RAM, they can increase this value to cache more data during the crawl.
·         DedicatedFilterProcessMemoryQuota: Same as for FilterProcessMemoryQuota except this is the size of the single-threaded daemons.

As an example, if the indexer box is 64-bit with 16 GB of RAM, the following values have been tested successfully:

FolderHighPriority: set to 500
FilterProcessMemoryQuota: set to 208857600
DedicatedFilterProcessMemoryQuota: set to 208857600

Review the information architecture of the content to be indexed.
Identify one subset of content where most of the changes happen (“fresh” content), and another subset that is mostly static (archived or older content).
Configuring more than one content source to target the “fresh” content areas separately from the “static” areas will provide more flexibility for crawl scheduling.
Multiple content sources will also mitigate the impact of a long running operation of the crawler (like a full crawl) in terms of latency for fresh content to appear in search results, - can selectively activate the crawling on the desired content sources only, postponing less important crawling activities to off-peak hours, etc
SharePoint gives priority to the first running crawl so that if you already are indexing one system it will hold up the indexing of a second and increase crawl times.
  • Solution: Schedule your crawl times so there is no overlap. Full crawls will take the longest so run those exclusively.

The Adobe PDF I-Filter can only filter one file at a time and that will slow crawls down, and has a high reject rate for new PDFs
  • Solution: Use a retail PDF filter e.g. Foxit
There is a setting in the registry that controls the number of times a file is retried on error. This will severely slow down incremental crawls as the default is 100. This retry count can be adjusted by this key:
  • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager: set DeleteOnErrorInterval = 4 Decimal
Ensure that you have at least 2 Gig of free memory available before your crawl even starts and that you have at least 2 real processors available.
If you execute a full crawl that overlaps with normal business hours it could have an adverse effect on SharePoint users if the crawl server is used for other services such as query or WFE roles. Central Admin on the app server will be impacted and if the query role wasn't moved to the WFE's then the search center results page will also perform badly during crawl operations (even an incremental crawl could have this impact if a lot of content had changed since the last crawl).
Run 64bit OS
You should exclude certain folders from being scanned. If you do not exclude these folders, you can experience many unexpected issues
You may have to configure the antivirus software to exclude the "Drive:\Program Files\Microsoft Office Servers" folder from antivirus scanning for SharePoint Server 2007. If you do not want to exclude the whole "Microsoft Office Servers" folder from antivirus scanning, you can exclude only the following folders:
·         Drive:\Program Files\Microsoft Office Servers\12.0\Data. (This folder is used for the indexing process. If the Index files are configured to reside in a different folder, you also have to exclude that location.)
·         Drive:\Program Files\Microsoft Office Servers\12.0\Logs
·         Drive:\Program Files\Microsoft Office Servers\12.0\Bin
·         Any location you choose to store Disk-based BLOB cache. For example: C:\blobcache. For more information on BLOB cache, see: http://office.microsoft.com/en-us/sharepoint-server-help/configure-disk-based-cache-settings-HA010176284.aspx
(http://office.microsoft.com/en-us/sharepoint-server-help/configure-disk-based-cache-settings-HA010176284.aspx)
Note If you have Microsoft Office SharePoint Server 2007, these folders should be excluded in addition to the folders listed in the WSS 3.0 section below

Note  When you install SharePoint Server 2007 or when you apply a hotfix to an existing installation of SharePoint Server 2007, you may have to disable the real-time option of the antivirus program. Or, you may have to exclude the Drive:\Windows\Temp folder from antivirus scanning if it is required.
You may have to configure the antivirus software to exclude the following folders and subfolders from antivirus scanning.

Note The Drive placeholder represents the drive letter in which you have installed Windows SharePoint Services 3.0 or SharePoint Server 2007. Typically, this drive letter is C.
·         Drive:\Program Files\Common Files\Microsoft Shared\Web Server Extensions

If you do not want to exclude the whole "Web Server Extensions" folder from antivirus scanning, you can exclude only the following two folders:
o    Drive:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\12\Logs
o    Drive:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\12\Data\Applications

Note The Applications folder must be excluded only if the computer is running the Windows SharePoint Services Search service. If the folder that contains the index file is located in some other location, you must also exclude that folder.
·         Drive:\Windows\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files

Note If you are running a 64-bit version of Windows, you should also include the following directory:
o    Drive:\Windows\Microsoft.NET\Framework64\v2.0.50727\Temporary ASP.NET Files
·         Drive:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config
·         Drive:\Windows\Temp\WebTempDir

Note The WebTempDir folder is a replacement for the FrontPageTempDir folder.
·         Drive\Documents and Settings\the account that the search service is running as\Local Settings\Temp\
·         With 64Bit 2008 Server and Product Version this will be Drive\Users\the account the search service is running as\Local\Temp\

Note The search account creates a folder in the "gthrsvc Temp" folder to which it periodically needs to write.
·         Drive:\WINDOWS\system32\LogFiles
·         With 64 Bit 2008 Server and Product Version this will be Drive:\Windows\Syswow64\LogFiles

Note If you use a specific account for SharePoint services or application pools identities, you may also have to exclude the following folders:
o    Drive:\Documents and Settings\ServiceAccount\Local Settings\Application Data
o    With 64 Bit 2008 Server and Product Version this will be Drive:\Users\ServiceAccount\Local
o    Drive:\Documents and Settings\ServiceAccount\Local Settings\Temp
o    With 64 Bit 2008 Server and Product Version this will be Drive:\Users\ServiceAccount\Local\Temp
·         Drive:\Documents and Settings\Default User\Local Settings\Temp
·         With 64 Bit 2008 Server and Product Version this will be Drive:\Users\Default\AppData\Local\Temp

Plan the SQL Server configuration in order to scale to large numbers of items in the index (5m+).
o   Indexes for SharedServices1_search db
o   Temp and system databases/tables
o   transaction log for SharedServices1_search db
o   Table content for SharedServices1_search 

Host the corresponding files on different sets of disks, to keep the crawl and query loads segregated and minimize I/O contention.

See details in “SQL File groups and Search” on the Enterprise Search blog at http://blogs.msdn.com/enterprisesearch/archive/2008/09/16/sql-file-groups-and-search.aspx.
Note however that it makes no sense to split the tables if you are not able to physically host the two filegroups on different sets of disks.


The following query can be used on SQL Server 2005 or higher to obtain all the indexes with a fragmentation level higher than 10%:
USE

DECLARE @currentDdbId int
SELECT @currentDdbId = DB_ID()

SELECT DISTINCT
      i.name,
st.avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats (@currentDdbId, NULL, NULL , NULL, 'SAMPLED') st
INNER JOIN sys.indexes AS i
ON st.object_id = i.object_id
WHERE st.avg_fragmentation_in_percent > 10

Indexes resulting from the above query should be defragmented.

Rebuild the indexes on the SharedSevice1_search_db database - especially the indexes on MSSDocProps table
See also “Database Maintenance for Microsoft SharePoint Products and Technologies” at http://go.microsoft.com/fwlink/?LinkId=111531&clcid=0x409 and “SQL Index defrag and maintenance tasks for Search” on the Enterprise Search Blog at http://blogs.msdn.com/enterprisesearch/archive/2008/09/02/sql-index-defrag-and-maintenance-tasks-for-search.aspx.



SQL Server maintenance plans should be configured with the following guidelines:
Search Database
·         Check Database integrity using the ‘DBCC CHECKDB WITH PHYSICAL_ONLY’ syntax to reduce the overhead of the command. This should be run on a weekly basis during off-peak hours. Any error returned from DBCC should be analyzed and solved proactively. The full ‘DBCC CHECKDB’ command should be ran with a lower frequency (e.g. once per month) to provide deeper analysis.
·         Do not shrink the Search database.
·         Index defragmentation should be executed following the recommendation above.
Content Databases
·         Check Database Integrity
·         Include indexes
·         Shrink Database
·         Shrink database when it goes beyond: maximum expected size of your content database + 20%
·         Amount of free space to remain after shrink: 10%
·         Return freed space to operating system
·         Reorganize Index
·         Compact large objects
·         Change free space per page percentage to: 70%
·         Maintenance Cleanup Task

See also “Database Maintenance for Microsoft SharePoint Products and Technologies” at http://go.microsoft.com/fwlink/?LinkId=111531&clcid=0x409 and “SQL Index defrag and maintenance tasks for Search” on the Enterprise Search Blog at http://blogs.msdn.com/enterprisesearch/archive/2008/09/02/sql-index-defrag-and-maintenance-tasks-for-search.aspx.


Avoid the auto-growth behavior for content databases by pre-setting the size to the maximum expected size (ALTER DATABASE … MODIFY FILE … SIZE property). Configure the autogrowth values to a fixed percentage (e.g. 10%) instead of a fixed space.


Establish a baseline for crawl performance. Schedule crawls such that the number of crawls running at the same time does not exceed the "healthy baseline".
To maintain an enterprise environment where SharePoint is a business critical application, the correct hardware needs to be in place

http://technet.microsoft.com/en-us/library/cc262574.aspx (Estimate performance and capacity requirements for search environments)
http://technet.microsoft.com/en-us/library/cc850696.aspx (Best practices for Search in Office SharePoint Server)





Reasons for a search services administrator to do a full crawl include:
·      One or more hotfix or service pack was installed on servers in the farm. See the instructions for the hotfix or service pack for more information.
·      An SSP administrator added a new managed property.
·      To re-index ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites.
The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites have changed. Because of this, incremental crawls do not re-index views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed.
·      To detect security changes that were made on a file share after the last full crawl of the file share.
·      To resolve consecutive incremental crawl failures. In rare cases, if an incremental crawl fails one hundred consecutive times at any level in a repository, the index server removes the affected content from the index.
·      Crawl rules have been added, deleted, or modified.
·      To repair a corrupted index.
·      The search services administrator has created one or more server name mappings.
·      The account assigned to the default content access account or crawl rule has changed.
·      An SSP administrator stopped the previous crawl.
·      A content database was restored from backup.
If you are running the Infrastructure Update for Microsoft Office Servers, you can use the restore operation of the stsadm command-line tool to change whether a content database restore causes a full crawl.
·      A farm administrator has detached and reattached a content database.
·      A full crawl of the site has never been done.
·      The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.
·      The account assigned to the default content access account or crawl rule has changed.
·      To repair a corrupted index.
Depending upon the severity of the corruption, the system might attempt to perform a full crawl if corruption is detected in the index

2 comments:

  1. You can also change TempPath value of C: disk to D: (which you have formatted wit 64k) You also need to change UseSystemTemp value to 0, to succeed.

    ReplyDelete
  2. Thank you for sharing the insight! Your article is very helpful and informative. I would like to read more updates from you.

    SEO Services in Melbourne

    ReplyDelete

SharePoint Information Architecture Diagram

Here is the template I use for Information Architecture designs. It's built using Mindjet and I flesh the nodes out with the low level d...