forge

SharePoint Search across the Globe

Several of my global clients have approached me over the last few weeks in some stage of planning or implementation of a global search solution. So I wanted to take a few moments and talk through global search configuration options including the general perceptions we have, the research on how users process options, the technology limitations – and the options. The goal here is to be primer for a conversation about how to create a search configuration that works for a global enterprise.

Single Relevance

There’s only one way to get a single relevance across all pieces of content – that is to have the same farm do the indexing for every piece of content. Because the relevance is based on many factors – including how popular various words are, etc., the net effect of this that if you want everything to be in exact right relevance order you’ll have to do all of the indexing from a single farm. (To address the obvious questions from my SharePoint readers, neither FAST nor SharePoint 2013 resolves this problem.)

OK, so if you consider that in order to accomplish the utopian goal of having all search results for the entire globe in a single relevance ordered list, one farm is going to have to index everything, you’ll have to have one massive search index. This means that you’ll have to plan on bringing everything across the wire – and that’s where the problems begin.

Search Indexing

In SharePoint (and this mostly applies to all search engines), the crawler component indexes all of the content by loading it locally (through a protocol handler in the case of SharePoint), breaking it into meaningful text (IFilter) and finally recording that into the search database. This is a very intensive process and by its very nature it requires that all of the bits for a file travel across the network from the source server to the SharePoint server doing the indexing. This is generally speaking not an issue for local servers because most local networks are very idle – there’s not an abundance of traffic on them and therefore any additional traffic caused by indexing isn’t that big of a deal. However, the story is very different in the case of the wide area network.

In a WAN most of the segments are significantly slower than their LAN counterparts. Consider that a typical LAN segment is 1gbps and a typical WAN connection is at most measured in megabytes. Let’s take a big example of a 30 Mbps connection. That means the LAN is roughly 300 times faster. For smaller locations that might be running on 1.544 Mbps connections the multiplier is much larger. (~650). This level of difference is monumental. Also consider that most WAN connections are at 80% utilization during the day.

Consider for a moment that if you want to bring across every bit of information from a 500 GB database across a 1.544 Mbps connection it will take about a month – not counting overhead or inefficiency to pull the data across the wire. The problem with this is what happens when you need to do a full index or when you need to do a content index reset.

Normally, the indexing process is looking for new content and just reading that and indexing it. That generally isn’t that big of a deal. We read and consume much more content than we create. So maybe 1% of the information in the index would change in a given day – in practical terms it is really much less than this. Pulling one percent of the data across the wire isn’t that hard. If you’re doing incremental indexes every hour or so you’ll probably complete the incremental index before the next one kicks off. (Generally speaking in my SharePoint environments incremental indexing takes about 15 minutes every hour.) However, occasionally your search index becomes “corrupt”. I don’t mean that in the “the world is going to end” kind of way, just an entry won’t have the right information. In most cases you won’t know that the data is wrong – it just won’t be returned in search results. The answer to this is to periodically run a full crawl to recrawl all the content.

During the time that the full crawl is running, incremental crawls can’t run. As a result while the indexer is recrawling all of the content some of the recently changing content isn’t being indexed. Users will perceive the index to be out of date – because it will be. If it takes a month to do a complete index of the content then the search index may be as much as a month out of date. Generally speaking that’s not going to be useful to users.

Help Your SharePoint User

While you will schedule full crawls on a periodic basis – sometimes monthly and sometimes quarterly. However, very rarely you’ll have a search event that will lead to you needing to reset the content index. In these cases the entire index is deleted and then a full crawl begins. This is worse than a regular full crawl because it won’t be just that the index is out of date – but it will be incomplete.

In short the amount of data that has to be pulled across the wire to have a single search is just not practical. It’s a much lower data requirement to just pass along user queries to regionally deployed servers and aggregate those results on one page.

One Global Deployment

Some organizations have addressed this concern with a single global deployment of SharePoint – and certainly this does resolve the issue of a single set of search results but at the expense of everyday performance for the remote regions. I’ve recommended single global deployments for some organizations because of their needs – and regional deployments for other situations. The assumption I’m making in this post is that your environment has regional farms to minimize latency between the users and their data.

Federated Search Web Parts

Out of the box there is a federated search web part. This web part will pass the query for the page to a remote OpenSearch 1.0/1.1 compliant server and display the results of the query. Out of the box it is configured to connect to Microsoft’s Bing search engine. You can connect it to other search engines as well – including other SharePoint farms in different regions of the globe. The good news is that this allows users to issue a single search and get back results from multiple sources; however, there are some technical limitations; some of which may be problematic.

Server Based Requests

While it’s not technically required for the specifications, the implementation that SharePoint includes has the Federated Search Web Parts doing the processing of the remote queries via the server – and not on the client. That means that the server must have access to connect to all of the locations that you want to use for federated search. In practical terms this may not be that difficult but most folks frown on their servers having unfettered access to the Internet. As a result having the servers running the federated searches may mean some firewall and/or proxy server changes.

The good news here is that federated search options must be configured in the search service – so you’ll know exactly what servers need to be allowed from the host SharePoint farm. The bad news is that if you’re making requests to other farms in your environment you’ll need a way to pass user authentication from one server to another and in internal situations that’s handled by Kerberos.

Kerberos

All too many implementations I go into don’t have Kerberos setup as an authentication protocol – or more frequently their clients are authenticating with NTLM rather than Kerberos for a variety of legitimate and illegitimate reasons. Let me start by saying that Kerberos, when correctly implemented, will help the performance of your site so outside of the conversation about delegating authentication it’s a good thing to implement in your environment.

Despite the relative fear that’s in the market about setting up and using Kerberos, it is really as simple as setting SharePoint/IIS to use it (Negotiate), setting the service principle name (SPN) of the URL used to access the service to the service account, and setting the service account up for delegation. In truth that’s it. It’s not magic – however, it is hard to debug. As a result, most people give up on setting it up. Fear of Kerberos and what’s required for it to be setup correctly falls into what I would consider to be an illegitimate reason.

There is a legitimate reason why you wouldn’t be able to use Kerberos. Kerberos is mutual authentication. It requires that the workstation be trusted – which means that it has to be a domain joined PC. If you’ve got a large contingent of staff that don’t have domain joined machines, you’ll find that Kerberos won’t work for you.

Kerberos is required for one server to pass along the identity of a user to another server. This trusted delegation of user resources isn’t supported through NTLM (or NTLMv2). In our search case, SharePoint trims the search results to only those results that a user can see – and thus the remote servers being queried need the identity of the user making the request. This is a problem if the authentication is being done via NTLM because that authentication can’t be passed – and as a result you won’t get any results. So in order to use the out-of-the-box federated search web parts to another SharePoint farm, you must have Kerberos setup and configured correctly.

Roll Your Own

Of course, just because the out-of-the-box web parts use a server-side approach to querying the remote search engine – and therefore need Kerberos to work for security trimming – doesn’t mean that you have to use the out of the box web parts. It’s certainly possible to write your own JavaScript based web part that will issue the query from the client side to the server and therefore have the client transmit their authentication to the remote server. However, as a practical matter this is more challenging than it first appears because of the transformation of results through XSLT. In my experience, clients haven’t opted to build their own federated web parts.

User Experience

From a user experience perspective, the first thing users will realize when using the federated search web parts is that the results are in different “buckets” and they’re unlikely to like this. As we started this post, there’s not much that can be done to resolve this problem from a technical problem – without creating larger issues of how “fresh” the index is. So while admittedly this isn’t our preference from a user experience perspective there aren’t great answers to resolving it.

Before dismissing this entirely, I need to say that there are some folks who have decided to live with the fact that relevance won’t be exactly right and are comingling the results and dealing with the set of issues that arise from that including how to manage paging, what to do about faceted search refinement – that is the property-value selection typically on the left hand side of the page. When you’re pulling from multiple sources you have to aggregate these refiners and manage your paging yourself – this turns out to be a non-trivial exercise, and one that doesn’t appear to improve the situation much.

Hicks Law

One of the most often misused “laws” in user experience design is Hick’s Law. It states, basically, that given a longer list of items vs. two smaller lists that a user will be able to find what they’re looking for out of one list faster. (Sorry this is gross oversimplification; follow the link for more details.) The key is that this oversimplification ignores two key facts. First, the user must understand the ordering of the results. Second, they must understand what they’re looking for – that is they have to know the exact language being used. In the case of search, neither of these two requirements will be met. The ordering is non-obvious and the exact title of the result is rarely known by the user that’s searching.

What this means is that although intuitively we “know” that having all the results in a single list will be better, the research doesn’t support this position. In fact some of the research quoted by Barry Schwartz in The Paradox of Choice seems to indicate that meaningful partitioning can be very valuable about reducing anxiety and improving performance. I’m not advocating that you should break up search results that you can get together – rather I’m saying that we may have a perception that comingled results may be of higher value than they may actually be.

Refiners and Paging

One of the challenges with the federated search user experience is that the facets will be driven off of the primary results so the results from other geographies won’t show in the refiners list. Nor is there paging on the federated search web parts. As a result the federated results web parts should be viewed as “teasers” which are inviting users to take the highly relevant results or to click over to the other geography to refine their searches. The federated search web part includes the concept of “more…” to reach the federated search results’ source. Ideally the look and feel – and global navigation – between the search locations will be similar so as to not be a jarring experience to the users.

Putting it Together

Having a single set of results may not be feasible from a technology standpoint today, however, with careful considerations of how users search and how they view search results you can build easy to consume experiences for the user. Relying on a model where users have regional deployments for their search needs which provides some geographic division between results but also minimizes the total number of places that they need to go for search can help users find what they’re looking for quickly – and easily.