Implementing Search Using Sitecore & Lucene – Part II

In the last post, I discussed aggregating data needed for search in a custom Lucene index. In this post, I’ll review how I implemented the query logic.

Date Conversion Exception

I started accessing the Lucene index and immediately started getting exceptions. Sitecore and Lucene were not happy with how my datetime data was getting stored in the Lucene index. I added a custom IndexFieldDateTimeConverter to manage the exceptions.


public class IndexFieldDateTimeValueConverter : Sitecore.ContentSearch.Converters.IndexFieldDateTimeValueConverter
{
    public override object ConvertFrom(ITypeDescriptorContext context, System.Globalization.CultureInfo culture, object value)
    {
        try
        {
             return base.ConvertFrom(context, culture, value);
        }
        catch(Exception e)
        {
             string fieldValue = value as string;

             DateTime dReturn = new DateTime();

             if (DateTime.TryParseExact(fieldValue, "yyyyMMdd", culture, DateTimeStyles.None, out dReturn))
                 return dReturn;
             else
                 throw e;
        }
    }
}

I created an include file to apply the class and had to name it with z (zCustomIndexValueConverters.config) so that it loaded after the Sitecore Content Search Lucene include files.


< configuration xmlns :patch ="http : / /www.sitecore.net /xmlconfig /" >
    < sitecore >
        < contentSearch >
            < indexConfigurations >
                < defaultLuceneIndexConfiguration >
                    < !-- DateTimeConverter -- >
                    < indexFieldStorageValueFormatter type ="Sitecore.ContentSearch.LuceneProvider.Converters.LuceneIndexFieldStorageValueFormatter, Sitecore.ContentSearch.LuceneProvider" >
                        < converters hint ="raw :AddConverter" >
                            < converter handlesType ="System.DateTime" >
                                < patch :attribute name ="typeConverter" >Someproject.ContentSearch.Converters.IndexFieldDateTimeValueConverter, Someproject<  /patch :attribute >
                            <  /converter >
                        <  /converters >
                    <  /indexFieldStorageValueFormatter >
                <  /defaultLuceneIndexConfiguration >
            <  /indexConfigurations >
        <  /contentSearch >
    <  /sitecore >
<  /configuration  >

I then applied the attribute to my data properties.

[TypeConverter(typeof(IndexFieldDateTimeValueConverter))]
 public virtual DateTime MetadataDate { get; set; }

Am I proud of myself? Nope. Did this work. Yep.

POCO and SearchResultItem

I needed classes to store the search results and facet data. I created four classes.

Facet Classes

The facet classes are fairly straight forward. The FacetValue class and the SearchFacet class are POCO (Plain Old CLR Objects) classes to store the facet data and return it to the presentation layer.

[Serializable]
[DataContract(Name = "FacetValue")]
public class FacetValue
{
    [DataMember(Name = "Value")]
    public string Value { get; set; }

    [DataMember(Name = "FacetCount")]
    public int FacetCount { get; set; }
}

[Serializable]
[DataContract(Name = "SearchFacet")]
public class SearchFacet
{
    private List _values;

    public SearchFacet()
    {
        _values = new List();
    }

    [DataMember(Name = "FacetName")]
    public string FacetName { get; set; }

    [DataMember(Name = "Values")]
    public List Values
    {
        get { return _values; }
        set { _values = value; }
    }
}

Search Results & Search Entity Classes

The search entity class stores all of the search result data we want to return to the presentation layer, as well as the properties to where and filter. I inherited from the Sitecore SearchResultItem class and then hid the data I did not want to return to the presentation layer for security and not to bloat the JSON. The search results class is the container for everything.

[Serializable]
[DataContract(Name = "SiteSearchEntity")]
public class SiteSearchEntity : SearchResultItem 
{
    [TypeConverter(typeof(IndexFieldIDValueConverter))]
    [IndexField("_id")]
    public Guid Id { get; set; }

    [DataMember(Name = "ComputedUrl")]
    [IndexField("LinkProviderUrl")]
    public virtual string ComputedUrl { get; set; }

    [DataMember(Name = "ComputedMetaTitle")]
    [IndexField("Title")]
    public virtual string ComputedMetaTitle { get; set; }

    [DataMember(Name = "ComputedMetaDescription")]
    [IndexField("Description")]
    public virtual string ComputedMetaDescription { get; set; }

    [IgnoreDataMember]
    [IndexField("Keywords")]
    public virtual string ComputedKeywords { get; set; }

    [DataMember(Name = "ComputedDocumentDate")]
    [IndexField("DocumentDate")]
    public virtual DateTime ComputedDocumentDate { get; set; }

    [DataMember(Name = "ComputedCategory")]
    [IndexField("ComputedCategory")]
    public virtual List ComputedCategory { get; set; }

    [DataMember(Name = "ComputedImageUrl")]
    [IndexField("ImageURL")]
    public virtual string ComputedImageUrl { get; set; }

    [DataMember(Name = "ComputedSearchUrl")]
    [IndexField("SearchURL")]
    public virtual string ComputedSearchUrl { get; set; }

    #region Hide Some Data Members
    [IgnoreDataMember]
    public new string Version { get; set; }

    [IgnoreDataMember]
    [IndexField("_group")]
    [TypeConverter(typeof(IndexFieldIDValueConverter))]
    public new ID ItemId { get; set; }

    [IgnoreDataMember]
    [IndexField("_uniqueid")]
    [TypeConverter(typeof(IndexFieldItemUriValueConverter))]
    [XmlIgnore]
    public new ItemUri Uri { get; set; }

    [IgnoreDataMember]
    [IndexField("_templatename")]
    public new string TemplateName { get; set; }

    [IgnoreDataMember]
    [IndexField("_template")]
    [TypeConverter(typeof(IndexFieldIDValueConverter))]
    public new ID TemplateId { get; set; }

    [IgnoreDataMember]
    [IndexField("__semantics")]
    [TypeConverter(typeof(IndexFieldEnumerableConverter))]
    public new IEnumerable Semantics { get; set; }

    [IgnoreDataMember]
    [IndexField("_fullpath")]
    public new string Path { get; set; }

    [IgnoreDataMember]
    [IndexField("_path")]
    [TypeConverter(typeof(IndexFieldEnumerableConverter))]
    public new IEnumerable Paths { get; set; }

    [IgnoreDataMember]
    [IndexField("_name")]
    public new string Name { get; set; }

    [IgnoreDataMember]
    [IndexField("_language")]
    public new string Language { get; set; }

    [IgnoreDataMember]
    [IndexField("__smallcreateddate")]
    public new DateTime CreatedDate { get; set; }

    [IgnoreDataMember]
    [IndexField("_content")]
    public new string Content { get; set; }

    [IgnoreDataMember]
    [IndexField("parsedcreatedby")]
    public new string CreatedBy { get; set; }

    [IgnoreDataMember]
    [IndexField("__smallupdateddate")]
    public new DateTime Updated { get; set; }

    [IgnoreDataMember]
    [IndexField("parsedupdatedby")]
    public new string UpdatedBy { get; set; }

    [IgnoreDataMember]
    [IndexField("_datasource")]
    public new string Datasource { get; set; }

    [IgnoreDataMember]
    [IndexField("_database")]
    public new string DatabaseName { get; set; }

    [IgnoreDataMember]
    [IndexField("_parent")]
    public new ID Parent { get; set; }

    [IgnoreDataMember]
    [IndexField("urllink")]
    public new string Url { get; set; }

    #endregion

    #region Work Around for Facets with Spaces
    [IndexField("CategoryFacet")]
    public virtual List CategoryFacet { get; set; }
    #endregion
}

[DataContract(Name = "SearchResults")]
[Serializable]
public class SerializableSearchResults
{
    List _entities = new List();
    List _facets = new List();

    [DataMember(Name = "TotalCount")]
    public int TotalCount { get; set; }

    [DataMember(Name = "SearchTerm")]
    public string SearchTerm { get; set; }

    [DataMember(Name = "entities")]
    public List entities
    {
        get { return _entities; }
        set { _entities = value; }
    }

    [DataMember(Name = "facets")]
    public List facets
    {
        get { return _facets; }
        set { _facets = value; }
    }
}

Search Logic & the Predicate Builder

Now for the fun part. How do we build a search algorithm to return accurate results?  I modeled my work after Matt Burke’s blog post.  I converted the search term into an array of strings, delimiting the term using the space character.  If then built the filter and term predicates separately and joined them together.   The Sitecore PredicateBuilder makes working Lucene fairly easy.  I simplified the search algorithm as it appears below for simplicity.

public static SerializableSearchResults GetSearchResultsLucene(string searchTerm, int page, Dictionary facets)
{
    SerializableSearchResults oReturn = new SerializableSearchResults();
    string sDatabase = "web";
    ISitecoreService service;

    ISearchIndex searchIndex = ContentSearchManager.GetIndex("sitesearch_web");

    SearchResults results = null;
    IQueryable query = null;

    service = new SitecoreService(sDatabase);

    using (IProviderSearchContext searchContext = searchIndex.CreateSearchContext())
    {
        // Parse search term into a collection of strings
        string[] terms = searchTerm.ToLower().Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

        Expression filterPredicate = PredicateBuilder.True();

        // Get facets
        string category = GetFacetValue("category", facets);

        // Build facets clauses
        if (!String.IsNullOrEmpty(category))
            filterPredicate = filterPredicate.And(se => se.CategoryFacet.Contains(category));

        Expression termPredicate = PredicateBuilder.False();

        foreach (string term in terms)
        {
            termPredicate = termPredicate
                .Or(p => p.ComputedMetaTitle.Contains(searchTerm)).Boost(5.0f)
                .Or(p => p.ComputedMetaTitle.Like(term, 0.75f)).Boost(2.0f)
                .Or(p => p.ComputedMetaDescription.Contains(searchTerm)).Boost(3.0f)
                .Or(p => p.ComputedMetaDescription.Like(term, 0.75f)).Boost(2.0f)
                .Or(p => p.ComputedKeywords.Contains(searchTerm).Boost(2.5f))
                .Or(p => p.ComputedKeywords.Like(term, 0.75f).Boost(1.5f));
        }

        Expression fullPredicate = filterPredicate.And(termPredicate);

        query = searchContext.GetQueryable().Where(fullPredicate);

        FacetResults searchFacets = searchContext.GetQueryable().Filter(fullPredicate).FacetOn(x => x.CategoryFacet).GetFacets();

        query = query.Page(page - 1, 20);
        results = query.GetResults();
        oReturn.entities = results.Hits.Select(hit => hit.Document).ToList();

        foreach(SiteSearchEntity s in oReturn.entities)
        {
            service.Map(s);
        }

        oReturn.SearchTerm = searchTerm.ToLower();
        oReturn.TotalCount = results.TotalSearchResults;

        oReturn.facets = GetFacetResults(searchFacets);
    }

    return oReturn;
}

private static string GetFacetValue(string FacetName, Dictionary facets)
{
    string sReturn = String.Empty;

    if (facets.ContainsKey(FacetName))
        sReturn = facets[FacetName];

    return sReturn;
}

The method below loads the facet data into the POCO facet objects.

private static List GetFacetResults(FacetResults results)
{
    List f = new List();
    foreach (FacetCategory fc in results.Categories)
    {
        SearchFacet sf = new SearchFacet();
        sf.FacetName = fc.Name;
        foreach(Sitecore.ContentSearch.Linq.FacetValue fv in fc.Values)
        {
            sf.Values.Add(new Someproject.Models.Search.FacetValue() { FacetCount = fv.AggregateCount, Value = fv.Name });
        }
        f.Add(sf);
    }
    return f;
}

Paginated search results and the faceted breakdown of the search results are neatly packaged and are ready to be serialized into JSON for the presentation layer.

Conclusion

The Sitecore PredicateBuilder and Content Search Linq interface makes building a site search solution very managable.  Computed Fields allow the ability to store anything you need into the Lucene index file.

Implementing Search Using Sitecore & Lucene – Part I

Here is an overview of how I recently implemented search for a web site built using Sitecore.  I did not have the option to design the site templates from scratch.  Instead, I inherited a messy template inheritance structure with some inconsistencies in the design. I also had no budget, so Coveo was not an option.

poor

Index Creation

I started by creating a custom index to use for search.  I didn’t like the idea of tacking a large number of computed fields onto the default Sitecore indexes.

The search index needed to include items from the entire content tree as well as the media library.  I added two crawler location definitions.


< locations hint ="list:AddCrawler" >
    < crawler type ="Sitecore.ContentSearch.ExcludeItemCrawler, XL.Website" >
        < Database >master< /Database >
        < Root >/sitecore/content< /Root >
    < /crawler >
< /locations >
< locations hint ="list:AddCrawler" >
    < crawler type ="Sitecore.ContentSearch.ExcludeItemCrawler, XL.Website" >
        < Database >master< /Database >
        < Root >/sitecore/media library< /Root >
    < /crawler >
< /locations >

In the index configuration, I defined all of the templates that I needed to include in the index.


< include hint="list:IncludeTemplate" >
    < Product >{272C2195-AFE6-47CC-9707-BC8FB1909BE4}< /Product >
    < Article >{ABAEACC9-AC67-4A0D-B0DE-54982D2D3246}< /Article >
    ...
< /include >

I also defined any common fields that I would need in the index to search on, as well as display in the UI.


< fieldMap type ="Sitecore.ContentSearch.FieldMap, Sitecore.ContentSearch" >
    < fieldNames hint ="raw:AddFieldByFieldName" >
        < field fieldName ="MetadataTitle" storageType ="YES" indexType ="UNTOKENIZED" vectorType ="NO" boost ="1f" type ="System.String" settingType ="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" >
            < analyzer type ="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" / >
        < /field >
        < field fieldName ="MetadataKeywords" storageType ="YES" indexType ="UNTOKENIZED" vectorType ="NO" boost ="1f" type ="System.String" settingType ="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" >
            < analyzer type ="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" / >
        < /field >
    < /fieldNames >
< /fieldMap >

Computed Fields

So, what about the data that is not stored consistently across the site?  You can aggregate the needed data using a computed field.

public class MetaDataDescriptionField : IComputedIndexField
{
   public string FieldName { get; set; }
   public string ReturnType { get; set; }

   public object ComputeFieldValue(IIndexable indexable)
   {
      Assert.ArgumentNotNull(indexable, "indexable");
      var indexableItem = indexable as SitecoreIndexableItem;

      if (indexableItem == null)
      {
         Log.Warn(string.Format("{0} : unsupported IIndexable type : {1}", this, indexable.GetType()), this);
         return null;
      }

      string sDescription = String.Empty;
      if (indexableItem.Item.IsDerived(indexableItem.Item, new Sitecore.Data.ID("some-template-guid")))
      {
         Sitecore.Data.Fields.Field stringField = indexableItem.Item.Fields["Somefieldname"];

         if (stringField != null)
         {
            sDescription = indexableItem.Item.Fields["Somefieldname"].Value;
         }
      }
      else
      {
          // Handle other templates ...
      }
      return sDescription;
   }
}

public static class ItemExtensions
{
        public static bool IsDerived([NotNull] this Item item, [NotNull] ID templateId)
        {
            return TemplateManager.GetTemplate(item).IsDerived(templateId);
        }
}

The computed fields get added to the custom Lucene index configuration.


< fields hint="raw:AddComputedIndexField" >
    < field fieldName="Description" storageType="YES" indexType="UNTOKENIZED" >Someproject.Indexes.Computed.MetadataDescriptionField, Somenamespace< /field >
    ...
< /fields >

I added a variety of computed fields.  Some of the fields accessed the LinkManager to store urls to pages or the MediaManager to store urls for images; needed by the presentation layer.  In addition, MultiList fields needed to be converted into a usable format so that they could be used for faceting.

Computed Fields for Facets

If you are using facet values in your presentation layer, rather than GUIDs, you will need to convert your Multilist fields into tokenized facet value data in the Lucene search index.

public class CategoryField : IComputedIndexField
{
   public string FieldName { get; set; }
   public string ReturnType { get; set; }

   public object ComputeFieldValue(IIndexable indexable)
   {
       Assert.ArgumentNotNull(indexable, "indexable");
       var indexableItem = indexable as SitecoreIndexableItem;

       if (indexableItem == null)
       {
           Log.Warn(string.Format("{0} : unsupported IIndexable type : {1}", this, indexable.GetType()), this);
          return null;
       }

       List sReturn = new List();

       if (indexableItem.Item != null)
       {
           if (indexableItem.Item.IsDerived(indexableItem.Item, new Sitecore.Data.ID("some-template-guid")))
           {
               Sitecore.Data.Fields.MultilistField multilistField = currentItem.Fields["somefieldname"];
               if (multilistField != null)
               {
                  sReturn = HelperClass.GetListValues(multilistField, "someotherfieldname");
               }
               else
               {
                ...
               }

               return sReturn;
           }
       }
}

public class HelperClass
{
   public static List GetListValues(MultilistField multiListField, string fieldName)
   {
       List results = new List();
       if (multiListField == null) { return results; }

       foreach (Sitecore.Data.ID sitecoreID in multiListField.TargetIDs)
       {
           Item sitecoreItem = SitecoreHelper.GetItem(sitecoreID.ToString());
           string result = SitecoreHelper.GetFieldValue(sitecoreItem, fieldName);
           if (!string.IsNullOrWhiteSpace(result))
           {
               results.Add(result);
           }
       }
       return results;
   }
}

< fields hint="raw:AddComputedIndexField" >
    < field fieldName="Description" storageType="YES" indexType="UNTOKENIZED" >Someproject.Indexes.Computed.MetadataDescriptionField, Somenamespace< /field >
    < field fieldName="ComputedCategory" storageType="YES" indexType="TOKENIZED" >Someproject.Indexes.Computed.CategoryField, Somenamespace< /field >
    ...
< /fields >

Tokenized versus Untokenized

An important setting in index field configuration is indexType.  Untokenized fields will be stored as one string in the Lucene index.  Tokenized fields will be broken up.

Facet Values Containing Spaces

One major gotcha that I encountered was facet values that contain spaces.  Because the facet fields are tokenized, the spaces in the facet values wrecked havoc with the facet results sets.  I found an excellent blog by Ryan Bailey, referencing a solution provided by Martina Welander.

Adding the computed facet fields to the fieldMap section solved the problem.


< fieldMap type="Sitecore.ContentSearch.FieldMap, Sitecore.ContentSearch" >
    < fieldNames hint="raw:AddFieldByFieldName" >
        ...
        < field fieldName="ComputedCategory" storageType="YES" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" >
            < analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" / >
        < /field >
        ...
    < /fieldNames >
< /fieldMap >

Dynamically Excluding Content

One of the requirements in this site is to be able to explicitly hide content from the site search based on an item field value.  I chose to create a custom crawler for this purpose.  The direct solution would be to directly filter on the results.  I chose to abstract this requirement into a crawler because of the complexity surrounding the search and faceting logic.  I did not want to complicate it further.

public class ExcludeItemCrawler : SitecoreItemCrawler
    {
        protected override bool IsExcludedFromIndex(SitecoreIndexableItem indexable, bool checkLocation = false)
        {
            bool isExcluded = base.IsExcludedFromIndex(indexable, checkLocation);

            if (isExcluded)
                return true;

            Item item = (Item)indexable;

            // If its a wildcard
            if (item.Name == "*")
                return true;

            // Several complex checks
            if (somecondition)
                return true;
	    ...
            return false;
        }
    }

 < locations hint="list:AddCrawler" >
     < crawler type="Somenamespace.ContentSearch.ExcludeItemCrawler, Somenamespace" >
         < Database >master < /Database >
         < Root >/sitecore/content < /Root >
     < /crawler >
 < /locations >
 < locations hint="list:AddCrawler" >
     < crawler type="Somenamespace.ContentSearch.ExcludeItemCrawler, Somenamespace" >
         < Database >master < /Database >
         < Root >/sitecore/media library < /Root >
     < /crawler >
 < /locations >

Ok, so now that we have the data that we want in the Lucene index, I’ll talk about how to query it in Part II.