Friday, November 12, 2010

Sitecore Searcher and Advanced Database Crawler


Hi there,
Today I am proud to announce a preview release of a component that extends the standard Sitecore Searching mechanisms, specifically, the relatively “new” Sitecore.Search namespace introduced in 6.0 and provides easy search querying APIs. If you are not sure what I am talking about, check out this recently published document on SDN and also Ivan’s blog posts about it.


As an example, I am walking through 5 generic data extraction scenarios and show you how this component can help you in daily development work while still leveraging standard Sitecore functionality.
The session is broken down into demos and code walkthrough. If you are not excited about the code, you can skip those parts.
There is no documentation available yet, but I do plan to come up with a blog series covering the recorded scenarios in more details.
image
The code has been published to the Shared Source Library. You can check it out.
Please let me know how it works, I appreciate any kind of feedback.
Shortened video is posted to YouTube.

64 comments:

TimWard said...

This video is spectacular. This will make Sitecore Developers so much more aware of what is possible with the Sitecore Search namespace. Well done!

- Tim Ward

Alex Shyba said...

Thanks Tim! this has been my favorite project because of exactly the same reason you mentioned.

aussieviking said...

Found this while I was trying to track down some information about custom searching in Sitecore. Thank you for sharing this awesome extension library!

I have one question regarding search possibilities. When you configure a search index you can specify which templates to include and also which fields to include. What I am looking for is a way to set a boost or weight value on individual fields and use that when searching for items by relevance. Say for example I have a template with three fields, (name, description and keywords), and I rate them as relevant in that order. This would mean that if my search term exists in the name field of an item then that item is returned before items with hits in the description or keywords fields. I know you can set a boost value when configuring the index but to as I understand it, that would only be useful if you're comparing results from two separate indexes.

Thoughts?

k said...

Hey Alex, great library. I was able to get it working quickly. I also was able to copy your examples out from the video. May I suggest making the source for your examples also available or just put them in the scripts folder of the library. Just as a note to other guys who may be new to lucene: This is a new way to configure, build and access your index.

step 1: make sure your on at least sitecore 6.2 rev. 101105
step 2: from the shared source library in /App_Config/Include/Sitecore.SharedSource.Search.config add the three sections(Engines, Engines.HistoryEngine.Storage, Index) to the Web.Config in your app.
step 3: from the library copy /scripts/RebuildDatabaseCrawlers.saspx and .cs into your app.
step 4: hit the page and build the index
step 5: build a new page/form and handle the form values by creating a SearchParam object and a Searcher object. Then call GetItems(searchParam) with the searcher.

I think you did a lot of nice work handling each specific field type and supporting ranges. But most especially for only getting the latest item version. kudos.

Alex Shyba said...

Thanks K, I will make sure to include these steps in the documentation.

Gulle said...

Hi Alex,

Very nice!
Looking forward to your technical blog series.

Koen

Mark Ursino said...

Not trying to put on any pressure, but is there a plan for when official documentation will be released on this or is that on the back-burner? I'm doing my own documentation for this within our own project that uses it but I wasn't sure when to really expect the real docs.

Thanks!

Alex Shyba said...

Mark - I will do my best to get something started, but my hope was to engage the community members like yourself more. Since the code is open, it is fairly straightforward to figure out what goes where. My hope was that the 40 minute screencast could be sufficient enough to get going with this.

Eugene Novikov said...

Great presentation Alex!
I am looking for advice related to crawler/indexing in SiteCore.
We have multi-language web site in SiteCore ver.6.0 and our existing crawler (based on Lucene API) written by another company and this crawler run once per day as scheduled application, but it takes more then 1 hour and a lot of resources.
Question is - can we use in our ver. 6.0 SiteCore internal Lucene indexing/crawling mechanism for our multi-language web site or can you suggest other solutions?

Eugene

Alex Shyba said...

Hello Eugene,

Thanks!

This is exactly what you should be leveraging inbuilt Sitecore Search for. If multilingual aspect is very important, it is highly recommended to upgrade to 6.2 Update-5, it contains some critical fixes added for versioned content.

Also, check what kind of search you are using: new or old.
http://sitecoreblog.alexshyba.com/2011/02/8-reasons-to-use-new-search-in-sitecore.html

"Old" search is slower and less efficient.

1 hour sounds like way too much for full rebuild unless we are talking millions of content items.

You should also be able to leverage incremental index update as opposed to full rebuild.

Email me if you need help with all this: AS -> sitecore.net

Eugene Novikov said...

Thanks Alex!
I read blog you mentioned, but there is my confusion. This search as I understand related to indexing of internal SiteCore items located in DB to improve search in SiteCore Desktop and I understand this could be done incrementally and fast when developer added new item.
My question related to indexing content of all our Web pages (we have web site for each language) . Our custom build crawler/indexer run every night and requesting every page and indexing it by using Lucene IndexWriter, so if Developer/Content Editor updated some content he has to wait for next run of our Indexer/Crawler, only after that external user can see in search result our new content.
Sorry for long and messy explanation.
Could it be done in new version of SiteCore without any external crawler with good performance for multi-language web site?

Alex Shyba said...

Eugene,

>>> Our custom build crawler/indexer run every night and requesting every page and indexing it by using Lucene IndexWriter...

So you are indexing the published html pages with Lucene?

If the pages you need to index are Sitecore managed, you can actually have those be indexed by inbuilt Sitecore database crawler.

If those pages are external to Sitecore, then it's a different story. Pls clarify.

Feel free to email me, that could be more efficient.

James said...

Great stuff. Everything is working perfect except for when I do a FullTextQuery equal to something like "matt caine". I am getting results back that have no matt or caine in them just words like matter. Also there is no caine in these documents so I am assuming that it is an OR instead of an AND. Can you tell me how I can fix this?

Christian said...

Hy Alex,

It really looks amazing. I was trying to download code at http://trac.sitecore.net/AdvancedDatabaseCrawler but I didn't find any downloadable package. What am I doing wrong?

Alex Shyba said...

Christian,

There is no downloadable package, just the source code. Simply check out the whole project from the source control and build ;-)

Eugene Novikov said...

Alex, after your recommendation we upgraded our development server from 6.0 to latest release and trying to implement web search based on Sitecore Lucene module, but there is puzzle I still trying to solve. Our all web pages sharing a lot of renderings to build different menus on pages, so when user execute web search with word existing in menu, it return many pages where search word is not part of content, but part of menu only.
I build our new index with "ExcludeTemplate" but it does not solve problem.
I discovered some suggestion but I do not like it:

http://wiki.evident.nl/Sitecore Exclude content from Search Server crawler.ashx

Can you recommend some solution?

Elisabeth Pagels said...

This is a great module! Good work.

A question:

How can we update just a part of the index and not the whole index? We need to update a specific item and it's subitems, without having to reindex the whole index? Is that possible? Any help is appreciated.

Kasper

Alex Shyba said...

Hi Kasper,

Unfortunately, it is not easy to force an update programmatically. I can only speculate in terms of why, but the main idea is that you configure the index and let Sitecore be fully responsible for the index maintenance. As soon as an item is updated (saved in master or published to web), the index will be updated for you.

So if you are trying to trigger that on a web database index, the easiest way is to republish the item with all subitems. Sitecore will take care of the update automatically if configured.

If it's the master database index, you can try to programmatically crawl the content tree and issue item save via API. That should put an entry into the history table and thus execute search index update operation.

Hope this helps,
-alex

Sampath said...

Hello Alex,

This is a great article and i am pretty new to sitecore and wanted to implement this to the website is am develiping, however after spending almost a day i ended up nowhere and still cannot get this working. I would really appreciate if any of you guys can help me out with this.

1) Do i need to install the Lucense Search on the site for this to be working ? http://trac.sitecore.net/LuceneSearch

2) how do i configure the webconig to add the indexes as you mentioned in your example, should i just copy the Sitecore.SharedSource.Search.configfile from your example and showhow add this reference to my web.config if yes can you tell me how i can do this ?
Otherwise can you tell me otherway to get this working.

I would really appreciate if you can help me out on this.

Sampath said...

Hello Alex,

I am trying to setup this for my new website and i am new to sitecore so i am not sure what i am doing wrong, have already spent a day trying to get this work. Can you please help me here

i have couple of questions

1) do i need to install the Lucene Search module for this to be working ? http://trac.sitecore.net/LuceneSearch/

2) i have tried all the steps "K" has mentioned in his comment, but i think i am doing something wrong with the web.config part, how do i add these to my site, i copied the Sitecore.SharedSource.Search.configfile to my app_config folder, now how do i add reference to the same in my webconfig ?

I get this error when i try to browse to RebuildDatabaseCrawlers.aspx Could not resolve type name: Sitecore.SharedSource.Search.Crawlers.AdvancedDatabaseCrawler,Sitecore.SharedSource.Search (method: Sitecore.Configuration.Factory.CreateType(XmlNode configNode, String[] parameters, Boolean assert)).

please advice

GW said...

Hi

I'd like to develop a geocoded/proximity 'find my nearest' type search. I see that this is possible with Lucene.net directly so I am assuming that it's possible vis Sitecore also.

Can you provide any hints and tips on how to achieve this functionality?

Thanks

spradlinb said...

Just a quick question. In all the other trac modules I've been able to download code from a specific "download" link directly on the trac page... all I see on the Advanced Database Crawler page is a link to this documentation and an Index link. How do I download the crawler? Is there some other way to get code like this?

Thanks!

hdvti said...

Great video! Great information!

Martijn Bos said...

Hi Alex, I have used the Advanced Database Crawler during my latest project. The demo pages provide a lot of help. I noticed that using the language selection dropdown doesn't return results if the language is defined with a locale (i.e. nl-NL). Any ideas to why this is?

Alex Shyba said...

Hi Martijn,

I think you will need to publish the language definition item to the web database for the dropdown to pick it up. the demo pages use web database as context db by default.

-alex

Marcin Dzięgielewski said...

There is a bug in FilteredFields method of AdvancedDatabaseCrawler.

else if (HasFieldIncludes)
{
foreach (var includeFieldId in from p in FieldFilter where p.Value select p)
{
filteredFields.Add(item.Fields[ID.Parse(includeFieldId)]);
}
}

IncludeField is keyValue object not a string. Kay should be used instead.

filteredFields.Add(item.Fields[ID.Parse(includeFieldId.Key)]);


ID format exception is thrown when using field filtering.

Alex Shyba said...

Hi Marcin,

thanks, I have actually checked in the fix for this yesterday.

http://trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2/

-alex

barygoodcoder said...

Hey Alex, I am a little confused why you explicitly attach the _database field in your query parser? I build a custom media parser to parse PDFs so I figured I would write the database to the _database field. However, during testing, if I rebuild the Index from the Sitecore Index Viewer, SearchHelper.ContextDB reports "master" even though it is in fact crawling web. So non of my custom crawled entries are returned when I query. I'll just rebuild the crawlers through the web page, then the "web" is reported as the right context, just wondering why you have it there.

andreasordell said...

Thank you Alex for great work!

I keep running in to a strange problem. Every now and then the indexing fails with the following error in log file:

Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@c:\\data\indexes\\write.lock at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout)

Evern seen this behavior?

Bary said...

Okay, still curious about the database being added to the query, but my problem was not with that. I was adding custom fields with my pdf crawler. I had them camel-cased and somewhere along the way they were lower-cased. Lucene fields are case-sensitive it appears.

christophe said...

Hi Alex

Just downloaded the files I try to build Sitecore.SharedSource.Search.sln. And it throws some errors it is looking for "LookupSourceSetField.cs" - "ModfiedSystemTemplateField.cs" and SuplicateFieldNamesField.cs" which cannot be found. It is looking for them in the following directory "Branches\v2\Sitecore.SharedSource.SearchCrawler.DynamicFields\Templates" right enough, they are not there. I downloaded the code a couple of times , any ideas ?

Thanks

Christophe

Alex Shyba said...

Hi Christophe,

I've trimmed the solution a bit, specifically removed most of the dynamic fields as they are not critical. I left just a few to demonstrate how such functionality can be used.

Just tested the latest codebase, it should build fine.

-alex

Alex Shyba said...

Hi barygoodcoder,

The _database parameter was added in order to support multiple "locations" within single index. One location can be pointing to "master", another to "web", so you need to have a way to filter out the documents based on the context database. If you do not specify it within SearchParam, it will always default to either Context.ContentDatabase or Context.Database. In other words, it depends on from where you actually execute the code.

Hope it clarifies things a bit.

-alex

chaturangar said...

Hi Alex,
Your solution is great. We have used it on our system.
But, we are facing a small problem when trying to trim the index.
We set to false.
i.e. false

Then, we added a field to section.

i.e.

{7F019E60-7F78-4163-9388-282AF9918AFF}


Then, when we try to rebuild the index using indexViewer tool, it returns an error, saying given GUID format is wrong.

Do you have any advice on this ?

Thanks and Best Regards,
Chaturanga

Sean Holmesby said...

Hi Alex,
Great work with the module. I was wondering if there is a known issue with sorting? I have implemented this on Sitecore 6.5 (Update 3), and found that the sorting by field option doesn't seem to work.

I tried using the same setup as Brian Pederson's 'Latest News' blog (http://briancaos.wordpress.com/2011/10/13/get-latest-news-using-sitecore-advanceddatabasecrawler-lucene-index/), as well as trying the search demo page, and found that my items would always return in the same order, no matter what field I said to sort it by.

I even found I was getting the same order with 'reverse' being set to true or false.

Any ideas?

Cheers,
Sean

sumith said...

This is a great post and makes implementation of search very simple. Thanks for your effort was very helpful.

I used this post along with
the updated code http://trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2/
post from
http://briancaos.wordpress.com/tag/advanceddatabasecrawler/

and http://sitecoreug.org/events/Searching%20Sitecore%20with%20Lucene video here http://www.youtube.com/watch?feature=player_embedded&v=utGMKTG-r7U

to get search up and working in few hours.

There are few questions that's not clear.
From aussie.. : rate them as relevant in that order.
From James : except for when I do a FullTextQuery equal to something like "matt caine"

Can you provide examples/demo pages for search relevance and multi keyword search.. just like Google search operators.

Александр said...

Very good Alex. You have a point.

Kyle said...

Alex, this is an amazing module. Thanks for your work on this. I just had one question and it may be more in regards to how indexing works on a lower level, but I thought I'd ask anyway. When using The FieldValueSearch, I want to search for exact values in a single line text field, however, it appears that all substrings are matching as well. For example, a search for the value "sun" would return any items with values "sun", "sunday", "sundae", etc. I tried change the field to untokenized, but this didn't seem to help.

Tohams said...

I'm trying to do a search that excludes 2 templates...not every time, but when the user does a specific search. If I do one searchparam with the fulltextsearch and other appropriate attributes, I get all the results. But trying to add 2 more searchparams to the collection with the template ID's to ignore and QueryOccurence.MustNot doesn't seem to work. What am I doing wrong?

Thanks!

ghettoiam said...

Hey Alex,

Your work here is great. Quick question about SVN however.

You have your release on trunk but also have two other branches:

http://svn.sitecore.net/AdvancedDatabaseCrawler/Branches/v2-LST

http://svn.sitecore.net/AdvancedDatabaseCrawler/Branches/v2

I noticed they have more projects than the "trunk" does and to my mind make more sense.

Are they okay to use? What is their status? Many thanks!

Bary said...

Darn Tohams, I have just run into the same problem. Did you figure it out?

brian @ BrainJocks said...

Alex,

We implemented this today and it appears to work so far :) We've added a dynamic field and can see it in the index, etc. A couple of questions:

1. Which branch to use? We are using Sitecore 6.5.111230. Is the 2.0 LST branch still in development or what is the status?

2. is the custom rebuild index screen required for use or can we omit that override?

Thanks so much for the great work !!

Alex Shyba said...

Hi Brian,

you can use the v2 branch, It is stable, I just did not have time to move it into the trunk.

trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2

Will github it soon, very soon ;-)

2. The override for the search index rebuild wizard is not required.

brian @ BrainJocks said...

Alex - to your knowledge has anyone developed an extension to the source property of Sitecore fields (for example treelist or treelistex) that will use a lucene query instead of Sitecore query or fast query?

I can see adding something like "lucene:" or "index name:" as a prefix to issue a database query....crazy?

Ivan said...

Hi Alex,This module is our favourite for Sitecore. We just have a question. Is that possible to exclude items under certain path? eg. if we crawl everything under /sitecore/home, but we want to exclude one of the branches say /sitecore/home/references.

Alex Shyba said...

Hi Ivan,

The crawler does not support path exclusions. In your example, instead of rooting of /home, you can define multiple locations within an index, such as /home/news, /home/events. This will exclude /home/references

Another way to go is customize the IsMatch method (protected virtual) so you are good. This could be a very easy addition.

Tohams said...

I've gotten and compiled the version in GitHub. But I'm having a problem. It seems that I can't get a pipe delimited list of template ID's in the TemplateIds field to work. If I use a single Template ID, everything is fine...for that single ID. But once I add any more GUIDs delimited by pipes, I get no results. What could I be doing wrong?

Alex Shyba said...

Tohams,

You gotta add another XML def. for the second filter:

guid1
guid2

Each tag name has to be unique.

Ivan said...

Thanks Alex, I modified the AddItem method in crawler to exclude the path, and it works quite well so far.

Tohams said...

Alex, sorry...I don't think I was clear. I know in the config each ID needs a node. I'm actually talking about code.

var searchParam = new SearchParam()
{
Database = Sitecore.Context.Database.Name,
Language = Sitecore.Context.Language.Name,
TemplateIds = "{F05D8987-0241-48A1-A3FE-8F32872867C9}"//|{3333EF02-EA66-4FEA-8A9B-B82A72C34756}|{91B4744D-B60F-4FA8-A420-E32E4519A1D6}"
};

If I remove the quote and uncomment the other 2 GUIDs, I get no results. Whereas with the one GUID, I do get results.

I realize I can probably work around this by setting up an additional search index with just those 3 guids, but when I'm done, the actual guids will be dynamic.

Thanks for your help, as always!

Natalie said...

Hi Alex,

I have this module installed and working great. Now I am adding a new language to the website (Japanese) making the site multilingual. The language is set using the "itecore.Context.Language.Name" as "ja-JP" but no results are returned. I have rebuilt the indexes but not sure what else I can do.

Thanks!

Tohams said...

I wasn't clear in what I was asking. I understand the different tags for each guid in the config. But my question was actually about from the code side of things.

This code does _not_ return anything:

var searchParam = new SearchParam
{
Database = Sitecore.Context.Database.Name,
Language = Sitecore.Context.Language.Name,
TemplateIds = "{F05D8987-0241-48A1-A3FE-8F32872867C9}|{3333EF02-EA66-4FEA-8A9B-B82A72C34756}|{91B4744D-B60F-4FA8-A420-E32E4519A1D6}"
};

However, this code _does_ return results:

var searchParam = new SearchParam
{
Database = Sitecore.Context.Database.Name,
Language = Sitecore.Context.Language.Name,
TemplateIds = "{F05D8987-0241-48A1-A3FE-8F32872867C9}"
};

My understanding is that the GUIDs should be pipe delimited, so I'm not sure why it doesn't work.

Thanks!

Alex Shyba said...

Tohams and Natalie,

Could you please post these questions as issues on GitHub?

https://github.com/sitecorian/SitecoreSearchContrib/issues

Thanks!

Tohams said...

In the config file, I have the following:

true

However, when I use the Shared Source Index Viewer to look at my index, "__display name" is not indexed. I've even tried to explicitly add it to the index:


{B5E02AD9-D56F-4C41-A065-A133DB87BDEB}


Ultimately, I'm trying to do a "starts with" query on the __display name for predictive search.

var fieldParam = new FieldSearchParam()
{
Database = Sitecore.Context.Database.Name,
Language = Sitecore.Context.Language.Name,
Condition = QueryOccurance.Must,
FieldName = "__display name", FieldValue = query
};

Just trying to get results from that back before I figure out how to do a "starts with" query on that field.

Alex Shyba said...

Tohams,

Have you tried setting Partial=true?

Tohams said...

Yes. I think the problem is that "__display name" isn't being indexed. My previous post didn't work because I had some HTML in it. :P

I have [IndexAllFields]true[/IndexAllFields] in the config.

Even when I use the Index Viewer to look, there's no values in "__display name". Any guesses as to what I'm missing?

Also, is partial a *query* wildcard search? I need specifically query* since it's for type-ahead/predictive search.

Tohams said...

OK...I figured it out. __Display Name is a Text field, which is deprecated. So in the .config that ships with Advanced Database Crawler, it's not indexed. As soon as I set storageType="YES" for the text fieldtype, it's indexed.

Alex Shyba said...

The field does not have to be marked as stored in order to be searchable. You just don't see it in the index. I will try it locally, and will let you know how it goes.

nourestani said...

If you want to index the properties of an item then here is a solution

http://nourestani.wordpress.com/2012/10/15/indexing-sitecore-item-properties-in-lucene/

nourestani said...

if you are interested in how to index PDF content with Lucene AdvancedDatabaseCrawler in Sitecore.

below is the code

http://nourestani.wordpress.com/2012/10/29/how-to-index-pdf-content-with-lucene-advanceddatabasecrawler-in-sitecore/

Paramveer Singh said...

Nice Post Alex,

I have two sites one Main and another one is Micro site of Main ( it have some clone items). I need to implements search which can be fetch result from both sites, also the duplicate result must be removed..

Please guide me how to implement this feature..

Thanks

natashajaz said...

Hi Alex,I am using AdvanceDatabaseCrawler for Numeric range search.Its working fine but I found it show me incorrect result.
I want to search numeric range using lucene.If I provide Price range like $1 to $5 it should give me a result as products that are in the defined price range.I worked for numeric range using lucene.Lucene is able to search for range but I found that it shows me an irrelevant output.
Like in my scenario if I want to search products between price range 1 and 5,then lucene shows me product having price like 11.30,It may be because of ,Lucene search string would look like "price:[1 to 5]".Critically, this would actually match a record with value "11.30". It looks like the first character in "11.30" is '1', and that's between the '1' and '5',so the whole string "11.30" is between "1" and "5".
I want lucene should search for products that having price in between 1 and 5 not 11.30.

To search range I have used range query as

public static string FormatNumber(int number)
{
return FormatNumber((double)(number));
}

protected void AddNumericRangeQuery(BooleanQuery query, NumericRangeField range, BooleanClause.Occur occurance)
{
Term lowerTerm = new Term(range.FieldName, FormatNumber(range.Start));
Term upperTerm = new Term(range.FieldName, FormatNumber(range.End));
RangeQuery query1 = new RangeQuery(lowerTerm, upperTerm, true);
query.Add(query1, occurance);
}

When I go with AdvanceDatabaseCrawler then in AddNumericRangeQuery() function format the inputted range (e.g FormatNumber(range.Start) and FormatNumber(range.End) ) as
public static string FormatNumber(double number)
{
return number.ToString().PadLeft(int.MaxValue.ToString().ToCharArray().Count(), '0');
}
so that Lucene search string would look like "price:[0000000001 to 0000000005]"(Note:appox. Left padding.)
that’s why Lucene not able to process the result .
When I change this functions as
public static string FormatNumber(double number)
{
return number.ToString();
}

Its work for me. I m getting result. But it’s an irrelevant result as am explained above.
For your reference : In my content tree
Field Name is : Price
Field Value like :10,25, 45 etc…
That’s why when I pad start range and end range ,lucene not able to search this field values.

Can you please help me out in numeric range.Please reply as soon as possible.Thank you.Its Urgent.

Mariella Scofield said...

In ADC we have concept of RelatedIds... how does that translate on Search for Sitecore 7? If i want to filter by relatedids.

thanks,
Mariella

Alex Shyba said...

It is based on the same _links field in 7. It may be disabled by default in config, I'd double check that. But it should be pretty much OOTB in 7.