Pages

Friday, November 12, 2010

Sitecore Searcher and Advanced Database Crawler


Hi there,

Today I am proud to announce a preview release of a component that extends the standard Sitecore Searching mechanisms, specifically, the relatively “new” Sitecore.Search namespace introduced in 6.0 and provides easy search querying APIs. If you are not sure what I am talking about, check out this recently published document on SDN and also Ivan’s blog posts about it.

As an example, I am walking through 5 generic data extraction scenarios and show you how this component can help you in daily development work while still leveraging standard Sitecore functionality.

The session is broken down into demos and code walkthrough. If you are not excited about the code, you can skip those parts.

There is no documentation available yet, but I do plan to come up with a blog series covering the recorded scenarios in more details.

image

The code has been published to the Shared Source Library. You can check it out.
Please let me know how it works, I appreciate any kind of feedback.

Shortened video is posted to YouTube.

36 comments:

TimWard said...

This video is spectacular. This will make Sitecore Developers so much more aware of what is possible with the Sitecore Search namespace. Well done!

- Tim Ward

Alex Shyba said...

Thanks Tim! this has been my favorite project because of exactly the same reason you mentioned.

aussieviking said...

Found this while I was trying to track down some information about custom searching in Sitecore. Thank you for sharing this awesome extension library!

I have one question regarding search possibilities. When you configure a search index you can specify which templates to include and also which fields to include. What I am looking for is a way to set a boost or weight value on individual fields and use that when searching for items by relevance. Say for example I have a template with three fields, (name, description and keywords), and I rate them as relevant in that order. This would mean that if my search term exists in the name field of an item then that item is returned before items with hits in the description or keywords fields. I know you can set a boost value when configuring the index but to as I understand it, that would only be useful if you're comparing results from two separate indexes.

Thoughts?

k said...

Hey Alex, great library. I was able to get it working quickly. I also was able to copy your examples out from the video. May I suggest making the source for your examples also available or just put them in the scripts folder of the library. Just as a note to other guys who may be new to lucene: This is a new way to configure, build and access your index.

step 1: make sure your on at least sitecore 6.2 rev. 101105
step 2: from the shared source library in /App_Config/Include/Sitecore.SharedSource.Search.config add the three sections(Engines, Engines.HistoryEngine.Storage, Index) to the Web.Config in your app.
step 3: from the library copy /scripts/RebuildDatabaseCrawlers.saspx and .cs into your app.
step 4: hit the page and build the index
step 5: build a new page/form and handle the form values by creating a SearchParam object and a Searcher object. Then call GetItems(searchParam) with the searcher.

I think you did a lot of nice work handling each specific field type and supporting ranges. But most especially for only getting the latest item version. kudos.

Alex Shyba said...

Thanks K, I will make sure to include these steps in the documentation.

Gulle said...

Hi Alex,

Very nice!
Looking forward to your technical blog series.

Koen

Mark Ursino said...

Not trying to put on any pressure, but is there a plan for when official documentation will be released on this or is that on the back-burner? I'm doing my own documentation for this within our own project that uses it but I wasn't sure when to really expect the real docs.

Thanks!

Alex Shyba said...

Mark - I will do my best to get something started, but my hope was to engage the community members like yourself more. Since the code is open, it is fairly straightforward to figure out what goes where. My hope was that the 40 minute screencast could be sufficient enough to get going with this.

Eugene Novikov said...

Great presentation Alex!
I am looking for advice related to crawler/indexing in SiteCore.
We have multi-language web site in SiteCore ver.6.0 and our existing crawler (based on Lucene API) written by another company and this crawler run once per day as scheduled application, but it takes more then 1 hour and a lot of resources.
Question is - can we use in our ver. 6.0 SiteCore internal Lucene indexing/crawling mechanism for our multi-language web site or can you suggest other solutions?

Eugene

Alex Shyba said...

Hello Eugene,

Thanks!

This is exactly what you should be leveraging inbuilt Sitecore Search for. If multilingual aspect is very important, it is highly recommended to upgrade to 6.2 Update-5, it contains some critical fixes added for versioned content.

Also, check what kind of search you are using: new or old.
http://sitecoreblog.alexshyba.com/2011/02/8-reasons-to-use-new-search-in-sitecore.html

"Old" search is slower and less efficient.

1 hour sounds like way too much for full rebuild unless we are talking millions of content items.

You should also be able to leverage incremental index update as opposed to full rebuild.

Email me if you need help with all this: AS -> sitecore.net

Eugene Novikov said...

Thanks Alex!
I read blog you mentioned, but there is my confusion. This search as I understand related to indexing of internal SiteCore items located in DB to improve search in SiteCore Desktop and I understand this could be done incrementally and fast when developer added new item.
My question related to indexing content of all our Web pages (we have web site for each language) . Our custom build crawler/indexer run every night and requesting every page and indexing it by using Lucene IndexWriter, so if Developer/Content Editor updated some content he has to wait for next run of our Indexer/Crawler, only after that external user can see in search result our new content.
Sorry for long and messy explanation.
Could it be done in new version of SiteCore without any external crawler with good performance for multi-language web site?

Alex Shyba said...

Eugene,

>>> Our custom build crawler/indexer run every night and requesting every page and indexing it by using Lucene IndexWriter...

So you are indexing the published html pages with Lucene?

If the pages you need to index are Sitecore managed, you can actually have those be indexed by inbuilt Sitecore database crawler.

If those pages are external to Sitecore, then it's a different story. Pls clarify.

Feel free to email me, that could be more efficient.

James said...

Great stuff. Everything is working perfect except for when I do a FullTextQuery equal to something like "matt caine". I am getting results back that have no matt or caine in them just words like matter. Also there is no caine in these documents so I am assuming that it is an OR instead of an AND. Can you tell me how I can fix this?

Christian said...

Hy Alex,

It really looks amazing. I was trying to download code at http://trac.sitecore.net/AdvancedDatabaseCrawler but I didn't find any downloadable package. What am I doing wrong?

Alex Shyba said...

Christian,

There is no downloadable package, just the source code. Simply check out the whole project from the source control and build ;-)

Eugene Novikov said...

Alex, after your recommendation we upgraded our development server from 6.0 to latest release and trying to implement web search based on Sitecore Lucene module, but there is puzzle I still trying to solve. Our all web pages sharing a lot of renderings to build different menus on pages, so when user execute web search with word existing in menu, it return many pages where search word is not part of content, but part of menu only.
I build our new index with "ExcludeTemplate" but it does not solve problem.
I discovered some suggestion but I do not like it:

http://wiki.evident.nl/Sitecore Exclude content from Search Server crawler.ashx

Can you recommend some solution?

Elisabeth Pagels said...

This is a great module! Good work.

A question:

How can we update just a part of the index and not the whole index? We need to update a specific item and it's subitems, without having to reindex the whole index? Is that possible? Any help is appreciated.

Kasper

Alex Shyba said...

Hi Kasper,

Unfortunately, it is not easy to force an update programmatically. I can only speculate in terms of why, but the main idea is that you configure the index and let Sitecore be fully responsible for the index maintenance. As soon as an item is updated (saved in master or published to web), the index will be updated for you.

So if you are trying to trigger that on a web database index, the easiest way is to republish the item with all subitems. Sitecore will take care of the update automatically if configured.

If it's the master database index, you can try to programmatically crawl the content tree and issue item save via API. That should put an entry into the history table and thus execute search index update operation.

Hope this helps,
-alex

Sampath said...

Hello Alex,

This is a great article and i am pretty new to sitecore and wanted to implement this to the website is am develiping, however after spending almost a day i ended up nowhere and still cannot get this working. I would really appreciate if any of you guys can help me out with this.

1) Do i need to install the Lucense Search on the site for this to be working ? http://trac.sitecore.net/LuceneSearch

2) how do i configure the webconig to add the indexes as you mentioned in your example, should i just copy the Sitecore.SharedSource.Search.configfile from your example and showhow add this reference to my web.config if yes can you tell me how i can do this ?
Otherwise can you tell me otherway to get this working.

I would really appreciate if you can help me out on this.

Sampath said...

Hello Alex,

I am trying to setup this for my new website and i am new to sitecore so i am not sure what i am doing wrong, have already spent a day trying to get this work. Can you please help me here

i have couple of questions

1) do i need to install the Lucene Search module for this to be working ? http://trac.sitecore.net/LuceneSearch/

2) i have tried all the steps "K" has mentioned in his comment, but i think i am doing something wrong with the web.config part, how do i add these to my site, i copied the Sitecore.SharedSource.Search.configfile to my app_config folder, now how do i add reference to the same in my webconfig ?

I get this error when i try to browse to RebuildDatabaseCrawlers.aspx Could not resolve type name: Sitecore.SharedSource.Search.Crawlers.AdvancedDatabaseCrawler,Sitecore.SharedSource.Search (method: Sitecore.Configuration.Factory.CreateType(XmlNode configNode, String[] parameters, Boolean assert)).

please advice

GW said...

Hi

I'd like to develop a geocoded/proximity 'find my nearest' type search. I see that this is possible with Lucene.net directly so I am assuming that it's possible vis Sitecore also.

Can you provide any hints and tips on how to achieve this functionality?

Thanks

spradlinb said...

Just a quick question. In all the other trac modules I've been able to download code from a specific "download" link directly on the trac page... all I see on the Advanced Database Crawler page is a link to this documentation and an Index link. How do I download the crawler? Is there some other way to get code like this?

Thanks!

hdvti said...

Great video! Great information!

Martijn Bos said...

Hi Alex, I have used the Advanced Database Crawler during my latest project. The demo pages provide a lot of help. I noticed that using the language selection dropdown doesn't return results if the language is defined with a locale (i.e. nl-NL). Any ideas to why this is?

Alex Shyba said...

Hi Martijn,

I think you will need to publish the language definition item to the web database for the dropdown to pick it up. the demo pages use web database as context db by default.

-alex

Marcin Dzięgielewski said...

There is a bug in FilteredFields method of AdvancedDatabaseCrawler.

else if (HasFieldIncludes)
{
foreach (var includeFieldId in from p in FieldFilter where p.Value select p)
{
filteredFields.Add(item.Fields[ID.Parse(includeFieldId)]);
}
}

IncludeField is keyValue object not a string. Kay should be used instead.

filteredFields.Add(item.Fields[ID.Parse(includeFieldId.Key)]);


ID format exception is thrown when using field filtering.

Alex Shyba said...

Hi Marcin,

thanks, I have actually checked in the fix for this yesterday.

http://trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2/

-alex

barygoodcoder said...

Hey Alex, I am a little confused why you explicitly attach the _database field in your query parser? I build a custom media parser to parse PDFs so I figured I would write the database to the _database field. However, during testing, if I rebuild the Index from the Sitecore Index Viewer, SearchHelper.ContextDB reports "master" even though it is in fact crawling web. So non of my custom crawled entries are returned when I query. I'll just rebuild the crawlers through the web page, then the "web" is reported as the right context, just wondering why you have it there.

andreasordell said...

Thank you Alex for great work!

I keep running in to a strange problem. Every now and then the indexing fails with the following error in log file:

Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@c:\\data\indexes\\write.lock at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout)

Evern seen this behavior?

Bary said...

Okay, still curious about the database being added to the query, but my problem was not with that. I was adding custom fields with my pdf crawler. I had them camel-cased and somewhere along the way they were lower-cased. Lucene fields are case-sensitive it appears.

christophe said...

Hi Alex

Just downloaded the files I try to build Sitecore.SharedSource.Search.sln. And it throws some errors it is looking for "LookupSourceSetField.cs" - "ModfiedSystemTemplateField.cs" and SuplicateFieldNamesField.cs" which cannot be found. It is looking for them in the following directory "Branches\v2\Sitecore.SharedSource.SearchCrawler.DynamicFields\Templates" right enough, they are not there. I downloaded the code a couple of times , any ideas ?

Thanks

Christophe

Alex Shyba said...

Hi Christophe,

I've trimmed the solution a bit, specifically removed most of the dynamic fields as they are not critical. I left just a few to demonstrate how such functionality can be used.

Just tested the latest codebase, it should build fine.

-alex

Alex Shyba said...

Hi barygoodcoder,

The _database parameter was added in order to support multiple "locations" within single index. One location can be pointing to "master", another to "web", so you need to have a way to filter out the documents based on the context database. If you do not specify it within SearchParam, it will always default to either Context.ContentDatabase or Context.Database. In other words, it depends on from where you actually execute the code.

Hope it clarifies things a bit.

-alex

chaturangar said...

Hi Alex,
Your solution is great. We have used it on our system.
But, we are facing a small problem when trying to trim the index.
We set to false.
i.e. false

Then, we added a field to section.

i.e.

{7F019E60-7F78-4163-9388-282AF9918AFF}


Then, when we try to rebuild the index using indexViewer tool, it returns an error, saying given GUID format is wrong.

Do you have any advice on this ?

Thanks and Best Regards,
Chaturanga

Sean Holmesby said...

Hi Alex,
Great work with the module. I was wondering if there is a known issue with sorting? I have implemented this on Sitecore 6.5 (Update 3), and found that the sorting by field option doesn't seem to work.

I tried using the same setup as Brian Pederson's 'Latest News' blog (http://briancaos.wordpress.com/2011/10/13/get-latest-news-using-sitecore-advanceddatabasecrawler-lucene-index/), as well as trying the search demo page, and found that my items would always return in the same order, no matter what field I said to sort it by.

I even found I was getting the same order with 'reverse' being set to true or false.

Any ideas?

Cheers,
Sean

sumith said...

This is a great post and makes implementation of search very simple. Thanks for your effort was very helpful.

I used this post along with
the updated code http://trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2/
post from
http://briancaos.wordpress.com/tag/advanceddatabasecrawler/

and http://sitecoreug.org/events/Searching%20Sitecore%20with%20Lucene video here http://www.youtube.com/watch?feature=player_embedded&v=utGMKTG-r7U

to get search up and working in few hours.

There are few questions that's not clear.
From aussie.. : rate them as relevant in that order.
From James : except for when I do a FullTextQuery equal to something like "matt caine"

Can you provide examples/demo pages for search relevance and multi keyword search.. just like Google search operators.