Lucene as a data repository

time to read 3 min | 432 words

The issue of user driven entity extensibility came up in the castle users mailing list, and a very interesting discussion has started. The underlying problem is fairly well known, we want to allow the extensibility of the schema by our end users.

The scenarios that I usually think of are about extending the static schema of the application. Like adding a CustomerExternalNumber field to the Customer entity, or adding MyOwnEntity custom entity. This can be solved in a number of ways, from meta tables a schema that looks like this:

I am usually suspicious of such methods, and would generally prefer to go with the option of simply extending the schema at runtime by adding additional tables for the use extensions.

The issue that came up in the list was quite different, the need was to extend each entity instance. Let us take bug tracking for instance. We need to allow the user to add different fields per each bugs. Then we need to allow to search on those extra fields, and each user can define their own fields.

Lucene came up as a way to store those extra fields, and then I had a light bulb moment. Lucene, by its nature, is a good place to store semi structure data. The basic unit of storage in Lucene is the Document. And a document is compromised of a set of fields, which can be indexed, stored or both. Hibernate Search (and NHibernate Search) uses this ability to allow us to store entity information in Lucene, which mean that we can retrieve information directly from Lucene, hitting the DB only for the missing information.

Extending this idea to also allow extra information in the Lucene store is a fairly natural extension, and extremely interesting to me. It means that I can give my users what they want (full extensibility) while keeping things very simple & clean from my point of view. Searching is built in, and easy enough that you can give the users the ability to do direct queries against that. In fact, you can even use NHibernate Search to allow even better scaling of the searching capabilities.

Reporting is also easy enough, you pull the data out, and into your entities, and report off of that, but if you want to do something more generic, it is very easy to build a Lucene query to a DataSet, which you can then hand to the reporting engine.

Exciting idea.

Tweet Share Share 20 comments

Tags:

Development

Comments

16 Oct 2007
21:57 PM

Tuna Toksoz

Wow, great idea. I have to take a look at dotlucene(lucene.net). I am looking forward to trying it soon.

16 Oct 2007
22:29 PM

Rik Hemsley

This is indeed a great idea - I hadn't spotted the potential of Lucene for this kind of thing when I looked at it. Now I'm thinking of perhaps using it instead of full-text indexing for something...

16 Oct 2007
23:53 PM

Stefan Wenig

NHibernate can do that? Wow...

Is Lucene.net robust enough to be used as a primary storage for data? I always think of full text indexes as something that tends to get corrupt and needs to be rebuilded froam a transactional data source, but maybe that's just from experience with - well, you get the idea.

Speaking of transactional sources, there's no transactional integrity between your RDBMS and your extensible data if you store the latter in Lucene.net, or is there?

17 Oct 2007
00:16 AM

Ayende Rahien

Stefan,

Yes, NH can do that.

There is not transactional integrity between the two, but the DB is the master, so that is fine.

I would want to run some tests before I would commit to making it the primary data source, and I would probably want to keep NH around as the primary and making this the extensible source, rather.

17 Oct 2007
01:24 AM

Pete w

Im in on this one. I was recently working up the idea for an extensible document repository system...

17 Oct 2007
04:16 AM

Tuna Toksoz

looking forward to the results.

dotlucene can be run on mssql, right?

17 Oct 2007
06:25 AM

krzysztof@kozmic.pl (Krzysztof Koźmic)

Ok, but how about performance of this approach, for high-traffic website would it not be an overkill?

Krzysztof Koźmic

17 Oct 2007
06:35 AM

Ayende Rahien

Lucene is built to be very scalable, you can distribute it etc.

17 Oct 2007
07:17 AM

Markus Zywitza

I cannot identify the need for such a solution currently. If I stick with the bug tracking example, I'd just use a bug table and a bugProperties table in an 1:n relation.

If you need a more generic solution, I propose a simple Property table using AR/NH "Any" to relate to the entities and a "HasMany" collection on the Entity side.

The value field must be varchar and serialized through an NH custom type. This might be too simple for complex object values, but sufficient for storing atomic information.

This allows me to search and index both property names and values.

17 Oct 2007
07:24 AM

Ayende Rahien

Markus,

Yes, that works for a simple scenario, but what happens when you have 100 entities, and you can add 5 fields to an instance?

This also lose you the ability to do such things as search for date ranges, etc.

17 Oct 2007
07:45 AM

Markus Zywitza

Ayende,

I think you misunderstood me. My Property table would be like that:

create table Properties (

id int primary key,

entityType varchar(50) not null,

entityId int not null,

propName varchar(50) not null,

--Variant 1

genericValue varchar (1000),

--Variant 2

valueType char(2), -- selects one of the columns below

stringValue varchar(1000),

decimalValue decimal,

dateValue datetime

-- etc.

)

Variant 2 is an extension that allows using date ranges etc. Using sql_variant might be possible, but I didn't have a closer look at it yet.

17 Oct 2007
07:55 AM

Ayende Rahien

Markus,

Assume that I extend my entity to include StartDate, DueDate, CompletionDate.

Now I want all the bugs that started last year and weren't finished:

In lucene it is something in the order of* "startdate:[20060101 TO 20070101] AND completiondate:null"

Now formulate is as a SQL query.

not sure if lucene allows comparing of terms, though, so I don't know if something like completiondate > duedate is possible.

17 Oct 2007
08:37 AM

Markus Zywitza

select * from bug

where

id in (

-- subselect 1

select entityId from property

where

entitytype = 'bug' and

propName = 'startdate' and

valueType = 'dt' and -- DateTime

dateValue >= '2006-01-01' and

dateValue < '2007-01-01')

and id not in (

-- subselect 2

select entityId from property

where

entitytype = 'bug' and

propName = 'completiondate'

-- end of subselect 2

)

Each of the expressions translate to a subquery. Null value means that it's simply not stored as a property. Comparing terms is possible, that translates to a nested subquery.

Yes it is complex and not performant, but it allows me to store all my data in a single database, which means much less haedaches in administration.

Although this model can be extended to allow extensions by type if the propName is replaced by a reference to another table that defines possible extensions per entitytype.

17 Oct 2007
09:22 AM

Stefan Wenig

As for transactional integrity though, I was talking about the primary data source, not the source of primary data. So if you have data stored in Lucene ONLY, the lucene repository better never needs to be rebuilt. If you make a change to both primary (fixed-schema) and secondary (dynamic) data, there's no guarantee that the change is atomic, so, in a large-scale environment, inconsistencies will happen.

I'm excited to hear that NH integrates with Lucene.net though. Have to look into that one soon.

PS: I sent you an email early last week, and tried again on monday. Could you check your spam-folder? Thanks!

17 Oct 2007
09:45 AM

Dan Bunea

Hi all,

Now that I revealed a secret we have been using for a while, I'd like to make a very small contribution to the castle project, with a small project called ActiveDocument.

I opened the discussion about it first in: http://groups.google.com/group/castle-project-users/browse_thread/thread/d73e1d00ee9d7fe4/#, where you Ayende asked me:

Dan,

where are you keeping the data, then?

The answer was revealed last night, with the post about

Basically I built on top of Lucene.Net, a few classes, which can do:

dynamic properties and search without problems

[Test]

public void Save()

{

ActiveDocument product = new ActiveDocument("Product");

product["Name"] = "CMS20";

product.Save();

ActiveDocument product2 = new ActiveDocument("Product");

product2["Name"] = "Taia";

product2["Category"] = "Software innovation";

product2.Save();

ActiveDocument[] allSoftware = ActiveDocument.Query("Category:Software*");

Assert.AreEqual(1, allSoftware.Length);

Assert.AreEqual ("Taia", allSoftware[0]["Name"]);

Assert.AreEqual("Product", allSoftware[0]["type"]);

}

relations:

[Test]

public void TestManyToManyRelations()

{

ActiveDocument category = new ActiveDocument("Category");

category["Name"] = "Software";

category.Save();

ActiveDocument category2 = new ActiveDocument("Category");

category2["Name"] = "Sad and Cheap";

category2.Save ();

ActiveDocument pf = new ActiveDocument("ProductFamily");

pf["Name"] = "nada";

pf.Save();

ActiveDocument[] allCateg = ActiveDocument.Query ("type:Category");

Assert.AreEqual(2, allCateg.Length);

ActiveDocument product2 = new ActiveDocument("Product");

product2["Name"] = "Taia";

product2.AddRelated("Categories", category);

product2.AddRelated("Categories", category2);

product2.AddRelated("ProductFamilies", pf);

product2.Create();

ActiveDocument[] relatedCategories = product2.FindRelated("Categories");

Assert.AreEqual(2, relatedCategories.Length);

ActiveDocument[] relatedPF = product2.FindRelated("ProductFamilies");

Assert.AreEqual(1, relatedPF.Length);

//and after it is loaded

ActiveDocument productAfter = ActiveDocument.Find(product2["id"]);

relatedCategories = productAfter.FindRelated("Categories");

Assert.AreEqual(2, relatedCategories.Length);

relatedPF = productAfter.FindRelated("ProductFamilies");

Assert.AreEqual(1, relatedPF.Length);

}

It also has internationalisation, multiple value fields (like tags), sorting, and probably it will have for the next versions: validation, customizations (maybe with postsharp http://www.postsharp.org/ aop engine)

At the time the code is a little too specific to our cms: http://www.eptala.ro/tb.htm but in the next few days I will publish the code for everyone to test and see.

Thanks,

17 Oct 2007
09:46 AM

Edwin de Jonge

Great idea, I had a simular idea some years ago:

In a research project some couple years ago, we used Lucene as a datastorage for Topic Maps (XTM) (semantic networks).

It worked very well, precisely because of the flexibility of Lucene: Topic maps can define topic types, which we stored as different fields in Lucene.

17 Oct 2007
14:47 PM

pete w

Dan, this is really cool stuff. I still have some reservations...

As I undertsand it, couchdb is recommended for the persistence of semi-structure data, but it isnt a relational database. Using couchdb to store semi-structural objects smells like an anti-pattern...

Perusing through the FAQs on the couchdb website confirmed this: it was not designed as an OO persistence layer.

Nonetheless, the desire for persistence of semi-structure objects remains...

Im trying to find out why you would be interested in using Lucene as a persistence layer and what advantage this would have over couchdb.

17 Oct 2007
15:35 PM

Tuna Toksoz

Well, i may use oodbms for such complex scenario but the problem with oodbms is it can be very very slow.

18 Oct 2007
12:54 PM

Chris Ortman

I was toying with this idea abnout 4 months ago. My biggest concern was around actually updating the index and keeping that fast / scalable. Lucene is great for read, not so sure about write.

I think what you're really talking about building though is CouchDB

22 Oct 2007
10:15 AM

Dan Bunea

Hi,

I've published the source code at: http://danbunea.blogspot.com/2007/10/lucene-indexes-as-agile-databases.html

Thanks,

Dan

PS: crouchdb is only javascript. What if I need a desktop app?

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB