Lucene as a data repository
The issue of user driven entity extensibility came up in the castle users mailing list, and a very interesting discussion has started. The underlying problem is fairly well known, we want to allow the extensibility of the schema by our end users.
The scenarios that I usually think of are about extending the static schema of the application. Like adding a CustomerExternalNumber field to the Customer entity, or adding MyOwnEntity custom entity. This can be solved in a number of ways, from meta tables a schema that looks like this:
I am usually suspicious of such methods, and would generally prefer to go with the option of simply extending the schema at runtime by adding additional tables for the use extensions.
The issue that came up in the list was quite different, the need was to extend each entity instance. Let us take bug tracking for instance. We need to allow the user to add different fields per each bugs. Then we need to allow to search on those extra fields, and each user can define their own fields.
Lucene came up as a way to store those extra fields, and then I had a light bulb moment. Lucene, by its nature, is a good place to store semi structure data. The basic unit of storage in Lucene is the Document. And a document is compromised of a set of fields, which can be indexed, stored or both. Hibernate Search (and NHibernate Search) uses this ability to allow us to store entity information in Lucene, which mean that we can retrieve information directly from Lucene, hitting the DB only for the missing information.
Extending this idea to also allow extra information in the Lucene store is a fairly natural extension, and extremely interesting to me. It means that I can give my users what they want (full extensibility) while keeping things very simple & clean from my point of view. Searching is built in, and easy enough that you can give the users the ability to do direct queries against that. In fact, you can even use NHibernate Search to allow even better scaling of the searching capabilities.
Reporting is also easy enough, you pull the data out, and into your entities, and report off of that, but if you want to do something more generic, it is very easy to build a Lucene query to a DataSet, which you can then hand to the reporting engine.
Exciting idea.
Comments
Wow, great idea. I have to take a look at dotlucene(lucene.net). I am looking forward to trying it soon.
This is indeed a great idea - I hadn't spotted the potential of Lucene for this kind of thing when I looked at it. Now I'm thinking of perhaps using it instead of full-text indexing for something...
NHibernate can do that? Wow...
Is Lucene.net robust enough to be used as a primary storage for data? I always think of full text indexes as something that tends to get corrupt and needs to be rebuilded froam a transactional data source, but maybe that's just from experience with - well, you get the idea.
Speaking of transactional sources, there's no transactional integrity between your RDBMS and your extensible data if you store the latter in Lucene.net, or is there?
Stefan,
Yes, NH can do that.
There is not transactional integrity between the two, but the DB is the master, so that is fine.
I would want to run some tests before I would commit to making it the primary data source, and I would probably want to keep NH around as the primary and making this the extensible source, rather.
Im in on this one. I was recently working up the idea for an extensible document repository system...
looking forward to the results.
dotlucene can be run on mssql, right?
Ok, but how about performance of this approach, for high-traffic website would it not be an overkill?
Krzysztof Koźmic
Lucene is built to be very scalable, you can distribute it etc.
I cannot identify the need for such a solution currently. If I stick with the bug tracking example, I'd just use a bug table and a bugProperties table in an 1:n relation.
If you need a more generic solution, I propose a simple Property table using AR/NH "Any" to relate to the entities and a "HasMany" collection on the Entity side.
The value field must be varchar and serialized through an NH custom type. This might be too simple for complex object values, but sufficient for storing atomic information.
This allows me to search and index both property names and values.
Markus,
Yes, that works for a simple scenario, but what happens when you have 100 entities, and you can add 5 fields to an instance?
This also lose you the ability to do such things as search for date ranges, etc.
Ayende,
I think you misunderstood me. My Property table would be like that:
create table Properties (
id int primary key,
entityType varchar(50) not null,
entityId int not null,
propName varchar(50) not null,
--Variant 1
genericValue varchar (1000),
--Variant 2
valueType char(2), -- selects one of the columns below
stringValue varchar(1000),
decimalValue decimal,
dateValue datetime
-- etc.
)
Variant 2 is an extension that allows using date ranges etc. Using sql_variant might be possible, but I didn't have a closer look at it yet.
Markus,
Assume that I extend my entity to include StartDate, DueDate, CompletionDate.
Now I want all the bugs that started last year and weren't finished:
In lucene it is something in the order of* "startdate:[20060101 TO 20070101] AND completiondate:null"
Now formulate is as a SQL query.
select * from bug
where
id in (
-- subselect 1
select entityId from property
where
entitytype = 'bug' and
propName = 'startdate' and
valueType = 'dt' and -- DateTime
dateValue >= '2006-01-01' and
dateValue < '2007-01-01')
and id not in (
-- subselect 2
select entityId from property
where
entitytype = 'bug' and
propName = 'completiondate'
-- end of subselect 2
)
Each of the expressions translate to a subquery. Null value means that it's simply not stored as a property. Comparing terms is possible, that translates to a nested subquery.
Yes it is complex and not performant, but it allows me to store all my data in a single database, which means much less haedaches in administration.
Although this model can be extended to allow extensions by type if the propName is replaced by a reference to another table that defines possible extensions per entitytype.
As for transactional integrity though, I was talking about the primary data source, not the source of primary data. So if you have data stored in Lucene ONLY, the lucene repository better never needs to be rebuilt. If you make a change to both primary (fixed-schema) and secondary (dynamic) data, there's no guarantee that the change is atomic, so, in a large-scale environment, inconsistencies will happen.
I'm excited to hear that NH integrates with Lucene.net though. Have to look into that one soon.
PS: I sent you an email early last week, and tried again on monday. Could you check your spam-folder? Thanks!
Hi all,
Now that I revealed a secret we have been using for a while, I'd like to make a very small contribution to the castle project, with a small project called ActiveDocument.
I opened the discussion about it first in: http://groups.google.com/group/castle-project-users/browse_thread/thread/d73e1d00ee9d7fe4/#, where you Ayende asked me:
Dan,
where are you keeping the data, then?
The answer was revealed last night, with the post about
Basically I built on top of Lucene.Net, a few classes, which can do:
[Test]
public void Save()
{
ActiveDocument product = new ActiveDocument("Product");
product["Name"] = "CMS20";
product.Save();
ActiveDocument product2 = new ActiveDocument("Product");
product2["Name"] = "Taia";
product2["Category"] = "Software innovation";
product2.Save();
ActiveDocument[] allSoftware = ActiveDocument.Query("Category:Software*");
Assert.AreEqual(1, allSoftware.Length);
Assert.AreEqual ("Taia", allSoftware[0]["Name"]);
Assert.AreEqual("Product", allSoftware[0]["type"]);
}
[Test]
public void TestManyToManyRelations()
{
ActiveDocument category = new ActiveDocument("Category");
category["Name"] = "Software";
category.Save();
ActiveDocument category2 = new ActiveDocument("Category");
category2["Name"] = "Sad and Cheap";
category2.Save ();
ActiveDocument pf = new ActiveDocument("ProductFamily");
pf["Name"] = "nada";
pf.Save();
ActiveDocument[] allCateg = ActiveDocument.Query ("type:Category");
Assert.AreEqual(2, allCateg.Length);
ActiveDocument product2 = new ActiveDocument("Product");
product2["Name"] = "Taia";
product2.AddRelated("Categories", category);
product2.AddRelated("Categories", category2);
product2.AddRelated("ProductFamilies", pf);
product2.Create();
ActiveDocument[] relatedCategories = product2.FindRelated("Categories");
Assert.AreEqual(2, relatedCategories.Length);
ActiveDocument[] relatedPF = product2.FindRelated("ProductFamilies");
Assert.AreEqual(1, relatedPF.Length);
//and after it is loaded
ActiveDocument productAfter = ActiveDocument.Find(product2["id"]);
relatedCategories = productAfter.FindRelated("Categories");
Assert.AreEqual(2, relatedCategories.Length);
relatedPF = productAfter.FindRelated("ProductFamilies");
Assert.AreEqual(1, relatedPF.Length);
}
It also has internationalisation, multiple value fields (like tags), sorting, and probably it will have for the next versions: validation, customizations (maybe with postsharp http://www.postsharp.org/ aop engine)
At the time the code is a little too specific to our cms: http://www.eptala.ro/tb.htm but in the next few days I will publish the code for everyone to test and see.
Thanks,
Great idea, I had a simular idea some years ago:
In a research project some couple years ago, we used Lucene as a datastorage for Topic Maps (XTM) (semantic networks).
It worked very well, precisely because of the flexibility of Lucene: Topic maps can define topic types, which we stored as different fields in Lucene.
Dan, this is really cool stuff. I still have some reservations...
As I undertsand it, couchdb is recommended for the persistence of semi-structure data, but it isnt a relational database. Using couchdb to store semi-structural objects smells like an anti-pattern...
Perusing through the FAQs on the couchdb website confirmed this: it was not designed as an OO persistence layer.
Nonetheless, the desire for persistence of semi-structure objects remains...
Im trying to find out why you would be interested in using Lucene as a persistence layer and what advantage this would have over couchdb.
Well, i may use oodbms for such complex scenario but the problem with oodbms is it can be very very slow.
I was toying with this idea abnout 4 months ago. My biggest concern was around actually updating the index and keeping that fast / scalable. Lucene is great for read, not so sure about write.
I think what you're really talking about building though is CouchDB
Hi,
I've published the source code at: http://danbunea.blogspot.com/2007/10/lucene-indexes-as-agile-databases.html
Thanks,
Dan
PS: crouchdb is only javascript. What if I need a desktop app?
Comment preview