FLOSS Moling with RavenDB
There is the FLOSS Mole data set, which provide a lot of interesting information about open source projects. As I am always interested in testing RavenDB with different data sets, I decided that this would be a great opportunity to do that, and get some additional information about how things are working as well.
The data is provided in a number of formats, but most of them aren’t really easy to access. SQL statements and raw text files that I assume to be tab separated, but I couldn’t really figure out quickly.
I decided that this would be a great example of actually migrating content from a SQL System to a RavenDB System. The first thing to do was to install MySQL, as that seems to be the easiest way to get the data out. (As a note, MySQL Workbench is really not what I would call nice.)
The data looks like this, this is the Google Code projects, and you can also see that a lot of the data is driven from the notion of a project.
I explored the data a bit, and I came to the conclusion that this is pretty simple stuff, overall. There are a few many to one associations, but all of them were capped (the max was 20 or so).
That meant, in turn, that we had a really simple work to do for the import process. I started by creating the actual model which we will use to save to RavenDB:
The rest was just a matter of reading from MySQL and writing to RavenDB. I chose to use Peta Poco for the SQL access, because it is the easiest. The following code sucks. It is written with the assumption that I know what the data sizes are, that the cost of making so many queries (roughly a 1,500,000 queries) is acceptable, etc.
using (var docStore = new DocumentStore { ConnectionStringName = "RavenDB" }.Initialize()) using (var db = new PetaPoco.Database("MySQL")) using (var bulk = docStore.BulkInsert()) { foreach (var prj in db.Query<dynamic>("select * from gc_projects").ToList()) { string name = prj.proj_name; bulk.Store(new Project { Name = name, CodeLicense = prj.code_license, CodeUrl = prj.code_url, ContentLicense = prj.content_license, ContentUrl = prj.content_url, Description = prj.project_description, Summary = prj.project_summary, Labels = db.Query<string>("select label from gc_project_labels where proj_name = @0", name) .ToList(), Blogs = db.Query<dynamic>("select * from gc_project_blogs where proj_name = @0", name) .Select(x => new Blog { Link = x.blog_link, Title = x.blog_title }) .ToList(), Groups = db.Query<dynamic>("select * from gc_project_groups where proj_name = @0", name) .Select(x => new Group { Name = x.group_name, Url = x.group_url }) .ToList(), Links = db.Query<dynamic>("select * from gc_project_links where proj_name = @0", name) .Select(x => new Link { Url = x.link, Title = x.link_title }) .ToList(), People = db.Query<dynamic>("select * from gc_project_people where proj_name = @0", name) .Select(x => new Person { Name = x.person_name, Role = x.role, UserId = x.user_id }) .ToList(), }); } }
But, it does the work, and it was simple to write. Using this code, I was able to insert 299,949 projects in just under 13 minutes. Most of the time went to making those 1.5 million queries to the db, by the way.
Everything is cool, and it is quite nice. On the next post, I’ll talk about why I wanted a new dataset. Don’t worry, it is going to be cool.
Comments
Agreed, the workbench app is bad. I only use the "legacy" admin and query apps.
You mentioned the many to one relationships were capped at 20. Out of curiosity, does raven support fixed length arrays on models?
Flukus, Yes, it does. Although the capping wasn't physical, it was just that the biggest collection they had was 20 items in size.
it would probably run faster if all the queries were under the same transaction: using (var scope = db.GetTransaction())
Comment preview