Monday 25 January 2016

Crate Snow Sprint: Day 0 (I Need Help)

Now for something a little different: Crate. Thanks to the kind sponsorship of Crate.io, I am attending the annual Snow Sprint. This is an event that has been in existence for many years (certainly over a decade); originally it was a get-together for Zope/Plone developers. Those of you with very long memories might remember that I used to be part of the Plone community and even worked for Zope Europe Association at one point. The Plone Snow Sprint is still a "thing". But, with former Zope/Plone developers involved in the company/community, the Crate Snow Sprint is also a "thing".

Getting To Know Crate

Crate is a high-performance, distributed database. Very easy to deploy (think: "the database storage for Docker") and designed to help manage subsets of very large datasets (Clusters of 100s of nodes? Why not!? Petabytes of data? Hells yeah! Hundreds of billions of rows? Come get some.)

From the marketing blurb:

Crate has been designed to be a highly distributed high performing database. Before Crate organizations had to compromise on performance if they wanted to keep the ease of use benefits of using SQL stores, or move to a No-SQL store and deal with the complexities of the query languages and rewriting their code. With Crate you get the best of both worlds: the No-SQL performance you require and the SQL syntax you want. Crate can be used for a variety of use cases, from a classic but scalable SQL database, to advanced usage incorporating full text search, geo shape and analytics support.

Big Data, of course, means we are not talking "classic" SQL. No foreign key support, for example; these just do not scale very well. Instead, all related data should just get fired into the same table. What you lose in space efficiency you gain in speed. Lots of speed.

Why Do You Care About Crate, Paul?

The Big Data space is full of incredible technology and is evolving fast. With the growth in Big Data we have seen a growth in technologies to enable Big Data; containers are a huge part of this. If you are not clued-in on containers, go take a look at what Wikipedia has to say on it. Until now containerisation (that's a word, right?) has really been focused on application deployment. Crate is the primary contender as the storage for containers, allowing extremely easy deployments in order to arbitrarily grow the clusters. As I said, rolling a 100s of nodes in a Crate deployment is really not hard to do.

So why do I care? Who here remembers the SQO-OSS project? The premise of this project was simple: create a system to measure the quality of a piece of Free Software, where "quality" was defined by any arbitrary collection of metrics by the user. The system maintained clones of SVN repositories and regularly ensured they were up-to-date. Metric scripts would then be run against these repositories and an SQL store would keep the results which could then be viewed through a web client.

This project, like almost every EC-funded project I have ever worked on, was a successful failure. We built exactly what we intended to build and helped develop knowledge on software quality (metrics), for which we had an extraordinary number of publications to prove it. I say "failure", however, because the architecture of the system we built was heavily tied to the hardware purchased for the project (i.e. one seriously fat sparc server from Sun). Arbitrary scaling of the datastore, processing and front-end? Never taken into consideration. Ironically, the backbone to the tool we developed was Equinox which could have enabled such an architecture. We really did not make best use of that technology, however.

The... result.... was....... very........... slow.

Why do I care about Crate? Because I want to fix ^ that problem. In short: I want to create an arbitrary scaleable solution for metric processing, result caching and visualisation in a timely fashion.

So What Do You Have In Mind, Paul?

I want to create a solution that is arbitrarily scaleable to the needs of those using it. To that end I envisage four discrete components that can be scaled according to need:

  • Front-end nodes using Python + Flask.

REST and web front-ends for data input, retrieval and visualisation. A typical deployment will not need many of these, I guess. However, a public deployment of the system for a popular Free Software project might well need extra oomph here. In the case of data retrieval, data will be grabbed from the results cache or, if the result is not available yet, a metric processing node will be triggered.

  • Data entry nodes.

These would simply be responsible for gathering the metadata from (probably just git, at first) repositories and entering the data in the (potentially very large, but hardly "Big Data") table in the data storage. Python + SQLAlchemy here. These would need to be scaled up with the number of projects being analysed.

  • Metric processing nodes.

These will be scripts that process data from the metadata store and stash the result in the results cache. Mostly envisage Python for basic metrics or C/C++ libraries for anything heavyweight, dispy for process distribution.

  • Storage nodes.

Err... Crate.

Where Am I At?

Well, it is a long time since I did any "real" Python programming. Or any programming for that matter. So I'm starting from the very beginning and getting my head around the technology stack that I am envisioning.

As part of my work at the Snow Sprint, yesterday, I started work on a basic code for data entry using Python and SQLAlchemy; which works very nicely with Crate. The fruits of my labour can be found on Github (promise not to laugh!). By the end of today (Day 1) I hope to have a nice pipeline: git repo -> metadata -> data entry node -> crate. Then tomorrow I will implement a basic metric (daily commit count?) and my work for the Snow Sprint will be done. At the moment I have no documentation tucked away in there; I will sort this out sometime after the sprint.

For me the next step will be to really nail down the basic architecture, ensuring ease of deployment and scaling. Once I have my head around that, building of the actual v0.1 system will not be too hard, I think. At least not for very simple metrics.

Call For Help

At the moment the system I am building is nothing more than an interesting demo for Crate (it is really showcasing simplicity rather than, say, scalability, since this will never be a real BigData application). However, as someone who cares about software/community/developer metrics I would love to see this mini-project turn into something real. If you have an interested in helping me develop this into a real tool, something that developers could really benefit from, I would be very happy to hear from you and get this thing moving. Not picked a license yet, because I want to engage the whole project team in that discussion, if I manage to grow one! :) But, most definitely, Free Software.

If you have an interest in metrics and would like to talk to me about this project, feel free to reach me by any means sensible (all my social media/email links are at the top of the page). Alternatively, if you are there, feel free to grab me for a chat at either the FLOSS Community Metrics Meeting or FOSDEM later this week.

Colophon:

The header image on this page is a photo I took during the setup of the snow sprint. As you might imagine, it is not enough (or sensible) for us to make use of the router provided in the chalet. Lots of cabling everywhere for our own network, of course! I will post more photos later.

No comments:

Post a Comment