Tuesday, 26 January 2016

Crate Snow Sprint: Day 1 (Stashing Git Metadata)

[If you haven't done so already, take a quick look at what I wrote yesterday about the Crate Snow Sprint].

So yesterday was the first "real" day of the Snow Sprint. I used the time to start implementing a very rudimentary metric processing tool. To recap: I want to build a service-oriented system with arbitrarily scalable components. In the cloud. (Bingo!). Whilst this is largely a demonstration of Crate and an opportunity for me to learn about that technology, it is a serious project that I would like to see grow into something meaningful.

So here I write about the nonsense code I wrote yesterday and what I have managed to achieve in prototyping my intended solution.

Bitten By Python

Many moons have passed since I was last paid to program in Python. The language has changed in some crucial ways since then, but not so much that it is hard to update your thinking from Python 2. I always find Python a real joy to work with, so updating my own thinking really was not a chore.

As well as defamiliarising with Python, my aim for yesterday was to get my head around the basics of certain technologies I was planning to use as part of this project, namely Flask and SQLAlchemy.

What Am I Building?

If you have yet to go back and read yesterday's blog post, now is a good time. Yesterday I envisaged a system with 4 distinct component services, each independently scalable:

  • Crate: Used as storage for Git metadata (who committed, what, when and in which project) and metric results (which metric gave what result, for which project and at what time). Depending on the number of projects and metrics we could potential get close to "big data" territory. The reason for opting for Crate is the scaling: a system like this is going to be inserting/retrieving data concurrently in great volume. The clustering of Crate will really help with this.
  • Metric Services: Invoked according to how often they need to be run (typically hourly, daily or weekly). These services will grab metadata from Crate in order to run a metric. Simple metrics could be implemented in Python or, for more computational metrics, implemented in C with Python bindings. These services are not constantly processing. They should only be run if a user of the system queries for metric result data that is not in the results cache.
  • Git Services: Invoked according to how timely the data for any given project must be. These services clone the project repo, run git log, parse the output and then stash the metadata of each commit in Crate before killing the clone.
  • Frontend Services: These provide REST API and web for manipulating the whole setup.

Well... The Crate part of this is easy enough. Crate exists and works. And is extremely simple to deploy (read: "plays extremely well with Docker"). Yesterday my aim was to create a pipeline for getting metadata from Git into Crate.

A Few Things To Show Off

Firstly, you can go and grab my code on Github: https://github.com/therealpadams/metre

What you will find in there:

  • create_commits.py: A script for invoking git-log, parsing the output and then calling the REST API for inserting the metadata.
  • models.py: Contains all the classes for mapping in SQLAlchemy as well as crating the table. In this case just one table, for storing commit metadata.
  • pipelines.py: Provides the functions called by the REST API. At this point I have simple functions for inserting commits and closing the transaction.
  • requirements.txt: Should be familiar to anyone who has worked with Python virtualenv and pip... contains a list of the dependencies to be installed.
  • urls.py: Provides the REST API using Flask. At this moment it provides one simple function for receiving log metadata and stashing it in Crate.

Right now, you can run the urls script and fire in data from the create_commits script and it will fail. I will fix this all before the end of the day. If all goes well, but the end of the day there will be a simple metric (commits per day?) script in the repo, too.

This is far far far from great code. At the moment my aim is simply to learn the technologies I am working with and have a play with the overall pipeline for the data. After this playground edition is complete, I will stash it somewhere by itself in the repo and go about hacking "the real thing". Although this is not likely to happen after FOSDEM.

Colophon:

Another header photo taken at the Crate Snow Sprint. Again, thanks to Crate.io for sponsoring my trip to Austria.

Monday, 25 January 2016

Crate Snow Sprint: Day 0 (I Need Help)

Now for something a little different: Crate. Thanks to the kind sponsorship of Crate.io, I am attending the annual Snow Sprint. This is an event that has been in existence for many years (certainly over a decade); originally it was a get-together for Zope/Plone developers. Those of you with very long memories might remember that I used to be part of the Plone community and even worked for Zope Europe Association at one point. The Plone Snow Sprint is still a "thing". But, with former Zope/Plone developers involved in the company/community, the Crate Snow Sprint is also a "thing".

Getting To Know Crate

Crate is a high-performance, distributed database. Very easy to deploy (think: "the database storage for Docker") and designed to help manage subsets of very large datasets (Clusters of 100s of nodes? Why not!? Petabytes of data? Hells yeah! Hundreds of billions of rows? Come get some.)

From the marketing blurb:

Crate has been designed to be a highly distributed high performing database. Before Crate organizations had to compromise on performance if they wanted to keep the ease of use benefits of using SQL stores, or move to a No-SQL store and deal with the complexities of the query languages and rewriting their code. With Crate you get the best of both worlds: the No-SQL performance you require and the SQL syntax you want. Crate can be used for a variety of use cases, from a classic but scalable SQL database, to advanced usage incorporating full text search, geo shape and analytics support.

Big Data, of course, means we are not talking "classic" SQL. No foreign key support, for example; these just do not scale very well. Instead, all related data should just get fired into the same table. What you lose in space efficiency you gain in speed. Lots of speed.

Why Do You Care About Crate, Paul?

The Big Data space is full of incredible technology and is evolving fast. With the growth in Big Data we have seen a growth in technologies to enable Big Data; containers are a huge part of this. If you are not clued-in on containers, go take a look at what Wikipedia has to say on it. Until now containerisation (that's a word, right?) has really been focused on application deployment. Crate is the primary contender as the storage for containers, allowing extremely easy deployments in order to arbitrarily grow the clusters. As I said, rolling a 100s of nodes in a Crate deployment is really not hard to do.

So why do I care? Who here remembers the SQO-OSS project? The premise of this project was simple: create a system to measure the quality of a piece of Free Software, where "quality" was defined by any arbitrary collection of metrics by the user. The system maintained clones of SVN repositories and regularly ensured they were up-to-date. Metric scripts would then be run against these repositories and an SQL store would keep the results which could then be viewed through a web client.

This project, like almost every EC-funded project I have ever worked on, was a successful failure. We built exactly what we intended to build and helped develop knowledge on software quality (metrics), for which we had an extraordinary number of publications to prove it. I say "failure", however, because the architecture of the system we built was heavily tied to the hardware purchased for the project (i.e. one seriously fat sparc server from Sun). Arbitrary scaling of the datastore, processing and front-end? Never taken into consideration. Ironically, the backbone to the tool we developed was Equinox which could have enabled such an architecture. We really did not make best use of that technology, however.

The... result.... was....... very........... slow.

Why do I care about Crate? Because I want to fix ^ that problem. In short: I want to create an arbitrary scaleable solution for metric processing, result caching and visualisation in a timely fashion.

So What Do You Have In Mind, Paul?

I want to create a solution that is arbitrarily scaleable to the needs of those using it. To that end I envisage four discrete components that can be scaled according to need:

  • Front-end nodes using Python + Flask.

REST and web front-ends for data input, retrieval and visualisation. A typical deployment will not need many of these, I guess. However, a public deployment of the system for a popular Free Software project might well need extra oomph here. In the case of data retrieval, data will be grabbed from the results cache or, if the result is not available yet, a metric processing node will be triggered.

  • Data entry nodes.

These would simply be responsible for gathering the metadata from (probably just git, at first) repositories and entering the data in the (potentially very large, but hardly "Big Data") table in the data storage. Python + SQLAlchemy here. These would need to be scaled up with the number of projects being analysed.

  • Metric processing nodes.

These will be scripts that process data from the metadata store and stash the result in the results cache. Mostly envisage Python for basic metrics or C/C++ libraries for anything heavyweight, dispy for process distribution.

  • Storage nodes.

Err... Crate.

Where Am I At?

Well, it is a long time since I did any "real" Python programming. Or any programming for that matter. So I'm starting from the very beginning and getting my head around the technology stack that I am envisioning.

As part of my work at the Snow Sprint, yesterday, I started work on a basic code for data entry using Python and SQLAlchemy; which works very nicely with Crate. The fruits of my labour can be found on Github (promise not to laugh!). By the end of today (Day 1) I hope to have a nice pipeline: git repo -> metadata -> data entry node -> crate. Then tomorrow I will implement a basic metric (daily commit count?) and my work for the Snow Sprint will be done. At the moment I have no documentation tucked away in there; I will sort this out sometime after the sprint.

For me the next step will be to really nail down the basic architecture, ensuring ease of deployment and scaling. Once I have my head around that, building of the actual v0.1 system will not be too hard, I think. At least not for very simple metrics.

Call For Help

At the moment the system I am building is nothing more than an interesting demo for Crate (it is really showcasing simplicity rather than, say, scalability, since this will never be a real BigData application). However, as someone who cares about software/community/developer metrics I would love to see this mini-project turn into something real. If you have an interested in helping me develop this into a real tool, something that developers could really benefit from, I would be very happy to hear from you and get this thing moving. Not picked a license yet, because I want to engage the whole project team in that discussion, if I manage to grow one! :) But, most definitely, Free Software.

If you have an interest in metrics and would like to talk to me about this project, feel free to reach me by any means sensible (all my social media/email links are at the top of the page). Alternatively, if you are there, feel free to grab me for a chat at either the FLOSS Community Metrics Meeting or FOSDEM later this week.

Colophon:

The header image on this page is a photo I took during the setup of the snow sprint. As you might imagine, it is not enough (or sensible) for us to make use of the router provided in the chalet. Lots of cabling everywhere for our own network, of course! I will post more photos later.