Tuesday 26 January 2016

Crate Snow Sprint: Day 1 (Stashing Git Metadata)

[If you haven't done so already, take a quick look at what I wrote yesterday about the Crate Snow Sprint].

So yesterday was the first "real" day of the Snow Sprint. I used the time to start implementing a very rudimentary metric processing tool. To recap: I want to build a service-oriented system with arbitrarily scalable components. In the cloud. (Bingo!). Whilst this is largely a demonstration of Crate and an opportunity for me to learn about that technology, it is a serious project that I would like to see grow into something meaningful.

So here I write about the nonsense code I wrote yesterday and what I have managed to achieve in prototyping my intended solution.

Bitten By Python

Many moons have passed since I was last paid to program in Python. The language has changed in some crucial ways since then, but not so much that it is hard to update your thinking from Python 2. I always find Python a real joy to work with, so updating my own thinking really was not a chore.

As well as defamiliarising with Python, my aim for yesterday was to get my head around the basics of certain technologies I was planning to use as part of this project, namely Flask and SQLAlchemy.

What Am I Building?

If you have yet to go back and read yesterday's blog post, now is a good time. Yesterday I envisaged a system with 4 distinct component services, each independently scalable:

  • Crate: Used as storage for Git metadata (who committed, what, when and in which project) and metric results (which metric gave what result, for which project and at what time). Depending on the number of projects and metrics we could potential get close to "big data" territory. The reason for opting for Crate is the scaling: a system like this is going to be inserting/retrieving data concurrently in great volume. The clustering of Crate will really help with this.
  • Metric Services: Invoked according to how often they need to be run (typically hourly, daily or weekly). These services will grab metadata from Crate in order to run a metric. Simple metrics could be implemented in Python or, for more computational metrics, implemented in C with Python bindings. These services are not constantly processing. They should only be run if a user of the system queries for metric result data that is not in the results cache.
  • Git Services: Invoked according to how timely the data for any given project must be. These services clone the project repo, run git log, parse the output and then stash the metadata of each commit in Crate before killing the clone.
  • Frontend Services: These provide REST API and web for manipulating the whole setup.

Well... The Crate part of this is easy enough. Crate exists and works. And is extremely simple to deploy (read: "plays extremely well with Docker"). Yesterday my aim was to create a pipeline for getting metadata from Git into Crate.

A Few Things To Show Off

Firstly, you can go and grab my code on Github: https://github.com/therealpadams/metre

What you will find in there:

  • create_commits.py: A script for invoking git-log, parsing the output and then calling the REST API for inserting the metadata.
  • models.py: Contains all the classes for mapping in SQLAlchemy as well as crating the table. In this case just one table, for storing commit metadata.
  • pipelines.py: Provides the functions called by the REST API. At this point I have simple functions for inserting commits and closing the transaction.
  • requirements.txt: Should be familiar to anyone who has worked with Python virtualenv and pip... contains a list of the dependencies to be installed.
  • urls.py: Provides the REST API using Flask. At this moment it provides one simple function for receiving log metadata and stashing it in Crate.

Right now, you can run the urls script and fire in data from the create_commits script and it will fail. I will fix this all before the end of the day. If all goes well, but the end of the day there will be a simple metric (commits per day?) script in the repo, too.

This is far far far from great code. At the moment my aim is simply to learn the technologies I am working with and have a play with the overall pipeline for the data. After this playground edition is complete, I will stash it somewhere by itself in the repo and go about hacking "the real thing". Although this is not likely to happen after FOSDEM.

Colophon:

Another header photo taken at the Crate Snow Sprint. Again, thanks to Crate.io for sponsoring my trip to Austria.

No comments:

Post a Comment