Sunday 9 November 2014

Parsing GIT Logs With Files

So things have been a little quiet on the GIT log analysis front. Various out-of-bounds activities were in my way but, more importantly, so was a directly-related technical issue... Just how do you get file data out of GIT logs?

Not As Easy As It Sounds

The problem is this... In the good ol' days I was parsing SVN log files that were XML-formatted; svn log --xml. I also needed the file data to be included in that for many of my analysis tools; svn log -v --xml. Typical output looked something like this:

<?xml version="1.0" encoding="UTF-8"?>
<log>
  <logentry revision="58903">
    <author>padams</author>
    <date>2014-11-04T15:22:30.364435Z</date>
    <paths>
      <path action="A">/Some/Path/To/A/File</path>
    </paths>
    <msg>New WMD contract</msg>
  </logentry>
</log>

Nice and easy to parse. GIT does not provide any such XML output, however it does provide a --format option where you can specify the output format. For everything but the files. You see with GIT I can get all the information I want, just not in some handy format. And I certainly need to XMLify it myself if I want to make use of my existing parser; git log --name-only:

commit f2fe64b234c31c703c998cacdc1c3cff43f4e05f
Author: Lamarque V. Souza <lamarque@kde.org>
Date: Fri Nov 7 22:37:56 2014 -0200

    Doxygen configuration for http://api.kde.org/.

doc/api/Doxyfile.local

The Approach

I sat down with Ade at Academy and we came to the conclusion that actually the best way to get this file data out of GIT was to use libgit. Holy moly. No, just no. Academy was some time ago and that was the last time I had a chance to look into this. A little bit of Googling around led me to this post which did everything I needed apart from handle the file data. In fact,, I really could not find any information on extracting the file data out of a GIT log. Well, thankfully, the modification was easy. So many thanks to the author of that blog post and here is my recipe!

import pprint
import subprocess

GIT_COMMIT_FIELDS = ['id', 'author_name', 'author_email', 'date', 'message', 'files']
GIT_LOG_FORMAT = ['%H', '%an', '%ae', '%ad', '%s']
GIT_LOG_FORMAT = '%x1e' + '%x1f'.join(GIT_LOG_FORMAT) + '%x1f'

p = subprocess.Popen('git log --name-only --format="%s"' % GIT_LOG_FORMAT, shell=True, stdout=subprocess.PIPE)
(log, _) = p.communicate()
log = log.strip('\n\x1e').split("\x1e")
log = [row.strip().split("\x1f") for row in log]
log = [dict(zip(GIT_COMMIT_FIELDS, row)) for row in log]

for row in log:
    row['files'] = row['files'].strip('\n').split('\n')

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(log)

At the end of that short script we are left with a Python dictionary representation of the log which can easily be processed within Python or outputted as XML for use by my other scripts. Hurrah!

Simple solution, but it took a while to get to it. Using the --format option to create sensible delimiters between log entries and their fields had already occurred to me and I had tried things like:

GIT_LOG_FORMAT = '<logentry revision="%H"><author>%an.......<paths>...'

The problem with this, of course... How do I insert the closing </paths> or the individual <path></path> tags? The files are not one of the items that can be formatted using git log --format!

Thanks to the other blog post, I realised that the XMLification should really be a separate step after extracting the required data.

No comments:

Post a Comment