Just a quick announcement that I've created Guile Scheme bindings for libmagic: guile-magic. The top-level API is trivial:

scheme@(guile-user)> (use-modules (magic))
scheme@(guile-user)> (magic-file-type "/usr/bin/file")
$2 = "ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/, for GNU/Linux 3.2.0, BuildID[sha1]=2b26928f841d92afa31613c2c916a3abc96bbed8, stripped"

but the Scheme programmer has access to nearly everything provided by magic.h, e.g.:

  (let ((m (make-magic-set #:opts (list magic-symlinks magic-compress)
                           #:magic "my-cool-magic-file"
                           #:params '((magic-param-indir-max . 32)))))
    (format #t "~a\n" (magic-file m some-file))
    (format #t "~a\n" (magic-buffer m some-bytevector)))
;; just drop `m' on the floor-- it will be cleaned-up automatically

I hope someone finds this useful; you can download it from Github or on my distributions page.

libmagic is the core library for the venerable file command, and I learned a lot about it on this project. Thing is, it's showing its age.

The library, not the command-line program, returns results as human-readable text (as ASCII-encoded US English, to boot). A library may in limited circumstances reasonably include text intended for human readers (in log messages, say), but in general, and certainly when it comes to reporting its primary results, it should return data intended for processing by a program. Representing that data as text, and localizing it for the particular reader, is the responsibility of the application using the library.

I suppose one could argue that it's not producing general-purpose text, but rather technical abbreviations that wouldn't be localized in any event, but still. As an application developer trying to determine a file's type, I shouldn't have to scan a string to figure out that I've got an ELF shared object: that should be a enumerated value of some kind.

The API is C, with all that that entails. In particular, we're back to the nineties-era "alloc-handle/use-handle/close-handle" programming idiom that inevitably produces resource leaks. In particular, magic_setparam takes a const void* that in reality has to point to a size_t, the size of which of course varies depending on your platform; if you as the caller get that wrong, congratulations: you just SEGV'd.

What I'm leading up to is this: is it time for a re-implementation of libmagic (and file, for that matter) in a modern language? I'm picturing a backwards-compatible implementation in Rust along the lines of ripgrep or dust ("fusty" [file + rusty] anyone?). It would not be a weekend project like this one: the implementation of parsing & implementing the arbitrary magic checks looks formidable. Nor am I aware of any general dissatisfaction other than my own. For now, something for my "someday/maybe" list.

10/19/20 06:46

Have a comment on this post? Start a discussion in my public inbox by sending an email to ~sp1ff/ list etiquette), or see existing discussions