Here at Numerate, a lot of our data is stored in standard relational databases, but we also find ourselves dealing with a significant amount of semi-structured data. This data comes from various sources, such as lab results, data files from outside vendors, our internal databases, etc. We often find ourselves needing to rapidly prototype new tools and ideas, but because we’re still experimenting with and trying to understand this data, we often don’t yet know how to organize it.
Unix pipes and tools like perl, grep, awk, R, and Octave are incredibly useful for dealing with unstructured data consisting of simple commonly used types (e.g., integers, floating point numbers, strings, associative arrays), but we frequently also have more complex data that we want to analyze (e.g., molecules, proteins, virtual assays). When using Unix pipes for ad-hoc processing, we observed that we were writing lots of small, one-off, single-use Java apps and creating lots of temporary files. Many of the apps were just a few lines of Java, and provided minimal functionality, but were nonetheless necessary. This was especially painful for users in production environments as they had to submit patches to get these one-off apps into releases!
What we realized we wanted was a simple way to expose our Java libraries and APIs to Unix pipes, and thus Nubs was born (NUmerate Bean Shell).
Nubs has a feature set heavily inspired by (read: stolen from) the Perl command-line binary. Unlike the Perl program, the scripting language is Beanshell, not Perl, but most of the behavior and command-line options are identical. There is a distributed version that lets us move particularly complicated processing on to our in-house cluster or AWS cloud cluster. In many fields, the overhead of distribution is higher than the script evaluation, but because many of our workloads are compute-bound, not data-bound, distribution makes sense.
While our particular implementation of Nubs is closed-source (for now… stay tuned), it was remarkably easy to create. The current implementation consists of a few hundred lines of Java code, most of which is concerned with command-line parsing and validation. Our main regret with Nubs is that we didn’t create it earlier.
Some implementation tips for anyone thinking about rolling their own version:
- To prevent runaway processes, it is useful to install a signal handler for SIGPIPE that terminates the program when the output pipe is disconnected.
- Passing complex, aggregate data types is difficult to do in a Unix pipeline. As a simple, pragmatic hack, we encode such objects using Java Serialization, followed by hex-encoding. We call this serhex and have implemented two Beanshell custom commands, serhex and deserhex, to handle interconversion.
- The original Beanshell project has been inactive since 2005, but the beanshell2 fork has many fixes and improvements that you may find useful.
Here’s a sample to give you a feel for how it’s used:
nubs -e 'for (mol : readMols("input.sdf")) println(serhex(mol));' |
nubs -p -e 'LINE + "\t" + molWeight(deserhex(LINE))' |
sort -k 2 -n -r | head -n 10 |
nubs -n -a -e 'appendMol("output.sdf", SPLIT[0])'
The first line executes a small script defined on the command line (indicated by the -e flag) which reads molecules out of an input file and prints the serhex’ed version of each molecule to stdout. Beanshell allows us to use the mol variable without declaring its type, which will be determined at runtime.
The next line appends a tab and the molecular weight to each line. Here, the -p flag indicates that Nubs should read from stdin and apply the script to each line (binding it to the variable LINE), printing the result to stdout.
The third line is standard Unix pipes goodness, sorting the lines by molecular weight and taking the 10 heaviest.
And finally, the last line extracts the serhex’ed molecules and writes them to an output file. The -n flag is similar to the -p flag, but no output is emitted to stdout. In concert, the -a flag splits the input LINE (by default using TAB delimiters) and stores the resultant array in SPLIT.
Before Nubs, the previous task would have required us to write an absurdly specific and short Java app that would probably never get used again. Now, we can prototype the same thing with a couple lines of script and not worry about leaving vestigial code in our codebase. For those Nubs scripts that we use a lot (like the first and last lines in the sample above), we eventually convert them into production-quality, optimized applications. As a result, we’ve compiled a small toolbox of serhex-enabled Java applications and Beanshell custom commands that have enabled some really complex operations with only a couple lines of code. But that’ll have to wait for another post… In the meantime, we hope we’ve inspired you to try your hand at mixing Java and Unix pipes and we’d love to hear about what you come up with!

