Nubs

Here at Numerate, a lot of our data is stored in standard relational databases, but we also find ourselves dealing with a significant amount of semi-structured data.  This data comes from various sources, such as lab results, data files from outside vendors, our internal databases, etc. We often find ourselves needing to rapidly prototype new tools and ideas, but because we’re still experimenting with and trying to understand this data, we often don’t yet know how to organize it.

Unix pipes and tools like perl, grep, awk, R, and Octave are incredibly useful for dealing with unstructured data consisting of simple commonly used types (e.g., integers, floating point numbers, strings, associative arrays), but we frequently also have more complex data that we want to analyze (e.g., molecules, proteins, virtual assays). When using Unix pipes for ad-hoc processing, we observed that we were writing lots of small, one-off, single-use Java apps and creating lots of temporary files. Many of the apps were just a few lines of Java, and provided minimal functionality, but were nonetheless necessary. This was especially painful for users in production environments as they had to submit patches to get these one-off apps into releases!

What we realized we wanted was a simple way to expose our Java libraries and APIs to Unix pipes, and thus Nubs was born (NUmerate Bean Shell).

Nubs has a feature set heavily inspired by (read: stolen from) the Perl command-line binary. Unlike the Perl program, the scripting language is Beanshell, not Perl, but most of the behavior and command-line options are identical. There is a distributed version that lets us move particularly complicated processing on to our in-house cluster or AWS cloud cluster.  In many fields, the overhead of distribution is higher than the script evaluation, but because many of our workloads are compute-bound, not data-bound, distribution makes sense.

While our particular implementation of Nubs is closed-source (for now… stay tuned), it was remarkably easy to create.  The current implementation consists of a few hundred lines of Java code, most of which is concerned with command-line parsing and validation.  Our main regret with Nubs is that we didn’t create it earlier.

Some implementation tips for anyone thinking about rolling their own version:

  • To prevent runaway processes, it is useful to install a signal handler for SIGPIPE that terminates the program when the output pipe is disconnected.
  • Passing complex, aggregate data types is difficult to do in a Unix pipeline. As a simple, pragmatic hack, we encode such objects using Java Serialization, followed by hex-encoding. We call this serhex and have implemented two Beanshell custom commands, serhex and deserhex, to handle interconversion.
  • The original Beanshell project has been inactive since 2005, but the beanshell2 fork has many fixes and improvements that you may find useful.

Here’s a sample to give you a feel for how it’s used:

nubs -e 'for (mol : readMols("input.sdf")) println(serhex(mol));' |
  nubs -p -e 'LINE + "\t" + molWeight(deserhex(LINE))' |
    sort -k 2 -n -r | head -n 10 |
      nubs -n -a -e 'appendMol("output.sdf", SPLIT[0])'

The first line executes a small script defined on the command line (indicated by the -e flag) which reads molecules out of an input file and prints the serhex’ed version of each molecule to stdout. Beanshell allows us to use the mol variable without declaring its type, which will be determined at runtime.

The next line appends a tab and the molecular weight to each line. Here, the -p flag indicates that Nubs should read from stdin and apply the script to each line (binding it to the variable LINE), printing the result to stdout.

The third line is standard Unix pipes goodness, sorting the lines by molecular weight and taking the 10 heaviest.

And finally, the last line extracts the serhex’ed molecules and writes them to an output file. The -n flag is similar to the -p flag, but no output is emitted to stdout. In concert, the -a flag splits the input LINE (by default using TAB delimiters) and stores the resultant array in SPLIT.

Before Nubs, the previous task would have required us to write an absurdly specific and short Java app that would probably never get used again. Now, we can prototype the same thing with a couple lines of script and not worry about leaving vestigial code in our codebase. For those Nubs scripts that we use a lot (like the first and last lines in the sample above), we eventually convert them into production-quality, optimized applications. As a result, we’ve compiled a small toolbox of serhex-enabled Java applications and Beanshell custom commands that have enabled some really complex operations with only a couple lines of code. But that’ll have to wait for another post… In the meantime, we hope we’ve inspired you to try your hand at  mixing Java and Unix pipes and we’d love to hear about what you come up with!

Bash Goodies: Turbocharging your History

Everybody knows you can get a list of recently run commands in your bash shell using the history command. And it’s commonplace to use grep to search through that output to find some commands of interest. But doing so is cumbersome:  you may have to issue a bunch of different grep commands before you can find the command you were looking for, and — once found — you have to copy and paste that command from the output to the command line (or use the dangerous ‘!’ operator). And if it’s been a while, there’s a good chance the command has fallen off the end of your history file. Or, if that command was issued in another open shell session, it will be invisible to you. These are all problems we address in this Bash Goodies installment!

1. Use reverse-i-search to power through your history.

Hit CTRL-R at the bash command prompt to start a reverse-i-search through your history file. As you type each character, bash will interactively display the last command in your history which contained what you’ve typed so far. Hit CTRL-R again to take the cursor to the yet-previous instance of that substring (whether in the current command or a previous one), repeating as necessary. You can use backspace to edit the search string at any point, or hit CTRL-C if you’d like to exit out of the search. If you see the command you want to reissue, you can just hit ENTER. Or, you can hit the left or right arrow key to begin editing the command so you can issue a modified version. (And if you want to get fancy, you can use CTRL-K and CTRL-Y, killing and yanking, respectively, to combine bits and pieces of different commands.) Once you’ve gotten the hang of reverse-i-search you’ll probably find it much more powerful and convenient than grepping or (gasp) scrolling through your history with up and down arrows.

2. Make your history file big.

It doesn’t usually pay to be stingy with your history file. At some point you’ll regret not being able to summon up that magic one-liner for counting the number of roman numerals in all your text files created on odd-numbered days.  By default, bash keeps around only your most recent 500 commands. We’ve been storing the last 10000 with no ill effects, using these additions to our .bashrc files:

export HISTSIZE=10000
export HISTFILESIZE=10000

HISTSIZE sets the number of commands to save during any one session, and HISTFILESIZE sets the number of commands to save between sessions. It’s best that they be equal.

3. Instantaneously share history between bash sessions.

If you’re doing some work across multiple open terminals, it’s easy to lose track of which commands were entered where. This makes your history files terminal-dependent, and often less useful. But if you add these lines to your .bashrc file, your history will be synchronized across sessions every time you enter a command!

shopt -s histappend
export PROMPT_COMMAND="history -a; history -n; $PROMPT_COMMAND"

The first line configures bash to append to rather than overwrite your history file. The second says that every time bash generates a fresh command prompt, it should append the last run command to the history file, and load any new commands written to that file (from other shells) into the current history list.  Since these commands run when the command prompt is set, you’ll need to enter a command (or just hit enter) in the current shell to see new additions. (If you’d rather refresh the history on demand for old-school up-arrow behavior, remove the history -n part of that line and run it manually when desired.)

Another nice aspect of updating your history file after each command is that if you have to kill a shell without logging out (or you log in remotely before exiting that shell), you’ll still have access to your history.

4. Bonus: Add timestamps to your history.

If you add this to your .bashrc file, your history file will contain the date and time that each command was issued:

export HISTTIMEFORMAT="%F %T "

…Use Wisely

Of course, relying on your history file should never take the place of good documentation and scripting of your workflow. And one must be careful with a turbocharged history — it’s easy to reverse-i-search your way to a dangerous command and accidentally hit enter. But hopefully the above tips, which we’ve been using to good effect here at Numerate, will save you some time and hassle next time you realize that the “one-off” command you issued last week is going to come in handy today.  As they say, “Those who cannot remember the past are condemned to repeat it (probably with a mistake thrown in)!”

Bash Goodies: Running N of M Jobs at a Time

This is the first post in a series in which we explore some useful tricks for getting the most out of our favorite shell here at Numerate, bash. Here we look at how a bash while loopprocess substitution, and the make command can be combined to yield a one-liner capable of running a list of piped-in commands, a user-specified number at a time, until they’re all done. This can be useful when you have a long list of memory- and CPU-intensive jobs you want run on a multi-core machine — running one at a time would be too slow, running all at once would thrash memory and put too much load on the machine, and breaking into fixed-size chunks would waste CPU as the batches finished up in a staggered fashion.

The command is useful and fun to dissect, but it won’t win any readability contests. Here it is in all its migraine-inducing, stomach-churning glory:

(while read line; do echo -e "$((++i)):\n\t$line"; done; echo all: $(seq 1 $i)) \
    | make -B -j $1 -f <(cat -) all

If we plop this into a bash script called run_n.sh, (be sure to start it with #!/bin/bash, since /bin/sh doesn’t support process substitution) you can run each line in commands.txt, five running at any time, until all the commands are done, with:

cat commands.txt | run_n.sh 5

Let’s pick apart the script one pipe at a time:

while read line;

Reads each line of input and stores it into the line variable.

do echo -e "$((++i)):\n\t$line";

Spit out an integer, starting at one and incrementing for each input line, followed by a colon, a newline, a tab, and the input line. (Yes, C-style incrementing works!) This gets our input into Makefile format, with target names given by integers.

done;

Terminate the loop.

echo all: $(seq 1 $i)

Spit out the all target and make it depend on all the other targets. (Unfortunately make doesn’t appear to have a built-in ‘all’ target.) The $(command) syntax is just another, more readable and easily nestable version of `command`. And seq is a handy little command for generating a sequence of integers, which here correspond to the names of all our targets.

The output of these three commands is grouped by parentheses and piped into:

make -B -j $1 -f <(cat -) all

The -B flag tells make to unconditionally make all targets, so we won’t skip executing a command even if a file with the target name (e.g., “1″) already exists. The -j flag tells it to run $1 (the argument to our script) processes at a time. The -f parameter expects an explicitly named Makefile, but instead of writing a temporary one and having to delete it afterwards, we make use of bash process substitution. This tricks make into thinking that its input is coming from a file while in reality it’s coming from executing the parentheses-enclosed command. In this case, that command is just cat -, which just copies stdin to stdout. You can invoke process substitution using <(command) or >(command) wherever a file is expected for input or output, respectively.

The final argument tells make to build target all, which will run all the commands. Here we send in some sleeping and echoing, with various values to the -j parameter, to see our command in action:

# Run 1 command at a time. make prints out each command just before execution.
$ for i in $(seq 1 4); do echo "sleep 2; echo $i" | run_n.sh 1
sleep 2; echo 1
1
sleep 2; echo 2
2
sleep 2; echo 3
3
sleep 2; echo 4
4

# Run 2 commands at a time.
$ for i in $(seq 1 4); do echo "sleep 2; echo $i" | run_n.sh 2
sleep 2; echo 1
sleep 2; echo 2
1
2
sleep 2; echo 3
sleep 2; echo 4
3
4

# Run 10 commands at a time.
$ for i in $(seq 1 4); do echo "sleep 2; echo $i" | run_n.sh 10
sleep 2; echo 1
sleep 2; echo 2
sleep 2; echo 3
sleep 2; echo 4
1
2
3
4

Whew, that was a trip! Hopefully you found this little script educational and useful. Is there an easier way to do this? If you know of one, please send it along!

FedEx Day at Numerate

A few weeks ago we had our first Fedex Day. It was a great success.

Fedex Day is an offshoot of Google’s “20% time”, named by Atlassian, and popularized by Daniel Pink’s excellent book Drive. In a Fedex Day, developers take time off from normal activities and spend a day working on out-of-the-box projects, then deliver the results at the end of the day. The point of Fedex Day is to foster creativity and innovation in the team, work on pet projects that people are excited about, and have some fun.

Our version of Fedex Day went like this: Thursday afternoon we gathered to share ideas and form teams. We encouraged people to collaborate with others with whom they hadn’t worked before. Our only guideline was the projects had to benefit the company in some way. We moved computers around to co-locate teams. Pizza and Red Bull for dinner. Many people worked late into the night and came in early the next day. At 4pm on Friday, everyone demoed their results to the rest of the company.

There was an explosion of creativity resulting in some impressive projects; it was great to see what people could do in just one day. Many of the projects will become part of our internal drug design application. Others are the first step in longer-term research or development projects; showing them off raised their visibility and increased their chances of future funding. In addition, the forced focus of a 24-hour window led to some valuable insights about teamwork, our development process, and the direction of our application.

The time we spent on Fedex Day was well worth it, and we will likely repeat the experience in the future. We encourage others to try it!

Java and Scientific Computing

Java is not a language typically associated with scientific computing. Historically, Java’s performance was not competitive with languages like C and C++, but its performance has steadily improved over the past 15 years to the point where it is comparable to C/C++ in many domains.

Still, Java has its weaknesses when it comes to scientific computing. The most obvious of these revolve around Java’s lack of support for numerical methods. For example, there is not a standard set of APIs for BLAS and LAPACK, no native support for complex numbers, and the generics system does not support primitive (int, double, etc.) types.

Despite these limitations, we’ve found Java to be a great language for developing scientific software.  Java is a stable, reliable platform, with a huge built-in runtime library and a massive number of third party libraries, both commercial and open source.  When writing hundreds of thousands (or millions) of lines of code, reliability, maintainability, and comprehensibility become paramount. In our experience, writing 99% of our code in Java and calling out to native code for inner-loop numerical algorithms provides a great power-to-weight ratio.

In this blog, we’d like to share with you some of our solutions for getting the maximum scientific computing performance out of Java. We also hope to learn from you! We’ve been doing this for years, and still come across libraries and packages that we wish we’d known about earlier. Hopefully, with your help, we can keep this field moving forward.

Amazon and Numerate write case study on cloud computing

Amazon Web Service (AWS) just released a case study we wrote with them on our use of their cloud services. Our business is truly enabled by AWS, so we were delighted for them to release this case study.

Intra-EC2 latency

As we’ve mentioned before, we use EC2 to handle a fair amount of our computing needs.  For many of our application workloads it’s a perfect fit, but we’ve found it to be less than perfect for workloads that are sensitive to intra-node latency.

For example, we have an application which makes tens of millions of small RPC requests to hundreds or sometimes thousands of slave nodes.  The RPC calls themselves only take a few dozens of milliseconds, so network latencies above the one millisecond range have a significant impact on performance.

There have been quite a few blog posts recently discussing EC2 network latency (this one from Alan Williamson provides a good summary), and we thought we would add our numbers to the mix.  In particular, the numbers we’re curious about are network round-trip times between nodes within a single availability zone.  Measuring intra-availability zones and intra-region latencies would be interesting as well, but we wouldn’t expect to see ”better” performance in those cases anyway.

The best data we could find on the subject came from amistrongeryet’s blog post, specifically this data set reporting ping (ICMP) round-trip times.  The most important thing to notice when looking at this data is that both the median and mean latencies are deceptively low.  For our particular algorithm, each batch of RPC calls is only as fast as its slowest individual RPC call, so far more relevant are numbers like the 99th percentile and maximum latencies.  As you can see, the “fat tail” of latencies leads to extremely large latencies at the 99th percentile;  ~30ms in this case!  We’ve seen some experiments done where people look at as few as 10 ping times before making conclusions, and that leads to completely incorrect results when the distribution has such a fat tail.

However, according to RightScale’s Thorsten von Eicken (as reported in The Register) ICMP packets are not a reliable measure of network latencies on EC2, so we decided to measure relative latencies by instrumenting our actual code and comparing round-trip times on EC2 with round-trip times in our own data center.

For the experiment, we used High-CPU Extra Large instances to mitigate any “noisy neighbor” (Williamson’s term) issues.  We then measured the total time from the initiation of each RPC call until the RPC call concluded.  This means that the numbers include the overhead of things like message serialization, thread synchronization, etc.  However, a High-CPU Extra Large instance is roughly equivalent to the computing power of our in-house cluster nodes, so we believe the relative performance differences are dominated by network latency differences, not CPU performance.  It should also be noted that all of the experiments were performed within a single availability zone in EC2-east.

The data is as follows:

As you can see, in 90% of the cases EC2 latencies are on-par with our in-house latencies; just a small constant factor (2X) higher.  However, above that 90th percentile, EC2’s performance falls off a cliff.  At many percentiles, we see latencies that are nearly an order of magnitude higher than we see in-house.

These numbers all seem to be fairly consistent with those we’ve seen from others like amistrongeryet, as well, despite the fact that their numbers are derived from ICMP traffic.  So, while we’d love to have found that the ICMP-based numbers were misleading, it looks like our best hope for EC2 in the near future is to focus on algorithms which are much less sensitive to the fat tail of latencies.

PS: None of these experiments were done using Amazon’s new Cluster Compute Instances, where we’d expect performance to meet (or exceed) the performance we’ve seen in-house. The folks at BioTeam have done some research of their own on this front. However, cluster instances cost about 40% more than the High-CPU instances we tested with (normalized by “EC2 Compute Units”), so users will have to evaluate the tradeoffs between the increased efficiency but higher cost of using the cluster instances. Yet another good reason to focus on latency-insensitive algorithms!

Securing EC2 and S3

Our computational infrastructure requirements are simple by design.  We require processors that are as fast as possible with a minimum of a gigabyte of memory per core.  For the majority of the jobs we run, our network requirements are very minimal.  Such computational requirements make EC2 (raw compute) and S3 (raw storage) perfect for our use.  Our security model is as simple as our infrastructure requirements: be as secure as possible.  We do this using four key security tenets: isolation, access control, encryption, and monitoring/logging.

Isolation is the hardest thing to control in the cloud.  The only thing one can do is make a best attempt.  On EC2 there is compelling evidence that the hardware is currently based on 8 core machines.  We therefore only use 8 core high-CPU XL instances.  This lowers the risk of a side-channel attack and has the added advantage of getting the most out of the network card.  On S3 isolation from other S3 users is impossible, but the isolation principle can be applied to reduce the spread of possible security breaches.  Using separate buckets for each internal project/customer with appropriate changes in ACLs and encryption is the best one can do to isolate data on S3.

Access control on EC2 can be divided into two categories: access control from the outside and access control from the inside.  Much of the access control we have implemented can be found in the currently-beta offering of VPC, with the exception of S3 access.  Once access to S3 is introduced we will most likely move to using VPC, but until then our set-up is as follows.  Locally our engineering network is isolated from the internet from a user’s point of view for security reasons.  To make our EC2 cluster available locally in the secured environment we have setup a linux VM that NATs traffic back and forth between a virtual network on EC2 and the local network.  The virtual network is created using OpenVPN.  The primary OpenVPN server is the head AMI running on EC2.  All other AMIs and the local NATing VM are OpenVPN clients.  Both NATing and access security on the local VM are controlled using a set of iptables rules.  Both the local iptables rules and the OpenVPN connections are updated automatically via custom scripts when a new cluster AMI is launched or brought down by the cluster administrator.  On EC2, each of the AMIs has a set of iptables rules as well that are dynamically setup on launch to allow communication with the head AMI.  The iptables rules restrict access for users of the cluster to the cluster AMIs and nothing else.  They also restrict external access from everywhere except cluster AMIs and the secure engineering network.  These iptables rules are used in addition to the standard AWS security group rules, which add an additional layer of access control, but only from the outside.  AWS security group rules do not restrict out going traffic.

I mentioned that we allow access to S3, but how do we do this given the access restrictions we have set forth?  We have a second specially designed AMI running at all times that is a proxy to S3.  This AMI is the only AMI in the cluster with access to S3 and users are restricted from accessing it directly.  The AMI is running Apache with mod_proxy and mod_perl enabled.  The proxy proxies all connections through it with both access and credentials being handled by a custom PerlFixupHandler.  The handler Perl script restricts access to predefined buckets and appends the appropriate access credentials to the headers of the S3 requests.  With this setup the S3 credentials only sit on this highly secured AMI and are not available to the average user.  This prevents the possibility of someone taking the keys and accessing the data while outside the company network.  One might point out that an attack against this hardened AMI could expose the credentials.  This should be caught by the monitoring layer.  In addition, the data is also encrypted before being sent to S3.  Thus a successful attack would require cracking or circumventing two encryption systems to access the data.

Encryption is used where ever possible and practical on EC2 and S3.  All data stored on S3 is encrypted twice.  The encryption happens at the point of origin and again on S3 and is therefore encrypted in transit.  We use JetS3t to handle interactions with S3.  It has the nice features of being usable with our proxy setup and has built-in encryption functionality.  In addition to data encryption, all communication between EC2 AMIs and between EC2 and our local gateway is encrypted using OpenVPN.  The use of OpenVPN has the added advantage of having all communication on a virtualized network.  This makes security simpler and reduces the need to inform each AMI of the addition to or elimination from the cluster of other AMIs.  On the AMIs the filesystems are encrypted and an encrypted swap is created on boot in the scratch space provided under /mnt.  The disk encryption is somewhat secondary because we have designed our code, as mentioned above, to fit in hardware memory and not to hit the local disk.  As an added security measure we have set the core dump size to zero.  With such measures it would be rare for us to actually need the disk encryption except to protect log files.

Monitoring will be examined in a later post.

Numerate’s Philosophy

Previously I briefly introduced Numerate and the purpose for this blog. Here I will describe our technical philosophy in a little more detail.

Numerate’s fundamental technical philosophy begins with three tenets:

  • Quantity has a quality of its own
  • Use every piece of available data
  • Encode the physics, chemistry and biology you know and make up the knowledge gaps with statistics, rather than with assumptions

With this philosophy in mind we have built a computational platform to design small molecule drugs. We begin by determining the criteria that a given drug must meet in order to be an effective treatment for any given disease. We do this in collaboration with our partners in the biotech and pharma industry based on their biological and medicinal insights. Once these criteria have been defined we develop accurate predictive models of each of the criteria and search very large spaces of compounds in order to identify molecules predicted likely to meet those criteria. The technological challenges therefore revolve around  constructing accurate predictive models, and searching large spaces of compounds.

Constructing Predictive Models

Numerate’s predictive models are primarily statistical; that is, we use modern machine learning techniques to construct predictive models of molecular properties. We begin by collecting and curating all of the available data for any given molecular property. We then apply machine learning techniques to analyze this data and produce a statistical model that can predict the given property for previously untested molecules given only their molecular drawings. Our approach routinely yields computational models that are as accurate as a laboratory experiment! Clearly, this is no trivial task and one of the aims of this blog will be to explore in detail some of the challenges that we have addressed.

Searching Large Spaces of Compounds

Once our statistical models are built we apply them to very large spaces of compounds to identify compounds predicted to meet the design criteria. These spaces typically contain ~100 billion compounds and are designed on a problem by problem basis to capture all available human intuition regarding a set of drug design criteria. The spaces are encoded combinatorially and must be explored in order to identify potential drugs. The computational challenges around this problem are substantial and we plan to explore in detail the challenges that we have addressed here, as well.

Why this blog?

Numerate, Inc. is a Bay Area startup developing a computational platform for the design of small molecule drugs. We use this platform to deliver novel, potent, and selective lead candidates to our partners in the pharmaceuticals and biotechnology industry. The purpose of this site is to discuss some of the more interesting technical challenges that we have observed and addressed during the ongoing development of our platform.

We made some significant engineering decisions early on that have shaped our efforts since. For example, all of our development is in Java and has been since late 2000. This decision has raised a lot of eyebrows over the years – Java is not exactly a traditional language for scientific computing, and had substantial performance issues in 2000. However, we have found that the benefits in terms of developer productivity have more than made up for reductions in computational efficiency and that as the quality of JVMs has improved the performance gaps have become negligible.

We also use large scale compute resources, having constructed several large clusters (100 nodes in 2001, and 600 nodes in 2004), but started our migration to the cloud in late 2005. Since then we have run many large jobs on Amazon EC2. We were early to cloud compute and our scientific compute workloads are different from those for which the Map-Reduce framework is aimed.

Over the next few posts we will provide a brief introduction to Numerate’s technology platform in order to frame future posts. In future posts, we will present some of our experiences and observations on Java as a language for scientific compute and how we use large scale compute and the cloud to address “synchronous”, latency-sensitive, high compute-density, scientific compute problems. Beyond that our technology team will use this blog to discuss our perspectives on other challenges we have faced, solved, or endured.