Warning: Cannot modify header information - headers already sent by (output started at /home/content/n/u/m/numeratiwww/html/wp-content/themes/numerati/functions.php:1) in /home/content/n/u/m/numeratiwww/html/wp-includes/feed-rss2.php on line 8
Numerati http://www.numerati.com Numerate's tech blog Tue, 26 Jun 2012 16:54:34 +0000 en hourly 1 http://wordpress.org/?v=3.3.2 Reading Large Result Sets with Hibernate and MySQL http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/ http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/#comments Tue, 26 Jun 2012 16:52:46 +0000 sean http://www.numerati.com/?p=411 Our production MySQL database contains almost 1TB of data and occasionally we need to read millions of rows for processing. There are a few ways of doing this, but we’ve found that streaming is the most flexible and efficient. Below I will describe the options we’ve tried and how to set up a streaming query in Hibernate, which we use as our O/R mapping framework.

By default, Hibernate (through the MySQL JDBC driver) fetches the entire result set in memory. For most queries, this is perfectly reasonably but beyond around 50K rows, depending upon the row size, performance begins to noticeably degrade. This is discussed in the MySQL JDBC implementation notes under the ResultSet section.

What are the alternatives?

One option is to batch the result sets using offset and limit pairs and process the sets serially. The limit value minimizes the amount of memory required to store each result set. Unfortunately, MySQL does not treat large offset values efficiently by default and will still read all the rows prior to an offset value. It is common to see a query with an offset above 100,000 take over 20 times longer than an offset of zero! There is a trick to improve the efficiency of this query by instead doing the offset on a covering index, then self-joining against the matched rows. The index can be used to find the relevant set of row identifiers, then the full rows can be read from disk. For example, the query:

    SELECT id, name
    FROM users
    WHERE state = ‘CA’
    ORDER BY id
    LIMIT 10000,50

would be rewritten as:

    SELECT user.id, user.name
    FROM users INNER JOIN (
        SELECT id
        FROM users
        WHERE state = ‘CA’
        ORDER BY id
        LIMIT 10000,50
    ) AS user USING (id)

This is a nice trick to have in your toolbox, but it may not be usable with certain schemas or queries. For instance, this would not work with a query that joins on another table. In addition, the Hibernate Query Language (HQL) does not support a select join, so you would need to use a Hibernate native SQL statement.

A second option is to again batch the results, but convert the limit and offset into a sorted range query on an indexed field. For example, each batch stores the largest primary key identifier seen so far and the subsequent batch query returns rows greater than that identifier. (A timestamp field could also be used.)

    SELECT id, name
    FROM users
    WHERE id BETWEEN 10000 AND 10050
    AND state = ‘CA’
    ORDER BY id

This option does not suffer from the issues of the first, but it also may not be usable with certain schemas. Remember, in order for MySQL to take advantage of the range condition, it is important to use the last field in a multi-part index.

Finally, a third option is to use a database cursor and stream the results one row at a time. As with the other options, there are a few caveats: 1) a connection must be held open as long as the statement cursor is active, 2) no other queries can be run on that connection, and 3) locks held by the query are not released until the statement or transaction is closed. What are the implications for a production system? Because a cursor ties up a connection until it is finished, you will need to support at least as many concurrent connections as you expect active cursors. Also add to this the number of connections needed to support any other concurrent queries. This can be easily configured in a connection pool (you’re using one, right?). For example, if you are using c3p0 with Hibernate, you would set the max_size property in your hibernate.cfg.xml configuration file:

    <property name=”hibernate.c3p0.max_size”>20</property>

The last caveat means that the statement locks will not be released until all of the streaming rows have been read and processed. So, if there are any other connections trying to access the same rows concurrently, they may be blocked until those locks are released (if they are exclusive locks). If you can live with these caveats, you can use streaming to speed up any query that returns a large result set.

Setting up streaming

Let’s take a look at how you configure a streaming query. The aforementioned documentation says:

    To enable this functionality, create a Statement instance in the following manner:
    stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                    java.sql.ResultSet.CONCUR_READ_ONLY);
    stmt.setFetchSize(Integer.MIN_VALUE);

This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:

    Query query = session.createQuery(query);
    query.setReadOnly(true);
    // MIN_VALUE gives hint to JDBC driver to stream results
    query.setFetchSize(Integer.MIN_VALUE);
    ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
    // iterate over results
    while (results.next()) {
        Object row = results.get();
        // process row then release reference
        // you may need to flush() as well
    }
    results.close();

This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.flush() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand.

That’s all it takes to get streaming working in Hibernate! For one of our smaller queries, which returns about 110,000 rows, just changing to streaming results in about a 35% speed improvement, though of course your mileage may vary. For other questions regarding MySQL performance, I’d highly recommend High Performance MySQL by Schwartz et al. It is an invaluable reference for understanding and boosting MySQL performance and, in fact, was where I found tips for the first two options.

]]>
http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/feed/ 0
Evaluating ligand-based algorithms: Noise http://www.numerati.com/2012/06/19/evaluating-ligand-based-algorithms-noise/ http://www.numerati.com/2012/06/19/evaluating-ligand-based-algorithms-noise/#comments Wed, 20 Jun 2012 05:49:29 +0000 nigel http://www.numerati.com/?p=447 We as a community are acutely aware that the data we have with which to build models can be extremely noisy. There are two approaches to dealing with this. First, only use consistent, low-noise sets of data. Second, develop noise-robust fitting procedures. I don’t intend to really argue for one of these here, except to note that I believe we have to take the second approach and leverage all available data.

With regard to evaluating algorithms, the first approach does not require much further discussion here. The second approach is more interesting.

The correct way to approach developing noise-robust algorithms is to begin by assessing the typical nature and quantity of noise on our problem of interest. We then develop techniques to deal with these particular kinds of noise. The reason this is the correct approach is that general robustness to arbitrary noise is hard to achieve, very hard. We always want to leverage our knowledge and understanding of the problem at hand to make it easier. Dealing with noise is an issue where this approach can provide significant leverage.

However, if we have developed algorithms to deal with the kinds of idiosyncratic noise that arise in our data, we can’t evaluate them on data that doesn’t present similar kinds of noise. Suppose for example that we have two components to our fitting procedure: a learning algorithm A, and a noise robustness module B. Now suppose we use a set of data to evaluate the relative performance of A alone against A combined with B (AB). Clearly, if the evaluation data is noise free, or exhibits different kinds of noise than those for which B was developed we should not expect AB to perform better. In fact, we can reasonably expect it to perform worse.

More generally, it is not reasonable to compare algorithms perfected for noise-free data versus algorithms perfected for noisy data on noise-free data. We must decide a priori which case is most relevant to the problem at hand and evaluate that case.

(Note that when I refer to evaluation data I mean both the training and testing data used for the evaluation.)

]]>
http://www.numerati.com/2012/06/19/evaluating-ligand-based-algorithms-noise/feed/ 0
Evaluating ligand-based algorithms http://www.numerati.com/2012/06/04/evaluating-ligand-based-algorithms/ http://www.numerati.com/2012/06/04/evaluating-ligand-based-algorithms/#comments Tue, 05 Jun 2012 03:36:11 +0000 nigel http://www.numerati.com/?p=407 What is the best way to evaluate or compare ligand-based algorithms? I’ve found myself spending quite a lot of time on this question recently. It’s both interesting and hard. And most of what is done in the computational chemistry literature isn’t right.

So, I’m going to write a series of blog posts aiming to clarify the issues.

Generally speaking there are three main axes:

1) What is the problem we want to solve? In other words, what are we going to use our predictions for?

2) What are the key features of the data available to us for building and evaluating models?

3) What are the appropriate statistical techniques/tests for assessing the results?

I’m going to address some of these issues in detail in later posts. Here I’d just like to say why each of these is important.

First, understanding how you’re going to use your model is key. It’s essential to understand that any model you build is going to make mistakes. It’s going to have errors. The key to problem definition then is deciding where to put those errors, or alternatively which errors to penalize and how much. For example, suppose you’re fitting atom centered partial charges for a force-field based on quantum data. You could minimize mean squared error, you could minimize the mean absolute error, or you could minimize the maximum deviation. You will get vastly different results in these different cases. Furthermore, the appropriate algorithm, representation, and data for fitting each objective function will be different. So, if what you really care about is maximum deviation, it would be a mistake to evaluate alternative fitting algorithms by measuring mean squared error.

Second, the distributional and noise properties of your data and of the underlying problem are critical to the effectiveness of algorithms and models. Therefore the experimental design needs to carefully reflect these properties. For example, if the underlying problem exhibits a particular kind of bias then the key determiner of effectiveness may be the ability to deal with that bias. If the evaluation experiment does not exhibit similar biases it will be incapable of differentiating appropriately between algorithms along this key axis.

Finally, different test designs will yield performance measures with different distributional properties, and so the appropriate statistical tests to determine effectiveness will differ. For example, Spearman’s Rank Correlation on a sample is a biased estimate of the true correlation while the same is not true for Kendall’s tau (under appropriate assumptions). This must be taken into account.

Designing appropriate experiments and evaluating them correctly is truly hard. I don’t believe that our current state of knowledge or tools allows us to do solve this problem fully. I do believe though that standard approaches make significant errors, and that these errors are sufficient to make many of the conclusions drawn incorrect. We will have real trouble advancing the field of computational chemistry until we address these shortcomings.

]]>
http://www.numerati.com/2012/06/04/evaluating-ligand-based-algorithms/feed/ 0
Nubs http://www.numerati.com/2011/08/17/nubs/ http://www.numerati.com/2011/08/17/nubs/#comments Wed, 17 Aug 2011 17:36:23 +0000 Jessen http://www.numerati.com/?p=9 Here at Numerate, a lot of our data is stored in standard relational databases, but we also find ourselves dealing with a significant amount of semi-structured data.  This data comes from various sources, such as lab results, data files from outside vendors, our internal databases, etc. We often find ourselves needing to rapidly prototype new tools and ideas, but because we’re still experimenting with and trying to understand this data, we often don’t yet know how to organize it.

Unix pipes and tools like perl, grep, awk, R, and Octave are incredibly useful for dealing with unstructured data consisting of simple commonly used types (e.g., integers, floating point numbers, strings, associative arrays), but we frequently also have more complex data that we want to analyze (e.g., molecules, proteins, virtual assays). When using Unix pipes for ad-hoc processing, we observed that we were writing lots of small, one-off, single-use Java apps and creating lots of temporary files. Many of the apps were just a few lines of Java, and provided minimal functionality, but were nonetheless necessary. This was especially painful for users in production environments as they had to submit patches to get these one-off apps into releases!

What we realized we wanted was a simple way to expose our Java libraries and APIs to Unix pipes, and thus Nubs was born (NUmerate Bean Shell).

Nubs has a feature set heavily inspired by (read: stolen from) the Perl command-line binary. Unlike the Perl program, the scripting language is Beanshell, not Perl, but most of the behavior and command-line options are identical. There is a distributed version that lets us move particularly complicated processing on to our in-house cluster or AWS cloud cluster.  In many fields, the overhead of distribution is higher than the script evaluation, but because many of our workloads are compute-bound, not data-bound, distribution makes sense.

While our particular implementation of Nubs is closed-source (for now… stay tuned), it was remarkably easy to create.  The current implementation consists of a few hundred lines of Java code, most of which is concerned with command-line parsing and validation.  Our main regret with Nubs is that we didn’t create it earlier.

Some implementation tips for anyone thinking about rolling their own version:

  • To prevent runaway processes, it is useful to install a signal handler for SIGPIPE that terminates the program when the output pipe is disconnected.
  • Passing complex, aggregate data types is difficult to do in a Unix pipeline. As a simple, pragmatic hack, we encode such objects using Java Serialization, followed by hex-encoding. We call this serhex and have implemented two Beanshell custom commands, serhex and deserhex, to handle interconversion.
  • The original Beanshell project has been inactive since 2005, but the beanshell2 fork has many fixes and improvements that you may find useful.

Here’s a sample to give you a feel for how it’s used:

nubs -e 'for (mol : readMols("input.sdf")) println(serhex(mol));' |
  nubs -p -e 'LINE + "\t" + molWeight(deserhex(LINE))' |
    sort -k 2 -n -r | head -n 10 |
      nubs -n -a -e 'appendMol("output.sdf", SPLIT[0])'

The first line executes a small script defined on the command line (indicated by the -e flag) which reads molecules out of an input file and prints the serhex’ed version of each molecule to stdout. Beanshell allows us to use the mol variable without declaring its type, which will be determined at runtime.

The next line appends a tab and the molecular weight to each line. Here, the -p flag indicates that Nubs should read from stdin and apply the script to each line (binding it to the variable LINE), printing the result to stdout.

The third line is standard Unix pipes goodness, sorting the lines by molecular weight and taking the 10 heaviest.

And finally, the last line extracts the serhex’ed molecules and writes them to an output file. The -n flag is similar to the -p flag, but no output is emitted to stdout. In concert, the -a flag splits the input LINE (by default using TAB delimiters) and stores the resultant array in SPLIT.

Before Nubs, the previous task would have required us to write an absurdly specific and short Java app that would probably never get used again. Now, we can prototype the same thing with a couple lines of script and not worry about leaving vestigial code in our codebase. For those Nubs scripts that we use a lot (like the first and last lines in the sample above), we eventually convert them into production-quality, optimized applications. As a result, we’ve compiled a small toolbox of serhex-enabled Java applications and Beanshell custom commands that have enabled some really complex operations with only a couple lines of code. But that’ll have to wait for another post… In the meantime, we hope we’ve inspired you to try your hand at  mixing Java and Unix pipes and we’d love to hear about what you come up with!

]]>
http://www.numerati.com/2011/08/17/nubs/feed/ 0
Bash Goodies: Turbocharging your History http://www.numerati.com/2011/08/03/bash-goodies-turbocharging-your-history/ http://www.numerati.com/2011/08/03/bash-goodies-turbocharging-your-history/#comments Wed, 03 Aug 2011 23:31:52 +0000 Brad http://www.numerati.com/?p=258 Everybody knows you can get a list of recently run commands in your bash shell using the history command. And it’s commonplace to use grep to search through that output to find some commands of interest. But doing so is cumbersome:  you may have to issue a bunch of different grep commands before you can find the command you were looking for, and — once found — you have to copy and paste that command from the output to the command line (or use the dangerous ‘!’ operator). And if it’s been a while, there’s a good chance the command has fallen off the end of your history file. Or, if that command was issued in another open shell session, it will be invisible to you. These are all problems we address in this Bash Goodies installment!

1. Use reverse-i-search to power through your history.

Hit CTRL-R at the bash command prompt to start a reverse-i-search through your history file. As you type each character, bash will interactively display the last command in your history which contained what you’ve typed so far. Hit CTRL-R again to take the cursor to the yet-previous instance of that substring (whether in the current command or a previous one), repeating as necessary. You can use backspace to edit the search string at any point, or hit CTRL-C if you’d like to exit out of the search. If you see the command you want to reissue, you can just hit ENTER. Or, you can hit the left or right arrow key to begin editing the command so you can issue a modified version. (And if you want to get fancy, you can use CTRL-K and CTRL-Y, killing and yanking, respectively, to combine bits and pieces of different commands.) Once you’ve gotten the hang of reverse-i-search you’ll probably find it much more powerful and convenient than grepping or (gasp) scrolling through your history with up and down arrows.

2. Make your history file big.

It doesn’t usually pay to be stingy with your history file. At some point you’ll regret not being able to summon up that magic one-liner for counting the number of roman numerals in all your text files created on odd-numbered days.  By default, bash keeps around only your most recent 500 commands. We’ve been storing the last 10000 with no ill effects, using these additions to our .bashrc files:

export HISTSIZE=10000
export HISTFILESIZE=10000

HISTSIZE sets the number of commands to save during any one session, and HISTFILESIZE sets the number of commands to save between sessions. It’s best that they be equal.

3. Instantaneously share history between bash sessions.

If you’re doing some work across multiple open terminals, it’s easy to lose track of which commands were entered where. This makes your history files terminal-dependent, and often less useful. But if you add these lines to your .bashrc file, your history will be synchronized across sessions every time you enter a command!

shopt -s histappend
export PROMPT_COMMAND="history -a; history -n; $PROMPT_COMMAND"

The first line configures bash to append to rather than overwrite your history file. The second says that every time bash generates a fresh command prompt, it should append the last run command to the history file, and load any new commands written to that file (from other shells) into the current history list.  Since these commands run when the command prompt is set, you’ll need to enter a command (or just hit enter) in the current shell to see new additions. (If you’d rather refresh the history on demand for old-school up-arrow behavior, remove the history -n part of that line and run it manually when desired.)

Another nice aspect of updating your history file after each command is that if you have to kill a shell without logging out (or you log in remotely before exiting that shell), you’ll still have access to your history.

4. Bonus: Add timestamps to your history.

If you add this to your .bashrc file, your history file will contain the date and time that each command was issued:

export HISTTIMEFORMAT="%F %T "

…Use Wisely

Of course, relying on your history file should never take the place of good documentation and scripting of your workflow. And one must be careful with a turbocharged history — it’s easy to reverse-i-search your way to a dangerous command and accidentally hit enter. But hopefully the above tips, which we’ve been using to good effect here at Numerate, will save you some time and hassle next time you realize that the “one-off” command you issued last week is going to come in handy today.  As they say, “Those who cannot remember the past are condemned to repeat it (probably with a mistake thrown in)!”

]]>
http://www.numerati.com/2011/08/03/bash-goodies-turbocharging-your-history/feed/ 2
Bash Goodies: Running N of M Jobs at a Time http://www.numerati.com/2011/07/21/bash-goodies-running-n-jobs/ http://www.numerati.com/2011/07/21/bash-goodies-running-n-jobs/#comments Fri, 22 Jul 2011 02:04:35 +0000 Brad http://www.numerati.com/?p=218 This is the first post in a series in which we explore some useful tricks for getting the most out of our favorite shell here at Numerate, bash. Here we look at how a bash while loopprocess substitution, and the make command can be combined to yield a one-liner capable of running a list of piped-in commands, a user-specified number at a time, until they’re all done. This can be useful when you have a long list of memory- and CPU-intensive jobs you want run on a multi-core machine — running one at a time would be too slow, running all at once would thrash memory and put too much load on the machine, and breaking into fixed-size chunks would waste CPU as the batches finished up in a staggered fashion.

The command is useful and fun to dissect, but it won’t win any readability contests. Here it is in all its migraine-inducing, stomach-churning glory:

(while read line; do echo -e "$((++i)):\n\t$line"; done; echo all: $(seq 1 $i)) \
    | make -B -j $1 -f <(cat -) all

If we plop this into a bash script called run_n.sh, (be sure to start it with #!/bin/bash, since /bin/sh doesn’t support process substitution) you can run each line in commands.txt, five running at any time, until all the commands are done, with:

cat commands.txt | run_n.sh 5

Let’s pick apart the script one pipe at a time:

while read line;

Reads each line of input and stores it into the line variable.

do echo -e "$((++i)):\n\t$line";

Spit out an integer, starting at one and incrementing for each input line, followed by a colon, a newline, a tab, and the input line. (Yes, C-style incrementing works!) This gets our input into Makefile format, with target names given by integers.

done;

Terminate the loop.

echo all: $(seq 1 $i)

Spit out the all target and make it depend on all the other targets. (Unfortunately make doesn’t appear to have a built-in ‘all’ target.) The $(command) syntax is just another, more readable and easily nestable version of `command`. And seq is a handy little command for generating a sequence of integers, which here correspond to the names of all our targets.

The output of these three commands is grouped by parentheses and piped into:

make -B -j $1 -f <(cat -) all

The -B flag tells make to unconditionally make all targets, so we won’t skip executing a command even if a file with the target name (e.g., “1″) already exists. The -j flag tells it to run $1 (the argument to our script) processes at a time. The -f parameter expects an explicitly named Makefile, but instead of writing a temporary one and having to delete it afterwards, we make use of bash process substitution. This tricks make into thinking that its input is coming from a file while in reality it’s coming from executing the parentheses-enclosed command. In this case, that command is just cat -, which just copies stdin to stdout. You can invoke process substitution using <(command) or >(command) wherever a file is expected for input or output, respectively.

The final argument tells make to build target all, which will run all the commands. Here we send in some sleeping and echoing, with various values to the -j parameter, to see our command in action:

# Run 1 command at a time. make prints out each command just before execution.
$ for i in $(seq 1 4); do echo "sleep 2; echo $i" | run_n.sh 1
sleep 2; echo 1
1
sleep 2; echo 2
2
sleep 2; echo 3
3
sleep 2; echo 4
4

# Run 2 commands at a time.
$ for i in $(seq 1 4); do echo "sleep 2; echo $i" | run_n.sh 2
sleep 2; echo 1
sleep 2; echo 2
1
2
sleep 2; echo 3
sleep 2; echo 4
3
4

# Run 10 commands at a time.
$ for i in $(seq 1 4); do echo "sleep 2; echo $i" | run_n.sh 10
sleep 2; echo 1
sleep 2; echo 2
sleep 2; echo 3
sleep 2; echo 4
1
2
3
4

Whew, that was a trip! Hopefully you found this little script educational and useful. Is there an easier way to do this? If you know of one, please send it along!

]]>
http://www.numerati.com/2011/07/21/bash-goodies-running-n-jobs/feed/ 2
FedEx Day at Numerate http://www.numerati.com/2011/03/03/fedex-day-at-numerate/ http://www.numerati.com/2011/03/03/fedex-day-at-numerate/#comments Fri, 04 Mar 2011 00:54:29 +0000 rick http://www.numerati.com/?p=205 A few weeks ago we had our first Fedex Day. It was a great success.

Fedex Day is an offshoot of Google’s “20% time”, named by Atlassian, and popularized by Daniel Pink’s excellent book Drive. In a Fedex Day, developers take time off from normal activities and spend a day working on out-of-the-box projects, then deliver the results at the end of the day. The point of Fedex Day is to foster creativity and innovation in the team, work on pet projects that people are excited about, and have some fun.

Our version of Fedex Day went like this: Thursday afternoon we gathered to share ideas and form teams. We encouraged people to collaborate with others with whom they hadn’t worked before. Our only guideline was the projects had to benefit the company in some way. We moved computers around to co-locate teams. Pizza and Red Bull for dinner. Many people worked late into the night and came in early the next day. At 4pm on Friday, everyone demoed their results to the rest of the company.

There was an explosion of creativity resulting in some impressive projects; it was great to see what people could do in just one day. Many of the projects will become part of our internal drug design application. Others are the first step in longer-term research or development projects; showing them off raised their visibility and increased their chances of future funding. In addition, the forced focus of a 24-hour window led to some valuable insights about teamwork, our development process, and the direction of our application.

The time we spent on Fedex Day was well worth it, and we will likely repeat the experience in the future. We encourage others to try it!

]]>
http://www.numerati.com/2011/03/03/fedex-day-at-numerate/feed/ 0
Java and Scientific Computing http://www.numerati.com/2010/11/29/java-and-scientific-computing/ http://www.numerati.com/2010/11/29/java-and-scientific-computing/#comments Mon, 29 Nov 2010 22:48:23 +0000 pat http://www.numerati.com/?p=12 Java is not a language typically associated with scientific computing. Historically, Java’s performance was not competitive with languages like C and C++, but its performance has steadily improved over the past 15 years to the point where it is comparable to C/C++ in many domains.

Still, Java has its weaknesses when it comes to scientific computing. The most obvious of these revolve around Java’s lack of support for numerical methods. For example, there is not a standard set of APIs for BLAS and LAPACK, no native support for complex numbers, and the generics system does not support primitive (int, double, etc.) types.

Despite these limitations, we’ve found Java to be a great language for developing scientific software.  Java is a stable, reliable platform, with a huge built-in runtime library and a massive number of third party libraries, both commercial and open source.  When writing hundreds of thousands (or millions) of lines of code, reliability, maintainability, and comprehensibility become paramount. In our experience, writing 99% of our code in Java and calling out to native code for inner-loop numerical algorithms provides a great power-to-weight ratio.

In this blog, we’d like to share with you some of our solutions for getting the maximum scientific computing performance out of Java. We also hope to learn from you! We’ve been doing this for years, and still come across libraries and packages that we wish we’d known about earlier. Hopefully, with your help, we can keep this field moving forward.

]]>
http://www.numerati.com/2010/11/29/java-and-scientific-computing/feed/ 1
Amazon and Numerate write case study on cloud computing http://www.numerati.com/2010/11/12/amazon-and-numerate-write-case-study-on-cloud-computing/ http://www.numerati.com/2010/11/12/amazon-and-numerate-write-case-study-on-cloud-computing/#comments Fri, 12 Nov 2010 15:51:59 +0000 brandon http://www.numerati.com/?p=192 Amazon Web Service (AWS) just released a case study we wrote with them on our use of their cloud services. Our business is truly enabled by AWS, so we were delighted for them to release this case study.

]]>
http://www.numerati.com/2010/11/12/amazon-and-numerate-write-case-study-on-cloud-computing/feed/ 0
Intra-EC2 latency http://www.numerati.com/2010/10/04/intra-ec2-latency/ http://www.numerati.com/2010/10/04/intra-ec2-latency/#comments Tue, 05 Oct 2010 01:21:27 +0000 pat http://www.numerati.com/?p=10 As we’ve mentioned before, we use EC2 to handle a fair amount of our computing needs.  For many of our application workloads it’s a perfect fit, but we’ve found it to be less than perfect for workloads that are sensitive to intra-node latency.

For example, we have an application which makes tens of millions of small RPC requests to hundreds or sometimes thousands of slave nodes.  The RPC calls themselves only take a few dozens of milliseconds, so network latencies above the one millisecond range have a significant impact on performance.

There have been quite a few blog posts recently discussing EC2 network latency (this one from Alan Williamson provides a good summary), and we thought we would add our numbers to the mix.  In particular, the numbers we’re curious about are network round-trip times between nodes within a single availability zone.  Measuring intra-availability zones and intra-region latencies would be interesting as well, but we wouldn’t expect to see ”better” performance in those cases anyway.

The best data we could find on the subject came from amistrongeryet’s blog post, specifically this data set reporting ping (ICMP) round-trip times.  The most important thing to notice when looking at this data is that both the median and mean latencies are deceptively low.  For our particular algorithm, each batch of RPC calls is only as fast as its slowest individual RPC call, so far more relevant are numbers like the 99th percentile and maximum latencies.  As you can see, the “fat tail” of latencies leads to extremely large latencies at the 99th percentile;  ~30ms in this case!  We’ve seen some experiments done where people look at as few as 10 ping times before making conclusions, and that leads to completely incorrect results when the distribution has such a fat tail.

However, according to RightScale’s Thorsten von Eicken (as reported in The Register) ICMP packets are not a reliable measure of network latencies on EC2, so we decided to measure relative latencies by instrumenting our actual code and comparing round-trip times on EC2 with round-trip times in our own data center.

For the experiment, we used High-CPU Extra Large instances to mitigate any “noisy neighbor” (Williamson’s term) issues.  We then measured the total time from the initiation of each RPC call until the RPC call concluded.  This means that the numbers include the overhead of things like message serialization, thread synchronization, etc.  However, a High-CPU Extra Large instance is roughly equivalent to the computing power of our in-house cluster nodes, so we believe the relative performance differences are dominated by network latency differences, not CPU performance.  It should also be noted that all of the experiments were performed within a single availability zone in EC2-east.

The data is as follows:

As you can see, in 90% of the cases EC2 latencies are on-par with our in-house latencies; just a small constant factor (2X) higher.  However, above that 90th percentile, EC2′s performance falls off a cliff.  At many percentiles, we see latencies that are nearly an order of magnitude higher than we see in-house.

These numbers all seem to be fairly consistent with those we’ve seen from others like amistrongeryet, as well, despite the fact that their numbers are derived from ICMP traffic.  So, while we’d love to have found that the ICMP-based numbers were misleading, it looks like our best hope for EC2 in the near future is to focus on algorithms which are much less sensitive to the fat tail of latencies.

PS: None of these experiments were done using Amazon’s new Cluster Compute Instances, where we’d expect performance to meet (or exceed) the performance we’ve seen in-house. The folks at BioTeam have done some research of their own on this front. However, cluster instances cost about 40% more than the High-CPU instances we tested with (normalized by “EC2 Compute Units”), so users will have to evaluate the tradeoffs between the increased efficiency but higher cost of using the cluster instances. Yet another good reason to focus on latency-insensitive algorithms!

]]>
http://www.numerati.com/2010/10/04/intra-ec2-latency/feed/ 0