Reading Large Result Sets with Hibernate and MySQL

Our production MySQL database contains almost 1TB of data and occasionally we need to read millions of rows for processing. There are a few ways of doing this, but we’ve found that streaming is the most flexible and efficient. Below I will describe the options we’ve tried and how to set up a streaming query in Hibernate, which we use as our O/R mapping framework.

By default, Hibernate (through the MySQL JDBC driver) fetches the entire result set in memory. For most queries, this is perfectly reasonably but beyond around 50K rows, depending upon the row size, performance begins to noticeably degrade. This is discussed in the MySQL JDBC implementation notes under the ResultSet section.

What are the alternatives?

One option is to batch the result sets using offset and limit pairs and process the sets serially. The limit value minimizes the amount of memory required to store each result set. Unfortunately, MySQL does not treat large offset values efficiently by default and will still read all the rows prior to an offset value. It is common to see a query with an offset above 100,000 take over 20 times longer than an offset of zero! There is a trick to improve the efficiency of this query by instead doing the offset on a covering index, then self-joining against the matched rows. The index can be used to find the relevant set of row identifiers, then the full rows can be read from disk. For example, the query:

    SELECT id, name
    FROM users
    WHERE state = ‘CA’
    ORDER BY id
    LIMIT 10000,50

would be rewritten as:

    SELECT user.id, user.name
    FROM users INNER JOIN (
        SELECT id
        FROM users
        WHERE state = ‘CA’
        ORDER BY id
        LIMIT 10000,50
    ) AS user USING (id)

This is a nice trick to have in your toolbox, but it may not be usable with certain schemas or queries. For instance, this would not work with a query that joins on another table. In addition, the Hibernate Query Language (HQL) does not support a select join, so you would need to use a Hibernate native SQL statement.

A second option is to again batch the results, but convert the limit and offset into a sorted range query on an indexed field. For example, each batch stores the largest primary key identifier seen so far and the subsequent batch query returns rows greater than that identifier. (A timestamp field could also be used.)

    SELECT id, name
    FROM users
    WHERE id BETWEEN 10000 AND 10050
    AND state = ‘CA’
    ORDER BY id

This option does not suffer from the issues of the first, but it also may not be usable with certain schemas. Remember, in order for MySQL to take advantage of the range condition, it is important to use the last field in a multi-part index.

Finally, a third option is to use a database cursor and stream the results one row at a time. As with the other options, there are a few caveats: 1) a connection must be held open as long as the statement cursor is active, 2) no other queries can be run on that connection, and 3) locks held by the query are not released until the statement or transaction is closed. What are the implications for a production system? Because a cursor ties up a connection until it is finished, you will need to support at least as many concurrent connections as you expect active cursors. Also add to this the number of connections needed to support any other concurrent queries. This can be easily configured in a connection pool (you’re using one, right?). For example, if you are using c3p0 with Hibernate, you would set the

max_size

property in your

hibernate.cfg.xml

configuration file:

    <property name=”hibernate.c3p0.max_size”>20</property>

The last caveat means that the statement locks will not be released until all of the streaming rows have been read and processed. So, if there are any other connections trying to access the same rows concurrently, they may be blocked until those locks are released (if they are exclusive locks). If you can live with these caveats, you can use streaming to speed up any query that returns a large result set.

Setting up streaming

Let’s take a look at how you configure a streaming query. The aforementioned documentation says:

    To enable this functionality, create a Statement instance in the following manner:
    stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                    java.sql.ResultSet.CONCUR_READ_ONLY);
    stmt.setFetchSize(Integer.MIN_VALUE);

This can be done using the

Query

interface (this should work for

Criteria

as well) in version 3.2+ of the Hibernate API:

    Query query = session.createQuery(query);
    query.setReadOnly(true);
    // MIN_VALUE gives hint to JDBC driver to stream results
    query.setFetchSize(Integer.MIN_VALUE);
    ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
    // iterate over results
    while (results.next()) {
        Object row = results.get();
        // process row then release reference
        // you may need to flush() as well
    }
    results.close();

This allows you to stream over the result set, however Hibernate will still cache results in the

Session

, so you’ll need to call

session.flush()

every so often. If you are only reading data, you might consider using a

StatelessSession

, though you should read its documentation beforehand.

That’s all it takes to get streaming working in Hibernate! For one of our smaller queries, which returns about 110,000 rows, just changing to streaming results in about a 35% speed improvement, though of course your mileage may vary. For other questions regarding MySQL performance, I’d highly recommend High Performance MySQL by Schwartz et al. It is an invaluable reference for understanding and boosting MySQL performance and, in fact, was where I found tips for the first two options.

Tags: , ,

Comments are closed.