Thursday, June 21, 2012

JPA2 is very inflexible with eager/lazy fetching

This must be a common situation with JPA2, but seems poorly catered for. I feel like I must be missing something obvious. It's amazingly hard to override lazy fetching of properties on a per-query basis, forcing eager fetching of normally lazily fetched properties.

There doesn't appear be any standard API to control whether fetching of a given relationship or property is done eagerly or lazily in a particular query. Nor is there a standard hint for setHint(...) - EclipseLink offers some limited control, Hibernate has no hint equivalent to its own Criteria.setFetchMode( ). JPQL/HQL's left join fetch and the (somewhat broken) Criteria equivalent scale very poorly to more than a couple of properties or to nested lazy properties, and don't permit other fetch strategies like subselect or SELECT fetching to be used.

Tell me I'm wrong and there's some facility in JPA for this.

Read on before replying with "just use left join fetch". I wish it were that simple, and not just because a join fetch isn't always an appropriate strategy.

A fixed global fetching mode per-property is extremely inflexible

Imagine: You have an entity that's fairly central in your relationship graph - it refers to or has references from several different things, and is only a couple of steps away from most entities. Let's call it Customer.

You need to load this entity frequently when accessing other entities, so you want to minimise the cost of accessing it. Consequently, most of its 1:m and 1:1* relationships are defined as lazily fetched in the entity mappings using appropriate OneToMany(fetch= FetchType.LAZY, ...), @ManyToOne(fetch=FetchType.LAZY, ...) and OneToOne(fetch=FetchType.LAZY, ...) annotations. Everything seems to work OK.

Then you need to efficiently fetch the entity and some subset of its lazy relationships. Say you want the Customer entity and its Contact and Address records, using the usual example terms. Or maybe you have a (possibly materialized) summary view of customer data you need to fetch alongside the customer. You don't want to fetch that info all the time, it'd be hopelessly inefficient to have every Customer referenced everywhere in your app pull in lots of data that's unnecessary 99% of the time. That's why the relationships are lazy.

Force loading via lazy loading proxy access


With appropriate entity class weaving (performed by default under most circumstances) lazy relationships are transparently loaded on access**. That means that you can iterate over your query results and touch the properties of interest - with collection properties, using .size(), and with 1:1 proxy properties, using any accessor. If the collections load thin proxy objects you might need to hit an accessor on the collection contents to load them too. This all seems very clunky and verbose - which it is, but there's a bigger problem. The JPA implementation doesn't know in advance what you're going to ask for, so it's issuing hundreds of tiny little SELECT statements that're hammering your database. Worse, if you're in READ COMMITTED isolation or outside a transaction, there's no guarantee the lazy loaded data has a consistent view of the database; entities might've been changed or deleted since the original query. This isn't good, so relying on lazy loading proxying isn't going to be the answer to preload significant amounts of data.

At best, this approach will produce classic ORM "n+1 queries" behaviour - or for nested relationships, approximately "1 + n + m*n" (where m is the number of sub-entities per n). This gets nonlinear fast and will cause your database admin to come knocking on your door ... with an axe.

This approach also only works if the EntityManager your entities came from remains open while you're traversing the graph touching properties.

Use JPQL / HQL left join fetch


Another possible answer is to force a left join fetch. JPQL's "left join fetch" functionality is inherited from Hibernate's HQL along with most of JPQL. When used, it asks the persistence provider to produce a query that selects fields from the joined entity via a LEFT OUTER JOIN and ensure that they're turned into loaded entities. That sounds great, but:

  • EntityA left join fetch EntityB can be an incredibly inefficient way to fetch EntityA. If EntityA is big and/or there are lots of EntityB entities per EntityA the fact that each field of each EntityA must be repeated in the database resultset for each instance of EntityB returned really starts to bite. Imagine how much worse it gets if you add "left join fetch EntityC". This approach scales horrifyingly badly. Not only does the database have to repeat all this data and send it down the wire, but the ORM has to de-duplicate it at the other end!
  • It muddles query semantics with fetching. The query doesn't care about EntityB, it's only included because we expect to need it later. Ugly.
  • It can't be used with a simple EntityManager.find(...), you must use proxy object loading or a separate query
  • You can't use fetch strategies other than left outer join fetching with left join fetch

Use the Criteria API's "Root.fetch(...)"


Theoretically the Criteria API has an equivalent to JPQL's left join fetch using Root.fetch(...). In practice it's hard to get right; it's easy to get a second unwanted join or an error like "query specified join fetching, but the owner of the fetched association was not present in the select list".

If you're working with a criteria API query created somewhere else where you can't change the acceptable results, I struggle to see any way to ask the JPA implementation to eagerly fetch certain properties for you.

Root.fetch(...) also suffers from all the limitations mentioned for JPQL's left join fetch.

Use a query hint to specify the fetching strategy


When using Hibernate's native Critria API, the API provides a setFetchMode(...) method that lets you specify on a per-property basis whether properties should be eagerly or lazily fetched and which fetching strategy should be used. It doesn't offer control of SELECT batch sizes this way, but that can be set globaly. It doesn't matter, because there is no equivalent in JPA2's Criteria API.

EclipseLink offers extensions offers hints for batching and fetch strategy control, eg query.setHint("eclipselink.batch", "e.address") and query.setHint("eclipselink.join-fetch", "e.address");. There is no standard interface to these features, and (as far as I've been able to determine) no equivalent functionality in the hints supported by Hibernate.

Build a new query to manually fetch the desired properties after the fact


If you're able to run more queries against the same entity manager before the entities are detached - which is not always the case - you might be able to build and run additional queries to fetch the desired properties. This often involves repeating work the first query did and can be pretty cumbersome, but it works.

Fetching and fetch strategies are vital for performance


Fetching and batching strategies are important. It's often immensely faster to SELECT a bunch of entities, then issue a couple more queries to SELECT related entities. The secondary queries might use the same predicates etc as the main query and a different select-list, or they might just SELECT entities using SELECT ... FROM entityTable WHERE id IN (...). Either way has approaches, and neither of them requires the ORM layer to read and de-duplicate massive amounts of data.

A dependence on join fetches also prevents JPA implementations from effectively using new features like PostgreSQL's record_to_json feature, which would allow the ORM to fetch entity graphs as easily parsed, duplication-free JSON data structures. A "JSON" fetch strategy would be very hot, but right now you're lucky if you can get you JPA implementation to do an eager left join fetch.

Proper fetch strategies are vital for correctness too


When asking around about these issues, I often received advice to just use EclipseLink, especially from Vaadin folks. It's pointed out that EclipseLink produces proxies for lazily loaded objects that can use a new EntityManager to load the object if the entity is detached from the original EntityManager.

It's scary that people are relying on that.

Why?

Unless optimistic locking is used it throws all consistency guarantees provided by the database completely out the window. The lazy entity is loaded in a different transaction and may not reflect the same database state that the original entity did. Optimistic locking will protect you from some simple errors, but it's still really easy to get burned.

It doesn't help that it's really slow - it's the ultimate way to force everything through n+1 queries and hammer your database with masses of tiny little queries, each of which makes a round trip through a lot of expensive layers.

Hibernate's lazy loading throws an exception rather than background-loading the entities for good reasons, not just because of internal architectural limitations. It shouldn't be necessary to use this crutch.

What's the answer?


Frankly, you tell me. I'm stumped. In simple cases a JPQL left join fetch does the job, but for many situations there doesn't seem to be a good answer. The Internet is full of questions about how to eagerly load on a per-query basis in JPA, and I haven't seen any good answers.

This urgently needs addressing in the JPA 2.1 spec and for Java EE 7. Support is needed for:

  • Per-query, per-property control over fetch mode, where normally-lazy properties can be fetched eagerly or normally-eager properties can be set to lazy. Dot-notation traversal to nested properties ("a.b.c") should be supported.
  • Control over fetch strategy with the same granularity or at least on a per-query basis, so eager fetching of sub-entities can be done using SELECT, SUBSELECT, JOIN, or (extensibly) other yet-to-be-defined methods like JSON or GRAPH fetching from smart databases
  • Some influence over fetch strategy parameters like batch sizes



* Why lazily fetch 1:1 relationships? Because the're a 1:1 relationship not direct composition for a reason. You probably want to control when the other side of the relationship gets fetched; maybe it's big or expensive or has different locking rules or changes a lot. Or maybe it's just vital that the main entity be as lightweight as possible because it's loaded a LOT.

** Hibernate (4.1, at least) won't lazily load OneToOne relationships by default, because it doesn't have a container to use as a proxy, and it can't generate a proxy for an object that might be null. It'll only lazily load the relationship if its mapped optional=false. The Hibernate @LazyToOne(...) annotation can be used to hint to Hibernate about how you want it loaded, but can't force Hibernate to proxy a possibly-null relationship.

6 comments:

  1. Your concern for lack of configurability in access pattern within JPA is a valid one. JPA has not properly adressed the real life scenrio where access graph of a root object can vary per use case/instance basis. Both FETCH JOIN and LAZY/EAGER fetch type have obvious limitatios.

    It may be useful to learn about advanced supprt for configurable FetchPlan which can cater to your need [1].
    You are welcome to post further queries in OpenJPA mailing list [2], if you want to explore FetchPlan support in OpenJPA.

    Regards --

    Pinaki Poddar
    Chair, Apache OpenJPA Project
    http://openjpa.apache.org

    [1] http://openjpa.apache.org/docs/latest/ref_guide_fetch.html
    [2] http://openjpa.208410.n2.nabble.com/

    ReplyDelete
    Replies
    1. Thanks Pinaki. I've been using similar facilities in EclipseLink for a while now. I haven't had any occasion to use OpenJPA yet, so I didn't realise it had similar features. Thanks.

      I had a quick look at the OpenJPA FAQ (http://openjpa.apache.org/faq.html) but didn't see any sort of "Why OpenJPA?" entry. Maybe it's worth adding something like that, noting some of the pluses and minuses of OpenJPA for people trying to select persistence providers.

      Delete
  2. Thanks for the post, I believe you hit the nail right on the head. I am facing similar problems and am turning to native SQL since the performance on queries is horrible with JPA (because of lazy-loaded entities). Next stop NoSQL databases.

    ReplyDelete
  3. Amen, brother. I am on my 3rd project using JPA and struggling with it on a daily basis for doing common operations, precisely due to the problems that you highlight. The situation where you have A ->> B -> C and need to fetch A is very common and by default JPA can issue hundreds of queries, killing performance.

    This is such an obvious problem that I have searched in vain for good support in either the JPA spec or in the individual implementations, but surprisingly it does not seem well addressed. EclipseLink is beginning to add features that help such as query hints and dynamically configurable entities, but the basic problem that the out-of-the-box functionality is not performant is really egregious.

    I ask interview candidates who say they have worked on a JPA project if they have had any performance problems and what did they do about them, but I rarely get good answers. I think a lot of people are working on toy databases without enough data to matter, and/or don't have very high performance goals. I work primarily in enterprise software where hundreds of tables and hundreds of thousands of rows of data are the norm, and trying to use JPA is a constant struggle of case-by-case workarounds. On one project we finally just ripped Hibernate out of the system and replaced it with JDBC and stored procedures so that we could have better control over the querying.

    I am not sure why this is not an urgent issue in the JPA community, because I believe the technology is essentially unusable for anything but the smallest of systems.

    ReplyDelete
  4. Another amen from me! In the ideal world we could specify which fields should be selected and let JPA figure out how to do this efficiently without N+1 selects, perhaps providing some hints here and there for join/subselect/batch select (in the even more ideal JPA could automatically determine the best strategy).

    In my case I also have a Customer object and a Reservation object with different use cases that need to fetch different parts of the object graph. OpenJPA has already done a great job with FetchPlans that can be created on a use case basis. EclipseLink now also has similar features and Hibernate has fetch profiles.

    I feel that all of the JPA implementations are only a few steps away from this ideal, but reaching it somehow this doesn't seem to be of high priority for them.

    To illustrate: suppose we have an eager bidirectional Order @ManyToOne Customer and we wish to select orders by date (i.e. we also fetch their customer and are interested in all orders of these customers).
    - OpenJPA uses an N+1 select for customer.orders for each order in the query. It makes it's own decisions on the fetch strategy, this is not configurable. (https://issues.apache.org/jira/browse/OPENJPA-2296)
    - Hibernate: can be solved by annotating customer.order using @Fetch(FetchMode.SUBSELECT) or @BatchSize. However these strategies are not yet available for fetch profiles (https://hibernate.onjira.com/browse/HHH-4048).
    - EclipseLink has a similar annotation but this doesn't work for bidirectional (please correct me if I'm wrong).

    ReplyDelete
  5. @Randy I have a similar experience with large enterprise application. In one of my previous projects we ran into performance issues [which was evident from the projected database sizes being in TB's]. Since we did not have control over the stack we used JPA and tried every bit to optimize for performance but soon hit the wall.

    Thanks for the article. Nicely done.

    ReplyDelete

Captchas suck. Bots suck more. Sorry.