Summary

Optimize complex queries on the server for efficient execution.

Motivation

We introduced a Criteria Builder, that lets people build complex queries to run entirely on the server. This allows the client to avoid multiple network calls and post filtering. Since the queries are run on the server, we can do some intelligent optimization to make sure we don't do any more work than necessary to come up with the correct result.

Use Cases

I want to write complex queries and I want them to return results as quickly as possible.

Effects on Public API

There will be no affects on the public API. This change will be "under the hood".

Effects on Thrift API

There will be no affects on the Thrift API.

Optimization Techniques

Given a query represented as an AST, how do you analyze the tree and come up with a plan to do the least amount of work necessary to compute the correct result?

Both branches of an ORTree must be evaluated
If one branch of an ANDTree has an empty result set, then we can reduce the entire tree to an empty result set without evaluating the other branch
If one branch of an ANDTree has few enough results, then it might be faster to evaluate other branch by checking to see if the results in the first branch equal the criteria for the second branch.
- For example, if the criteria A has 5 results, then for each of those results, see if they each meet criteria B instead of finding all the results that meet criteria B and taking the intersection of A and B.
- This would be faster if all the results of A are in memory or smaller than B
A leaf that is in memory is faster to evaluate than one that is not in memory

Set<Long> resA = null;
Set<Long> resB = null;
if(AND){
	if(A in memory and A is present state){
		resA = get result set for A
	}
 
	if(B in memory and B is present state && (resA != null && !resA.isEmpty())){
		sizeB = get result set for B
	}
 
	if((resA != null && resA.isEmpty()) || (resB != null && resB.isEmpty())){
		return empty set
	}
	else{
		resA = resA == null ? get result set for A : resA;
		resB = resB == null ? get result set for B : resB
		return Sets.intersection(TCollections.smallerBetween(resA, resB), TCollections.largerBetween(resA, resB);
	}
}
else { //OR
	resA = get result set for A
	resB = get result set for B
	return Sets.union(TCollections.smallerBetween(resA, resB), TCollections.largerBetween(resA, resB));
}
 
 
// Calculating cost

Important Questions and Semantics

Should we record stats like index size and memory presence in order to help with optimization?
- If we do record stats, should we guarantee that they are always accurate and up-to-date or should we only guarantee that they are up-to-date within a certain interval and therefore they are only suitable to use as hints?
- Are stats persistent?

Implementation Plan

Task	Notes
Convert Criteria to AST
Convert AST to optimal execution plan

Concourse

Query Optimization