Summary
Allow users to programmatically build complex find queries (e.g. conjunctive, disjunctive, combination, etc).
Motivation
Right now, there is no support for building complex find queries. Users can do this manually by doing a bunch of separate queries and post filtering the results, but this isn't an ideal long-term SOP. If this is supported in the product, then it lowers the barrier to entry and it gives us the opportunity to make those complex queries more efficient (i.e. we can cache certain sub queries, do query planning, etc).
Use Cases
- I want to find records that match all of several criteria (AND)
- I want to find records that match one of several criteria (OR)
- I want to find records that match all of several criteria and/or one of several criteria (AND/OR)
- I want to group different criteria in my query (parenthesis, order of operations)
Effects on Public API
The following methods will be added to the Public API and will therefore have a client-side implementation:
- public Set<Long> find(Criteria criteria)
The following classes will be added to the Public API:
- Criteria
- CriteriaBuilder (or Criteria.Builder)
Effects on Thrift API
- Add a TCriteria struct (this must be wrapped by the Criteria object in the public api)
- Must deprecated or change the current #find() method
- Implement #find() method or new method that takes a Criteria as a param (i.e. set<i64> find0(data.Criteria criteria))
- Make sure that ConcourseServer takes simple queries and turns them to a Criteria and sends them to the appropriate thrift ConcourseService method.
Important Questions and Semantics
- Should the appropriate processing and post-filtering for a complex find be done client or server side?
- We should probably do this on the server so that these queries are performant. I'm really concerned about all the network overhead of the client sending multiple separate queries over the wire.
- If this is done on the client, then complex queries are somewhat analogous to a compound operation
- This means that our results can become invalid if a spurious write changes the data in between network calls
- Example: Assume my criteria is find all records where "name = 'John' AND age = 23". If the client has to do separate calls and post filter the results then its possible a record that matches the first query can be changed to no longer match the first query after the result is returned and the client moves onto the second query.
- This means that our results can become invalid if a spurious write changes the data in between network calls
- If this is done on the server, the complex query is analogous to a repeatable atomic operation
- Should we deprecate the #find methods that take simple parameters?
- No: Builders are inherently not user friendly so we should not force the user to use one of a simple query
- Yes: Overloaded methods that take different kinds of parameters is bad API design
- Why a programatic builder as opposed to a grammar parser?
- Don't trust raw string input from users. We can regulate things a lot more if this is done programmatically with a builder.
- This project may form the basis of a larger database language (e.g. Concourse Action Language (CAL))
- This means that we need to write this feature in a way that can easily be leveraged and extended in the future
Building Process
- User creates Criteria from builder.
- The client goes through each symbol in the Criteria and converts them to the appropriate TSymbol, building a TCriteria along the way
- All sub criteria is expanded and prepended/appended with the appropriate parenthesis symbols
- The server receives the TCriteria, goes through the list of TSymbols and converts them back to Symbols
- The server does the shunting yard alg on the list of Symbols.
Context Free Grammar
We need to build a CFG that can control the state of the builder and also parse hand written queries. The grammar must support the following cases:
- Simple queries: key operator value
- Conjunctive queries: key operator value AND key operator value
- Disjunctive queries: key1 operator value1 OR key2 operator value2
- Combined queries: key1 operator value1 AND key2 operator value2 OR key3 operator value3
- Grouped queries: (key1 operator value1 AND key2 operator value2) OR key3 operator value3
S → SS S → (S) S → () S → key KEY KEY → operator OPERATOR OPERATOR → value VALUE OPERATOR → value VALUE → value VALUE VALUE → value VALUE → and AND VALUE → or OR AND → S OR → S
Psuedocode
TSymbolType
A representation for an enum that declares the type of a TSymbol.
enum TSymbolType { CONJUNCTION = 1, KEY = 2, VALUE = 3, PARENTHESIS = 4 }
TSymbol
A representation for a Symbol that can be passed over the wire via Thrift. Once passed over the wire, the server uses information about the symbol type to parse the string representation of the symbol to an actual object.
struct TSymbol { 1:required TSymbolType type; 2:required string symbol; }
TCriteria
A representation for a Criteria that can be passed over the wire via Thrift. Once passed over the write, the server goes through the list of TSymbols and converts them to actual Symbol objects which can then be used in the shunting-yard algorithm.
struct TCriteria { 1:required list<TSymbol> symbols }
Criteria
A wrapper around an in-order collection of symbols. A Criteria is generated from the completion of a state machine that comprises a builder.
// The client must convert a Criteria to a TCriteria and while doing so and encountering an embedded Criteria, it should add a open paren then add all the symbols from the // sub criteria and finally add a closing paren. Then the server will take the TCriteria and make a Criteria that has no sub Criteria objects but has parenthesis in the list // of symbols. public class Criteria implements Symbol { public static Builder builder() { return new StartState(new Criteria()); } private final List<Symbol> symbols = Lists.newArrayList(); protected Criteria(); protected void addSymbol(Symbol symbol); }
State
The base class and marker for any valid state for the builder. Each state is passed the current Criteria and holds a reference. For any method called from the state, a symbol is added to the Criteria or the Criteria is returned.
abstract class State { Criteria criteria; }
StartState
The state that marks the beginning of a new Criteria.
class StartState extends State { StartState group(Criteria criteria); KeyState key(String key); Criteria build(); }
KeyState
The state that expects the next symbol/token to be an operator specification.
class KeyState extends State { OperatorState operator(Operator operator); }
OperatorState
The state that expects the next symbol/token to be a value specification.
class OperatorState extends State { ValueState value(Object value); }
ValueState
The state that expects the current symbol/token to be the last or the next symbol/token to be value or conjunction specification.
class ValueState extends State { ValueState value(Object value); StartState and(); StartState or(); Criteria build(); }
Symbol
A marker interface to indicate an object that can be included in a Criteria
interface Symbol {} //marker interface
Each subclass implementation of the Symbol interface should have some public static method that allows the creation of an Object in that class from a string parameter.
ConjunctionSymbol
An enum that represents the possible conjunction symbols.
enum ConjunctionSymbol implements Symbol { AND, OR; }
KeySymbol
A symbol that represents a "key" in a Criteria.
class KeySymbol implements Symbol { public String getKey(); }
OperatorSymbol
A symbol that represents an "operator" in a Criteria.
class OperatorSymbol implements Symbol { public Operator getOperator(); }
ValueSymbol
A symbol that represents a "value" in a Criteria.
class ValueSymbol implements Symbol { // When the State creates this symbol it must convert the object to a TObject so that it can be passed over the wire via Thrift public TObject getValue(); }
ParenthesisSymbol (only defined on the server)
A symbol that represents a parenthesis in a Criteria. Note that the builder doesn't actually need to use these since Criteria objects are embedded to signify groups.
enum ParenthesisSymbol implements Symbol { LEFT, RIGHT; }
Order of Operations
Given validly formed grouped criteria, we must process everything with the correct order of operations. For this, we should use the Shunting-yard algorithm. I've written this algorithm in PHP so we should port it to Java.
/** * Breakdown a valid selection criteria using the Shunting-yard algorithin * @see http://en.wikipedia.org/wiki/Shunting-yard_algorithm * @param string $criteria * @return array $criteria in reverse polish notation {@link http://en.wikipedia.org/wiki/Reverse_Polish_notation} * @since 1.1.0 */ private static function breakdownSelectionCriteria($criteria) { $rangeSeparator = "*\*"; $spaceSeperator = "++\++"; $_criteria = $criteria; $stack = array(); $queue = array(); $criteria = str_replace("(", " ( ", $criteria); $criteria = str_replace(")", " ) ", $criteria); $criteria = preg_replace("/(range) ([0-9]+) ([0-9]+)/i", "$1$2$rangeSeparator$3", $criteria); $criteria = preg_replace("/'([a-zA-Z0-9]+)([\s])+([a-zA-Z0-9]+)'[\s]*/i", "$1$spaceSeperator$3", $criteria); foreach (self::$_SELECTION_CRITERIA_OPERATORS as $operator) { $criteria = str_ireplace(" $operator", $operator, $criteria); $criteria = str_ireplace("$operator ", $operator, $criteria); } $tokens = array_filter(explode(" ", $criteria)); foreach ($tokens as $token) { $token = str_replace($rangeSeparator, " ", $token); $token = str_replace($spaceSeperator, " ", $token); if (strcasecmp($token, "and") == 0 || strcasecmp($token, "or") == 0) { while (!empty($stack)) { $topOfStack = $topOfStack = $stack[count($stack) - 1]; if (strcasecmp($token, "or") == 0 && (strcasecmp($topOfStack, "or") == 0 || strcasecmp($topOfStack, "and") == 0)) { $queue[] = array_pop($stack); } else { break; } } array_push($stack, $token); } else if (strcasecmp($token, "(") == 0) { array_push($stack, $token); } else if (strcasecmp($token, ")") == 0) { $foundLeftParen = false; while (!empty($stack)) { $topOfStack = $stack[count($stack) - 1]; if (strcasecmp($topOfStack, "(") == 0) { $foundLeftParen = true; break; } else { $queue[] = array_pop($stack); } } if (!$foundLeftParen) { throw new \Exception("Syntax error in criteria $_criteria. Mismatched parenthesis"); } array_pop($stack); } else { $queue[] = $token; } } while (!empty($stack)) { $topOfStack = $stack[count($stack) - 1]; if (strcasecmp($topOfStack, ")") == 0 || strcasecmp($topOfStack, "(") == 0) { throw new \Exception("Syntax error in criteria $_criteria. Mismatched parenthesis"); } $queue[] = array_pop($stack); } return $queue; }
Implementation Plan
Task | Description | Notes |
---|---|---|
Define Criteria grammar | Implement the rules of the grammar, define all the appropriate symbols, and define all the appropriate operations | |
Port Shunting-yard to Java | Given a collection of symbols, implement the algorithm that returns those symbols in reverse polish notation. | |
Implement Criteria builder | Implement a builder that allows the programatic creation of a Criteria | |
Edit Public API | Add the appropriate methods to the public api |