Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Summary

Allow users to programmatically build complex find queries (e.g. conjunctive, disjunctive, combination, etc).

Motivation

Right now, there is no support for building complex find queries. Users can do this manually by doing a bunch of separate queries and post filtering the results, but this isn't an ideal long-term SOP. If this is supported in the product, then it lowers the barrier to entry and it gives us the opportunity to make those complex queries more efficient (i.e. we can cache certain sub queries, do query planning, etc).

Use Cases

  • I want to find records that match all of several criteria (AND)
  • I want to find records that match one of several criteria (OR)
  • I want to find records that match all of several criteria and/or one of several criteria (AND/OR)
  • I want to group different criteria in my query (parenthesis, order of operations)

Effects on Public API

The following methods will be added to the Public API and will therefore have a client-side implementation:

  • public Set<Long> find(Criteria criteria)

The following classes will be added to the Public API:

  • Criteria
  • CriteriaBuilder (or Criteria.Builder)

Effects on Thrift API

  • Add a TCriteria struct (this must be wrapped by the Criteria object in the public api)
  • Must deprecated or change the current #find() method
  • Implement #find() method or new method that takes a Criteria as a param (i.e. set<i64> find0(data.Criteria criteria))
  • Make sure that ConcourseServer takes simple queries and turns them to a Criteria and sends them to the appropriate thrift ConcourseService method.

Important Questions and Semantics

  1. Should the appropriate processing and post-filtering for a complex find be done client or server side?
    1. We should probably do this on the server so that these queries are performant. I'm really concerned about all the network overhead of the client sending multiple separate queries over the wire.
    2. If this is done on the client, then complex queries are somewhat analogous to a compound operation
      1. This means that our results can become invalid if a spurious write changes the data in between network calls
        1. Example: Assume my criteria is find all records where "name = 'John' AND age = 23". If the client has to do separate calls and post filter the results then its possible a record that matches the first query can be changed to no longer match the first query after the result is returned and the client moves onto the second query.
    3. If this is done on the server, the complex query is analogous to a repeatable atomic operation
  2. Should we deprecate the #find methods that take simple parameters?
    1. No: Builders are inherently not user friendly so we should not force the user to use one of a simple query
    2. Yes: Overloaded methods that take different kinds of parameters is bad API design
  3. Why a programatic builder as opposed to a grammar parser?
    1. Don't trust raw string input from users. We can regulate things a lot more if this is done programmatically with a builder.
  4. This project may form the basis of a larger database language (e.g. Concourse Action Language (CAL))
    1. This means that we need to write this feature in a way that can easily be leveraged and extended in the future
  5. Will this builder be usable in CaSH?

Building Process

  1. User creates Criteria from builder.
  2. The client goes through each symbol in the Criteria and converts them to the appropriate TSymbol, building a TCriteria along the way
    1. All sub criteria is expanded and prepended/appended with the appropriate parenthesis symbols
  3. The server receives the TCriteria, goes through the list of TSymbols and converts them back to Symbols
  4. The server does the shunting yard alg on the list of Symbols.

Context Free Grammar

We need to build a CFG that can control the state of the builder and also parse hand written queries. The grammar must support the following cases:

  • Simple queries: key operator value
  • Conjunctive queries: key operator value AND key operator value
  • Disjunctive queries: key1 operator value1 OR key2 operator value2
  • Combined queries: key1 operator value1 AND key2 operator value2 OR key3 operator value3
  • Grouped queries: (key1 operator value1 AND key2 operator value2) OR key3 operator value3
Code Block
S → SS
S → (S)
S → ()
S → key KEY
KEY → operator OPERATOR
OPERATOR → value VALUE
OPERATOR → value
VALUE → value VALUE
VALUE → value
VALUE → and AND
VALUE → or OR
AND → S
OR → S

Psuedocode

TSymbolType

A representation for an enum that declares the type of a TSymbol.

Code Block
titledata.thrift
enum TSymbolType {
	CONJUNCTION = 1,
	KEY = 2,
	VALUE = 3,
	PARENTHESIS = 4
}

TSymbol

A representation for a Symbol that can be passed over the wire via Thrift. Once passed over the wire, the server uses information about the symbol type to parse the string representation of the symbol to an actual object.

Code Block
titledata.thrift
struct TSymbol {
	1:required TSymbolType type; 
	2:required string symbol;
}

TCriteria

A representation for a Criteria that can be passed over the wire via Thrift. Once passed over the write, the server goes through the list of TSymbols and converts them to actual Symbol objects which can then be used in the shunting-yard algorithm.

Code Block
titledata.thrift
struct TCriteria {
	1:required list<TSymbol> symbols
}

Criteria

A wrapper around an in-order collection of symbols. A Criteria is generated from the completion of a state machine that comprises a builder.

Code Block
// The client must convert a Criteria to a TCriteria and while doing so and encountering an embedded Criteria, it should add a open paren then add all the symbols from the
// sub criteria and finally add a closing paren. Then the server will take the TCriteria and make a Criteria that has no sub Criteria objects but has parenthesis in the list
// of symbols.
public class Criteria implements Symbol {
	public static Builder builder() {
		return new StartState(new Criteria());
	}
	private final List<Symbol> symbols = Lists.newArrayList();
	protected Criteria();
	protected void addSymbol(Symbol symbol);  
}

State

The base class and marker for any valid state for the builder. Each state is passed the current Criteria and holds a reference. For any method called from the state, a symbol is added to the Criteria or the Criteria is returned.

Code Block
abstract class State {
	Criteria criteria;
} 

StartState

The state that marks the beginning of a new Criteria.

Code Block
class StartState extends State {
	StartState group(Criteria criteria);
	KeyState key(String key);
	Criteria build();
}

KeyState

The state that expects the next symbol/token to be an operator specification.

Code Block
class KeyState extends State {
	OperatorState operator(Operator operator);
}

OperatorState

The state that expects the next symbol/token to be a value specification.

Code Block
class OperatorState extends State {
	ValueState value(Object value);
}

ValueState

The state that expects the current symbol/token to be the last or the next symbol/token to be value or conjunction specification.

Code Block
class ValueState extends State {
	ValueState value(Object value);
	StartState and();
	StartState or();
	Criteria build();
}

Symbol

A marker interface to indicate an object that can be included in a Criteria

Code Block
interface Symbol {} //marker interface
Note

Each subclass implementation of the Symbol interface should have some public static method that allows the creation of an Object in that class from a string parameter.

ConjunctionSymbol

An enum that represents the possible conjunction symbols.

Code Block
enum ConjunctionSymbol implements Symbol {
	AND, OR;
}

KeySymbol

A symbol that represents a "key" in a Criteria.

Code Block
class KeySymbol implements Symbol {
	public String getKey();
}

OperatorSymbol

A symbol that represents an "operator" in a Criteria.

Code Block
class OperatorSymbol implements Symbol {
	public Operator getOperator();
}

 

ValueSymbol

A symbol that represents a "value" in a Criteria.

Code Block
class ValueSymbol implements Symbol {	
	// When the State creates this symbol it must convert the object to a TObject so that it can be passed over the wire via Thrift
	public TObject getValue();
}

ParenthesisSymbol (only defined on the server)

A symbol that represents a parenthesis in a Criteria. Note that the builder doesn't actually need to use these since Criteria objects are embedded to signify groups.

Code Block
enum ParenthesisSymbol implements Symbol {
	LEFT, RIGHT;
}

Order of Operations

Given validly formed grouped criteria, we must process everything with the correct order of operations. For this, we should use the Shunting-yard algorithm. I've written this algorithm in PHP so we should port it to Java.

Code Block
	/**
     * Breakdown a valid selection criteria using the Shunting-yard algorithin
     * @see http://en.wikipedia.org/wiki/Shunting-yard_algorithm
     * @param string $criteria
     * @return array $criteria in reverse polish notation {@link http://en.wikipedia.org/wiki/Reverse_Polish_notation}
     * @since 1.1.0
     */
    private static function breakdownSelectionCriteria($criteria) {
        $rangeSeparator = "*\*";
        $spaceSeperator = "++\++";
        $_criteria = $criteria;
        $stack = array();
        $queue = array();
        $criteria = str_replace("(", " ( ", $criteria);
        $criteria = str_replace(")", " ) ", $criteria);
        $criteria = preg_replace("/(range) ([0-9]+) ([0-9]+)/i",
                "$1$2$rangeSeparator$3", $criteria);
        $criteria = preg_replace("/'([a-zA-Z0-9]+)([\s])+([a-zA-Z0-9]+)'[\s]*/i",
                "$1$spaceSeperator$3", $criteria);
        foreach (self::$_SELECTION_CRITERIA_OPERATORS as $operator) {
            $criteria = str_ireplace(" $operator", $operator, $criteria);
            $criteria = str_ireplace("$operator ", $operator, $criteria);
        }
        $tokens = array_filter(explode(" ", $criteria));
        foreach ($tokens as $token) {
            $token = str_replace($rangeSeparator, " ", $token);
            $token = str_replace($spaceSeperator, " ", $token);
            if (strcasecmp($token, "and") == 0 || strcasecmp($token, "or") == 0) {
                while (!empty($stack)) {
                    $topOfStack = $topOfStack = $stack[count($stack) - 1];
                    if (strcasecmp($token, "or") == 0 && (strcasecmp($topOfStack,
                                    "or") == 0 || strcasecmp($topOfStack, "and")
                            == 0)) {
                        $queue[] = array_pop($stack);
                    }
                    else {
                        break;
                    }
                }
                array_push($stack, $token);
            }
            else if (strcasecmp($token, "(") == 0) {
                array_push($stack, $token);
            }
            else if (strcasecmp($token, ")") == 0) {
                $foundLeftParen = false;
                while (!empty($stack)) {
                    $topOfStack = $stack[count($stack) - 1];
                    if (strcasecmp($topOfStack, "(") == 0) {
                        $foundLeftParen = true;
                        break;
                    }
                    else {
                        $queue[] = array_pop($stack);
                    }
                }
                if (!$foundLeftParen) {
                    throw new \Exception("Syntax error in criteria $_criteria. Mismatched parenthesis");
                }
                array_pop($stack);
            }
            else {
                $queue[] = $token;
            }
        }
        while (!empty($stack)) {
            $topOfStack = $stack[count($stack) - 1];
            if (strcasecmp($topOfStack, ")") == 0 || strcasecmp($topOfStack, "(")
                    == 0) {
                throw new \Exception("Syntax error in criteria $_criteria. Mismatched parenthesis");
            }
            $queue[] = array_pop($stack);
        }
        return $queue;
    }

Implementation Plan

TaskDescriptionNotes
Define Criteria grammarImplement the rules of the grammar, define all the appropriate symbols, and define all the appropriate operations(tick)
Port Shunting-yard to JavaGiven a collection of symbols, implement the algorithm that returns those symbols in reverse polish notation.(tick)
Implement Criteria builderImplement a builder that allows the programatic creation of a Criteria(tick)
Edit Public APIAdd the appropriate methods to the public api(tick)
Integrate with CaSH (tick)