Implementing UNION Clause In Neo4j Cypher

Nov 13, 2025 by Admin 42 views

Hey guys! Let's dive deep into the exciting world of extending Neo4j's Cypher query language by implementing the UNION clause. This is a significant undertaking, so buckle up as we explore the different facets involved, from abstract syntax tree (AST) modifications to validation and testing. We'll break it down into manageable chunks to make it easier to digest. Here is everything you need to know to properly add a UNION clause.

1. AST (ast.rs)

First, we need to represent the UNION clause in our Abstract Syntax Tree (AST). This involves defining new data structures that can hold the information related to UNION operations. Our primary goal here is to ensure that the AST accurately reflects the structure of a Cypher query involving UNION. This representation will be crucial for subsequent stages like parsing, validation, and query execution.

Adding UnionClause struct: We'll start by adding a UnionClause struct. This struct will encapsulate the left and right queries that are combined by the UNION operation. Think of it as a container that holds two separate query trees, each representing a complete Cypher query. The struct should look something like this:
```
struct UnionClause {
    left_query: Query,
    right_query: Query,
}
```
The left_query and right_query fields will each hold a complete Query object, representing the two queries being combined. This allows us to maintain the full structure of each query while associating them within the UNION clause.
Adding UnionAllClause struct (or use a flag): Next, we need to handle the UNION ALL variant, which is slightly different from UNION. The main difference is that UNION ALL does not remove duplicate rows from the result set, while UNION does. We have two options here: we can either create a separate UnionAllClause struct, or we can add a flag to the UnionClause struct to indicate whether it's a UNION or a UNION ALL operation. Let's explore both options:
- Separate UnionAllClause struct:
```
struct UnionAllClause {
    left_query: Query,
    right_query: Query,
}
```
  This approach keeps the two variants separate, which can make the code more readable and easier to maintain. However, it also introduces some duplication, as both structs will have the same fields.
- Flag in UnionClause struct:
```
struct UnionClause {
    left_query: Query,
    right_query: Query,
    all: bool, // True if it's UNION ALL, false if it's UNION
}
```
  This approach avoids duplication, but it can make the code slightly more complex, as we need to check the all flag whenever we're dealing with a UnionClause. For simplicity, let's go with the flag approach for now.
Considering a different Query structure: Finally, we need to consider whether the existing Query structure is sufficient to handle UNION operations. Since UNION combines two queries, we might need to modify the Query struct to accommodate multiple queries combined by UNION clauses. This might involve adding a vector of UnionClause objects to the Query struct.

2. Clause enum (clauses.rs)

Now that we have defined the UnionClause struct, we need to integrate it into our Clause enum. The Clause enum represents the different types of clauses that can appear in a Cypher query. By adding Union variants to this enum, we enable our parser to recognize and handle UNION clauses.

Adding Union(UnionClause) variant: We'll add a Union variant to the Clause enum, which will hold a UnionClause object. This allows us to represent a UNION clause as a distinct type of clause in our query.
```
enum Clause {
    // Other clause variants...
    Union(UnionClause),
}
```
Adding UnionAll(UnionClause) variant (or use a flag): Similarly, we'll add a UnionAll variant to the Clause enum to represent the UNION ALL clause. Again, we could use a flag in the Union variant instead, but for clarity, let's add a separate variant.
```
enum Clause {
    // Other clause variants...
    Union(UnionClause),
    UnionAll(UnionClause),
}
```

3. Parser functions (clauses.rs)

Next up, we need to implement the parser functions that will actually recognize and parse UNION and UNION ALL clauses in a Cypher query. These functions will be responsible for extracting the left and right queries from the input string and creating the corresponding UnionClause objects.

Adding union_clause() parser function: This function will parse a UNION clause. It should expect to find the keyword UNION followed by two complete queries.

fn union_clause(input: &str) -> IResult<&str, Clause> {
    // Parse the 'UNION' keyword
    let (input, _) = tag("UNION")(input)?;

    // Parse the left query
    let (input, left_query) = parse_query(input)?;

    // Parse the right query
    let (input, right_query) = parse_query(input)?;

    // Create a UnionClause object
    let union_clause = UnionClause {
        left_query,
        right_query,
        all: false, // It's a UNION, not UNION ALL
    };

    // Return the Union clause
    Ok((input, Clause::Union(union_clause)))
}

Adding union_all_clause() parser function: This function will parse a UNION ALL clause. It should expect to find the keywords UNION ALL followed by two complete queries.

fn union_all_clause(input: &str) -> IResult<&str, Clause> {
    // Parse the 'UNION ALL' keywords
    let (input, _) = tag("UNION ALL")(input)?;

    // Parse the left query
    let (input, left_query) = parse_query(input)?;

    // Parse the right query
    let (input, right_query) = parse_query(input)?;

    // Create a UnionClause object
    let union_clause = UnionClause {
        left_query,
        right_query,
        all: true, // It's a UNION ALL
    };

    // Return the UnionAll clause
    Ok((input, Clause::UnionAll(union_clause)))
}

Parsing two complete queries separated by UNION: The parser functions need to be able to handle two complete queries separated by the UNION or UNION ALL keywords. This means that each query must have its own RETURN clause, as UNION can only appear between complete queries.

4. Clause dispatcher (clauses.rs)

The clause dispatcher is responsible for routing the input string to the appropriate parser function based on the keywords it encounters. We need to update the clause dispatcher to recognize UNION and UNION ALL keywords and call the corresponding parser functions.

Adding UNION to clause() function's alt() parser: We'll add union_clause() to the alt() parser in the clause() function. This will allow the parser to recognize the UNION keyword and call the union_clause() function to parse the clause.
```
fn clause(input: &str) -> IResult<&str, Clause> {
    alt((/* Other clauses */, union_clause))(input)
}
```
Adding UNION ALL to clause() function's alt() parser: Similarly, we'll add union_all_clause() to the alt() parser in the clause() function. This will allow the parser to recognize the UNION ALL keywords and call the union_all_clause() function to parse the clause.
```
fn clause(input: &str) -> IResult<&str, Clause> {
    alt((/* Other clauses */, union_clause, union_all_clause))(input)
}
```

5. Query assembly (clauses.rs)

Query assembly is the process of constructing the final query object from the parsed clauses. We need to handle UNION clauses during query assembly to ensure that the final query object accurately represents the UNION operation.

Handling UNION in parse_query(): We need to modify the parse_query() function to handle UNION clauses. This might involve creating a different Query structure that can accommodate multiple queries combined by UNION clauses.
UNION combines two queries, so Query struct might need unions: Vec<UnionClause>: As mentioned earlier, we might need to add a vector of UnionClause objects to the Query struct to represent multiple UNION operations in a single query.
```
struct Query {
    // Other query elements...
    unions: Vec<UnionClause>,
}
```

6. Clause order validation (clauses.rs)

Clause order validation ensures that the clauses in a Cypher query appear in the correct order. We need to add UNION to the clause order validation state machine to ensure that UNION clauses are used correctly.

Adding UNION to validate_clause_order() state machine: We'll update the validate_clause_order() function to include UNION in the state machine. This will allow the validator to check that UNION clauses appear in the correct order with respect to other clauses.
UNION must connect two complete queries (both must have RETURN): The validator must ensure that UNION clauses connect two complete queries, meaning that both the left and right queries must have a RETURN clause.
UNION can only appear between complete queries: The validator must also ensure that UNION clauses only appear between complete queries and not in the middle of a query.

7. Clause name function (clauses.rs)

The clause name function is used to get the name of a clause for debugging and error reporting purposes. We need to add UNION and UNION ALL to the clause name function so that we can easily identify these clauses.

**Adding `Clause::Union(_) =>