Query domain-specific language (DSL)

Join query

The join filter enables the filtering of one set of documents (the target) with another one (the source) based on shared field values. It accepts the following parameters:

type

The type of join algorithm to use. Valid values are BROADCAST_JOIN, HASH_JOIN, INDEX_JOIN, or ROUTING_JOIN. If this parameter is not specified, the query planner will automatically select the optimal one. For more information, see Configuring joins by type.

indices

The index names for the child set. Multiple indices can be specified using the Elasticsearch syntax. Defaults to all indices.

on

An array of two elements that specifies the field paths for the join keys in the parent and the child set, respectively. The metadata of the fields is validated prior to the query execution, see field metadata.

request

The search request that is used to compute the set of documents of the child set before performing the join.

Example

In this example, we will join all the documents from target_index with the documents of source_index using the HASH_JOIN algorithm. The query first filters documents from source_index and of type type with the query { "terms" : { "tag" : [ "aaa" ] } }. It then retrieves the ids of the documents from the field id specified by the parameter on. The list of ids is then used as filter and applied on the field foreign_key of the documents from target_index.

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/target_index/_search' 'd '
{
  "query" : {
    "join" : {
      "type": "HASH_JOIN",
      "indices" : ["source_index"],
      "on" : ["foreign_key", "id"],
      "request" : { (1)
        "query" : {
          "terms" : {
            "tag" : [ "aaa" ]
          }
        }
      }
    }
  }
}
'
1 The search request that will be used to filter out the source set (i.e. source_index)

Field Metadata

The fields metadata are accessed to check for their capabilities. The field capabilities contain the following information:

type

The data type of the field.

searchable

A property indicating whether or not the field can be used for search.

aggregatable

A property indicating whether or not the field can be aggregated on.

Both the parent and child fields must be of the same type. For hash and broadcast joins, all join fields must be aggregatable. For index join and routing join, parent field must be searchable, and the child field must be aggregatable.

Suppose we have index1 with two fields, one keyword (indexed as searchable and aggregatable) and the other long (not indexed as searchable and aggregatable), then we will have the example result below:

GET /siren/index1/_field_caps
{
  "indices": [
    "index1"
  ],
  "fields": {
    "comments.author.keyword": {
      "keyword": {
        "type": "keyword",
        "searchable": true,
        "aggregatable": true
      }
    },
    "comments.number": {
      "long": {
        "type": "long",
        "searchable": false,
        "aggregatable": false
      }
    }
  }
}

A field with doc values and defined as not indexed (index: false) has a metadata searchable true. However, the use of such a doc-values-only field is not supported as parent field in index join queries.

Metadata field _id

The metadata field _id is supported as a join key field in semi-join queries.

Example

Consider the following documents from two indices, company and people:

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "CoAcme" } }
{ "id": 1, "name" : "Acme", "ceo": "peo1" }
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "CoBueno" } }
{ "id": 2, "name" : "Bueno" }
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "CoArk" } }
{ "id": 3, "name" : "Ark" }

{ "index" : { "_index" : "people", "_type" : "person", "_id" : "peo1" } }
{ "id" : 1, "name" : "Alice", "employed_by" : "CoAcme" }
{ "index" : { "_index" : "people", "_type" : "person", "_id" : "peo2" } }
{ "id" : 2, "name" : "Bob", "employed_by" : "CoBueno" }
{ "index" : { "_index" : "people", "_type" : "person", "_id" : "peo3" } }
{ "id" : 3, "name" : "Carol", "employed_by" : "CoAcme" }
'

Suppose that the two indices are joined in order to retrieve a list of companies using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["_id", "employed_by"],  (1)
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
      }
    }
}'
1 The metadata field _id of the index company is used as the left join key field

The response should contain two hits, as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "CoBueno",
        "_score" : 1.0
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "CoAcme",
        "_score" : 1.0
      }
    ]
  }
}

Suppose that the two indices are joined in order to retrieve a list of companies using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["ceo", "_id"],  (1)
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
      }
    }
}'
1 The metadata field _id of the index people is used as the right join key field

The response should contain one hit, as follows:

{
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "CoAcme",
        "_score" : 1.0
      }
    ]
  }
}

Scoring Capabilities

The join filter returns a constant score. Therefore, the scores of the matching documents from the child set do not affect the scores of the matching documents from the parent set. However, one can project the document’s score from the child set and customize the scoring of the documents from the parent set with a script score function.

Project

When joining a child set with a parent set, the fields from the child set may be projected to the parent set. The projected fields and associated values are mapped to the matching documents of the parent set.

A projection is defined in the request body search of the join clause using the parameter project. The project parameter accepts an array of field’s objects, each one defining a field to project.

The projected fields from a child set are accessible in the scope of the parent’s request. One can refer to a projected field in a project context or in a script context such as in a script field, a script-based sort, and so on.

The projection of the metadata field _id is not supported. When the _id field is used as the join key on the child side, no other field can be projected.

Example

Consider the following documents from two indices, company and people:

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "1" } }
{ "id": 1, "name" : "Acme" }
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "2" } }
{ "id": 2, "name" : "Bueno" }

{ "index" : { "_index" : "people", "_type" : "person", "_id" : "1" } }
{ "id" : 1, "name" : "Alice", "age" : 31, "gender" : "Female", "employed_by" : 1 }
{ "index" : { "_index" : "people", "_type" : "person", "_id" : "2" } }
{ "id" : 2, "name" : "Bob", "age" : 42, "gender" : "Male", "employed_by" : 2 }
{ "index" : { "_index" : "people", "_type" : "person", "_id" : "3" } }
{ "id" : 3, "name" : "Carol", "age" : 26, "gender" : "Female", "employed_by" : 1 }
'

Suppose that the two indices are joined in order to retrieve a list of companies with the ages of all their respective employees using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["id", "employed_by"],
        "request" : {
          "project" : [
            { "field" : { "name" : "age", "alias" : "employee_age" } }  (1)
          ],
          "query" : {
            "match_all" : {}
          }
        }
      }
    },
    "script_fields" : {
      "employees_age" : {
        "script" : "doc.employee_age"                                    (2)
      }
    }
}'
1 Project the field age from index people as employee_age
2 Return a script field employees_age for each hit with the associated projected values

The response should contain two hits, one for each company, with the script field employees_age as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "2",
        "_score" : 0.0,
        "fields" : {
          "employees_age" : [
            42
          ]
        }
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "1",
        "_score" : 0.0,
        "fields" : {
          "employees_age" : [
            26,
            31
          ]
        }
      }
    ]
  }
}

Field

A standard field object specifies the projection of a field from a set. It is composed of the following parameters:

name

The name of a field from a child set to project.

alias

An alias name to give to the projected field. It is not possible to have multiple fields with identical names in the same set scope as this leads to ambiguity. It is therefore important to carefully select alias names to avoid such ambiguity.

{
  "field" : {
    "name" : "age",              (1)
    "alias" : "employee_age"     (2)
  }
}
1 The name of the field to project
2 An alias for the field name

Runtime field

The runtime field mapping defined in the search request allows to create a field that exists only as part of the query. This new field can be projected like any other field using script fields.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["id", "employed_by"],
        "request" : {
            "runtime_mappings": {
                "rt_field": {
                  "type": "long",
                  "script": {
                    "source": "<put your script here>"                   (1)
                  }
                }
              },
              "project" : [
                { "field" : { "name" : "rt_field" } }                    (2)
              ],
              "query" : {
                "match_all" : {}
              }
        }
      }
    },
    "script_fields" : {
      "fetched_rt_field" : {
        "script" : "doc.rt_field"                                         (3)
      }
    }
}'
1 The runtime field defined in the request
2 To project the new field
3 To fetch the new field into the response if needed

Runtime fields can also be used within an aggregation or, as in the following example as part of a join.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["rt_field_company", "rt_field_people"],         (1)
        "request" : {
            "runtime_mappings": {
                "rt_field_people": {
                  "type": "long",
                  "script": {
                    "source": "<put your script here>"          (2)
                  }
                }
              },
              "query" : {
                "match_all" : {}
              }
        }
      }
    },
    "runtime_mappings": {
        "rt_field_company": {
          "type": "long",
          "script": {
            "source": "<put your script here>"                   (3)
          }
        }
      }
}'
1 Join on two runtime fields defined in the request
2 Runtime field for people index
3 Runtime field for company index

Document Score

The score of a matching document from a set may be projected using a standard field object using the special field name _score.

{
  "field" : {
    "name" : "_score",
    "alias" : "employee_score"
  }
}

Retrieving a projected field

A script field may be used to retrieve the values of a projected field for each hit, as shown in the previous example. The projected field is accessed using the doc values API. In the example, the projected field employee_age is accessed using the syntax doc.employee_age.

Sorting based on a projected field

A script-based sorting method can be used to sort the hits based on the values of a projected field, for example:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
     "join" : {
       "indices" : ["people"],
       "on" : ["id", "employed_by"],
       "request" : {
          "project" : [
            { "field" : { "name" : "age", "alias" : "employee_age" } }
          ],
         "query" : {
           "match_all" : {}
         }
       }
     }
   },
   "sort": [
    {
      "_script": {
        "script": {
          "lang": "painless",
          "source": "int sum = 0; for (value in doc.employee_age) { sum += value } return sum;"
        },
        "type": "number",
        "order": "desc"
      }
    }
  ]
}'

The response should contain two hits, one for each company, sorted by the sum of their employees age as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "id" : 1,
          "name" : "Acme"
        },
        "sort" : [
          57.0
        ]
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "id" : 2,
          "name" : "Bueno"
        },
        "sort" : [
          42.0
        ]
      }
    ]
  }
}

Scoring based on a projected field

A script-based scoring method can be used to customize the scoring based on the values of a projected field. For example, we can project the score of the matching documents from the child set and aggregate them into the parent document as follows:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "join": {
          "indices": [ "people" ],
          "on": [ "id", "employed_by" ],
          "request": {
            "project" : [
              { "field" : { "name" : "_score", "alias" : "child_score" } }
            ],
            "query": {
              "match_all": {}
            }
          }
        }
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "float sum = 0; for (value in doc.child_score) { sum += value } return sum;"
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}'

The response should contain two hits, one for each company, sorted by the sum of their child scores as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : 2.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "1",
        "_score" : 2.0,
        "_source" : {
          "id" : 1,
          "name" : "Acme"
        }
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "name" : "Bueno"
        }
      }
    ]
  }
}

Aggregating based on a projected field

A script can be used to access and aggregate the values of a projected field. For example, we can project the values of the field gender of the matching documents from the people index and aggregate the documents from the company index by using these values as follows:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
 "query" : {
    "join" : {
      "indices" : ["people"],
      "on" : ["id", "employed_by"],
      "request" : {
        "project" : [
          { "field" : { "name" : "gender.keyword", "alias" : "employee_gender" } }
        ],
        "query" : {
          "match_all" : {}
        }
      }
    }
  },
  "aggs": {
    "count_by_gender": {
      "terms": {
        "script": {
          "lang": "painless",
          "source": "doc.employee_gender"
        }
      }
    }
  }
}'

The response should contain an aggregation result count_by_gender with two buckets Female and Male as follows:

{
  "aggregations": {
    "count_by_gender": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Female",
          "doc_count": 1
        },
        {
          "key": "Male",
          "doc_count": 1
        }
      ]
    }
  }
}

Known limitations

  • A join query involving one or more runtime fields is currently supported using HASH_JOIN or BROADCAST_JOIN algorithms. The INDEX_JOIN and ROUTING_JOIN algorithms only supports joins with the runtime field on the child set.

  • Regarding fields of nested type:

    • The join filter within a nested query is currently not supported.

    • A nested query within a join filter is supported if and only if the join key does not refer to a field of the nested object.

    • The projection of fields of nested data type is currently not supported.

  • A has_child or has_parent query in a join query is supported, but a join query in a has_child or has_parent query is currently not supported.

  • The _id field is supported as a join key field with concrete indices, but its projection as a secondary field is not supported. When the _id field is used as the join key on the child side, no other field can be projected.

  • The _id field is not supported as join key field with virtual indices.

  • The projection of fields in a join with virtual indices isn’t currently supported. However, if both virtual indices are on the same remote cluster with the Federate plugin installed, then it is possible.

Limitations specific to the ROUTING_JOIN

  • The ROUTING_JOIN cannot be selected when joining on something other than the parent set’s _id.

  • The ROUTING_JOIN must not be selected when the parent set’s indices use custom routing as this can result in incorrect joins.

    • When using custom routing, the best practice is to declare the routing value as required in the index mapping, as per Elasticsearch’s documentation.

    • To follow this best practice from Elasticsearch will also prevent that incorrect join results are generated inadvertently by Federate.

      • If this best practice is followed, Federate will prompt the user with a meaningful error message when she selects the ROUTING JOIN algorithm despite the custom routing on the parent set’s indices.

      • If this best practice is followed, Federate’s query planner will never select the ROUTING JOIN in case of custom routing on the parent set’s indices.

  • The ROUTING_JOIN cannot be selected when the parent set’s indices are referred by some alias that uses custom routing for indexing (see Alias routing).