Joining your data

Join based on the metadata field _id

The metadata field _id is supported as a join key field in semi-join queries.

Example

Consider the following documents from two indices, company and people:

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "CoAcme" } }
{ "id": 1, "name" : "Acme", "ceo": "peo1" }
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "CoBueno" } }
{ "id": 2, "name" : "Bueno" }
{ "index" : { "_index" : "company", "_type" : "company", "_id" : "CoArk" } }
{ "id": 3, "name" : "Ark" }

{ "index" : { "_index" : "people", "_type" : "person", "_id" : "peo1" } }
{ "id" : 1, "name" : "Alice", "employed_by" : "CoAcme" }
{ "index" : { "_index" : "people", "_type" : "person", "_id" : "peo2" } }
{ "id" : 2, "name" : "Bob", "employed_by" : "CoBueno" }
{ "index" : { "_index" : "people", "_type" : "person", "_id" : "peo3" } }
{ "id" : 3, "name" : "Carol", "employed_by" : "CoAcme" }
'

Suppose that the two indices are joined in order to retrieve a list of companies using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["_id", "employed_by"],  (1)
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
      }
    }
}'

1	The metadata field `_id` of the index `company` is used as the left join key field

The response should contain two hits, as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "CoBueno",
        "_score" : 1.0
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "CoAcme",
        "_score" : 1.0
      }
    ]
  }
}

Suppose that the two indices are joined in order to retrieve a list of companies using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["ceo", "_id"],  (1)
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
      }
    }
}'

1	The metadata field `_id` of the index `people` is used as the right join key field

The response should contain one hit, as follows:

{
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "CoAcme",
        "_score" : 1.0
      }
    ]
  }
}

Join based on a runtime field

Runtime fields can also be used as part of a join. They can be as part of the parent index or the child index, as shown in the following example:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["rt_field_company", "rt_field_people"],         (1)
        "request" : {
            "runtime_mappings": {
                "rt_field_people": {
                  "type": "long",
                  "script": {
                    "source": "<put your script here>"          (2)
                  }
                }
              },
              "query" : {
                "match_all" : {}
              }
        }
      }
    },
    "runtime_mappings": {
        "rt_field_company": {
          "type": "long",
          "script": {
            "source": "<put your script here>"                   (3)
          }
        }
      }
}'

1	Join on two runtime fields defined in the request
2	Runtime field for people index
3	Runtime field for company index

In this example, the join operation is performed on two runtime fields: rt_field_company and rt_field_people. The runtime mappings are defined within the request for both the parent (company) and child (people) indices. Replace <put your script here> with the appropriate Painless script to compute the desired values for each runtime field.

Join based on a composite key

Runtime fields can be used to create composite keys, that can be used to join on multiple conditions. For example, if you have first_name and last_name fields, you often want to join on both fields at the same time. You can use runtime fields to concatenate multiple fields, and use this composite key as a join key. Another use case example is if you have fields for IP address and network port, you can create a composite key <ip_address>:<port>.

Like in the previous section, they can be defined for the parent index, the child index, or both.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
  "query" : {
    "join" : {
      "indices" : ["people"],
      "on" : ["ceo_full_name", "rt_full_name"],     (1)
      "request" : {
        "runtime_mappings": {
          "rt_full_name": {                         (2)
            "type": "keyword",
            "script": {
              "source": "emit(doc['first_name'].value + ' ' + doc['last_name'].value)"
            }
          }
        },
        "query" : {
          "match_all" : {}
        }
      }
    }
  }
}'

1	Join between an indexed field and a runtime field
2	Runtime field for the `people` index, containing the concatenation of multiple fields

In this example, a join operation is performed between the ceo_full_name field of the company index and a runtime field rt_full_name. The runtime field is a composite key: a concatenation of the first_name and last_name of employees in the people index. This allows to join indices even when they don’t have directly corresponding keys.

Retrieving a projected field

A script field may be used to retrieve the values of a projected field for each hit, as shown in the following example.

Consider the following documents from two indices, company and people:

$ curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "company", "_id" : "1" } }
{ "id": 1, "name" : "Acme" }
{ "index" : { "_index" : "company", "_id" : "2" } }
{ "id": 2, "name" : "Bueno" }

{ "index" : { "_index" : "people", "_id" : "1" } }
{ "id" : 1, "name" : "Alice", "age" : 31, "gender" : "Female", "employed_by" : 1 }
{ "index" : { "_index" : "people", "_id" : "2" } }
{ "id" : 2, "name" : "Bob", "age" : 42, "gender" : "Male", "employed_by" : 2 }
{ "index" : { "_index" : "people", "_id" : "3" } }
{ "id" : 3, "name" : "Carol", "age" : 26, "gender" : "Female", "employed_by" : 1 }
'

Suppose that the two indices are joined in order to retrieve a list of companies with the ages of all their respective employees using the following request:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["id", "employed_by"],
        "request" : {
          "project" : [
            { "field" : { "name" : "age", "alias" : "employees_age" } }  (1)
          ],
          "query" : {
            "match_all" : {}
          }
        }
      }
    },
    "fields" : ["employees_age"]                                       (2)
}'

1	Project the field `age` from index `people` as `employees_age`
2	Return the projected field `employees_age`

The response should contain two hits, one for each company, with the script field employees_age as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "2",
        "_score" : 0.0,
        "fields" : {
          "employees_age" : [
            42
          ]
        }
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "1",
        "_score" : 0.0,
        "fields" : {
          "employees_age" : [
            26,
            31
          ]
        }
      }
    ]
  }
}

Retrieving a runtime field

The runtime field mapping defined in the search request allows creating a field that exists only as part of the query. This new field can be projected like any other field using script fields.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["people"],
        "on" : ["id", "employed_by"],
        "request" : {
            "runtime_mappings": {
                "rt_field": {
                  "type": "long",
                  "script": {
                    "source": "<put your script here>"                   (1)
                  }
                }
              },
              "project" : [
                { "field" : { "name" : "rt_field" } }                    (2)
              ],
              "query" : {
                "match_all" : {}
              }
        }
      }
    },
    "fields" : ["rt_field"]                                              (3)
}'

1	The runtime field defined in the request
2	To project the new field
3	To fetch the new field into the response if needed

In this example, a runtime field named rt_field is created with a Painless script. Replace <put your script here> with the appropriate script to compute the desired value for the runtime field. The new runtime field is projected and retrieved into the response using the fields parameter.

Filter using a projected field

Runtime fields can also be used to filter on projected data.

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
  "runtime_mappings": {
    "employee_count": {
      "type": "long",
      "script": "emit(doc.employees.length)"                          (1)
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "join": {
            "indices": ["people"],
            "on": ["id", "employed_by"],
            "request": {
              "project": [
                {
                  "field": {"name": "id", "alias": "employees"}       (2)
                }
              ],
              "query": {
                "match_all": {}
              }
            }
          }
        }
      ],
      "filter": [
        {
          "range" : {
            "employee_count" : { "gt" : 1 }                           (3)
          }
        }
      ]
    }
  }
}'

1	The runtime field used to collect the projected fields
2	To project the `id` field for all matching documents in the `people` index
3	To filter the documents in the `company` index, using the value from the runtime field

In this example, the runtime field employee_count collects all the projected values from the people index, and returns the number of values. This runtime field is then used in the filter clause to return only the companies that have more than 1 employee.

Sorting based on a projected field

A script-based sorting method can be used to sort the hits based on the values of a projected field, for example:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
   "query" : {
     "join" : {
       "indices" : ["people"],
       "on" : ["id", "employed_by"],
       "request" : {
          "project" : [
            { "field" : { "name" : "age", "alias" : "employee_age" } }
          ],
         "query" : {
           "match_all" : {}
         }
       }
     }
   },
   "runtime_mappings": {
    "employee_ages": {
      "type": "long",
      "script": "int sum = 0; for (value in doc.employee_age) { sum += value } emit(sum);"
     }
   },
   "sort": [
    {
      "employee_ages": "desc"
    }
  ]
}'

The response should contain two hits, one for each company, sorted by the sum of their employees age as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "id" : 1,
          "name" : "Acme"
        },
        "sort" : [
          57.0
        ]
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "id" : 2,
          "name" : "Bueno"
        },
        "sort" : [
          42.0
        ]
      }
    ]
  }
}

In this example, the join operation is performed between the company and people indices, and the projected field employee_age is created. The script-based sorting method is used to sort the hits based on the sum of employees' ages in the employee_ages runtime field. The search results are sorted in descending order, with the company having the highest sum of employees' ages appearing first in the search results.

Spatial join using simple fields

Federate supports spatial joins by applying inequalities between one or more fields. For example, documents with a date field in the parent index can be joined with the child index where the date is between an initial_date field and an end_date field of the child index. This translates to the Boolean expression date > initial_date && date < end_date.

This spatial join can be expressed as follows:

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [ (1)
            {"gt": {"fields": ["date", "initial_date"] }}, (2)
            {"lt": {"fields": ["date", "end_date"] }} (3)
          ],
        }
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'

1	The join is the conjunction of 2 conditions.
2	Match documents where the `date` is greater than `initial_date`.
3	In addition, this matches documents where the `date` is less than `end_date`.

In this example, a spatial join is performed between the parent_index and child_index based on the date, initial_date, and end_date fields. The join condition requires the parent index’s date field to be greater than the child index’s initial_date field and less than the child index’s end_date field.

Instead of combining GT(E) and LT(E) conditions to join on ranges, you can use range data types and their specialized operators: contains, within, and intersect. The use of range data types natively supports open bounded fields and it can simplify your model. For example, a range data field considers a range to be open-ended if its less than value is undefined for an indexed document.

Spatial join using range fields

Federate supports spatial joins by joining on range fields. For example, documents with a date field in the parent index can be joined with the child index when the date is within retention_dates field of the child index; where retention_dates is of type date_range.

The previous join can be expressed as follows:

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [ (1)
            {"within": {"fields": ["date", "retention_dates"] }} (2)
          ],
        }
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'

1	The join using range field data types.
2	Match documents where the `date` is between the ranges defined by `retention_dates`.

In this example, a spatial join is performed between the parent_index and child_index based on the date and retention_dates fields. The join condition requires the parent index’s date field to be between the range defined by the retention_dates field.

If the range data type field is located on the parent_index, use the contains operator instead.

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
  "query" : {
    "join" : {
      "indices" : ["child_index"],
      "on" :
        {
          "must": [ (1)
            {"contains": {"fields": ["retention_dates", "date"] }} (2)
          ],
        }
        "request" : {
          "query" : {
            "match_all" : {}
          }
        }
    }
  }
}
'

1	The join using range field data types.
2	Match documents where a set defined by the range in `retention_dates` contains the `date` value.

For information about accepted data types in a spatial join, see Requirements for a join and the known limitations section for spatial joins.

Scoring Capabilities

By default, the join filter returns a constant score. Therefore, the scores of the matching documents from the child set do not affect the scores of the matching documents from the parent set. However, you can project the document’s score from the child set and customize the scoring of the documents from the parent set with a script score function.

Document Score

The score of a matching document from a set may be projected using a standard field object using the special field name _score.

{
  "field" : {
    "name" : "_score",
    "alias" : "employee_score"
  }
}

Scoring based on a projected field

A script-based scoring method can be used to customize the scoring based on the values of a projected field. For example, you can project the score of the matching documents from the child set and aggregate them into the parent document as follows:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "join": {
          "indices": [ "people" ],
          "on": [ "id", "employed_by" ],
          "request": {
            "project" : [
              { "field" : { "name" : "_score", "alias" : "child_score" } }
            ],
            "query": {
              "match_all": {}
            }
          }
        }
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "float sum = 0; for (value in doc.child_score) { sum += value } return sum;"
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "replace"
    }
  }
}'

The response should contain two hits, one for each company, sorted by the sum of their child scores as follows:

{
  "hits" : {
    "total" : 2,
    "max_score" : 2.0,
    "hits" : [
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "1",
        "_score" : 2.0,
        "_source" : {
          "id" : 1,
          "name" : "Acme"
        }
      },
      {
        "_index" : "company",
        "_type" : "company",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "name" : "Bueno"
        }
      }
    ]
  }
}

Aggregating based on a projected field

A script can be used to access and aggregate the values of a projected field. For example, you can project the values of the field age of the matching documents from the people index and aggregate the documents from the company index by using these values as follows:

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
 "query" : {
    "join" : {
      "indices" : ["people"],
      "on" : ["id", "employed_by"],
      "request" : {
        "project" : [
          { "field" : { "name" : "age", "alias" : "employee_age" } }
        ],
        "query" : {
          "match_all" : {}
        }
      }
    }
  },
  "runtime_mappings": {
    "employee_ages": {
      "type": "long",
      "script": "for (value in doc.employee_age) { emit(value) }"
    }
  },
  "aggs": {
    "by_company": {
      "terms": {
        "field": "name.keyword"
      },
      "aggs": {
        "average_employee_age": {
          "avg": {
            "field": "employee_ages"
          }
        }
      }
    }
  }
}'

The response contains an aggregation result with a bucket for each company, and a sub-aggregation average_employee_age corresponding to the average age of employees in this company.

{
  "aggregations": {
    "by_company": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Acme",
          "doc_count": 1,
          "average_employee_age": {
            "value": 28.5
          }
        },
        {
          "key": "Bueno",
          "doc_count": 1,
          "average_employee_age": {
            "value": 42.0
          }
        }
      ]
    }
  }
}

In this example, the query joins the company index with the people index using the id and employed_by fields. The query projects the age field from the people index using an alias employee_age. The aggregation is performed using a runtime field employee_ages, which iterates through the projected employee_age field. The documents are first aggregated by company name, then a sub-aggregation average_employee_age computes the average age of employees in each bucket.