Introduction

The Siren Vanguard plugin is a plugin for Elasticsearch that adds join capabilities across indices. Currently one type of join is implemented which we call "Search Join", but our objective is to add more type of joins in the future.

The Search Join is basically a (left) semi-join between two set of documents based on a common attribute, where the result only contains the attributes of one of the joined set of documents. This join is used to filter one document set based on a second document set, hence its name. It is equivalent to the EXISTS() operator in SQL.

The Search Join supports joins on both numerical and textual fields, but the joined attributes must be of the same type. You can also freely combine and nest multiple Search Joins using boolean operators (conjunction, disjunction, negation) to create complex query plans. It is fully integrated with the Elasticsearch API and is compatible with distributed environments.

How Does It Compare With Parent-Child

The Search Join is similar in nature to the Parent-Child feature of Elasticsearch: they perform a join at query-time. However, there are important differences between them:

  • The parent document and all of its children must live on the same shard, which limits its flexibility. The Search Join removes this constraint and is therefore more flexible: it allows to join documents across shards and across indices.

  • Thanks to the data locality of the Parent-Child model, joins are faster and more scalable. The Search Join on the contrary needs to transfer data across the network to compute joins across shards, limiting its scalability and performance.

There is no "one size fits all" solution to this problem, and you need to understand your requirements to choose the proper solution. As a basic rule, if your data model and data relationships are purely hierarchical (or can be mapped to a purely hierarchical model), then the Parent-Child model might be more appropriate. If on the contrary you need to query both directions of a data relationship, then the Filter Join might be more appropriate.

On Which Data Model It Operates

The most important requirement for the Search Join is to have a common shared attribute between two indices. For example, let’s take a simple relational data model composed of two tables, Articles and Companies, and of one junction table ArticlesMentionCompanies to encode the many-to-many relationships between them.

This model can be mapped to two Elasticsearch indices, Articles and Companies. An article document will have a multi-valued field mentions with the unique identifiers of the companies mentioned in the article. In other words, the field mentions is a foreign key in the Articles table that refers to the primary key of the Companies table.

It should be straightforward for someone to write an SQL statement to flatten and map relationships into a single multi-valued field. We can see that, compared to a traditional database model where a junction table is necessary, the model is simplified by leveraging multi-valued fields.

Getting Started

In this short guide, you will learn how you can quickly install the Siren Vanguard plugin in Elasticsearch, load two collections of documents inter-connected by a common attribute, and execute a relational query across the two collections within the Elasticsearch environment.

Prerequisites

This guide requires that you have downloaded and installed the Elasticsearch 5.5.2 distribution on your computer. If you do not have an Elasticsearch distribution, you can run the following commands:

$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.zip
$ unzip elasticsearch-5.5.2.zip
$ cd elasticsearch-5.5.2

Installing the Siren Vanguard Plugin

Before starting Elasticsearch, you have to install the Siren Vanguard plugin. Assuming that you are in your Elasticsearch installation directory, you can run the following command:

$ ./bin/elasticsearch-plugin install file:///PATH-TO-SIREN-VANGUARD-PLUGIN/siren-vanguard-5.5.2-plugin.zip
-> Downloading file:///PATH-TO-SIREN-VANGUARD-PLUGIN/siren-vanguard-5.5.2-plugin.zip
[=================================================] 100%  
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.RuntimePermission accessDeclaredMembers
* java.lang.RuntimePermission createClassLoader
* java.lang.reflect.ReflectPermission suppressAccessChecks
* java.security.SecurityPermission insertProvider.BC
* java.security.SecurityPermission putProviderProperty.BC
* java.util.PropertyPermission * read,write
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed siren-vanguard

In case you want to remove the plugin, you can run the following command:

$ bin/elasticsearch-plugin remove siren-vanguard

-> Removing siren-vanguard...
Removed siren-vanguard

Starting Elasticsearch

To launch Elasticsearch, run the following command:

$ ./bin/elasticsearch

In the output, you should see a line like the following which indicates that the Siren Vanguard plugin is installed and running:

[2017-04-11T10:42:02,209][INFO ][o.e.p.PluginsService     ] [etZuTTn] loaded plugin [siren-vanguard]

Loading Some Relational Data

We will use a simple synthetic dataset for the purpose of this demo. The dataset consists of two collections of documents: Articles and Companies. An article is connected to a company with the attribute mentions. Articles will be loaded into the articles index and companies in the companies index. To load the dataset, run the following command:

$ curl -XPUT 'http://localhost:9200/articles'
$ curl -XPUT 'http://localhost:9200/articles/_mapping/article' -d '
{
  "properties": {
    "mentions": {
      "type": "keyword"
    }
  }
}
'
$ curl -XPUT 'http://localhost:9200/companies'
$ curl -XPUT 'http://localhost:9200/companies/_mapping/company' -d '
{
  "properties": {
    "id": {
      "type": "keyword"
    }
  }
}
'

$ curl -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "1" } }
{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "2" } }
{ "title" : "Graph Databases Seen Connecting the Dots", "mentions" : [] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "3" } }
{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "4" } }
{ "title" : "MapR ships Apache Drill", "mentions" : ["4"] }

{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "1" } }
{ "id": "1", "name" : "Elastic" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "2" } }
{ "id": "2", "name" : "Orient Technologies" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "3" } }
{ "id": "3", "name" : "Cloudera" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "4" } }
{ "id": "4", "name" : "MapR" }
'

{
  "took" : 8,
  "errors" : false,
  "items" : [ {
    "index" : {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "1",
      "_version" : 3,
      "status" : 200
    }
  },
  ...
}

Relational Querying of the Data

We will now show you how to execute a relational query across the two indices. For example, we would like to retrieve all the articles that mention companies whose name matches orient. This relational query can be decomposed in two search queries: the first one to find all the companies whose name matches orient, and a second query to filter out all articles that do not mention a company from the first result set. The Siren Vanguard plugin introduces a new Elasticsearch’s filter, named join, that allows to define such a query plan and a new search API _search that allows to execute this query plan. Below is the command to run the relational query:

$ curl -XGET 'http://localhost:9200/siren/articles/_search?pretty' -d '{
   "query" : {
      "join" : {                      (1)
        "indices" : ["companies"],    (2)
        "on" : ["mentions", "id"],    (3)
        "request" : {                 (4)
          "query" : {
            "term" : {
              "name" : "orient"
            }
          }
        }
      }
    }
}'
  1. The join query clause

  2. The source indices (i.e., companies)

  3. The clause specifying the paths for join keys in both source and target indices

  4. The search request that will be used to filter out companies

The command should return you the following response with two search hits:

{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
    }, {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
    } ]
  }
}

You can also reverse the order of the join, and query for all the companies that are mentioned in articles whose title matches nosql:

$ curl -XGET 'http://localhost:9200/siren/companies/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["articles"],
        "on": ["id", "mentions"],
        "request" : {
          "query" : {
            "term" : {
              "title" : "nosql"
            }
          }
        }
      }
    }
}'

The command should return you the following response with three search hits:

{
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "companies",
      "_type" : "company",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "id": "4", "name" : "MapR" }
    }, {
      "_index" : "companies",
      "_type" : "company",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "id": "1", "name" : "Elastic" }
    }, {
      "_index" : "companies",
      "_type" : "company",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{ "id": "2", "name" : "Orient Technologies" }
    } ]
  }
}

Vanguard API

Search API

This plugin introduces two new search actions, /siren/[INDICES]/_search that replaces the /[INDICES]/_search action, and /siren/[INDICES]/_msearch that replaces the /[INDICES]/_search action. Both actions are extensions of the original Elasticsearch actions and therefore supports the same API. One must use these actions with the join query clause, as the join query clause is not supported by the original elaticsearch actions.

Parameters

  • join: the name of the join query clause

  • indices: the index names that will be joined with the source indices (optional, default to all indices).

  • types: the index types that will be joined with the source indices (optional, default to all types).

  • on: an array specifying the paths for join keys in both source and target indices

  • request: the search request that will be used to filter out documents before performing the join

Example

In this example, we will join all the documents from index1 with the documents of index2. The query first filters documents from index2 and of type type with the query { "terms" : { "tag" : [ "aaa" ] } }. It then retrieves the ids of the documents from the field id specified by the parameter on. The list of ids is then used as filter and applied on the field foreign_key of the documents from index1.

    {
      "join" : {
        "indices" : ["index2"],
        "types" : ["type"],
        "on" : ["foreign_key", "id"],
        "request" : {
          "query" : {
            "terms" : {
              "tag" : [ "aaa" ]
            }
          }
        }
      }
    }

Response Format

The response returned by the Vanguard’s search API is identical to the response returned by Elasticsearch’s search API.

Performance Considerations

Numeric vs String Attributes

Joining numeric attributes is more efficient than joining string attributes. If you are planning to join attributes of type string, we recommend to generate a murmur hash of the string value at indexing time into a new attribute, and use this new attribute for the join. Such index-time data transformation can be easily done using Logstash’s fingerprint plugin.

Cache Settings

By default, Elasticsearch does not cache queries immediately and does not cache queries at all on small segments (from Lucene’s LRUQueryCache: segments with less than 10k documents or less than 3% of the total number of documents in the index). To ensure optimal performance, the following Elasticsearch settings must be set:

  • Index level settings:

 index.queries.cache.enabled: true
 index.queries.cache.everything: true
  • Node level settings:

 indices.queries.cache.all_segments: true

License API for Siren Vanguard

Vanguard includes a license manager service and a set of rest commands to register, verify and delete a Siren’s license.

Without a valid license, Vanguard will log a message to notify that the current license is invalid at every request.

Usage

Let’s assume you have a Siren license named license.sig. You can upload and register this license in Elasticsearch using the command:

$ curl -XPUT -T "license.sig" 'http://localhost:9200/_siren/license'
---
acknowledged: true

You can then check the status of the license using the command:

$ curl -XGET 'http://localhost:9200/_siren/license'
{
  "license" : {
    "content" : {
      "valid-date" : "2016-05-16",
      "issue-date" : "2016-04-15",
      "max-nodes" : "12"
    },
    "isValid" : true
  }
}

To delete a license from Elasticsearch, you can use the command:

$ curl -XDELETE 'http://localhost:9200/_siren/license'
{"acknowledged":true}