Introduction

The Siren Federate plugin is a plugin for Elasticsearch that extends Elasticsearch with (1) a federation layer to query external databases with the Elasticsearch API and (2) join capabilities across indices and external databases.

Once configured, an external database is mapped to a “Virtual Index” in Elasticsearch. Elasticsearch’s requests on a virtual index, such as Get Mapping or Search, are intercepted by the Siren Federate plugin, translated into the external database dialect and executed against the external database. This enables Siren Investigate to create and display dashboards for data located both in Elasticsearch indices and external databases.

The Siren Federate plugin also extends the Elasticsearch DSL with a join query clause which enables the user to execute a join between indices. The join capabilities are implemented on top of a in-memory distributed computing layer which scales with the number of nodes available in the cluster.

The current join capabilities is currently limited to a (left) semi-join between two set of documents based on a common attribute, where the result only contains the attributes of one of the joined set of documents. This join is used to filter one document set based on a second document set, hence its name. It is equivalent to the EXISTS() operator in SQL. Joins on both numerical and textual fields are supported, but the joined attributes must be of the same type. You can also freely combine and nest multiple joins using boolean operators (conjunction, disjunction, negation) to create complex query plans. It is fully integrated with the Elasticsearch API and is compatible with distributed environments.

Architecture Overview

Our core requirements for Siren Federate are:

  • Low latency, real time interactive response – Siren Federate is designed to power ad hoc interactive, read only queries such as Siren Investigate.

  • Implement a full featured relational algebra, capable of being extended for more advanced join conditions, operations and statistical optimizations.

  • Flexible distributed computational framework

  • Horizontal scaling of fully distributed operations, leveraging all the cluster memory.

  • Federated – capable of working on data that is not inside the cluster, for example via JDBC connections.

Siren Federate is based on the following high level architecture concepts:

  • A coordinator node which is in charge of the query parsing, query planning and query execution. We are leveraging the Apache Calcite engine to create a logical plan of the query, optimise the logical plan and execute a physical plan.

  • A set of worker processes that are in charge of executing the physical operations. Depending on the type of physical operation, a worker process is spawned on a per node or per shard basis.

  • An in-memory distributed file system that is used by the worker nodes to exchange data, with a compact columnar data representation optimized for analytical data processing, zero copy and zero data serialisation.

How Does Siren Federate Join Compare With Parent-Child

The Siren Federate join is similar in nature to the Parent-Child feature of Elasticsearch: they perform a join at query-time. However, there are important differences between them:

  • The parent document and all of its children must live on the same shard, which limits its flexibility. The Siren Federate join removes this constraint and is therefore more flexible: it allows to join documents across shards and across indices.

  • Thanks to the data locality of the Parent-Child model, joins are faster and more scalable. The Siren Federate join on the contrary needs to transfer data across the network to compute joins across shards, limiting its scalability and performance.

There is no “one size fits all” solution to this problem, and you need to understand your requirements to choose the proper solution. As a basic rule, if your data model and data relationships are purely hierarchical (or can be mapped to a purely hierarchical model), then the Parent-Child model might be more appropriate. If on the contrary you need to query both directions of a data relationship, then the Siren Federate join might be more appropriate.

On Which Data Model It Operates

The most important requirement for executing a join is to have a common shared attribute between two indices. For example, let’s take a simple relational data model composed of two tables, Articles and Companies, and of one junction table ArticlesMentionCompanies to encode the many-to-many relationships between them.

This model can be mapped to two Elasticsearch indices, Articles and Companies. An article document will have a multi-valued field mentions with the unique identifiers of the companies mentioned in the article. In other words, the field mentions is a foreign key in the Articles table that refers to the primary key of the Companies table.

It should be straightforward for someone to write an SQL statement to flatten and map relationships into a single multi-valued field. We can see that, compared to a traditional database model where a junction table is necessary, the model is simplified by leveraging multi-valued fields.

Getting Started

In this short guide, you will learn how you can quickly install the Siren Federate plugin in Elasticsearch, load two collections of documents inter-connected by a common attribute, and execute a relational query across the two collections within the Elasticsearch environment.

Prerequisites

This guide requires that you have downloaded and installed the Elasticsearch 5.6.7 distribution on your computer. If you do not have an Elasticsearch distribution, you can run the following commands:

$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.7.zip
$ unzip elasticsearch-5.6.7.zip
$ cd elasticsearch-5.6.7

Installing the Siren Federate Plugin

Before starting Elasticsearch, you have to install the Siren Federate plugin. Assuming that you are in your Elasticsearch installation directory, you can run the following command:

$ ./bin/elasticsearch-plugin install file:///PATH-TO-SIREN-FEDERATE-PLUGIN/siren-federate-5.6.7-10.0.0-beta-1-plugin.zip
-> Downloading file:///PATH-TO-SIREN-FEDERATE-PLUGIN/siren-federate-5.6.7-10.0.0-beta-1-plugin.zip
[=================================================] 100%  
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.io.FilePermission cloudera.properties read
* java.io.FilePermission simba.properties read
* java.lang.RuntimePermission accessClassInPackage.sun.misc
* java.lang.RuntimePermission accessClassInPackage.sun.misc.*
* java.lang.RuntimePermission accessClassInPackage.sun.security.provider
* java.lang.RuntimePermission accessDeclaredMembers
* java.lang.RuntimePermission createClassLoader
* java.lang.RuntimePermission getClassLoader
...
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed siren-federate

In case you want to remove the plugin, you can run the following command:

$ bin/elasticsearch-plugin remove siren-federate

-> Removing siren-federate...
Removed siren-federate

Starting Elasticsearch

To launch Elasticsearch, run the following command:

$ ./bin/elasticsearch

In the output, you should see a line like the following which indicates that the Siren Federate plugin is installed and running:

[2017-04-11T10:42:02,209][INFO ][o.e.p.PluginsService     ] [etZuTTn] loaded plugin [siren-federate]

Loading Some Relational Data

We will use a simple synthetic dataset for the purpose of this demo. The dataset consists of two collections of documents: Articles and Companies. An article is connected to a company with the attribute mentions. Articles will be loaded into the articles index and companies in the companies index. To load the dataset, run the following command:

$ curl -XPUT 'http://localhost:9200/articles'
$ curl -XPUT 'http://localhost:9200/articles/_mapping/article' -d '
{
  "properties": {
    "mentions": {
      "type": "keyword"
    }
  }
}
'
$ curl -XPUT 'http://localhost:9200/companies'
$ curl -XPUT 'http://localhost:9200/companies/_mapping/company' -d '
{
  "properties": {
    "id": {
      "type": "keyword"
    }
  }
}
'

$ curl -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "1" } }
{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "2" } }
{ "title" : "Graph Databases Seen Connecting the Dots", "mentions" : [] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "3" } }
{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "4" } }
{ "title" : "MapR ships Apache Drill", "mentions" : ["4"] }

{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "1" } }
{ "id": "1", "name" : "Elastic" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "2" } }
{ "id": "2", "name" : "Orient Technologies" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "3" } }
{ "id": "3", "name" : "Cloudera" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "4" } }
{ "id": "4", "name" : "MapR" }
'

{
  "took" : 8,
  "errors" : false,
  "items" : [ {
    "index" : {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "1",
      "_version" : 3,
      "status" : 200
    }
  },
  ...
}

Relational Querying of the Data

We will now show you how to execute a relational query across the two indices. For example, we would like to retrieve all the articles that mention companies whose name matches orient. This relational query can be decomposed in two search queries: the first one to find all the companies whose name matches orient, and a second query to filter out all articles that do not mention a company from the first result set. The Siren Federate plugin introduces a new Elasticsearch’s filter, named join, that allows to define such a query plan and a new search API _search that allows to execute this query plan. Below is the command to run the relational query:

$ curl -XGET 'http://localhost:9200/siren/articles/_search?pretty' -d '{
   "query" : {
      "join" : {                      (1)
        "indices" : ["companies"],    (2)
        "on" : ["mentions", "id"],    (3)
        "request" : {                 (4)
          "query" : {
            "term" : {
              "name" : "orient"
            }
          }
        }
      }
    }
}'
  1. The join query clause

  2. The source indices (i.e., companies)

  3. The clause specifying the paths for join keys in both source and target indices

  4. The search request that will be used to filter out companies

The command should return you the following response with two search hits:

{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
    }, {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
    } ]
  }
}

You can also reverse the order of the join, and query for all the companies that are mentioned in articles whose title matches nosql:

$ curl -XGET 'http://localhost:9200/siren/companies/_search?pretty' -d '{
   "query" : {
      "join" : {
        "indices" : ["articles"],
        "on": ["id", "mentions"],
        "request" : {
          "query" : {
            "term" : {
              "title" : "nosql"
            }
          }
        }
      }
    }
}'

The command should return you the following response with three search hits:

{
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "companies",
      "_type" : "company",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "id": "4", "name" : "MapR" }
    }, {
      "_index" : "companies",
      "_type" : "company",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "id": "1", "name" : "Elastic" }
    }, {
      "_index" : "companies",
      "_type" : "company",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{ "id": "2", "name" : "Orient Technologies" }
    } ]
  }
}

Federate API

Node-level Settings

siren.planner.volcano.enable

Use the volcano statistical optimizer when selecting the join algorithms. Defaults to true.

siren.planner.volcano.join

Defines which distributed join algorithm to be selected when optimizing a request. Valid values are either HASH_JOIN or MERGE_JOIN, case-insensitive. Defaults to HASH_JOIN.

siren.planner.volcano.use_query

Use contextual queries when computing statistics. If false, computed statistics are effectively "global" to the index. Defaults to true.

Search API

This plugin introduces two new search actions, /siren/[INDICES]/_search that replaces the /[INDICES]/_search action, and /siren/[INDICES]/_msearch that replaces the /[INDICES]/_msearch action. Both actions are extensions of the original Elasticsearch actions and therefore supports the same API. One must use these actions with the join query clause, as the join query clause is not supported by the original Elasticsearch actions.

Search Request Parameters

debug

Enable query planner debug. Default to false.

task_timeout

A task timeout, bounding a task to be executed within the specified time value (in milliseconds) and bail with the values accumulated up to that point when expired. Default to no timeout.

Query Syntax

  • join: the name of the join query clause

  • type: the type of the join. Valid values are either BROADCAST_JOIN, HASH_JOIN or MERGE_JOIN.

  • indices: the index names that will be joined with the source indices (optional, default to all indices).

  • types: the index types that will be joined with the source indices (optional, default to all types).

  • on: an array specifying the paths for join keys in both source and target indices

  • request: the search request that will be used to filter out documents before performing the join

Example

In this example, we will join all the documents from index1 with the documents of index2. The query first filters documents from index2 and of type type with the query { "terms" : { "tag" : [ "aaa" ] } }. It then retrieves the ids of the documents from the field id specified by the parameter on. The list of ids is then used as filter and applied on the field foreign_key of the documents from index1.

GET /siren/index1/_search
{
  "join" : {
    "type": "HASH_JOIN",
    "indices" : ["index2"],
    "types" : ["type"],
    "on" : ["foreign_key", "id"],
    "request" : {
      "query" : {
        "terms" : {
          "tag" : [ "aaa" ]
        }
      }
    }
  }
}

Response Format

The response returned by the Federate’s search API is identical to the response returned by Elasticsearch’s search API.

Performance Considerations

Join Types

Siren Federate includes different join strategies: “Broadcast Join”, “Hash Join” and “Merge Join”. Each one has its pros and cons and the optimal performance will depends on the scenario. By default, the Siren Federate planner will try to automatically pick the best strategy, but it might be best in certain scenarios to pick manually one of the strategies.

The Broadcast Join is best when filtering a large index with a small set of documents. The Hash Join and Merge Join are fully distributed and are designed to handle large joins. They both scales horizontally (based on the number of nodes) and vertically (based on the number of cpu cores). Currently, the Hash Join usually performs better in many scenarios compared to the Merge Join.

Siren Federate provides two fully distributed join algorithms: the Hash Join and the Sort-Merge Join. Each one is designed for leveraging multi-core architecture. This is achieved by creating many small data partitions during the Project phase. Each node of the cluster will receive a number of partitions that are dependent of the number of cpus. Partitions are independent from each other and can be processed independently by a different join worker thread. During the join phase, each worker thread will join tuples from one partition. The number of join worker threads scales automatically with the number of cpu cores available.

The Hash Join is performed in two phases: build and probe. The build phase creates a in-memory hash table of one of the relation in the partition. The probe phase then scans the second relation and probes the hash table to find the matching tuples.

The Sort-Merge Join instead requires a sort phase of the two relations during the project phase. It then performs a linear scan over the two sorted relations to find the matching tuples.

Compared to the Hash Join, the Sort-Merge Join does not require additional memory since it does not have to build a in-memory hash table. However, it requires a sort operation to be executed during the project phase. It is in fact trading cpu for memory.

Numeric vs String Attributes

Joining numeric attributes is more efficient than joining string attributes. If you are planning to join attributes of type string, we recommend to generate a murmur hash of the string value at indexing time into a new attribute, and use this new attribute for the join. Such index-time data transformation can be easily done using Logstash’s fingerprint plugin.

Tuple Collector Settings

Tuple Collectors are sending batches of tuples of fixed size. The size of a batch has an impact on the performance. Smaller batches will take less memory but will increase cpu times on the receiver side since it will have to reconstruct a tuple collection from many small batches (especially for sorted tuple collection). By default, the size of a batch of tuple is set to 1048576 tuples (which represents 8mb for a column of long datatype). The size can be configured using the setting key siren.io.tuple.collector.batch_size with a integer value representing the maximum number of tuples in a batch.

License API for Siren Federate

Federate includes a license manager service and a set of rest commands to register, verify and delete a Siren’s license.

Without a valid license, Federate will log a message to notify that the current license is invalid at every request.

Usage

Let’s assume you have a Siren license named license.sig. You can upload and register this license in Elasticsearch using the command:

$ curl -XPUT -T "license.sig" 'http://localhost:9200/_siren/license'
---
acknowledged: true

You can then check the status of the license using the command:

$ curl -XGET 'http://localhost:9200/_siren/license'
{
  "license" : {
    "content" : {
      "valid-date" : "2016-05-16",
      "issue-date" : "2016-04-15",
      "max-nodes" : "12"
    },
    "isValid" : true
  }
}

To delete a license from Elasticsearch, you can use the command:

$ curl -XDELETE 'http://localhost:9200/_siren/license'
{"acknowledged":true}

Connecting to JDBC datasources

The Siren Federate plugin provides the capability to query data in remote datasources through the Elasticsearch API by mapping tables to virtual indices.

The plugin stores its configuration in two Elasticsearch indices:

  • .siren-federate-datasources: used to store the JDBC configuration parameters of remote datasources.

  • .siren-federate-indices: used to store the configuration parameters of virtual indices.

Datasources and virtual indices can be managed using the REST API or the user interface available in Siren Investigate.

These indices are created automatically when required.

Settings

In order to send queries to virtual indices the Elasticsearch cluster must contain at least one node enabled to issue queries over JDBC; it is advised to use a coordinating only node for this role, although this is not a requirement for testing purposes.

JDBC node settings

In order to enable JDBC on a node where the Siren Federate plugin is installed, add the following setting to elasticsearch.yml:

node.attr.connector.jdbc: true

Then, create a directory named jdbc-drivers inside the configuration directory of the node (e.g. elasticsearch/config or /etc/elasticsearch).

Finally, copy the JDBC driver for your remote datasource and its dependencies to the jdbc-drivers directory created above and restart the node; see the [siren-federate-jdbc-compat] section for a list of compatible drivers and dependencies.

Common configuration settings

Encryption

JDBC passwords are encrypted by default using a predefined 128 bit AES key; before creating datasources, it is advised to generate a custom key by running the keygen.sh script included in the siren-federate plugin directory as follows:

bash plugins/siren-federate/tools/keygen.sh -s 128

The command will output a random base64 key; it is also possible to generate keys longer than 128 bit if your JVM supports it.

To use the custom key, the following parameters must be set in elasticsearch.yml on master nodes and on all the JDBC nodes:

  • siren.connector.encryption.enabled: true by default, can be set to false to disable JDBC password encryption.

  • siren.connector.encryption.secret_key: a base64 encoded AES key used to encrypt JDBC passwords.

Example elasticsearch.yml settings for a master node with a custom encryption key:

siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="

Example elasticsearch.yml settings for a JDBC node with a custom encryption key:

siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
node.attr.connector.jdbc: true

Restart the nodes after changing the configuration to apply the settings.

Cluster wide settings

The following parameters can be set in elasticsearch.yml on JDBC nodes or by using the Elasticsearch cluster update settings API:

  • siren.connector.siren.timeout.connection: the maximum amount of seconds to wait when establishing or acquiring a JDBC connection (30 by default).

  • siren.connector.timeout.query: the maximum execution time for JDBC queries, in seconds (30 by default).

  • siren.connector.enable_union_aggregations: true by default, can be set to false to disable the use of unions in nested aggregations.

  • siren.connector.query.max_result_rows: the maximum number of rows that will be retrieved from a resultset when performing a join across datasources. Defaults to 50000.

  • siren.connector.query.max_bucket_queries: the maximum number of JDBC queries that will be generated to compute aggregation buckets. Defaults to 500.

Additional node settings

The following settings can be used to tune query processing on JDBC enabled nodes:

  • siren.connector.pool.size: the number of threads that will be allocated to process the execution of queries to remote datasources; by default it is set to int((number of available_processors * 3) / 2) + 1.

  • siren.connector.pool.queue: the maximum number of requests that should be queued if all the threads are busy. Defaults to 40.

Authentication

The Federate server role

If your cluster is protected by Search Guard or Elastic X-Pack, it is required to define a role with access to the Federate indices and internal operations and to create a user with this role.

For interoperability with these plugins, whenever a virtual index is created the Federate plugin creates a concrete Elasticsearch index with the same name as the virtual index; when starting up, the Federate plugin will check for missing concrete indices and will attempt to create them automatically.

Sample Search Guard role definition:

federateserver:
  cluster:
    - "indices:admin/aliases"
  indices:
    ?siren-federate-datasources:
      '*':
        - ALL
    ?siren-federate-indices:
      '*':
        - ALL
    ?siren-federate-target:
      '*':
        - ALL

Sample X-Pack role definition:

{
  "cluster": [
    "monitor",
    "cluster:admin/siren/connector"
  ],
  "indices" : [
    {
      "names" : [ "*" ],
      "privileges" : [ "create_index", "indices:data/read/get", "indices:admin/siren/connector" ]
    },
    {
      "names" : [ ".siren-federate-*" ],
      "privileges" : [ "all", "indices:admin/siren/connector" ]
    }
  ]
}

Then create a user with that role e.g. a user called federateserver.

Example elasticsearch.yml settings for a master node in a cluster with authentication and federateserver user:

siren.connector.username: federateserver
siren.connector.password: password
siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="

Example elasticsearch.yml settings for a JDBC node in a cluster with authentication and federateserver user:

siren.connector.username: federateserver
siren.connector.password: password
siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
node.attr.connector.jdbc: true

Restart the nodes after setting the appropriate configuration parameters.

Administrative role

In order to manage datasources and virtual indices, it is required to grant the cluster:admin/siren/connector/* permissions at the cluster level.

In addition, the user must have the indices:admin/siren/connector/ and indices:data/siren/connector/ permissions on all the index names that he’s allowed to define, in addition to create, write, read and search permissions.

Write permissions are required because when a virtual index is defined the plugin will create a concrete Elasticsearch index with the same name for interoperability with authentication plugins, unless such an index already exists.

Example Search Guard role allowed to manage virtual indices starting with db-:

sirenadmin:
  cluster:
    - SIREN_CLUSTER
    - cluster:admin/plugin/siren/license/put
    - cluster:admin/plugin/siren/license/get
    - cluster:admin/siren/connector/*
  indices:
    'db-*':
      '*':
        - SIREN_READWRITE
        - indices:admin/create
        - indices:admin/siren/connector/*
    '*':
      '*':
        - SIREN_COMPOSITE

Example X-Pack role allowed to manage virtual indices starting with db-:

{
  "cluster": [
    "cluster:admin/siren/connector"
    "cluster:admin/plugin/siren/license",
    "cluster:siren/internal",
    "manage"
  ],
  "indices" : [
    {
      "names" : [ "*" ],
      "privileges" : [ "indices:siren/mplan" ]
    },
    {
      "names" : [ "db-*" ],
      "privileges" : [
        "read",
        "create_index",
        "view_index_metadata",
        "indices:data/siren",
        "indices:siren",
        "indices:admin/version/get",
        "indices:admin/get",
        "indices:admin/siren/connector"
      ]
    }
  ]
}
Search role

In order to search virtual indices, users must have the indices:data/siren/connector/* permissions on these indices in addition to standard standard read and search permissions.

Example Search Guard role allowed to search virtual indices starting with db-:

sirenuser:
  cluster:
    - SIREN_CLUSTER
  indices:
    '*':
      '*':
        SIREN_COMPOSITE
    'db-*':
      '*':
        - SIREN_READONLY
        - indices:data/siren/connector/*

Example X-Pack role allowed to search virtual indices starting with db-:

{
  "cluster": [
    "cluster:admin/plugin/siren/license/get",
    "cluster:siren/internal"
  ],
  "indices" : [
    {
      "names" : [ "*" ],
      "privileges" : [ "indices:siren/mplan" ]
    },
    {
      "names" : [ "db-*" ],
      "privileges" : [
        "read",
        "view_index_metadata",
        "indices:data/siren",
        "indices:siren",
        "indices:admin/version/get",
        "indices:admin/get"
      ]
    }
  ]
}

JDBC driver installation and compatibility

The JDBC driver for your remote datasource and its dependencies must be copied to the jdbc-drivers subdirectory inside the configuration directory of JDBC nodes (e.g. elasticsearch/config/jdbc-drivers).

It is not required nor recommended to copy these drivers to nodes which are not enabled to execute queries.

Table 1. List of supported JDBC drivers
Name JDBC class Notes

PostgreSQL

org.postgresql.Driver

Download the latest JDBC 4.2 driver from https://jdbc.postgresql.org/download.html and copy the postgresql-<version>.jar file to the jdbc-drivers directory.

MySQL

com.mysql.jdbc.Driver

Download the latest GA release from https://dev.mysql.com/downloads/connector/j/, extract it, then copy mysql-connector-java-<version>.jar to the jdbc-drivers plugin directory.

When writing the JDBC connection string, set the useLegacyDatetimeCode parameter to false to avoid issues when converting timestamps.

Microsoft SQL Server 2014 or greater

com.microsoft.sqlserver.jdbc.SQLServerDriver

Download sqljdbc_<version>_enu.tar.gz from https://www.microsoft.com/en-us/download/details.aspx?id=55539, extract it, then copy mssql-jdbc-<version>.jre8.jar to the jdbc-drivers directory.

Sybase ASE 15.7+

com.sybase.jdbc4.jdbc.SybDriver

OR

net.sourceforge.jtds.jdbc.Driver

To use the FreeTDS driver, download the latest version from https://sourceforge.net/projects/jtds/files/, extract it, then copy jtds-<version>.jar to the jdbc-drivers directory.

To use the jConnect driver, copy jConnect-<version>.jar from your ASE directory to the jdbc-drivers directory.

Oracle 12c+

oracle.jdbc.OracleDriver

Download the latest ojdbc8.jar from http://www.oracle.com/technetwork/database/features/jdbc/jdbc-ucp-122-3110062.html and copy it to the jdbc-drivers plugin directory.

Presto

com.facebook.presto.jdbc.PrestoDriver

Download the latest JDBC driver from https://prestodb.io/docs/current/installation/jdbc.html and copy it to the jdbc-drivers plugin directory.

Spark SQL 2.2+

com.simba.spark.jdbc41.Driver

The Magnitude JDBC driver for Spark can be purchased at https://www.simba.com/product/spark-drivers-with-sql-connector/; once downloaded, extract the bundle, then extract the JDBC 4.1 archive and copy the following jars to the jdbc-drivers plugin directory:

SparkJDBC41.jar

commons-codec-<version>.jar

hive_metastore.jar

hive_service.jar

libfb303-<version>.jar

libthrift-<version>.jar

ql.jar

TCLIServiceClient.jar

zookeeper-<version>.jar

In addition, copy your license file to the jdbc-drivers plugin directory.

Dremio

com.dremio.jdbc.Driver

Download the jar at https://download.siren.io/dremio-jdbc-driver-1.4.4-201801230630490666-6d69d32.jar and copy it to the jdbc-drivers plugin directory.

Impala

com.cloudera.impala.jdbc41.Driver

Download the latest JDBC bundle from https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-42.html, extract the bundle, then extract the JDBC 4.1 archive and copy the following jars to the jdbc-drivers plugin directory:

ImpalaJDBC41.jar

commons-codec-<version>.jar

hive_metastore.jar

hive_service.jar

libfb303-<version>.jar

libthrift-<version>.jar

ql.jar

TCLIServiceClient.jar

zookeeper-<version>.jar

Restart the JDBC node after copying the drivers.

API

Datasource management

The endpoint for datasource management is at /_siren/connector/datasources.

Datasource creation and modification

A datasource with a specific id can be updated by issuing a PUT request as follows:

PUT /_siren/connector/datasource/<id>
{
  "jdbc": {
    "username": "username",
    "password": "password",
    "driver": "com.db.Driver",
    "url": "jdbc:db://localhost:5432/default"
  }
}

Body parameters:

  • jdbc: the JDBC configuration of the datasource.

JDBC configuration parameters:

  • driver: the class name of the JDBC driver.

  • url: the JDBC url of the datasource.

  • username: the username that will be passed to the JDBC driver when getting a connection (optional).

  • password: the password that will be passed to the JDBC driver when getting a connection (optional).

  • timezone: if date and timestamp fields are stored in a timezone other than UTC, specifying this parameter will instruct the plugin to convert dates and times to/from the specified timezone when performing queries and retrieving results.

Datasource deletion

TO delete a datasource, issue a DELETE request as follows:

DELETE /_siren/connector/datasource/<id>
Datasource listing

To list the datasources configured in the system, issue a GET request as follows:

GET /_siren/connector/datasource/_search
Datasource validation

To validate the connection to a datasource, issue a POST request as follows:

POST /_siren/connector/datasource/<id>/_validate

Virtual index management

Virtual index creation and modification

A virtual index with a specific id can be updated by issuing a PUT request as follows:

PUT /_siren/connector/index/<id>
{
  "datasource": "ds",
  "catalog": "catalog",
  "schema": "schema",
  "resource": "table",
  "key": "id"
}

The id of a virtual index must be a valid lowercase Elasticsearch index name; it is recommended to start virtual indices with a common prefix to simplify handling of permissions.

Body parameters:

  • datasource: the id of an existing datasource.

  • resource: the name of a table or view on the remote datasource.

  • key: the name of a unique column; if a virtual index has no primary key it will be possible to perform aggregations, however queries that expect a reproducible unique identifier will not be possible.

  • catalog and schema: the catalog and schema containing the table specified in the resource parameter; these are usually required only if the connection does not specify a default catalog or schema.

Virtual index deletion

To delete a virtual index, issue a DELETE request as follows:

DELETE /_siren/connector/index/<id>

When a virtual index is deleted, the corresponding concrete index is not deleted automatically.

Virtual index listing

To list the virtual indices configured in the system, issue a GET request as follows:

GET /_siren/connector/index/_search

Operations on virtual indices

The plugin supports the following operations on virtual indices:

  • get mapping

  • get field capabilities

  • search

  • msearch

  • get

  • mget

Search requests involving a mixture of virtual and normal Elasticsearch indices (e.g. when using a wildcard) are not supported and will be rejected; it is however possible to issue msearch requests containing requests on normal Elasticsearch indices and virtual indices.

When creating a virtual index, the plugin will create an empty Elasticsearch index for interoperability with Search Guard and X-Pack; if an Elasticsearch index with the same name as the virtual index already exists and it is not empty, the virtual index creation will fail.

When deleting a virtual index, the corresponding Elasticsearch index will not be removed.

Type conversion

The plugin converts JDBC types to their closest Elasticsearch equivalent:

  • String types are handled as keyword fields.

  • Boolean types are handled as boolean fields.

  • Date and timestamp are handled as date fields.

  • Integer types are handled as long fields.

  • Floating point types are handled as double fields.

Complex JDBC types which are not recognized by the plugin are skipped during query processing and resultset fetching.

Supported search queries

The plugin supports the following queries:

  • match_all

  • term

  • terms

  • range

  • exists

  • prefix

  • wildcard

  • ids

  • bool

At this time the plugin provides no support for datasource specific full text search functions, so all these queries will work as if they were issued against keyword fields.

Supported aggregations

Currently the plugin provides support for the following aggregations:

Metric:

  • Average

  • Cardinality

  • Max

  • Min

  • Sum

Bucket:

  • Date histogram

  • Histogram

  • Date range

  • Range

  • Terms

  • Filters

Only terms aggregations can be nested inside a parent bucket aggregation.

Known Limitations

  • Cross backend join currently supports only integer keys.

  • Cross backend support has very different scalability according to the direction of the Join, a join which involves sending IDs to a remote system will be possibly hundreds of times less scalable (e.g. thousands vs millions) to one where the keys are fetched from a remote system.

  • Only terms aggregations can be nested inside a parent bucket aggregation.

  • The missing parameter in bucket aggregations is not supported.

  • Scripted fields are not supported.

  • When issuing queries containing string comparisons, the plugin does not force a specific collation, if a table behind a virtual indices uses a case insensitive collation, string comparisons will be case insensitive.

  • Wildcards on virtual index names are not supported by any API; a wildcard search will silently ignore virtual indices.

  • Currently cross cluster searches on virtual indices are not supported.

Troubleshooting

Cannot reconnect to datasource by hostname after DNS update

When the Java security manager is enabled, the JVM will cache name resolutions indefinitely; if the system you’re connecting to uses round-robin DNS or the IP address of the system changes frequently, you will need to modify the following Java Security Policy properties:

  • networkaddress.cache.ttl: the number of seconds to cache a successful DNS lookup. Defaults to -1 (forever).

  • networkaddress.cache.negative.ttl: the number of seconds to cache an unsuccessful DNS lookup. Defaults to 10, set to 0 to avoid caching.