Introduction
The Siren Federate plugin is a plugin for Elasticsearch that extends Elasticsearch with (1) a federation layer to query external databases with the Elasticsearch API and (2) join capabilities across indices and external databases.
Once configured, an external database is mapped to a “Virtual Index” in Elasticsearch. Elasticsearch’s requests on a
virtual index, such as Get Mapping
or Search
, are intercepted by the Siren Federate plugin, translated into the
external database dialect and executed against the external database. This enables Siren Investigate to create and
display dashboards for data located both in Elasticsearch indices and external databases.
The Siren Federate plugin also extends the Elasticsearch DSL with a join
query clause which enables the user to
execute a join between indices. The join capabilities are implemented on top of a in-memory distributed computing layer
which scales with the number of nodes available in the cluster.
The current join capabilities is currently limited to a (left) semi-join between two set of documents
based on a common attribute, where the result only contains the attributes of one of the joined set of documents.
This join is used to filter one document set based on a second document set, hence its name. It is equivalent
to the EXISTS()
operator in SQL. Joins on both numerical and textual fields are supported, but the joined attributes must be of the
same type. You can also freely combine and nest multiple joins using boolean operators (conjunction,
disjunction, negation) to create complex query plans. It is fully integrated with the Elasticsearch API and is
compatible with distributed environments.
Architecture Overview
Our core requirements for Siren Federate are:
-
Low latency, real time interactive response – Siren Federate is designed to power ad hoc interactive, read only queries such as Siren Investigate.
-
Implement a full featured relational algebra, capable of being extended for more advanced join conditions, operations and statistical optimizations.
-
Flexible distributed computational framework
-
Horizontal scaling of fully distributed operations, leveraging all the cluster memory.
-
Federated – capable of working on data that is not inside the cluster, for example via JDBC connections.
Siren Federate is based on the following high level architecture concepts:
-
A coordinator node which is in charge of the query parsing, query planning and query execution. We are leveraging the Apache Calcite engine to create a logical plan of the query, optimise the logical plan and execute a physical plan.
-
A set of worker processes that are in charge of executing the physical operations. Depending on the type of physical operation, a worker process is spawned on a per node or per shard basis.
-
An in-memory distributed file system that is used by the worker nodes to exchange data, with a compact columnar data representation optimized for analytical data processing, zero copy and zero data serialisation.
How Does Siren Federate Join Compare With Parent-Child
The Siren Federate join is similar in nature to the Parent-Child feature of Elasticsearch: they perform a join at query-time. However, there are important differences between them:
-
The parent document and all of its children must live on the same shard, which limits its flexibility. The Siren Federate join removes this constraint and is therefore more flexible: it allows to join documents across shards and across indices.
-
Thanks to the data locality of the Parent-Child model, joins are faster and more scalable. The Siren Federate join on the contrary needs to transfer data across the network to compute joins across shards, limiting its scalability and performance.
There is no “one size fits all” solution to this problem, and you need to understand your requirements to choose the proper solution. As a basic rule, if your data model and data relationships are purely hierarchical (or can be mapped to a purely hierarchical model), then the Parent-Child model might be more appropriate. If on the contrary you need to query both directions of a data relationship, then the Siren Federate join might be more appropriate.
On Which Data Model It Operates
The most important requirement for executing a join is to have a common shared attribute between two indices.
For example, let’s take a simple relational data model composed of two tables, Articles
and Companies
, and of one
junction table ArticlesMentionCompanies
to encode the many-to-many relationships between them.
This model can be mapped to two Elasticsearch indices, Articles
and Companies
. An article document will have
a multi-valued field mentions
with the unique identifiers of the companies mentioned in the article.
In other words, the field mentions
is a foreign key in the Articles
table that refers to the primary key of
the Companies
table.
It should be straightforward for someone to write an SQL statement to flatten and map relationships into a single multi-valued field. We can see that, compared to a traditional database model where a junction table is necessary, the model is simplified by leveraging multi-valued fields.
Getting Started
In this short guide, you will learn how you can quickly install the Siren Federate plugin in Elasticsearch, load two collections of documents inter-connected by a common attribute, and execute a relational query across the two collections within the Elasticsearch environment.
Prerequisites
This guide requires that you have downloaded and installed the Elasticsearch 5.6.4 distribution on your computer. If you do not have an Elasticsearch distribution, you can run the following commands:
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.4.zip
$ unzip elasticsearch-5.6.4.zip
$ cd elasticsearch-5.6.4
Installing the Siren Federate Plugin
Before starting Elasticsearch, you have to install the Siren Federate plugin. Assuming that you are in your Elasticsearch installation directory, you can run the following command:
$ ./bin/elasticsearch-plugin install file:///PATH-TO-SIREN-FEDERATE-PLUGIN/siren-federate-5.6.4-10.0.2-plugin.zip
-> Downloading file:///PATH-TO-SIREN-FEDERATE-PLUGIN/siren-federate-5.6.4-10.0.2-plugin.zip
[=================================================] 100%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.io.FilePermission cloudera.properties read
* java.io.FilePermission simba.properties read
* java.lang.RuntimePermission accessClassInPackage.sun.misc
* java.lang.RuntimePermission accessClassInPackage.sun.misc.*
* java.lang.RuntimePermission accessClassInPackage.sun.security.provider
* java.lang.RuntimePermission accessDeclaredMembers
* java.lang.RuntimePermission createClassLoader
* java.lang.RuntimePermission getClassLoader
...
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
Continue with installation? [y/N]y
-> Installed siren-federate
In case you want to remove the plugin, you can run the following command:
$ bin/elasticsearch-plugin remove siren-federate
-> Removing siren-federate...
Removed siren-federate
Starting Elasticsearch
To launch Elasticsearch, run the following command:
$ ./bin/elasticsearch
In the output, you should see a line like the following which indicates that the Siren Federate plugin is installed and running:
[2017-04-11T10:42:02,209][INFO ][o.e.p.PluginsService ] [etZuTTn] loaded plugin [siren-federate]
Loading Some Relational Data
We will use a simple synthetic dataset for the purpose of this demo. The dataset consists of two collections
of documents: Articles and Companies. An article is connected to a company with the attribute mentions
.
Articles will be loaded into the articles
index and companies in the companies
index. To load the dataset, run
the following command:
$ curl -XPUT 'http://localhost:9200/articles'
$ curl -XPUT 'http://localhost:9200/articles/_mapping/article' -d '
{
"properties": {
"mentions": {
"type": "keyword"
}
}
}
'
$ curl -XPUT 'http://localhost:9200/companies'
$ curl -XPUT 'http://localhost:9200/companies/_mapping/company' -d '
{
"properties": {
"id": {
"type": "keyword"
}
}
}
'
$ curl -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "1" } }
{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "2" } }
{ "title" : "Graph Databases Seen Connecting the Dots", "mentions" : [] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "3" } }
{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
{ "index" : { "_index" : "articles", "_type" : "article", "_id" : "4" } }
{ "title" : "MapR ships Apache Drill", "mentions" : ["4"] }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "1" } }
{ "id": "1", "name" : "Elastic" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "2" } }
{ "id": "2", "name" : "Orient Technologies" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "3" } }
{ "id": "3", "name" : "Cloudera" }
{ "index" : { "_index" : "companies", "_type" : "company", "_id" : "4" } }
{ "id": "4", "name" : "MapR" }
'
{
"took" : 8,
"errors" : false,
"items" : [ {
"index" : {
"_index" : "articles",
"_type" : "article",
"_id" : "1",
"_version" : 3,
"status" : 200
}
},
...
}
Relational Querying of the Data
We will now show you how to execute a relational query across the two indices. For example, we would like
to retrieve all the articles that mention companies whose name matches orient
. This relational query can be decomposed in
two search queries: the first one to find all the companies whose name matches orient
, and a second
query to filter out all articles that do not mention a company from the first result set. The Siren Federate plugin
introduces a new Elasticsearch’s filter, named join
, that allows to
define such a query plan and a new search API _search
that allows to execute this query plan.
Below is the command to run the relational query:
$ curl -XGET 'http://localhost:9200/siren/articles/_search?pretty' -d '{
"query" : {
"join" : { (1)
"indices" : ["companies"], (2)
"on" : ["mentions", "id"], (3)
"request" : { (4)
"query" : {
"term" : {
"name" : "orient"
}
}
}
}
}
}'
-
The
join
query clause -
The source indices (i.e.,
companies
) -
The clause specifying the paths for join keys in both source and target indices
-
The search request that will be used to filter out companies
The command should return you the following response with two search hits:
{
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "articles",
"_type" : "article",
"_id" : "1",
"_score" : 1.0,
"_source":{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
}, {
"_index" : "articles",
"_type" : "article",
"_id" : "3",
"_score" : 1.0,
"_source":{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
} ]
}
}
You can also reverse the order of the join, and query for all the companies that are mentioned
in articles whose title matches nosql
:
$ curl -XGET 'http://localhost:9200/siren/companies/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["articles"],
"on": ["id", "mentions"],
"request" : {
"query" : {
"term" : {
"title" : "nosql"
}
}
}
}
}
}'
The command should return you the following response with three search hits:
{
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "companies",
"_type" : "company",
"_id" : "4",
"_score" : 1.0,
"_source":{ "id": "4", "name" : "MapR" }
}, {
"_index" : "companies",
"_type" : "company",
"_id" : "1",
"_score" : 1.0,
"_source":{ "id": "1", "name" : "Elastic" }
}, {
"_index" : "companies",
"_type" : "company",
"_id" : "2",
"_score" : 1.0,
"_source":{ "id": "2", "name" : "Orient Technologies" }
} ]
}
}
Federate API
Node-level Settings
|
Use the volcano statistical optimizer when selecting the join algorithms. Defaults to |
|
Defines which distributed join algorithm to be selected when optimizing a request. Valid values are either |
|
Use contextual queries when computing statistics. If |
Search API
This plugin introduces two new search actions, /siren/[INDICES]/_search
that replaces the /[INDICES]/_search
action,
and /siren/[INDICES]/_msearch
that replaces the /[INDICES]/_msearch
action. Both actions are extensions of the original
Elasticsearch actions and therefore supports the same API. One must use these actions with the join
query clause,
as the join
query clause is not supported by the original Elasticsearch actions.
Search Request Parameters
|
Enable query planner debug. Default to |
|
A task timeout, bounding a task to be executed within the specified time value (in milliseconds) and bail with the values accumulated up to that point when expired. Default to no timeout. |
Query Syntax
-
join
: the name of the join query clause -
type
: the type of the join. Valid values are eitherBROADCAST_JOIN
,HASH_JOIN
orMERGE_JOIN
. -
indices
: the index names that will be joined with the source indices (optional, default to all indices). -
types
: the index types that will be joined with the source indices (optional, default to all types). -
on
: an array specifying the paths for join keys in both source and target indices -
request
: the search request that will be used to filter out documents before performing the join
Example
In this example, we will join all the documents from index1
with the documents of index2
.
The query first filters documents from index2
and of type type
with the query
{ "terms" : { "tag" : [ "aaa" ] } }
. It then retrieves the ids of the documents from the field id
specified by the parameter on
. The list of ids is then used as filter and applied on the field
foreign_key
of the documents from index1
.
GET /siren/index1/_search
{
"join" : {
"type": "HASH_JOIN",
"indices" : ["index2"],
"types" : ["type"],
"on" : ["foreign_key", "id"],
"request" : {
"query" : {
"terms" : {
"tag" : [ "aaa" ]
}
}
}
}
}
Response Format
The response returned by the Federate’s search API is identical to the response returned by Elasticsearch’s search API.
Performance Considerations
Join Types
Siren Federate includes different join strategies: “Broadcast Join”, “Hash Join” and “Merge Join”. Each one has its pros and cons and the optimal performance will depends on the scenario. By default, the Siren Federate planner will try to automatically pick the best strategy, but it might be best in certain scenarios to pick manually one of the strategies.
The Broadcast Join is best when filtering a large index with a small set of documents. The Hash Join and Merge Join are fully distributed and are designed to handle large joins. They both scales horizontally (based on the number of nodes) and vertically (based on the number of cpu cores). Currently, the Hash Join usually performs better in many scenarios compared to the Merge Join.
Siren Federate provides two fully distributed join algorithms: the Hash Join and the Sort-Merge Join. Each one is designed for leveraging multi-core architecture. This is achieved by creating many small data partitions during the Project phase. Each node of the cluster will receive a number of partitions that are dependent of the number of cpus. Partitions are independent from each other and can be processed independently by a different join worker thread. During the join phase, each worker thread will join tuples from one partition. The number of join worker threads scales automatically with the number of cpu cores available.
The Hash Join is performed in two phases: build and probe. The build phase creates a in-memory hash table of one of the relation in the partition. The probe phase then scans the second relation and probes the hash table to find the matching tuples.
The Sort-Merge Join instead requires a sort phase of the two relations during the project phase. It then performs a linear scan over the two sorted relations to find the matching tuples.
Compared to the Hash Join, the Sort-Merge Join does not require additional memory since it does not have to build a in-memory hash table. However, it requires a sort operation to be executed during the project phase. It is in fact trading cpu for memory.
Numeric vs String Attributes
Joining numeric attributes is more efficient than joining string attributes. If you are planning to join attributes of
type string
, we recommend to generate a murmur hash of the string value at indexing time into a new attribute, and use this new attribute for the join.
Such index-time data transformation can be easily done using Logstash’s fingerprint
plugin.
Tuple Collector Settings
Tuple Collectors are sending batches of tuples of fixed size. The size of a batch
has an impact on the performance. Smaller batches will take less memory but will increase
cpu times on the receiver side since it will have to reconstruct a tuple collection from many
small batches (especially for sorted tuple collection). By default, the size of a batch of tuple is
set to 1048576 tuples (which represents 8mb for a column of long datatype). The size can be configured
using the setting key siren.io.tuple.collector.batch_size
with a integer value representing the
maximum number of tuples in a batch.
License API for Siren Federate
Federate includes a license manager service and a set of rest commands to register, verify and delete a Siren’s license.
Without a valid license, Federate will log a message to notify that the current license is invalid at every request.
Usage
Let’s assume you have a Siren license named license.sig
. You can upload and register this license in
Elasticsearch using the command:
$ curl -XPUT -T "license.sig" 'http://localhost:9200/_siren/license'
---
acknowledged: true
You can then check the status of the license using the command:
$ curl -XGET 'http://localhost:9200/_siren/license'
{
"license" : {
"content" : {
"valid-date" : "2016-05-16",
"issue-date" : "2016-04-15",
"max-nodes" : "12"
},
"isValid" : true
}
}
To delete a license from Elasticsearch, you can use the command:
$ curl -XDELETE 'http://localhost:9200/_siren/license'
{"acknowledged":true}
Connecting to JDBC datasources
The Siren Federate plugin provides the capability to query data in remote datasources through the Elasticsearch API by mapping tables to virtual indices.
The plugin stores its configuration in two Elasticsearch indices:
-
.siren-federate-datasources
: used to store the JDBC configuration parameters of remote datasources. -
.siren-federate-indices
: used to store the configuration parameters of virtual indices.
Datasources and virtual indices can be managed using the REST API or the user interface available in Siren Investigate.
These indices are created automatically when required.
Settings
In order to send queries to virtual indices the Elasticsearch cluster must contain at least one node enabled to issue queries over JDBC; it is advised to use a coordinating only node for this role, although this is not a requirement for testing purposes.
JDBC node settings
In order to enable JDBC on a node where the Siren Federate plugin is installed,
add the following setting to elasticsearch.yml
:
node.attr.connector.jdbc: true
Then, create a directory named jdbc-drivers
inside the configuration directory
of the node (e.g. elasticsearch/config
or /etc/elasticsearch
).
Finally, copy the JDBC driver for your remote datasource and its dependencies
to the jdbc-drivers
directory created above and restart the node; see the
[siren-federate-jdbc-compat] section for a list of compatible drivers and
dependencies.
Common configuration settings
Encryption
JDBC passwords are encrypted by default using a predefined 128 bit AES key;
before creating datasources, it is advised to generate a custom key by running
the keygen.sh
script included in the siren-federate
plugin directory as
follows:
bash plugins/siren-federate/tools/keygen.sh -s 128
The command will output a random base64 key; it is also possible to generate keys longer than 128 bit if your JVM supports it.
To use the custom key, the following parameters must be set in
elasticsearch.yml
on master nodes and on all the JDBC nodes:
-
siren.connector.encryption.enabled
:true
by default, can be set tofalse
to disable JDBC password encryption. -
siren.connector.encryption.secret_key
: a base64 encoded AES key used to encrypt JDBC passwords.
Example elasticsearch.yml
settings for a master node with a custom encryption
key:
siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
Example elasticsearch.yml
settings for a JDBC node with a custom encryption
key:
siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
node.attr.connector.jdbc: true
Restart the nodes after changing the configuration to apply the settings.
Cluster wide settings
The following parameters can be set in elasticsearch.yml
on JDBC nodes
or by using the Elasticsearch cluster update settings API:
-
siren.connector.siren.timeout.connection
: the maximum amount of seconds to wait when establishing or acquiring a JDBC connection (30
by default). -
siren.connector.timeout.query
: the maximum execution time for JDBC queries, in seconds (30
by default). -
siren.connector.enable_union_aggregations
:true
by default, can be set to false to disable the use of unions in nested aggregations. -
siren.connector.query.max_result_rows
: the maximum number of rows that will be retrieved from a resultset when performing a join across datasources. Defaults to50000
. -
siren.connector.query.max_bucket_queries
: the maximum number of JDBC queries that will be generated to compute aggregation buckets. Defaults to500
.
Additional node settings
The following settings can be used to tune query processing on JDBC enabled nodes:
-
siren.connector.pool.size
: the number of threads that will be allocated to process the execution of queries to remote datasources; by default it is set toint((number of available_processors * 3) / 2) + 1
. -
siren.connector.pool.queue
: the maximum number of requests that should be queued if all the threads are busy. Defaults to40
.
Authentication
The Federate server role
If your cluster is protected by Search Guard or Elastic X-Pack, it is required to define a role with access to the Federate indices and internal operations and to create a user with this role.
For interoperability with these plugins, whenever a virtual index is created the Federate plugin creates a concrete Elasticsearch index with the same name as the virtual index; when starting up, the Federate plugin will check for missing concrete indices and will attempt to create them automatically.
Sample Search Guard role definition:
federateserver:
cluster:
- "indices:admin/aliases"
indices:
?siren-federate-datasources:
'*':
- ALL
?siren-federate-indices:
'*':
- ALL
?siren-federate-target:
'*':
- ALL
Sample X-Pack role definition:
{
"cluster": [
"monitor",
"cluster:admin/siren/connector"
],
"indices" : [
{
"names" : [ "*" ],
"privileges" : [ "create_index", "indices:data/read/get", "indices:admin/siren/connector" ]
},
{
"names" : [ ".siren-federate-*" ],
"privileges" : [ "all", "indices:admin/siren/connector" ]
}
]
}
Then create a user with that role e.g. a user called federateserver
.
Example elasticsearch.yml
settings for a master node in a cluster with
authentication and federateserver
user:
siren.connector.username: federateserver
siren.connector.password: password
siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
Example elasticsearch.yml
settings for a JDBC node in a cluster with
authentication and federateserver
user:
siren.connector.username: federateserver
siren.connector.password: password
siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
node.attr.connector.jdbc: true
Restart the nodes after setting the appropriate configuration parameters.
Administrative role
In order to manage datasources and virtual indices, it is required to
grant the cluster:admin/siren/connector/*
permissions at the cluster
level.
In addition, the user must have the indices:admin/siren/connector/
and
indices:data/siren/connector/
permissions on all the index names that he’s
allowed to define, in addition to create, write, read and search permissions.
Write permissions are required because when a virtual index is defined the plugin will create a concrete Elasticsearch index with the same name for interoperability with authentication plugins, unless such an index already exists.
Example Search Guard role allowed to manage virtual indices starting
with db-
:
sirenadmin:
cluster:
- SIREN_CLUSTER
- cluster:admin/plugin/siren/license/put
- cluster:admin/plugin/siren/license/get
- cluster:admin/siren/connector/*
indices:
'db-*':
'*':
- SIREN_READWRITE
- indices:admin/create
- indices:admin/siren/connector/*
'*':
'*':
- SIREN_COMPOSITE
Example X-Pack role allowed to manage virtual indices starting with db-
:
{
"cluster": [
"cluster:admin/siren/connector"
"cluster:admin/plugin/siren/license",
"cluster:siren/internal",
"manage"
],
"indices" : [
{
"names" : [ "*" ],
"privileges" : [ "indices:siren/mplan" ]
},
{
"names" : [ "db-*" ],
"privileges" : [
"read",
"create_index",
"view_index_metadata",
"indices:data/siren",
"indices:siren",
"indices:admin/version/get",
"indices:admin/get",
"indices:admin/siren/connector"
]
}
]
}
Search role
In order to search virtual indices, users must have the
indices:data/siren/connector/*
permissions on these indices in addition to
standard standard read and search permissions.
Example Search Guard role allowed to search virtual indices starting with db-
:
sirenuser:
cluster:
- SIREN_CLUSTER
indices:
'*':
'*':
SIREN_COMPOSITE
'db-*':
'*':
- SIREN_READONLY
- indices:data/siren/connector/*
Example X-Pack role allowed to search virtual indices starting with db-
:
{
"cluster": [
"cluster:admin/plugin/siren/license/get",
"cluster:siren/internal"
],
"indices" : [
{
"names" : [ "*" ],
"privileges" : [ "indices:siren/mplan" ]
},
{
"names" : [ "db-*" ],
"privileges" : [
"read",
"view_index_metadata",
"indices:data/siren",
"indices:siren",
"indices:admin/version/get",
"indices:admin/get"
]
}
]
}
JDBC driver installation and compatibility
The JDBC driver for your remote datasource and its dependencies must be copied
to the jdbc-drivers
subdirectory inside the configuration directory of JDBC
nodes (e.g. elasticsearch/config/jdbc-drivers
).
It is not required nor recommended to copy these drivers to nodes which are not enabled to execute queries.
Name | JDBC class | Notes |
---|---|---|
PostgreSQL |
org.postgresql.Driver |
Download the latest JDBC 4.2 driver from
https://jdbc.postgresql.org/download.html and copy the
|
MySQL |
com.mysql.jdbc.Driver |
Download the latest GA release from
https://dev.mysql.com/downloads/connector/j/, extract it, then copy
When writing the JDBC connection string, set the |
Microsoft SQL Server 2014 or greater |
com.microsoft.sqlserver.jdbc.SQLServerDriver |
Download |
Sybase ASE 15.7+ |
com.sybase.jdbc4.jdbc.SybDriver OR net.sourceforge.jtds.jdbc.Driver |
To use the FreeTDS driver, download the latest version from
https://sourceforge.net/projects/jtds/files/, extract it, then copy
To use the jConnect driver, copy |
Oracle 12c+ |
oracle.jdbc.OracleDriver |
Download the latest |
Presto |
com.facebook.presto.jdbc.PrestoDriver |
Download the latest JDBC driver from https://prestodb.io/docs/current/installation/jdbc.html
and copy it to the |
Spark SQL 2.2+ |
com.simba.spark.jdbc41.Driver |
The Magnitude JDBC driver for Spark can be purchased at
https://www.simba.com/product/spark-drivers-with-sql-connector/; once
downloaded, extract the bundle, then extract the JDBC 4.1 archive and copy the
following jars to the In addition, copy your license file to the |
Dremio |
com.dremio.jdbc.Driver |
Download the jar at https://download.siren.io/dremio-jdbc-driver-1.4.4-201801230630490666-6d69d32.jar
and copy it to the |
Impala |
com.cloudera.impala.jdbc41.Driver |
Download the latest JDBC bundle from
https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-42.html, extract
the bundle, then extract the JDBC 4.1 archive and copy the following jars to the
|
Restart the JDBC node after copying the drivers.
API
Datasource management
The endpoint for datasource management is at /_siren/connector/datasources
.
Datasource creation and modification
A datasource with a specific id
can be updated by issuing a PUT
request as follows:
PUT /_siren/connector/datasource/<id>
{
"jdbc": {
"username": "username",
"password": "password",
"driver": "com.db.Driver",
"url": "jdbc:db://localhost:5432/default"
}
}
Body parameters:
-
jdbc
: the JDBC configuration of the datasource.
JDBC configuration parameters:
-
driver
: the class name of the JDBC driver. -
url
: the JDBC url of the datasource. -
username
: the username that will be passed to the JDBC driver when getting a connection (optional). -
password
: the password that will be passed to the JDBC driver when getting a connection (optional). -
timezone
: if date and timestamp fields are stored in a timezone other than UTC, specifying this parameter will instruct the plugin to convert dates and times to/from the specified timezone when performing queries and retrieving results.
Datasource deletion
TO delete a datasource, issue a DELETE
request as follows:
DELETE /_siren/connector/datasource/<id>
Datasource listing
To list the datasources configured in the system, issue a GET
request as follows:
GET /_siren/connector/datasource/_search
Datasource validation
To validate the connection to a datasource, issue a POST
request as follows:
POST /_siren/connector/datasource/<id>/_validate
Virtual index management
Virtual index creation and modification
A virtual index with a specific id
can be updated by issuing a PUT
request as follows:
PUT /_siren/connector/index/<id>
{
"datasource": "ds",
"catalog": "catalog",
"schema": "schema",
"resource": "table",
"key": "id"
}
The id of a virtual index must be a valid lowercase Elasticsearch index name; it is recommended to start virtual indices with a common prefix to simplify handling of permissions.
Body parameters:
-
datasource
: the id of an existing datasource. -
resource
: the name of a table or view on the remote datasource. -
key
: the name of a unique column; if a virtual index has no primary key it will be possible to perform aggregations, however queries that expect a reproducible unique identifier will not be possible. -
catalog
andschema
: the catalog and schema containing the table specified in theresource
parameter; these are usually required only if the connection does not specify a default catalog or schema.
Virtual index deletion
To delete a virtual index, issue a DELETE
request as follows:
DELETE /_siren/connector/index/<id>
When a virtual index is deleted, the corresponding concrete index is not deleted automatically.
Virtual index listing
To list the virtual indices configured in the system, issue a GET
request as follows:
GET /_siren/connector/index/_search
Operations on virtual indices
The plugin supports the following operations on virtual indices:
-
get mapping
-
get field capabilities
-
search
-
msearch
-
get
-
mget
Search requests involving a mixture of virtual and normal Elasticsearch indices (e.g. when using a wildcard) are not supported and will be rejected; it is however possible to issue msearch requests containing requests on normal Elasticsearch indices and virtual indices.
When creating a virtual index, the plugin will create an empty Elasticsearch index for interoperability with Search Guard and X-Pack; if an Elasticsearch index with the same name as the virtual index already exists and it is not empty, the virtual index creation will fail.
When deleting a virtual index, the corresponding Elasticsearch index will not be removed.
Type conversion
The plugin converts JDBC types to their closest Elasticsearch equivalent:
-
String types are handled as
keyword
fields. -
Boolean types are handled as
boolean
fields. -
Date and timestamp are handled as
date
fields. -
Integer types are handled as
long
fields. -
Floating point types are handled as
double
fields.
Complex JDBC types which are not recognized by the plugin are skipped during query processing and resultset fetching.
Supported search queries
The plugin supports the following queries:
-
match_all
-
term
-
terms
-
range
-
exists
-
prefix
-
wildcard
-
ids
-
bool
At this time the plugin provides no support for datasource specific full text
search functions, so all these queries will work as if they were issued against
keyword
fields.
Supported aggregations
Currently the plugin provides support for the following aggregations:
Metric:
-
Average
-
Cardinality
-
Max
-
Min
-
Sum
Bucket:
-
Date histogram
-
Histogram
-
Date range
-
Range
-
Terms
-
Filters
Only terms aggregations can be nested inside a parent bucket aggregation.
Known Limitations
-
Cross backend join currently supports only integer keys.
-
Cross backend support has very different scalability according to the direction of the Join, a join which involves sending IDs to a remote system will be possibly hundreds of times less scalable (e.g. thousands vs millions) to one where the keys are fetched from a remote system.
-
Only terms aggregations can be nested inside a parent bucket aggregation.
-
The
missing
parameter in bucket aggregations is not supported. -
Scripted fields are not supported.
-
When issuing queries containing string comparisons, the plugin does not force a specific collation, if a table behind a virtual indices uses a case insensitive collation, string comparisons will be case insensitive.
-
Wildcards on virtual index names are not supported by any API; a wildcard search will silently ignore virtual indices.
-
Currently cross cluster searches on virtual indices are not supported.
Troubleshooting
Cannot reconnect to datasource by hostname after DNS update
When the Java security manager is enabled, the JVM will cache name resolutions indefinitely; if the system you’re connecting to uses round-robin DNS or the IP address of the system changes frequently, you will need to modify the following Java Security Policy properties:
-
networkaddress.cache.ttl
: the number of seconds to cache a successful DNS lookup. Defaults to-1
(forever). -
networkaddress.cache.negative.ttl
: the number of seconds to cache an unsuccessful DNS lookup. Defaults to10
, set to0
to avoid caching.