Connecting to Remote Datasources

The Siren Federate plugin provides the capability to query data in remote datasources through the Elasticsearch API by mapping tables to virtual indices.

The plugin stores its configuration in two Elasticsearch indices:

  • .siren-federate-datasources: used to store the JDBC configuration parameters of remote datasources.

  • .siren-federate-indices: used to store the configuration parameters of virtual indices.

Other indices are also used for different features: - .siren-federate-ingestions: used to store the ingestions configurations. - .siren-federate-joblogs: used to store logs of ingestion jobs.

Datasources and virtual indices can be managed using the REST API or the user interface available in Siren Investigate.

These indices are created automatically when required.

Connecting to JDBC datasources

Siren Federate provides the capability to query data from a remote JDBC databases.

Settings

In order to send queries to virtual indices the Elasticsearch cluster must contain at least one node enabled to issue queries over JDBC; it is advised to use a coordinating only node for this role, although this is not a requirement for testing purposes.

JDBC node settings

In order to enable JDBC on a node where the Siren Federate plugin is installed, add the following setting to elasticsearch.yml:

node.attr.connector.jdbc: true

Then, create a directory named jdbc-drivers inside the configuration directory of the node (e.g. elasticsearch/config or /etc/elasticsearch).

Finally, copy the JDBC driver for your remote datasource and its dependencies to the jdbc-drivers directory created above and restart the node; see the JDBC driver installation and compatibility section for a list of compatible drivers and dependencies.

Common configuration settings

Encryption

JDBC passwords are encrypted by default using a predefined 128 bit AES key; before creating datasources, it is advised to generate a custom key by running the keygen.sh script included in the siren-federate plugin directory as follows:

bash plugins/siren-federate/tools/keygen.sh -s 128

The command will output a random base64 key; it is also possible to generate keys longer than 128 bit if your JVM supports it.

To use the custom key, the following parameters must be set in elasticsearch.yml on master nodes and on all the JDBC nodes:

  • siren.connector.encryption.enabled: true by default, can be set to false to disable JDBC password encryption.

  • siren.connector.encryption.secret_key: a base64 encoded AES key used to encrypt JDBC passwords.

Example elasticsearch.yml settings for a master node with a custom encryption key:

siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="

Example elasticsearch.yml settings for a JDBC node with a custom encryption key:

siren.connector.encryption.secret_key: "1zxtIE6/EkAKap+5OsPWRw=="
node.attr.connector.jdbc: true

Restart the nodes after changing the configuration to apply the settings.

JDBC driver installation and compatibility

The JDBC driver for your remote datasource and its dependencies must be copied to the jdbc-drivers subdirectory inside the configuration directory of JDBC nodes (e.g. elasticsearch/config/jdbc-drivers).

It is not required nor recommended to copy these drivers to nodes which are not enabled to execute queries.

You may create a sub-directory within jdbc-drivers to store a driver and to provide a custom security policy file for this driver. It is recommended to use this approach for drivers that come in multiple jars. A custom security policy file enables the definition of driver-specific permissions. The custom security policy file must be named security.policy and must be located inside the driver sub-directory. The following variables can be used within the policy file:

  • codebase.federate.common: Path to the directory storing the security.policy (defaults to jdbc-drivers if the default drivers-security.policy file is used)

  • codebase.federate.${jar_name}: Path to a driver JAR file. Here, ${jar_name} refers to the name of the JAR file that is stored in the directory where the security.policy file is located. If the default drivers-security.policy file is used, then the system uses JAR files that are stored in the jdbc-drivers directory. For example:

grant codeBase "${codebase.federate.postgresql-42.2.5.jar}" {
  // Permissions for postgresql-42.2.5.jar
}

If a security.policy is placed in the main jdbc-drivers directory, then it overrides the default drivers-security.policy

Restart the JDBC node after copying the drivers.

Table 1. List of supported JDBC drivers
Name JDBC class Notes

PostgreSQL

org.postgresql.Driver

Download the latest JDBC 4.2 driver from https://jdbc.postgresql.org/download.html and copy the postgresql-<version>.jar file to the jdbc-drivers directory.

MySQL

com.mysql.jdbc.Driver

Download the latest GA release from https://dev.mysql.com/downloads/connector/j/, extract it, then copy mysql-connector-java-<version>.jar to the jdbc-drivers plugin directory.

When writing the JDBC connection string, set the useLegacyDatetimeCode parameter to false to avoid issues when converting timestamps.

Oracle 12c+

oracle.jdbc.OracleDriver

Download the latest ojdbc8.jar from http://www.oracle.com/technetwork/database/features/jdbc/jdbc-ucp-122-3110062.html and copy it to the jdbc-drivers plugin directory.

Spark SQL 2.2+

com.simba.spark.jdbc41.Driver

The Magnitude JDBC driver for Spark can be purchased at https://www.simba.com/product/spark-drivers-with-sql-connector/; once downloaded, extract the bundle, then extract the JDBC 4.1 archive and copy the following jars to the jdbc-drivers plugin directory:

SparkJDBC41.jar

commons-codec-<version>.jar

hive_metastore.jar

hive_service.jar

libfb303-<version>.jar

libthrift-<version>.jar

ql.jar

TCLIServiceClient.jar

zookeeper-<version>.jar

In addition, copy your license file to the jdbc-drivers plugin directory.

Neo4j

org.neo4j.jdbc.http.HttpDriver

Download the driver from https://mvnrepository.com/artifact/org.neo4j/neo4j-jdbc-driver/3.4.0 and copy the jar to the jdbc-drivers directory.

The following JDBC drivers are used by our customers but not officially supported. They are used as is.

Table 2. List of JDBC drivers used as is.
Name JDBC class Notes

Microsoft SQL Server 2014 or greater

com.microsoft.sqlserver.jdbc.SQLServerDriver

Download sqljdbc_<version>_enu.tar.gz from https://www.microsoft.com/en-us/download/details.aspx?id=55539, extract it, then copy mssql-jdbc-<version>.jre8.jar to the jdbc-drivers directory.

Sybase ASE 15.7+

com.sybase.jdbc4.jdbc.SybDriver

OR

net.sourceforge.jtds.jdbc.Driver

To use the FreeTDS driver, download the latest version from https://sourceforge.net/projects/jtds/files/, extract it, then copy jtds-<version>.jar to the jdbc-drivers directory.

To use the jConnect driver, copy jConnect-<version>.jar from your ASE directory to the jdbc-drivers directory.

Presto

com.facebook.presto.jdbc.PrestoDriver

Download the latest JDBC driver from https://prestodb.io/docs/current/installation/jdbc.html and copy it to the jdbc-drivers plugin directory.

Dremio

com.dremio.jdbc.Driver

Download the jar at https://download.siren.io/dremio-jdbc-driver-1.4.4-201801230630490666-6d69d32.jar and copy it to the jdbc-drivers plugin directory.

Impala

com.cloudera.impala.jdbc41.Driver

Download the latest JDBC bundle from https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-42.html, extract the bundle, then extract the JDBC 4.1 archive and copy the following jars to the jdbc-drivers plugin directory:

ImpalaJDBC41.jar

commons-codec-<version>.jar

hive_metastore.jar

hive_service.jar

libfb303-<version>.jar

libthrift-<version>.jar

ql.jar

TCLIServiceClient.jar

zookeeper-<version>.jar

Operations on virtual indices

The plugin supports the following operations on virtual indices:

  • get mapping

  • get field capabilities

  • search

  • msearch

  • get

  • mget

Search requests involving a mixture of virtual and normal Elasticsearch indices (e.g. when using a wildcard) are not supported and will be rejected; it is however possible to issue msearch requests containing requests on normal Elasticsearch indices and virtual indices.

When creating a virtual index, the plugin will create an empty Elasticsearch index for interoperability with Search Guard and Elastic Stack Security; if an Elasticsearch index with the same name as the virtual index already exists and it is not empty, the virtual index creation will fail.

When deleting a virtual index, the corresponding Elasticsearch index will not be removed.

Example of a search:

GET /siren/twitter/_search
{
  "query": {
    "match": {
      "user": "siren"
    }
  }
}

The API returns the following response:

{
  "took": 150,
  "timed_out": false,
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "twitter",
        "_id": "0",
        "_score": 2,
        "_source": {
          "user": "siren",
          "date": "2019-11-16T12:12:12",
          "message": "trying out Siren",
          "likes": 0
        }
      }
    ]
  }
}

Type conversion

The plugin converts JDBC types to their closest Elasticsearch equivalent:

  • String types are handled as keyword fields.

  • Boolean types are handled as boolean fields.

  • Date and timestamp are handled as date fields.

  • Integer types are handled as long fields.

  • Floating point types are handled as double fields.

Complex JDBC types which are not recognized by the plugin are skipped during query processing and resultset fetching.

Supported search queries

The plugin supports the following queries:

  • match_all

  • term

  • terms

  • range

  • exists

  • prefix

  • wildcard

  • ids

  • bool

At this time the plugin provides no support for datasource specific full text search functions, so all these queries will work as if they were issued against keyword fields.

Supported aggregations

Currently the plugin provides support for the following aggregations:

Metric:

  • Average

  • Cardinality

  • Max

  • Min

  • Sum

Bucket:

  • Date histogram

  • Histogram

  • Date range

  • Range

  • Terms

  • Filters

Only terms aggregations can be nested inside a parent bucket aggregation.

Troubleshooting

Cannot reconnect to datasource by hostname after DNS update

When the Java security manager is enabled, the JVM will cache name resolutions indefinitely; if the system you’re connecting to uses round-robin DNS or the IP address of the system changes frequently, you will need to modify the following Java Security Policy properties:

  • networkaddress.cache.ttl: the number of seconds to cache a successful DNS lookup. Defaults to -1 (forever).

  • networkaddress.cache.negative.ttl: the number of seconds to cache an unsuccessful DNS lookup. Defaults to 10, set to 0 to avoid caching.

Connecting to Remote Elasticsearch Clusters

Siren Federate provides the capability to query data from an Elasticsearch remote cluster through the Remote Clusters Module and the Federate Connector APIs.

NOTE: This connector, unlike the JDBC connector, supports wildcard index patterns.

Configuring the Remote Cluster

To send queries from a cluster (let’s call it the coordinator) to remote Elasticsearch clusters, the remote clusters must be configured as described in Configuring remote clusters.

The Siren Federate plugin has to be installed on the remote clusters.

This example shows how to set up the remote Elasticsearch clusters:

curl -X PUT http://localhost:9200/_cluster/settings -H 'Content-type: application/json' -d '
{
    "persistent": {
        "cluster": {
            "remote": {
                "remotefederate": {
                    "seeds": [
                        "127.0.0.1:9330"
                    ]
                }
            }
        }
    }
}
'

Configuring the Datasource

A datasource must first be defined as an alias to the remote cluster. Datasources are created in the coordinator cluster using the Federate Connector APIs.

curl -X PUT http://localhost:9200/_siren/connector/datasource/remotefederateds -H 'Content-type: application/json' -d '
  {
    "elastic": {
      "alias": "remotefederate"
    }
  }
  '

Configuring the Virtual Index

Let’s assume our remote cluster remotefederate has indices called logs-2019.01, logs-2019.02, …​, logs-2019.12.

Using a Wildcard Index Pattern

Let’s define a virtual index on the coordinator cluster that matches the wildcard index pattern logs-* using the Federate Virtual Index API:

curl -X PUT http://localhost:9200/_siren/connector/index/logsvi -H 'Content-type: application/json' -d '
{
  "datasource": "remotefederateds",
  "resource": "logs-*",
  "key": "_id"
}
'

Assuming the coordinator cluster has an index called machines which contains information on IP addresses on machines of interest, and that we would like to find out about the logs associated to these machines, you can execute the following Federate JOIN query to do so:

curl -X GET http://localhost:9200/siren/logsvi/_search -H 'Content-Type: application/json' -d '
{
    "query": {
        "join": {
            "indices": [
                "machines"
            ],
            "on": [
                "logs_ip_hash",
                "machines_ip_hash"
            ],
            "request": {
                "query": {
                    "match_all": {

                    }
                }
            }
        }
    }
}
'

logs_ip_hash is the IP field in the index logsvi and machines_ip_hash is the IP field in the index machines.

The API returns the following response:

{
  "took": 150,
  "timed_out": false,
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "logs-2019-11-12",
        "_id": "0",
        "_score": 2,
        "_source": {
          "date": "2019-11-12T12:12:12",
          "message": "trying out Siren"
        }
      }
    ]
  }
}

Known limitations

In order to take advantage of Federate with a remote cluster, at the moment a coordinator Federate cluster must run 7.3.2-19.0 up and the remote cluster must run Federate version from 7.3.2-19.0 up.

Search Guard Compatibility

The connector is compatible with Search Guard. One can define Search Guard users with roles to secure the remote clusters and the coordinator cluster.

Each cluster must have the same user that has permissions to access the cluster datasources, indices and virtual indices in order to properly execute Federate search requests on remote clusters.

Using curl and a Search Guard user called admin, the command would start like this:

curl -k -uadmin:password -X PUT https://localhost:9200/<some API request> ...

More information is available on the Search Guard website.

Known Limitations

Limitations for all the connectors

  • Cross backend join currently supports only integer keys.

  • Cross backend support has very different scalability according to the direction of the Join, a join which involves sending IDs to a remote system will be possibly hundreds of times less scalable (e.g. thousands vs millions) to one where the keys are fetched from a remote system.

  • Currently cross cluster searches on virtual indices are not supported.

Limitations for the JDBC Connector

  • Wildcards on virtual index names are not supported by any API; a wildcard search will silently ignore virtual indices.

  • Comma-separated lists of index patterns which target virtual indices are not supported.

  • Document-level security and field-level security are currently not supported.

  • Only terms aggregations can be nested inside a parent bucket aggregation.

  • The missing parameter in bucket aggregations is not supported.

  • Scripted fields are not supported.

  • When issuing queries containing string comparisons, the plugin does not force a specific collation, if a table behind a virtual indices uses a case insensitive collation, string comparisons will be case insensitive.

  • Complex types are supported when their property types are scalar (text, numbers, boolean) or collections (list, map).

  • Arrays of complex type are supported if the complex type meets the previous requirement.