Connecting to remote datasources
What is a remote datasource?
The Siren Federate plugin provides the capability to query data that is stored in external databases. The databases are connected by using the Elasticsearch API and by mapping database tables to virtual indices.
An analyst who is using Siren Investigate can then create dashboards by directly querying the remote datasources and displaying the resulting data alongside Elasticsearch data.
Two types of remote datasource can be configured; JDBC datasources and remote Elasticsearch clusters.
What is a virtual index?
After you connect remote datasources in Siren Platform, you create virtual indices to map the external database tables.
With virtual indices, data is never copied. Instead, queries that are sent to virtual indices are translated to the native dialect of the database system and back again in real time.
Datasources and virtual indices can be managed by using the REST API.
How do you query virtual indices?
Virtual indices can be queried by using one of the following APIs:
-
Standard Elasticsearch API: Allows you to query a virtual index as if it were a standard Elasticsearch index. However, note that this method does not support joins.
-
Siren Federate API: Allows you to use the join query clause on virtual indices. The Siren Federate query planner pushes down to the remote datasources the computation of query operators such as filters, aggregations, and joins.
Siren Federate supports sophisticated join capabilities across both real indices and virtual indices (federated).
There are three kinds of join operations:
-
Joins involving indices within the same external datasource: In this case, Siren Federate will simply push down the joins in the native back-end language, for example, SQL. The performance and scalability depends on the back-end system that Siren Federate is connected to.
-
Cross back-end joins (external datasource to external datasource, external datasource to Elasticsearch): The scalability of this operation is, in its current version, quite high from an external datasource to Elasticsearch, while limited in the opposite direction. Improvements are planned in future versions.
-
Joins across indices that are within the same Elasticsearch cluster. These are extremely scalable. Siren Federate augments existing Elasticsearch installations with an in-memory distributed computational layer. Search operations are pushed down to the Elasticsearch indices and then search results are distributed across the available Elasticsearch nodes for distributed join computation. This enables horizontal scaling, which leverages the entire cluster’s CPUs and memory.
Settings for remote datasources
The Siren Federate plugin stores the datasource configuration in two Elasticsearch indices:
-
.siren-federate-datasources
: The index that is used to store the JDBC configuration parameters of remote datasources. -
.siren-federate-indices
: The index that is used to store the configuration parameters of virtual indices.
Other indices are also used for different features:
-
.siren-federate-ingestions
: The index that is used to store the ingestion configurations. -
.siren-federate-joblogs
: The index that is used to store logs of ingestion jobs.
The Siren Federate Connector module supports the datasource node configuration settings. For more information, see the Connector module.
Operations on virtual indices
The Siren Federate plugin supports the following operations on virtual indices:
-
get mapping
-
get field capabilities
-
search
-
msearch
-
get
-
mget
Search requests that involve a mixture of virtual and normal Elasticsearch indices, for example, when using a wildcard, are not supported and will be rejected. It is, however, possible to issue |
Elasticsearch index for interoperability with Search Guard and Elastic Stack Security. If an Elasticsearch index with the same name as the virtual index already exists and it is not empty, the virtual index creation wilElasticsearch index is not removed. |
Known limitations with configuring remote datasources
The following limitations exist for all connectors:
-
A cross back-end join is limited to semi-join.
-
A cross back-end join supports only integer keys.
-
Cross back-end support has very different scalability according to the direction of the join. A join that involves sending IDs to a remote system can potentially be hundreds of times less scalable, to one where the keys are fetched from a remote system.
-
Cross cluster searches on virtual indices are not supported.