The Siren Federate plugin extends Elasticsearch with (1) a federation layer to query external databases with the Elasticsearch API and (2) distributed join capabilities across indices and external databases.
Siren Federate provides a module, called “Connector”, which transparently maps external database systems
to “Virtual Indices” in Elasticsearch. Requests to the Elasticsearch APIs, such as the
Search APIs, are
intercepted by the Connector module. These requests are translated to the external database dialect and executed
against the external database. This enables Siren Investigate to create and
display dashboards for data located in external databases as if they were Elasticsearch’s indices.
Siren Federate extends the Elasticsearch DSL with a
join query clause which enables a user to
execute a join between indices (being virtual or not). The join capabilities are implemented on top of an in-memory
distributed computing layer which scales with the number of nodes available in the cluster.
The current join capabilities is currently limited to a (left) semi-join between two set of documents
based on a common attribute, where the result only contains the attributes of one of the joined set of documents.
This join is used to filter one set of documents with a second document set. It is equivalent
EXISTS() operator in SQL. Joins on both numerical and textual fields are supported, but the joined attributes must be of the
same type. You can also freely combine and nest multiple joins using boolean operators (conjunction,
disjunction, negation) to create complex query plans. It is fully integrated with the Elasticsearch API and is
compatible with distributed environments.
The Siren Federate join is similar in nature to the Parent-Child feature of Elasticsearch: they perform a join at query-time. However, there are important differences between them:
The parent document and all of its children must live on the same shard, which limits its flexibility. The Siren Federate join removes this constraint and is therefore more flexible: it allows to join documents across shards and across indices.
Thanks to the data locality of the Parent-Child model, joins are faster and more scalable. The Siren Federate join on the contrary needs to transfer data across the network to compute joins across shards, limiting its scalability and performance.
There is no “one size fits all” solution to this problem, and you need to understand your requirements to choose the proper solution. As a basic rule, if your data model and data relationships are purely hierarchical (or can be mapped to a purely hierarchical model), then the Parent-Child model might be more appropriate. If on the contrary you need to query both directions of a data relationship, then the Siren Federate join might be more appropriate.
The most important requirement for executing a join is to have a common shared attribute between two indices.
For example, let’s take a simple relational data model composed of two tables,
Companies, and of one
ArticlesMentionCompanies to encode the many-to-many relationships between them.
This model can be mapped to two Elasticsearch indices,
Companies. An article document will have
a multi-valued field
mentions with the unique identifiers of the companies mentioned in the article.
In other words, the field
mentions is a foreign key in the
Articles table that refers to the primary key of
It should be straightforward for someone to write an SQL statement to flatten and map relationships into a single multi-valued field. We can see that, compared to a traditional database model where a junction table is necessary, the model is simplified by leveraging multi-valued fields.