Siren Federate offers three join strategies: the hash join, the broadcast join and the index join.
All strategies have advantages and disadvantages, but choosing the right one can help to optimize system performance. For more information, see Configuring joins by type.
Joining numeric attributes is more efficient than joining string attributes. Elasticsearch is using a sorted-set doc
values data structure to encode per-document values for a
keyword field type. This data structure is not optimal to
scan a large number of keyword values. When it is possible, Siren Federate fallbacks to a more efficient strategy that
is based on a scan of the inverted index, but it has its own limitations.
If you are planning to join attributes of type
keyword, we recommend to generate a murmur hash of the string value at
indexing time into a new numeric attribute, and use this new attribute for the join. Such index-time data transformation
can be easily done using the Fingerprint ingest processor, the Murmur3 mapper plugin, or the Logstash’s
Tuples collected will be transferred in one or more
packets. The size of a packet
has an impact on the performance. Smaller packets will take less memory but will increase
cpu times on the receiver side since it will have to reconstruct a tuple collection from many
small packets (especially for sorted tuple collection). By default, the size of a packets is
set to 8MB, (which represents 1,048,576 tuples for a column of long datatype). The size can be configured
using the setting key
siren.io.pipeline.max_packet_size with a value representing the
maximum size (in bytes) of a packet.
For more information, see
To optimize cache utilization, Elasticsearch recommends using the
preference parameter, which controls which shard copies on which to execute the search.
By default, Elasticsearch selects from the available shard copies in an unspecified order, taking the allocation awareness and adaptive replica selection configuration into account. However, it may sometimes be desirable to try and route certain searches to certain sets of shard copies. For example, the
preference parameter could be set to a custom string value like a session or user id. This is very important in Siren Federate to better leverage the join query cache.
For more information, see Tune for search speed.
search-project task parallelizes its work by using a single worker per index segment. Therefore, caution must be exercised when considering a force-merge of an index.
Force-merging an index with a single segment impacts the
search-project task’s performance, as it will not be able to parallelize.