Performance Considerations
Join types
Siren Federate offers three join strategies: the hash join, the broadcast join and the index join.
All strategies have advantages and disadvantages, but choosing the right one can help to optimize system performance. For more information, see Configuring joins by type.
Numeric versus string attributes
Joining numeric attributes is more efficient than joining string attributes. If you are planning to join attributes of
type string
, we recommend to generate a murmur hash of the string value at indexing time into a new attribute, and use
this new attribute for the join. Such index-time data transformation can be easily done using
Logstash’s fingerprint
plugin.
Tuple collector settings
Tuple collectors are sending batches of tuples of fixed size. The size of a batch
has an impact on the performance. Smaller batches will take less memory but will increase
cpu times on the receiver side since it will have to reconstruct a tuple collection from many
small batches (especially for sorted tuple collection). By default, the size of a batch of tuple is
set to 1,048,576 tuples (which represents 8mb for a column of long datatype). The size can be configured
using the setting key siren.io.tuple.collector.batch_size
with a integer value representing the
maximum number of tuples in a batch.
Using the preference
parameter for search requests
To optimize cache utilization, Elasticsearch recommends using the preference
parameter, which controls which shard copies on which to execute the search.
By default, Elasticsearch selects from the available shard copies in an unspecified order, taking the allocation awareness and adaptive replica selection configuration into account. However, it may sometimes be desirable to try and route certain searches to certain sets of shard copies. For example, the preference
parameter could be set to a custom string value like a session or user id. This is very important in Siren Federate to better leverage the join query cache.
For more information, see Tune for search speed.
Caution when force-merging single-segment indices
The search-project
task parallelizes its work by using a single worker per index segment. Therefore, caution must be exercised when considering a force-merge of an index.
Force-merging an index with a single segment impacts the search-project
task’s performance, as it will not be able to parallelize.