Siren Platform User Guide

Modules

Planner

The planner module is responsible in parsing a (multi) search request and generating a logical model. This logical model is then optimized by leveraging the rule-based Hep engine and statistical Volcano engine from Apache Calcite. The outcome is a physical query plan, which is then executed. The physical query plan is a Directed Acyclic Graph workflow composed of individual computing steps. The workflow is executed as a Job and the individual computing steps are executed as`Tasks`. We can therefore map one (multi) search request to a single job.

siren.planner.pool.job.size

Control the maximum number of concurrent jobs being executed per node. Defaults to 1.

siren.planner.pool.job.queue_size

Control the size of the queue for pending jobs per node. Defaults to 100.

siren.planner.pool.tasks_per_job.size

Control the maximum number of concurrent tasks being executed per job. Defaults to 3.

siren.planner.volcano.enable

Enable or disable the Volcano statistical engine to select the most appropriate join algorithms. Defaults to true.

siren.planner.volcano.join

Defines which distributed join algorithm to be selected when optimizing a request. Valid values are either HASH_JOIN or MERGE_JOIN, case-insensitive. Defaults to HASH_JOIN.

siren.planner.volcano.use_query

Use contextual queries when computing statistics. If false, computed statistics are effectively "global" to the index. Defaults to false.

siren.planner.volcano.cache.enable

Enable or disable a caching layer over Elasticsearch requests sent during query optimizations in order to gather statistics. Defaults to true.

siren.planner.volcano.cache.refresh_interval

The minimum interval time for refreshing the cached response of a statistics-gathering request. The time unit is in minutes and defaults to 60 minutes.

siren.planner.volcano.cache.maximum_size

The maximum number of requests response that can be cached. Defaults to 1000000.

Memory

The memory module is responsible in allocating and managing chunks of off-heap memory. The allocated memory is managed in a hierarchical model. The root allocator is managing the memory allocation on a node level, and can have one or more job allocators. A job allocator is created for each job (that is, a Siren Federate search request) and is managing the memory allocation on a job level. A job can have one or more task allocators. A task allocator is created for each task of a job and is managing the memory allocation on a task level. Each allocator specifies a limit for how much off-heap memory it can use.

siren.memory.root.limit

Limit in bytes for the root allocator. Defaults to 750MB.

siren.memory.job.limit

Limit in bytes for the job allocator. Defaults to siren.memory.root.limit.

siren.memory.task.limit

Limit in bytes for the task allocator. Defaults to siren.memory.job.limit.

By default, the job limit is equal to the root limit, and the task limit is equal to the job limit. This enables a simplified configuration for most common scenarios where only the root limit has to be configured. For more advanced scenarios, e.g., with multiple concurrent users, one might want to tune the job and task limit to avoid having a user executing a query that will consume all the available off-heap memory on the root level, leaving no memory for the queries executed by other users.

Typically, you should never give more than half of the remaining OS memory to the siren root allocator, to leave some memory for the OS cache and cater for Netty’s memory management overhead. For example, if Elasticsearch is configured with a 32GB heap on a machine with 64GB of ram, this leaves 32GB to the OS. The maximum limit that one could set for the root allocator should be 16GB.

IO

The IO module is responsible in encoding, decoding and shuffling data across the nodes in the cluster.

Tuple collector

This module introduces the concept of Tuple Collectors which are responsible in collecting tuples created by a`SearchProject` or Join task and shuffling them across the shards or nodes in the cluster.

Tuples collected will be transferred in one or more packets. The size of a packet has an impact on the resources. Small packets will take less memory but will increase CPU times on the receiver side since it will have to reconstruct a tuple collection from many small packets. Large packets will reduce CPU usage on the receiver side, but at the cost of higher memory usage on the collector side and longer network transfer latency. The size of a packet can be configured with the following setting:

siren.io.tuple.collector.packet_size

The number of tuples in a packet. The packet size must be a power of 2. Defaults to 2^20 tuples.

When using the Hash Join or Merge Join algorithm, the collector will use a hash partitioner strategy to create small data partitions. Creating multiple small data partitions helps in parallelizing the join computation, as each worker thread for the join task will be able to pick and join one partition independently of the others. Setting the number of data partitions per node to 1 will cancel any parallelization. The number of data partitions per node can be configured with the following setting:

siren.io.tuple.collector.hash.partitions_per_node

The number of partitions per node. The number of partitions must be a power of 2. Defaults to 32.

Thread pools

Siren Federate introduces new thread pools:

federate.planner

For the query planner operations. Thread pool type is scaling.

federate.data

For the data operations (create, upload, delete). Thread pool type is scaling.

federate.task.worker

For task worker threads. Thread pool type is fixed_auto_queue_size with a size of max((# of available_processors) - 1, 1), and initial queue_size of 1000.

federate.connector.query

For connector query operations. Thread pool type is fixed_auto_queue_size with a size of int((# of available_processors * 3) / 2) + 1, and initial queue_size of 1000.

Connector

The Federate Connector module supports the following node configuration settings:

siren.connector.datasources.index

The index in which Federate will store datasource configurations.

siren.connector.query.max_result_rows

The maximum number of rows returned when executing a query on a remote datasource. Defaults to 10.

Search results

    No results found