Siren Platform User Guide

IO

The IO module is responsible in encoding, decoding and shuffling data across the nodes in the cluster.

Tuple Collector

This module introduces the concept of Tuple Collectors which are responsible in collecting tuples created by a SearchProject or Join task and shuffling them across the shards or nodes in the cluster.

Tuples collected will be transferred in one or more packets. The size of a packet has an impact on the resources. Small packets will take less memory but will increase cpu times on the receiver side since it will have to reconstruct a tuple collection from many small packets. Large packets will reduce cpu usage on the receiver side, but at the cost of higher memory usage on the collector side and longer network transfer latency. The size of a packet can be configured with the following setting:

siren.io.tuple.collector.packet_size
The number of tuples in a packet. The packet size must be a power of 2. Defaults to 2^20 tuples.

When using the Hash Join or Merge Join algorithm, the collector will use a hash partitioner strategy to create small data partitions. Creating multiple small data partitions helps in parallelizing the join computation, as each worker thread for the join task will be able to pick and join one partition independently of the others. Setting the number of data partitions per node to 1 will cancel any parallelization. The number of data partitions per node can be configured with the following setting:

siren.io.tuple.collector.hash.partitions_per_node
The number of partitions per node. The number of partitions must be a power of 2. Defaults to 32.