Siren Platform User Guide

Topic Clustering

Experimental feature

The Topic Clustering visualization performs significance and clustering analysis on full-text fields. While similar to the Tags Cloud visualization, Topic Clustering highlights significant terms (topics) whose frequency in documents increases when current filters and search queries are applied.

tc_1.png

The Topic Clustering panel is divided into separate cells, each representing a term. Cell size represents the number of documents it matches, while its color represents its relevance to current queries/filters.

The visualization can also cluster together mutually significant terms, forming groups that denote 'areas of interest' of the text corpus (large set of structured texts).

Interaction

You can interact with Topic Clustering using mouse or touch:

  • Pan the view by click-and-drag/tap-and-drag

  • Zoom in or out using the mouse wheel/pinch-zoom gestures

  • Zoom out to initial view and close all cells with the ESC key

  • Open a cell with a double-click/tap

  • Close a cell by double-clicking its header

Expanding and collapsing cells

Double-click/tap a cell to open it; this loads its significant sub-terms and displays them recursively. It also puts the cell term in a white header at the top of the cell.

tc_2.png

Once loaded, cells are automatically opened and closed depending on your current level of zoom. However, you can still open a cell explicitly by double-clicking on it. Conversely, you can close it by double-clicking on the white cell header.

tc_3.png

Loaded cells open/close automatically as you zoom in/out.

Tooltip

Hovering over a cell displays details such as the number of documents it represents and its relevance/significance score:

tc_4.png

Tooltip legend

  1. Path: Hierarchical position of the cell

  2. Relevance Score: The term’s relevance/significance score with respect to current search query, along with filters and ancestor cells

  3. Size (total): Matching documents compared to all searched/filtered documents

  4. Size (parent): Matching documents compared  to those in the parent cell

  5. Parent Coverage: Union of all documents matched by this and sibling cells compared to those in the parent cell.

    You can apply either a found term or a full cluster as filters, by clicking on the associated buttons on the tooltip. This is useful to foreground an interesting area of the corpus that has been identified.

  6. Filter cluster: Click to apply a dashboard filter matching any of the cluster terms.

  7. Filter term: Click to apply a dashboard filter for the cell term only

Live Filter

Clicking a cell selects it. In a dashboard, this automatically applies a live filter matching the cell's term to other visualizations in that dashboard.

The live filter does not apply to the visualization itself, or other Topic Clusterings in the same dashboard, which retain their UI state.

For example, you can pair a Topic Clustering with a Record Table next to it. Selecting a cell updates the table and displays associated document samples.

tc_5.png

Setting up the visualization

Data Tab

The only required input is the text field to operate on, which must be set before the visualization can render.

tc_6.png

Terms extraction can work in Plain mode or Clustering mode:

  • In Plain mode, no clustering is performed, and only a flat selection of the most significant terms is displayed.

  • In Clustering mode, significant topics are put in clusters based on their mutual significance. This mode can trade off some high-significance plain terms in favor of filling up the clusters.

You can select a different extraction mode when retrieving terms at the initial (root topics) level, and when expanding cells (sub-topics) with a double-click/tap.

Changing the Chart Type option to Square renders the visualization as a more traditional square treemap:

tc_7.png

The following parameters control terms generation:

  • Target Topics Count: The desired number of terms to calculate and display.

  • Ignore Large Topics: Specifies how relatively large terms (as percentage) will be ignored, as they can be considered trivial.

  • Maximum Clusters Size: The maximum number of terms a cluster can have.

  • Per-shard documents (thousands): Restricts document analysis to the specified number of documents, per shard, chosen among the best matching for current search query and filters. Can be used to discard analysis of the worst-matching documents to gain a speed boost.

Stop-Words Tab

Text field values often contain undesirable or irrelevant terms that should be filtered out; these are called stop-words.

Stop-words are best applied at index-time using the appropriate Elasticsearch analyzers support.  Check the Elasticsearch stop-words documentation for further details. However, some undesirable words will inevitably slip past the indexing phase. Some words may also be undesirable only in the context of a particular visualization.

You can provide an additional list of stop-words to be filtered by the visualization itself. The list is configured as separate lines in the Stop Words tab; each line is a separate stop-word.

tc_8.png

Regular expressions as stop-words are supported, which can be useful to, say, remove all numbers. However, using regular expression stop-words incurs a performance penalty. A single regular expression is sufficient, as it will force all stop-words to be included in a regular expression separated by | (pipe) conditionals.

Appearance Tab

Cell colors are associated with the relevance/significance score calculated for each term, while cell size relates to the number of documents it matches.

You can change the colors displayed by adjusting two colors representing the extremes of the color palette.

Tip

It is good practice to set the Low Relevance color to a low saturation (paler) value of the selected color, and set the High Relevance color to a high saturation (more intense) value of the same color.

tc_9.png

You can also fine-tune some of the aspects of the rendered chart using the following options.

  • Cell Gradients: controls the rendering mode of cell backgrounds. When enabled, cells will be rendered with a nice-looking smooth-colored gradient. You can disable this option to render cells using a flat color, which is faster and arguably less distracting.

  • Colored Cell Headers: controls coloring of cell headers. When enabled, opened cell headers will be rendered with a smooth gradient based on the opened cell’s relevance score. When disabled, the header will be rendered with a flat white color, to separate it from child cells.

  • Coverage information: enables/disables the Parent Coverage section from the tooltip.

Limitations

String Analysis

The Topic Clustering visualization only applies to the ElasticSearch text datatype, which undergoes string analysis transformations like tokenization and stemming at data ingest.

This means that it is not applicable to fields found in JDBC backends, which do not support string analysis out of the box.

Fielddata

As with Tags Cloud, fielddata support must be enabled on a text field for Topic Clustering to work.

Enabling fielddata can result in high memory usage on the ElasticSearch cluster, so refer to the official ElasticSearch guide for more information on enabling fielddata appropriately.

Additional Notes

Without a search query or filter, it is not possible to establish a foreground/background set pair, so there is nothing to define significance against.

In these cases, the visualization adopts alternative relevance score functions:

  • In Plain terms mode, each term is scored according to its matching documents count normalized by the total documents count. This is like selecting the largest terms in the field.

  • In Clustered terms mode, each term is scored according to an average of its own significant subterms (the significant terms found when it is used as a filter).