Siren Platform User Guide

Topic Clustering

Beta feature

The Topic Clustering visualization performs significance and clustering analysis on full-text fields. While similar to the Tags Cloud visualization, Topic Clustering highlights significant terms (topics) whose frequency in documents increases when current filters and search queries are applied.

tc1.png

The Topic Clustering panel is divided into separate cells, each representing a topic. Cell size represents the number of documents it matches, while its color represents its relevance to current queries/filters.

The visualization can also cluster together mutually significant topics, forming groups that denote 'areas of interest' of the text corpus (large set of structured texts).

Interaction

You can interact with Topic Clustering using mouse or touch:

  • Pan the view by click-and-drag/tap-and-drag

  • Zoom in or out using the mouse wheel/pinch-zoom gestures

  • Zoom out to initial view and close all cells with the ESC key

  • Open a cell with a double-click/tap

  • Close a cell by double-clicking its header

Expanding and collapsing cells

Double-click/tap a cell to open it; this loads its significant subtopics and displays them recursively. The opened cell’s topic will be displayed in a white header at the top of the cell.

tc2.png

Once loaded, cells are automatically opened and closed depending on your current level of zoom. However, you can still open a cell explicitly by double-clicking on it. Conversely, you can close it by double-clicking on the white cell header.

tc3.png

Loaded cells open/close automatically as you zoom in/out.

Tooltip

Hovering over a cell displays details such as the number of documents it represents and its relevance/significance score:

tc4.png

Tooltip legend

  • Title: The topic represented by the cell, with associated number of documents and percentage of current dashboard documents

  • Relevance Score: The topic’s relevance/significance score with respect to current dashboard filters and ancestor cells

  • Size In Parent: Matching documents compared  to those in the parent cell

  • Tiles Coverage: The amount of a tile’s documents represented by the union of all its sub-tiles

  • Filter corpus just on <topic>: You can apply a topic as dashboard filter by clicking on this button. This is useful to foreground an interesting area of the corpus that has been identified.

  • Extract sub-topics: Opens a cell to display its sub-topics. Equivalent to a double-click on the cell.

Live Filter

Clicking a cell selects it. In a dashboard, this automatically applies a live filter matching the cell's topic to other visualizations in that dashboard.

The live filter does not apply to the visualization itself, or other Topic Clusterings in the same dashboard, which retain their UI state.

For example, you can pair a Topic Clustering with a Record Table next to it. Selecting a cell updates the table and displays associated document samples.

tc5.png

Setting up the visualization

Data Tab

The only required input is the text field to operate on, which must be set before the visualization can render.

Tip

If you can’t see the field you want, check the Management > Data Model page to make sure that it’s aggregatable. You can make a text field aggregatable by enabling the fielddata mapping property for the field, as explained in the Elasticsearch documentation. Remember to refresh the Fields list in the Data Model page to reflect the changes.

tc6.png

Topics extraction is performed using one of the available algorithms:

  • Plain Significant Terms: extracts a flat selection of the most significant terms. No clustering and only single words will be displayed.

  • Clustering on significant terms: significant terms are extracted and put in clusters based on their mutual significance. This mode can trade off some high-significance plain terms in favor of filling up the clusters. This is still limited to single words.

  • Plain significant phrases: uses proximity queries to extract a flat selection (no clustering) of significant phrases of the text corpus. The resulting phrases can be optionally merged with significant words from the Plain Significant Terms algorithm.

  • Clustering on significant phrases: same as the previous one, but it also clusters generated topics based on their mutual significance.

You can select a different extraction mode when retrieving terms at the initial (root topics) level, and when expanding cells (sub-topics) with a double-click/tap.

tc7.png

Changing the Chart Type option to Square renders the visualization as a more traditional square treemap:

tc8.png

The following parameters control terms generation:

  • Target Topics Count: The desired number of terms to calculate and display.

  • Ignore Large Topics: Specifies how relatively large terms (as percentage) will be ignored, as they can be considered trivial.

  • Maximum Clusters Size: The maximum number of terms a cluster can have.

  • Per-shard documents (thousands): Restricts document analysis to the specified number of documents, per shard, chosen among the best matching for current search query and filters. Can be used to discard analysis of the worst-matching documents to gain a speed boost.

  • Mix with single-word topics: For phrase-based algorithms, you can also include single-word topics in the displayed results.

Stop-Words Tab

Text field values often contain undesirable or irrelevant terms that should be filtered out; these are called stop-words.

Stop-words are best applied at index-time using the appropriate Elasticsearch analyzers support.  Check the Elasticsearch stop-words documentation for further details. However, some undesirable words will inevitably slip past the indexing phase. Some words may also be undesirable only in the context of a particular visualization.

You can provide an additional list of stop-words to be filtered by the visualization itself. The list is configured as separate lines in the Stop Words tab; each line is a separate stop-word.

tc9.png

Regular expressions as stop-words are supported, which can be useful to, say, remove all numbers. However, using regular expression stop-words incurs a performance penalty. A single regular expression is sufficient, as it will force all stop-words to be included in a regular expression separated by | (pipe) conditionals.

Appearance Tab

Cell colors are associated with the relevance/significance score calculated for each term, while cell size relates to the number of documents it matches.

You can change the colors displayed by adjusting two colors representing the extremes of the color palette.

Tip

It is good practice to set the Low Relevance color to a low saturation (paler) value of the selected color, and set the High Relevance color to a high saturation (more intense) value of the same color.

tc10.png

You can also fine-tune some of the aspects of the rendered chart using the following options.

  • Cell Gradients: controls the rendering mode of cell backgrounds. When enabled, cells will be rendered with a nice-looking smooth-colored gradient. You can disable this option to render cells using a flat color, which is faster and arguably less distracting.

  • Colored Cell Headers: controls coloring of cell headers. When enabled, opened cell headers will be rendered with a smooth gradient based on the opened cell’s relevance score. When disabled, the header will be rendered with a flat white color, to separate it from child cells.

  • Add “Other Topics” tile: controls the presence of a tile representing documents not covered by current selection of topics. You can expand it to look for more topics at any given level of the topics hierarchy.

  • Add “Empty/No Text” tile: controls the presence of a tile representing documents that have no terms in the selected text field. This includes cases where there is no text, the text is an empty string, or the text is only composed of stop-words. This tile cannot be expanded, and only appears at the root level of the topics hierarchy.

Limitations

String Analysis

The Topic Clustering visualization only applies to the ElasticSearch text datatype, which undergoes string analysis transformations like tokenization and stemming at data ingest.

This means that it is not applicable to fields found in JDBC backends, which do not support string analysis out of the box.

Fielddata

As with Tags Cloud, fielddata support must be enabled on a text field for Topic Clustering to work.

Enabling fielddata can result in high memory usage on the ElasticSearch cluster, so refer to the official ElasticSearch guide for more information on enabling fielddata appropriately.

Additional Notes

How is the significance score calculated?

Significant topics are extracted using the built-in significant terms aggregation available in ElasticSearch clusters using the default JLH scoring function. In simple terms, this means that the generated score represents the increase in frequency passing from a background query to a foreground query.

At the root level of the Topic Clustering visualization, the background query is given by current dashboard’s time filters only, and the foreground query adds to that all the active dashboard filters. Opening a topic will move current foreground query to the background, and the opened topic will define the new foreground query.

What if there are no search query or filters?

Without a search query or filter, it is not possible to establish a foreground/background set pair, so there is nothing to define significance against.

In these cases, the visualization adopts alternative relevance score functions:

  • In Plain topic extraction modes, each topic is scored according to its matching documents count normalized by the total documents count. This is like selecting the largest topic in the field.

  • In Clustered topic extraction modes, each topic is scored according to an average of its own significant sub-topics (the significant topic found when it is used as a filter).