Datasource reflection pipelines
Pipelines may be used to enrich documents before they are indexed to Elasticsearch.
Examples:
To split a string, separated by delimiter "|" into a list of sub-strings, and if no initial string exists, fill the target field with an empty string
{ "description": "_description", "processors": [ { "split": { "on_failure": [ { "set": { "field": "parents", "value": "" } } ], "field": "parents", "separator": "\\|" } } ] }
To accomplish a similar goal, but this time convert each sub-string to a long, and if no value exists in the initial field, on failure set the target field to -1.
{ "description": "_description", "processors": [ { "split": { "on_failure": [ { "set": { "field": "parents", "value": -1 } } ], "field": "parents", "separator": "\\|" }, "convert": { "field": "parents", "type": "long" } } ] }
For enriching documents from a web service; it takes the value of the 'Abstract' field (path syntax) and posts a request {"text": abstract_value}
to http://35.189.96.185/bio. If the Abstract field is null, an empty string will be sent instead (input_default). The JSON response object is used as the value of a new field Abstract_text_mined_entities at the top level ($) of the document
{ "json-ws": { "resource_name": "siren-nlp", "method": "post", "url": "http://35.189.96.185/bio", "input_map": { "$.Abstract": "text" }, "output_map": { "Abstract_text_mined_entities": "$" }, "input_default": { "text": "''" } } }
To extract the text between the first set of parentheses in the Title field and create a new field Patent_ID for it.
{ "script": { "source": "def f = ctx['Title']; if(f != null){ def m= /\\((.*?)\\)/.matcher(f); m.find(); ctx.Patent_ID=m.group(1);)}" } }
Note
You need to enable regex in the elasticsearch.yml file: script.painless.regex.enabled: true