Siren Platform User Guide

Using complex normalized databases

It is often the case, especially in relational databases that the data is highly normalized. In these situations, to increase the value for the user it makes a lot of sense to create semi-denormalized views. Typically, the right level of abstraction is the entity level, in other words, creating views which reflect useful representations of the entities in the domain. As an example of this, here you can see how our demo distribution was created. While the user sees only four indexes, which represent the entities which make sense in the investment domain (articles, companies, investments, and investors), the original data is much more normalized as per the following structure.

From a SQL database using Logstash

The indices in the Siren Platform demonstration distribution have been populated by running four Logstash configurations over the SQLite database in siren-investigate/crunchbase.db.

The database has the following schema:

Index setup

Before loading data, we need to setup indices and mappings; for example, let’s create an index called company-minimal in the Elasticsearch cluster at http://localhost:9220.

Create the index by running the following command in a terminal window:

curl -X PUT http://localhost:9220/company-minimal -H "Content-Type: application/json"-H "Content-Type: application/json"

If the index is created correctly, Elasticsearch will return the following response:

{"acknowledged":true}

If you want to destroy the index and start from scratch, execute the following command:

curl -X DELETE http://localhost:9220/company-minimal -H "Content-Type: application/json"-H "Content-Type: application/json"

Mapping definition

Mappings allow the user to configure how documents are stored in the index. For example, they allow you to define how fields are matched by the search engine and set their type (string, dates, numbers, locations and so on).

For detailed documentation about indices and mappings refer to the Elasticsearch Reference.

Let’s define a simple mapping to describe a company. The mapping will define the following fields:

• id: the id of the company in the SQLite database
• name: the name of the company
• description: a description of the company
• homepage: the URL of the company homepage
• number_of_employees: the number of employees
• location: the geographical coordinates of the company

Open a text editor and paste the following text:

{
"CompanyMinimal": {
"properties": {
"id": {
"type": "keyword"
},
"number_of_employees": {
"type": "long"
},
"name": {
"type": "text"
},
"description": {
"type": "text"
},
"homepage": {
"type": "keyword"
},
"location": {
"type": "geo_point"
}
}
}
}

CompanyMinimal is the name of the mapping; properties contains the options for each field.

Save the file to demo/example/CompanyMinimal.mapping inside the folder where you extracted the demonstration distribution.

To apply the mapping, execute the following command:

curl -X PUT "http://localhost:9220/company-minimal/_mapping/CompanyMinimal" -H "Content-Type: application/json" -d "@demo/example/CompanyMinimal.mapping"

If the mapping is created correctly, Elasticsearch will return the following response:

{"acknowledged":true}

SQL query definition

To extract the values that will be loaded to the index by Logstash, we need to write a SQL query. Open a text editor and paste the following one:

SELECT id,
label AS name,
description,
homepage_url as homepage,
number_of_employees,
CASE WHEN lat IS NULL THEN
NULL
ELSE
lat || ', ' || lng
END AS location
FROM company
LEFT JOIN company_geolocation ON company.id = company_geolocation.companyid

Save the file to demo/example/company-minimal.sql inside the folder where you extracted the demonstration distribution.

Logstash configuration

We now need to write a Logstash configuration to process the records returned by the query and populate the company-minimal index.

Note

Support for SQL databases is provided by the Logstash jdbc input plugin; You must download logstash to the demo/example folder and install the required plugin

Open a text editor and paste the following:

input {
jdbc {
jdbc_driver_library => "sqlitejdbc-v056.jar"
jdbc_driver_class => "org.sqlite.JDBC"
jdbc_connection_string => "jdbc:sqlite:crunchbase.db"
jdbc_user => ""
statement_filepath => "company-minimal.sql"
jdbc_paging_enabled => true
jdbc_page_size => 10000
}
}

filter {
mutate {
remove_field => ["@timestamp", "@version"]
}
}

output {
elasticsearch {
hosts => "localhost:9220"
manage_template => false
action => "index"
index => "company-minimal"
document_type => "CompanyMinimal"
}
}

The statement_filepath parameter specifies the path to the file containing the SQL query; the jdbc_* parameters set the database connection string and authentication options.

The mutate filter is configured to remove default Logstash fields which are not needed in the destination index.

The output section specifies the destination index; manage_template is set to false as the index mapping has been explicitly defined in the previous steps.

Save the file to demo/example/company-minimal.conf

Copy the SQLite database to demo/example/crunchbase.db, then go to the demo/example folder and run the following command:

cd demo/example
logstash/bin/logstash -f company-minimal.conf

Logstash will execute the query and populate the index.

Browsing the index in Siren Investigate

Open http://localhost:5606 in your browser, then go toManagementData Model.

Deselect Index contains time-based events, then enter company-minimal in the Index name or pattern field:

Click Create to create the index reference, then go to the Discover tab and select company-minimal in the dark gray box:

Click the right arrow at the beginning of each row to expand it and see all the loaded fields:

Script to load the demonstration data

The complete demonstration data loading process can be repeated by running the demo/sql/bin/index_crunchbase_sqlite.sh script. The script performs the following actions:

• Creates a copy of the database in the folder containing Logstash configurations
• Creates the indices article, company, investor and investment
• Sets the mappings for each index
• Runs the Logstash configuration for each index

The Logstash configurations and Elasticsearch mappings are available in the demo/sql/crunchbase/conf/logstash_sqlite folder.