Siren Platform User Guide

Data sampling guidelines

Data used for model training should be representative of the data as a whole, containing the scope of the values and all the patterns that are typical in the dataset. This maximizes the accuracy for detection and prediction models. You should carefully consider the sampling resolution and the range of the data when training a model to balance this against the efficiency of training.

Bucket size

The bucket size is the resolution that your data is sampled at both during training and for detections/predictions. The bucket size should be at least as large as the typical rate that your data is logged at. This means that if data is being logged at hourly intervals, the minimum bucket size value should be one hour. If you choose a lower value, your machine learning model will be trained on missing data; therefore, it will waste a lot of computational power learning trends that are not useful or that do not really exist.

In many cases, choosing a larger bucket size can maintain the characteristics of the data while reducing the amount for training, speeding up model creation. Use the preview graph in the model creation screen as an indicator of a good bucket size and time range.

As a practical example, let's say you want to train a model on a year's data which was logged at one-second intervals. A bucket size of 1 second yields 31.5 million points, whereas a bucket size of 5 minutes reduces this to 100k points. The preview graph in the model creation screen can again give an indication of a good bucket size and time range.

Seasonality

When training a model, you should ensure that the training range captures the seasonal changes within your data (seasonal variations occur at specific regular intervals of less than a year). For example, if your data has different patterns on weekdays and weekends, the training range should be over multiple weeks. If your data changes between summer and winter, use multiple years of data to accurately capture this trend.

False positive anomalies

An anomaly detection model may mark as anomalies some data points that you may consider normal. To rectify this, retrain the model, making sure to include the data that it incorrectly tagged as anomalous.

Use cases

Coastal temperature

An oceanographer with a dataset measuring coastal sea temperature at a depth of 15m every day wishes to predict changes in the temperature signal. A model trained on the last two decades (7,300 points) would more accurately cover the ongoing sea temperature changes to allow more accurate prediction.

Web server response time

Anomaly detection models work very well for server logs. An example of this is response logs for a web server:

{
  "time": "17/May/2018:08:05:32 +0000",
  "request": "GET /downloads/product_1 HTTP/1.1",
  "response": 304,
  "response_time": 13,
  "remote_ip": "93.180.71.3",
  "agent": "Debian APT-HTTP/1.3 (0.8.16~exp12ubuntu10.21)",
  "bytes": 0,
}

A useful metric here is the response time of the server, expressed in milliseconds. This value may rise and fall based on the number of requests the web server is handling at any one moment, which in turn will vary based on the time of day and even during weekends. With the Machine Learning application, we can create a model that understands the seasonality of the response time and indicates when it is behaving unexpectedly; in this case, when the server is handling requests unusually slowly.

For this scenario, you would create an anomaly detection model for the saved search that has the logs, looking at the max of response_time. Depending on how many requests the web server gets, bucket size might be somewhere between 5 seconds (250k documents) and 1 minute (20k documents). Several weeks of data should be selected, which is enough to see variations over the weekends.