Skip to main content

kMeansMiniBatch (Stream Processor)

Performs K-Means clustering on a streaming data set. Data points can be of any dimension and the dimensionality is calculated from number of parameters. All data points to be processed in a single query should be of the same dimensionality. The Euclidean distance is taken as the distance metric. The algorithm resembles mini-batch K-Means. (refer Web-Scale K-Means Clustering by D.Sculley, Google, Inc.).

Syntax

streamingml:kMeansMiniBatch(<INT> no.of.clusters, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <DOUBLE> decay.rate, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <INT> maximum.iterations, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <INT> no.of.events.to.retrain, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <DOUBLE> decay.rate, <INT> maximum.iterations, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <DOUBLE> decay.rate, <INT> no.of.events.to.retrain, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <INT> maximum.iterations, <INT> no.of.events.to.retrain, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)
streamingml:kMeansMiniBatch(<INT> no.of.clusters, <DOUBLE> decay.rate, <INT> maximum.iterations, <INT> no.of.events.to.retrain, <DOUBLE|FLOAT|INT|LONG> model.feature, <DOUBLE|FLOAT|INT|LONG> ...)

Query Parameters

NameDescriptionDefault ValuePossible Data TypesOptionalDynamic
no.of.clustersThe assumed number of natural clusters in the data set.INTNoNo
decay.ratethis is the decay rate of old data compared to new data. Value of this will be in [0,1]. 0 means only old data used and1 will mean that only new data is used0.01DOUBLEYesNo
maximum.iterationsNumber of iterations, the process iterates until the number of maximum iterations is reached or the centroids do not change50INTYesNo
no.of.events.to.retrainnumber of events to recalculate cluster centers.20INTYesNo
model.featureThis is a variable length argument. Depending on the dimensionality of data points we will receive coordinates as features along each axis.DOUBLE FLOAT INT LONGNoYes

Extra Return Attributes

NameDescriptionPossible Types
euclideanDistanceToClosestCentroidRepresents the Euclidean distance between the current data point and the closest centroid.DOUBLE
closestCentroidCoordinateThis is a variable length attribute. Depending on the dimensionality(d) we will return closestCentroidCoordinate1 to closestCentroidCoordinated which are the d dimensional coordinates of the closest centroid from the model to the current event. This is the prediction result and this represents the cluster towhich the current event belongs to.DOUBLE

Example 1

CREATE STREAM InputStream (x double, y double);
@info(name = 'query1')
insert into OutputStream
select closestCentroidCoordinate1, closestCentroidCoordinate2, x, y
from InputStream#streamingml:kMeansMiniBatch(2, 0.2, 10, 20, x, y);

This is an example where user gives all three hyper parameters. first 20 events will be consumed to build the model and from the 21st event prediction would start.

Example 2

CREATE STREAM InputStream (x double, y double);
@info(name = 'query1')
insert into OutputStream
select closestCentroidCoordinate1, closestCentroidCoordinate2, x, y
from InputStream#streamingml:kMeansMiniBatch(2, x, y);

This is an example where user has not specified hyper parameters, so default values are used.