Practices Part3
Week 8: Data Stream Mining
In this practical, we use the stream R package for analysing stream data. Please install the stream package to complete the practical.
I. Creating a data stream
We firstly create a generator to generate stream data points that will belong to one of three clusters
(k=3)
. Each data point will have 2 dimensions(d=2)
. The data points will follow Gaussian distribution with 5% noise. When a new data point is requested from this data generator, a cluster will be chosen randomly using the probability weights inp
.library("stream") stream <- DSD_Gaussians(k = 3, d = 2, noise = .05, p = c(.5, .3, .1)) stream
Generate 5 data points using the generator.
p <- get_points(stream, n = 5)
p
- Use option
class=TRUE
to see which cluster a data point belongs to. Please note that noise data points (5%) do have the class labels (NA).
p <- get_points(stream, n = 10, class = TRUE)
p
- Plot the 500 points from the data stream
plot(stream, n=500)
II. Reading and writing data streams
- Write the created stream with 100 data points to a file called data.csv
write_stream(stream, "data.csv", n = 100, sep = ",")
- Read back the data.csv file to R.
stream_data = DSD_ReadStream("data.csv")
- Note that the data has not been read to the stream_data until we use get_points
get_points(stream_data, n=5)
III. Reservoir Sampling
- Create a stream with 3 clusters and 5% noise
stream <- DSD_Gaussians(k = 3, d = 2, noise = .05, p = c(.5, .3, .1))
- Create a Reservoir sampling mechanism with 20 points will be sampled from the stream
sample <- DSAggregate_Sample(k = 20)
- Update the data for sample using 500 data points from stream
update(sample, stream, 500)
sample
- Get the data from sample
get_points(sample)
- Plot the data points in sample
plot(get_points(sample))
IV. Data Stream Clustering
- We firstly prepare the clustering algorithm. We use DSC_DStream which implements the D-Stream algorithm (Tu and Chen 2009). D-Stream assigns points to cells in a grid. For the example we use a gridsize of 0.1.
dstream <- DSC_DStream(gridsize = .1, Cm = 1.2)
dstream
- The clusters are currently empty, but they are ready to get data points from the stream.
update(dstream, stream, n = 500)
dstream
plot(dstream, stream)
- There are a number of micro-clusters. We can get the centers of the micro-clusters using:
head(get_centers(dstream))
Week 10: Data Stream Mining
I. Evaluation of data stream clustering
Internal evaluation measures:
- “average.between” Average distance between clusters
- “average.within” Average distance within clusters
- “max.diameter” Maximum cluster diameter
- “entropy” entropy of the distribution of cluster memberships
External evaluation measures:
- “precision” and “recall”:
- Precision=TP/(TP+FP)
- Recall=TP/(TP+FN)
- “purity”: Average purity of clusters. The purity of each cluster is the proportion of the points of the majority true group assigned to it.
- “Euclidean”: Euclidean dissimilarity of the memberships
library("stream")
stream <- DSD_Gaussians(k = 3, d = 2, noise = .05)
- Use Reservoir sampling to generate 100 data points and use K-means to generate 4 clusters.
Reservoir_Kmeans = DSC_TwoStage(micro = DSC_Sample(k = 100), macro = DSC_Kmeans(k = 4))
update(Reservoir_Kmeans, stream, n=500)
Reservoir_Kmeans
plot(Reservoir_Kmeans, stream)
evaluate_static(Reservoir_Kmeans, stream, measure =c("average.between", "precision", "recall"), n =500)
- Use sliding window method rather than Reservoir sampling in the above example. Compare the precision and recall of the two methods.
Hint
Window_Kmeans = DSC_TwoStage(micro = DSC_Window(horizon = 100), macro = DSC_Kmeans(k = 4)).
Window_Kmeans = DSC_TwoStage(micro = DSC_Window(horizon = 100), macro = DSC_Kmeans(k = 4))
update(Window_Kmeans, stream, n=500)
Window_Kmeans
plot(Window_Kmeans, stream)
evaluate_static(Window_Kmeans, stream,measure = c("average.between","precision","recall"), n =500)
II. Concept Drift
Concept drift means the changes of the data generating process over time. It implies that the statistical properties of the data also change when time passes. A good data mining algorithm should be able to deal with concept drift. In the stream package, DSD_Benchmark(1) is an example data stream which contains concept drift. To show the concept drift we request four times 250 data points from the stream and plot them. To fast-forward in the stream we request 1400 points in between the plots and ignore them. The codes below will show 4 figures of the data at different time points.
stream <- DSD_Benchmark(1)
stream
for(i in 1:4) {
plot(stream, 250, xlim = c(0, 1), ylim = c(0, 1))
tmp <- get_points(stream, n = 1400)
}
We can use animation package to demonstrate this:
reset_stream(stream)
animate_data(stream, n = 10000, horizon = 100
, xlim = c(0, 1), ylim = c(0, 1))
library("animation")
animation::ani.options(interval = .1)
ani.replay()
III. Evaluation of data stream clustering with concept drift
- Using Reservoir sampling and K-means
stream = DSD_Benchmark(1)
Reservoir_Kmeans= DSC_TwoStage(micro = DSC_Sample(k = 100, biased = TRUE), macro = DSC_Kmeans(k = 2))
update(Reservoir_Kmeans, stream, n=500)
plot(Reservoir_Kmeans, stream)
evaluate_stream(Reservoir_Kmeans, stream, measure = c( "precision", "recall"), n =5000, horizon=100)
- Evaluate the Sliding window + K-means clustering
#2. Sliding window + K-means clustering
Window_Kmeans = DSC_TwoStage(micro = DSC_Sample(k = 100, biased = TRUE), macro = DSC_Kmeans(k = 2))
update(Window_Kmeans, stream, n=500)
Window_Kmeans
plot(Window_Kmeans, stream)
evaluate_static(Window_Kmeans, stream, measure = c("precision", "recall"), n =5000, orizon=100)