Skip to content

Commit

Permalink
Fixed lots of code.
Browse files Browse the repository at this point in the history
  • Loading branch information
mhahsler committed Jun 8, 2019
1 parent 2de94e5 commit a0e678a
Show file tree
Hide file tree
Showing 27 changed files with 154 additions and 337 deletions.
5 changes: 3 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: stream
Version: 1.3-0.1
Date: 2018-xx-xx
Version: 1.3-1
Date: 2019-06-07
Title: Infrastructure for Data Stream Mining
Authors@R: c(person("Michael", "Hahsler", role = c("aut", "cre", "cph"),
email = "mhahsler@lyle.smu.edu"),
Expand All @@ -17,3 +17,4 @@ URL: https://github.com/mhahsler/stream
BugReports: https://github.com/mhahsler/stream/issues
LinkingTo: Rcpp, BH
License: GPL-3
RoxygenNote: 6.1.1
12 changes: 10 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,34 @@
# stream 1.3-0.1 (xx/xx/18)
# stream 1.3-1 (06/07/19)

## New Features
* Added DSC_evoStream and DSC_EA. Code by Matthias Carnein.

## Changes
* Package animation is now only suggested since it requires package magick
which may need the imagemagick++ libraries installed.

# stream 1.3-0 (05/31/18)

## New Features
* Added DSC_BIRCH. Code and Interface by Dennis Assenmacher and Matthias Carnein.
* Added DSC_BICO. Code by Hendrik Fichtenberger, Marc Gille, Melanie Schmidt,
Chris Schwiegelshohn, Christian Sohler and Interface provided by Matthias
Carnein and Dennis Assenmacher.
* animate_cluster: noise now accepts "class" or "exclude" ("ignore" is deprecated).

## Bug Fixes
* DSD_ReadCSV: Fixed bug with streams that have no class/cluster label
(reported by Matthias Carnein).
* animate_cluster: noise now accepts "class" or "exclude" ("ignore" is deprecated).

# stream 1.2-4 (02/25/17)

## Bug Fixes
* Use dbFetch in DSD_ReadDB (new version of RSQLite).
* Register native C routines.

# stream 1.2-3 (08/07/16)

## Bug Fixes
* fixed saveDSC for DBStream.
* fixed handling of data with d=1 (reported by Ilana Lichtenstein).
* plot now automatically determines if the data supports a class attribute.
Expand Down
17 changes: 0 additions & 17 deletions R/DSC_BICO.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,6 @@
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' BICO - Fast computation of k-means coresets in a data stream
#'
#' BICO maintains a tree which is inspired by the clustering tree of BIRCH,
#' a SIGMOD Test of Time award-winning clustering algorithm.
#' Each node in the tree represents a subset of these points. Instead of
#' storing all points as individual objects, only the number of points,
#' the sum and the squared sum of the subset's points are stored as key features
#' of each subset. Points are inserted into exactly one node.
#'
#' In this implementation, the nearest neighbour search on the first level
#' of the tree ist sped up by projecting all points to random 1-d subspaces.
#' The first estimation of the optimal clustering cost is computed in a
#' buffer phase at the beginning of the algorithm.
#'
#' This implementation interfaces the original C++ implementation available here: \url{http://ls2-www.cs.tu-dortmund.de/grav/de/bico}.
#' For micro-clustering, the algorithm computes the coreset of the stream. Reclustering is performed by using the \code{kmeans++} algorithm on the coreset.

DSC_BICO <- function(k=5, space=10, p=10, iterations=10) {

BICO <- BICO_R$new(k, space, p, iterations)
Expand Down
27 changes: 6 additions & 21 deletions R/DSC_EA.R
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@
#' reset_stream(stream)
#' plot(two, stream, type="both")
#'
#' ## if we have time, evaluate additional generations. This can be called at any time, also between observations.
#' ## if we have time, evaluate additional generations. This can be
#' ## called at any time, also between observations.
#' two$macro_dsc$RObj$recluster(2000)
#'
#' ## plot improved result
Expand Down Expand Up @@ -85,30 +86,14 @@ DSC_EA <- function(k, generations=2000, crossoverRate=.8, mutationRate=.001, pop
}



#' Reference Class EA_R
#'
#' Reference class used for Reclustering using an evolutionary algorithm
#'
#' @field crossoverRate cross-over rate for the evolutionary algorithm
#' @field mutationRate mutation rate for the evolutionary algorithm
#' @field populationSize number of solutions that the evolutionary algorithm maintains
#' @field k number of macro-clusters
#' @field generations number of EA generations performed during reclustering
#' @field data micro-clusters to recluster
#' @field weights weights of the micro-clusters
#' @field C exposed C class
#'
#' @author Matthias Carnein \email{matthias.carnein@@uni-muenster.de}
#'
EA_R <- setRefClass("EA",
fields = list(
crossoverRate = "numeric",
mutationRate = "numeric",
crossoverRate = "numeric",
mutationRate = "numeric",
populationSize = "integer",
k = "integer",
k = "integer",
data = "data.frame",
weights = "numeric",
weights = "numeric",
generations = "integer",
C = "ANY"
),
Expand Down
27 changes: 8 additions & 19 deletions R/DSC_evoStream.R
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,13 @@
#' @references Carnein M. and Trautmann H. (2018), "evoStream - Evolutionary Stream Clustering Utilizing Idle Times", Big Data Research.
#'
#' @examples
#' stream <- DSD_Memory(DSD_Gaussians(k = 3, d = 2), 1000)
#' stream <- DSD_Memory(DSD_Gaussians(k = 3, d = 2), 500)
#'
#' ## init evoStream
#' evoStream <- DSC_evoStream(r=0.05, k=3, incrementalGenerations=1, reclusterGenerations=1000)
#' evoStream <- DSC_evoStream(r=0.05, k=3, incrementalGenerations=1, reclusterGenerations=500)
#'
#' ## insert observations
#' update(evoStream, stream, n = 1000)
#' update(evoStream, stream, n = 500)
#'
#' ## micro clusters
#' get_centers(evoStream, type="micro")
Expand All @@ -69,9 +69,11 @@
#' reset_stream(stream)
#' plot(evoStream, stream, type = "both")
#'
#' ## if we have time, evaluate additional generations. This can be called at any time, also between observations.
#' ## by default, 1 generation is evaluated after each observation and 1000 generations during reclustering (parameters)
#' evoStream$RObj$recluster(2000)
#' ## if we have time, evaluate additional generations.
#' ## This can be called at any time, also between observations.
#' ## by default, 1 generation is evaluated after each observation and
#' ## 1000 generations during reclustering but we set it here to 500
#' evoStream$RObj$recluster(500)
#'
#' ## plot improved result
#' reset_stream(stream)
Expand All @@ -94,19 +96,6 @@ DSC_evoStream <- function(r, lambda=0.001, tgap=100, k=2, crossoverRate=.8, muta
}







#' Reference Class evoStream_R
#'
#' Reference class mostly used to expose the C class object
#'
#' @field C exposed C class
#'
#' @author Matthias Carnein \email{matthias.carnein@@uni-muenster.de}
#'
evoStream_R <- setRefClass("evoStream_R", fields = list(
C ="ANY"
))
Expand Down
11 changes: 2 additions & 9 deletions inst/CITATION
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,6 @@ bibentry(bibtype = "Article",
volume = "76",
number = "14",
pages = "1--50",
doi = "10.18637/jss.v076.i14",

header = "To cite stream in publications use:",
textVersion =
paste("Michael Hahsler, Matthew Bolanos, John Forrest (2017).",
"Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R.",
"Journal of Statistical Software, 76(14), 1-50.",
"doi:10.18637/jss.v076.i14")
)
doi = "10.18637/jss.v076.i14"
)

4 changes: 0 additions & 4 deletions man/DSC_BICO.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,13 @@ In this implementation, the nearest neighbour search on the first level
of the tree ist sped up by projecting all points to random 1-d subspaces.
The first estimation of the optimal clustering cost is computed in a
buffer phase at the beginning of the algorithm.
This implementation interfaces the original C++ implementation available here: \url{http://ls2-www.cs.tu-dortmund.de/grav/de/bico}.
For micro-clustering, the algorithm computes the coreset of the stream. Reclustering is performed by using the \code{kmeans++} algorithm on the coreset.
}
\examples{
stream <- DSD_Gaussians(k = 3, d = 2)
BICO <- DSC_BICO(k = 3, p = 10, space = 100, iterations = 10)
update(BICO, stream, n = 500)
plot(BICO,stream, type = "both")
}
\references{
Expand All @@ -47,7 +44,6 @@ Hendrik Fichtenberger, Marc Gille, Melanie Schmidt, Chris Schwiegelshohn, Christ
R-Interface:
Matthias Carnein (\email{Matthias.Carnein@uni-muenster.de}),
Dennis Assenmacher.
C-Implementation:
Hendrik Fichtenberger,
Marc Gille,
Expand Down
3 changes: 2 additions & 1 deletion man/DSC_EA.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 11 additions & 11 deletions man/DSC_TwoStage.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,21 @@
single process.}

\usage{
DSC_TwoStage(micro, macro)
DSC_TwoStage(micro, macro)
}

\arguments{
\item{micro}{Clustering algorithm for online stage (\code{DSC_micro})}
\item{macro}{Clustering algorithm for offline stage (\code{DSC_macro})}
\item{micro}{Clustering algorithm used in the online stage (\code{DSC_micro})}
\item{macro}{Clustering algorithm used for reclustering in the offline stage (\code{DSC_macro})}
}

\details{
\code{update()} runs the micro-clustering stage and if centers/weights are
requested the reclustering is automatically performed.
\code{update()} runs the micro-clustering stage and only when macro cluster
centers/weights are requested, then the offline stage reclustering is automatically performed.
}

\value{
An object of class \code{DSC_TwoStage} (subclass of \code{DSC}, \code{DSC_Macro}).
An object of class \code{DSC_TwoStage} (subclass of \code{DSC}, \code{DSC_Macro}).
}

%\references{ }
Expand All @@ -39,13 +39,13 @@ stream <- DSD_Gaussians(k=3)
# Create a clustering process that uses a window for the online stage and
# k-means for the offline stage (reclustering)
win_km <- DSC_TwoStage(
micro=DSC_Window(horizon=100),
micro=DSC_Window(horizon=100),
macro=DSC_Kmeans(k=3)
)
)
win_km
update(win_km, stream, 200)

update(win_km, stream, 200)
win_km
plot(win_km, stream, type="both")
plot(win_km, stream, type="both")
evaluate(win_km, stream, assign="macro")
}
14 changes: 8 additions & 6 deletions man/DSC_evoStream.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

39 changes: 21 additions & 18 deletions src/BICO.cpp
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
#include <Rcpp.h>
using namespace Rcpp;

#include <iostream>
#include <sstream>
#include <fstream>
#include <random>
#include <ctime>
#include <time.h>
//#include <iostream>
//#include <sstream>
//#include <fstream>
//#include <random>
//#include <ctime>
//#include <time.h>

#include <boost/algorithm/string.hpp>
//#include <boost/algorithm/string.hpp>

#include "BICO/l2metric.h"
#include "BICO/squaredl2metric.h"
#include "BICO/point.h"
#include "BICO/pointweightmodifier.h"
#include "BICO/bico.h"
#include "BICO/randomness.h"
#include "BICO/randomgenerator.h"
//#include "BICO/randomness.h"
//#include "BICO/randomgenerator.h"
#include "BICO/proxysolution.h"
#include "BICO/pointcentroid.h"
#include "BICO/pointweightmodifier.h"
#include "BICO/realspaceprovider.h"

#include "BICO/master.h"
#include <stdio.h>
#include <stdlib.h>
//#include <stdio.h>
//#include <stdlib.h>


// data structure adapted to Rcpp NumericMatrix and IntegerVector
Expand Down Expand Up @@ -115,21 +115,24 @@ class BICO {
}


void cluster(Rcpp::NumericMatrix data){
void cluster(Rcpp::NumericMatrix data) {

// initialize
if(bico==NULL){
this->d = data.ncol();
// parameter n is not used, therefore pass dummy 0 value
this->bico = new CluE::Bico<CluE::Point>(this->d, 0, this->k, this->p, this->space, &this->metric, &this->modifier);
}else{
if(d != data.ncol()) {
Rf_error("Dimensions of new data do not match the current BICO clustering.");
}
}

this->upToDate=false;

time_t starttime;

// time_t starttime;
// Randomness::initialize(seed);
time(&starttime);
// time(&starttime);
int n = data.nrow();

for(int row=0; row < n; row++){
Expand Down Expand Up @@ -162,12 +165,12 @@ class BICO {

this->micro = micro;
this->microWeight = microWeight;
}
}




void recluster(){
void recluster() {

if(this->micro.nrow() == 0){
return;
Expand Down Expand Up @@ -206,7 +209,7 @@ class BICO {
this->macro = macro;
this->macroWeight = macroWeight;
this->assignment = assignment;
}
}


Rcpp::NumericMatrix get_microclusters(){
Expand Down
Loading

0 comments on commit a0e678a

Please sign in to comment.