init

illidanlab · May 29, 2021 · 3790be3 · 3790be3
1 parent 816a62f
commit 3790be3
Show file tree

Hide file tree

Showing 15 changed files with 2,280 additions and 3 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,167 @@
+## Suppose your main tex file is EpicPaper.tex in the root folder,
+## which generates EpicPaper.pdf in the root folder, then add the following line. 
+EpicPaper.pdf
+
+## Many mac users will need this as well 
+# MacOSX Directory meta info
+.DS_Store
+
+## Core latex/pdflatex auxiliary files:
+*.aux
+*.lof
+*.log
+*.lot
+*.fls
+*.out
+*.toc
+*.fmt
+
+## Intermediate documents:
+*.dvi
+*-converted-to.*
+# these rules might exclude image files for figures etc.
+# *.ps
+# *.eps
+# *.pdf
+
+## Bibliography auxiliary files (bibtex/biblatex/biber):
+*.bbl
+*.bcf
+*.blg
+*-blx.aux
+*-blx.bib
+*.brf
+*.run.xml
+
+## Build tool auxiliary files:
+*.fdb_latexmk
+*.synctex
+*.synctex.gz
+*.synctex.gz(busy)
+*.pdfsync
+
+## Auxiliary and intermediate files from other packages:
+# algorithms
+*.alg
+*.loa
+
+# achemso
+acs-*.bib
+
+# amsthm
+*.thm
+
+# beamer
+*.nav
+*.snm
+*.vrb
+
+# cprotect
+*.cpt
+
+#(e)ledmac/(e)ledpar
+*.end
+*.[1-9]
+*.[1-9][0-9]
+*.[1-9][0-9][0-9]
+*.[1-9]R
+*.[1-9][0-9]R
+*.[1-9][0-9][0-9]R
+*.eledsec[1-9]
+*.eledsec[1-9]R
+*.eledsec[1-9][0-9]
+*.eledsec[1-9][0-9]R
+*.eledsec[1-9][0-9][0-9]
+*.eledsec[1-9][0-9][0-9]R
+
+# glossaries
+*.acn
+*.acr
+*.glg
+*.glo
+*.gls
+
+# gnuplottex
+*-gnuplottex-*
+
+# hyperref
+*.brf
+
+# knitr
+*-concordance.tex
+*.tikz
+*-tikzDictionary
+
+# listings
+*.lol
+
+# makeidx
+*.idx
+*.ilg
+*.ind
+*.ist
+
+# minitoc
+*.maf
+*.mtc
+*.mtc[0-9]
+*.mtc[1-9][0-9]
+
+# minted
+_minted*
+*.pyg
+
+# morewrites
+*.mw
+
+# mylatexformat
+*.fmt
+
+# nomencl
+*.nlo
+
+# sagetex
+*.sagetex.sage
+*.sagetex.py
+*.sagetex.scmd
+
+# sympy
+*.sout
+*.sympy
+sympy-plots-for-*.tex/
+
+# pdfcomment
+*.upa
+*.upb
+
+#pythontex
+*.pytxcode
+pythontex-files-*/
+
+# Texpad
+.texpadtmp
+
+# TikZ & PGF
+*.dpth
+*.md5
+*.auxlock
+
+# todonotes
+*.tdo
+
+# xindy
+*.xdy
+
+# xypic precompiled matrices
+*.xyc
+
+# WinEdt
+*.bak
+*.sav
+
+# endfloat
+*.ttt
+*.fff
+
+# Latexian
+TSWLatexianTemp*
diff --git a/README.md b/README.md
@@ -1,3 +1,63 @@
-# RCA
-Implementation for paper: RCA: A Deep Collaborative Autoencoder Approach for Anomaly Detection
-Code will be released soon.
+# RCA: A Deep Collaborative Autoencoder Approach for Anomaly Detection
+This is the official implementation of Robust Collaborative Autoencoders(RCA). 
+
+## Paper Abstract
+Unsupervised anomaly detection plays a crucial role in many critical applications.
+Driven by the success of deep learning, recent years have witnessed a growing interest in applying deep neural networks (DNNs) to anomaly detection problems. A common approach is using autoencoders to learn a feature representation for the normal observations in the data. The reconstruction error of the autoencoder is then used as outlier score to detect the anomalies. However, due to the high complexity brought upon by the over-parameterization of DNNs, the reconstruction error of the anomalies could also be small, which hampers the effectiveness of these methods. To alleviate this problem, we propose a robust framework using collaborative autoencoders to jointly identify normal observations from the data while learning its feature representation. We investigate the theoretical properties of the framework and empirically show its outstanding performance as compared to other DNN-based methods. Our experimental results also show the resiliency of the framework to missing values compared to other baseline methods.
+
+## RCA vs Autoencoder
+1. RCA uses multiple autoencoders (we found two autoencoders are usually enough).
+2. For each minibatch, RCA only uses samples with small reconstruction loss to update while AE uses all samples in minibatch to update the model.
+3. Each autoencoder of RCA will exchange the selected data to other autoencoder.
+4. RCA still use dropout during evaluation to get multiple anomaly scores while autoencoder only uses dropout in training.
+
+## Conda Environment
+We provide the conda virtual environment in environment.yml. 
+
+## Data
+We use the ODDs dataset. The preprocessed data is in the data folder and you need first unzip the data.rar file.
+More details can be found in [Official Page of ODDs Dataset](http://odds.cs.stonybrook.edu/)
+
+## Example: 
+run RCA on vowels:
+
+> python3 trainRCA.py --data vowels --missing_ratio 0
+
+run RCA on pima:
+
+> python3 trainRCA.py --data pima --missing_ratio 0
+
+run RCA on vowels with 10% missing value and mean imputation:
+
+> python3 trainRCA.py --data vowels --missing_ratio 0.1
+
+run RCA by using k autoencoders
+
+> python3 trainRCAMulti.py --data vowels --missing_ratio 0.0 --n_member k
+
+## Hyperparameters
+Since in unsupervised anomaly detection, there is no clean validation data available to tune the hyperparameter. Thus, we use the same hyperparameter across all different datasets to show that our method does not heavily depend on hyperparameter tuning. 
+> batchsize=128
+>
+>learningrate=3e-4 with Adam Optimizer
+>
+>hidden dimension=256
+>
+>bottleneck dimension=10
+>
+>The network structure is in the models/RCA.py. Currently, we use a 6-layer autoencoder.
+
+
+## Baselines
+We implement several baselines. Our implementations for one class SVM, SO-GAAL, isolation forest are based on the [pyod](https://github.com/yzhao062/pyod) implementation.
+They also provide the official benchmark on ODDs dataset, which can be found in [here](https://pyod.readthedocs.io/en/latest/benchmark.html).
+
+We implement the [DAGMM](https://openreview.net/forum?id=BJJLHbb0-) and [Deep one class SVM](http://proceedings.mlr.press/v80/ruff18a.html) by ourselves. Our DAGMM implementation heavily depends on this [third-party implementation](https://github.com/danieltan07/dagmm), and we found the DAGMM is highly numerical unstable in the ODDs dataset.
+For the DeepSVDD, we train the autoencoder for 50 epochs as the initialization. 
+
+## Acknowledgements
+This research is funded by NSF-IIS 2006633, EF1638679, NSF-IIS-1749940, Office of Naval Research N00014-20-1-2382, National Institue on Aging RF1AG072449.
+
+
+
+
diff --git a/data/data.zip b/data/data.zip
diff --git a/data_process.py b/data_process.py
@@ -0,0 +1,59 @@
+from six.moves import cPickle as pickle
+import numpy as np
+from sklearn.preprocessing import MinMaxScaler
+from sklearn.impute import SimpleImputer
+import torch as torch
+from torch.utils.data import Dataset
+
+
+def load_dict(filename_):
+    with open(filename_, "rb") as f:
+        ret_di = pickle.load(f)
+    return ret_di
+
+
+class RealDataset(Dataset):
+    def __init__(self, path, missing_ratio):
+        scaler = MinMaxScaler()
+
+        data = np.load(path, allow_pickle=True)
+        data = data.item()
+        self.missing_ratio = missing_ratio
+        self.x = data["x"]
+        self.y = data["y"]
+
+        n, d = self.x.shape
+        mask = np.random.rand(n, d)
+        mask = (mask > self.missing_ratio).astype(float)
+        if missing_ratio > 0.0:
+            self.x[mask == 0] = np.nan
+            imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
+            self.x = imputer.fit_transform(self.x)
+            scaler.fit(self.x)
+            self.x = scaler.transform(self.x)
+        else:
+            scaler.fit(self.x)
+            self.x = scaler.transform(self.x)
+
+    def __len__(self):
+        return self.x.shape[0]
+
+    def __dim__(self):
+        if len(self.x.shape) > 2:
+            raise Exception("only handles single channel data")
+        else:
+            return self.x.shape[1]
+
+    def __getitem__(self, idx):
+        return (
+            torch.from_numpy(np.array(self.x[idx, :])),
+            torch.from_numpy(np.array(self.y[idx])),
+        )
+
+    def __sample__(self, num):
+        len = self.__len__()
+        index = np.random.choice(len, num, replace=False)
+        return self.__getitem__(index)
+
+    def __anomalyratio__(self):
+        return self.y.sum() / self.y.shape[0]