Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
LiuBoyang93 committed May 29, 2021
1 parent 816a62f commit 3790be3
Show file tree
Hide file tree
Showing 15 changed files with 2,280 additions and 3 deletions.
167 changes: 167 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
## Suppose your main tex file is EpicPaper.tex in the root folder,
## which generates EpicPaper.pdf in the root folder, then add the following line.
EpicPaper.pdf

## Many mac users will need this as well
# MacOSX Directory meta info
.DS_Store

## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
*.fmt

## Intermediate documents:
*.dvi
*-converted-to.*
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
# *.pdf

## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.brf
*.run.xml

## Build tool auxiliary files:
*.fdb_latexmk
*.synctex
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync

## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa

# achemso
acs-*.bib

# amsthm
*.thm

# beamer
*.nav
*.snm
*.vrb

# cprotect
*.cpt

#(e)ledmac/(e)ledpar
*.end
*.[1-9]
*.[1-9][0-9]
*.[1-9][0-9][0-9]
*.[1-9]R
*.[1-9][0-9]R
*.[1-9][0-9][0-9]R
*.eledsec[1-9]
*.eledsec[1-9]R
*.eledsec[1-9][0-9]
*.eledsec[1-9][0-9]R
*.eledsec[1-9][0-9][0-9]
*.eledsec[1-9][0-9][0-9]R

# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls

# gnuplottex
*-gnuplottex-*

# hyperref
*.brf

# knitr
*-concordance.tex
*.tikz
*-tikzDictionary

# listings
*.lol

# makeidx
*.idx
*.ilg
*.ind
*.ist

# minitoc
*.maf
*.mtc
*.mtc[0-9]
*.mtc[1-9][0-9]

# minted
_minted*
*.pyg

# morewrites
*.mw

# mylatexformat
*.fmt

# nomencl
*.nlo

# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd

# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/

# pdfcomment
*.upa
*.upb

#pythontex
*.pytxcode
pythontex-files-*/

# Texpad
.texpadtmp

# TikZ & PGF
*.dpth
*.md5
*.auxlock

# todonotes
*.tdo

# xindy
*.xdy

# xypic precompiled matrices
*.xyc

# WinEdt
*.bak
*.sav

# endfloat
*.ttt
*.fff

# Latexian
TSWLatexianTemp*
66 changes: 63 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,63 @@
# RCA
Implementation for paper: RCA: A Deep Collaborative Autoencoder Approach for Anomaly Detection
Code will be released soon.
# RCA: A Deep Collaborative Autoencoder Approach for Anomaly Detection
This is the official implementation of Robust Collaborative Autoencoders(RCA).

## Paper Abstract
Unsupervised anomaly detection plays a crucial role in many critical applications.
Driven by the success of deep learning, recent years have witnessed a growing interest in applying deep neural networks (DNNs) to anomaly detection problems. A common approach is using autoencoders to learn a feature representation for the normal observations in the data. The reconstruction error of the autoencoder is then used as outlier score to detect the anomalies. However, due to the high complexity brought upon by the over-parameterization of DNNs, the reconstruction error of the anomalies could also be small, which hampers the effectiveness of these methods. To alleviate this problem, we propose a robust framework using collaborative autoencoders to jointly identify normal observations from the data while learning its feature representation. We investigate the theoretical properties of the framework and empirically show its outstanding performance as compared to other DNN-based methods. Our experimental results also show the resiliency of the framework to missing values compared to other baseline methods.

## RCA vs Autoencoder
1. RCA uses multiple autoencoders (we found two autoencoders are usually enough).
2. For each minibatch, RCA only uses samples with small reconstruction loss to update while AE uses all samples in minibatch to update the model.
3. Each autoencoder of RCA will exchange the selected data to other autoencoder.
4. RCA still use dropout during evaluation to get multiple anomaly scores while autoencoder only uses dropout in training.

## Conda Environment
We provide the conda virtual environment in environment.yml.

## Data
We use the ODDs dataset. The preprocessed data is in the data folder and you need first unzip the data.rar file.
More details can be found in [Official Page of ODDs Dataset](http://odds.cs.stonybrook.edu/)

## Example:
run RCA on vowels:

> python3 trainRCA.py --data vowels --missing_ratio 0
run RCA on pima:

> python3 trainRCA.py --data pima --missing_ratio 0
run RCA on vowels with 10% missing value and mean imputation:

> python3 trainRCA.py --data vowels --missing_ratio 0.1
run RCA by using k autoencoders

> python3 trainRCAMulti.py --data vowels --missing_ratio 0.0 --n_member k
## Hyperparameters
Since in unsupervised anomaly detection, there is no clean validation data available to tune the hyperparameter. Thus, we use the same hyperparameter across all different datasets to show that our method does not heavily depend on hyperparameter tuning.
> batchsize=128
>
>learningrate=3e-4 with Adam Optimizer
>
>hidden dimension=256
>
>bottleneck dimension=10
>
>The network structure is in the models/RCA.py. Currently, we use a 6-layer autoencoder.

## Baselines
We implement several baselines. Our implementations for one class SVM, SO-GAAL, isolation forest are based on the [pyod](https://github.com/yzhao062/pyod) implementation.
They also provide the official benchmark on ODDs dataset, which can be found in [here](https://pyod.readthedocs.io/en/latest/benchmark.html).

We implement the [DAGMM](https://openreview.net/forum?id=BJJLHbb0-) and [Deep one class SVM](http://proceedings.mlr.press/v80/ruff18a.html) by ourselves. Our DAGMM implementation heavily depends on this [third-party implementation](https://github.com/danieltan07/dagmm), and we found the DAGMM is highly numerical unstable in the ODDs dataset.
For the DeepSVDD, we train the autoencoder for 50 epochs as the initialization.

## Acknowledgements
This research is funded by NSF-IIS 2006633, EF1638679, NSF-IIS-1749940, Office of Naval Research N00014-20-1-2382, National Institue on Aging RF1AG072449.




Binary file added data/data.zip
Binary file not shown.
59 changes: 59 additions & 0 deletions data_process.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
from six.moves import cPickle as pickle
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
import torch as torch
from torch.utils.data import Dataset


def load_dict(filename_):
with open(filename_, "rb") as f:
ret_di = pickle.load(f)
return ret_di


class RealDataset(Dataset):
def __init__(self, path, missing_ratio):
scaler = MinMaxScaler()

data = np.load(path, allow_pickle=True)
data = data.item()
self.missing_ratio = missing_ratio
self.x = data["x"]
self.y = data["y"]

n, d = self.x.shape
mask = np.random.rand(n, d)
mask = (mask > self.missing_ratio).astype(float)
if missing_ratio > 0.0:
self.x[mask == 0] = np.nan
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
self.x = imputer.fit_transform(self.x)
scaler.fit(self.x)
self.x = scaler.transform(self.x)
else:
scaler.fit(self.x)
self.x = scaler.transform(self.x)

def __len__(self):
return self.x.shape[0]

def __dim__(self):
if len(self.x.shape) > 2:
raise Exception("only handles single channel data")
else:
return self.x.shape[1]

def __getitem__(self, idx):
return (
torch.from_numpy(np.array(self.x[idx, :])),
torch.from_numpy(np.array(self.y[idx])),
)

def __sample__(self, num):
len = self.__len__()
index = np.random.choice(len, num, replace=False)
return self.__getitem__(index)

def __anomalyratio__(self):
return self.y.sum() / self.y.shape[0]
Loading

0 comments on commit 3790be3

Please sign in to comment.