Skip to content

Commit

Permalink
Merge pull request #1 from weekend37/parameters
Browse files Browse the repository at this point in the history
Callable functions to provide kernel parameters and arguments + docs
  • Loading branch information
weekend37 authored Nov 10, 2021
2 parents e116b95 + 57053ef commit a755b0d
Show file tree
Hide file tree
Showing 4 changed files with 106 additions and 7 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,17 @@ Assuming you have [Scikit-Learn](https://scikit-learn.org/) already installed, y
```python
from sklearn import svm
from stringkernels.kernels import string_kernel
model = svm.SVC(kernel=string_kernel)
model = svm.SVC(kernel=string_kernel())
```

and the polynomial string kernel,

```python
from sklearn import svm
from stringkernels.kernels import polynomial_string_kernel
model = svm.SVC(kernel=polynomial_string_kernel)
model = svm.SVC(kernel=polynomial_string_kernel())
```

See the notebook [example.ipynb](https://github.com/weekend37/string-kernels/blob/master/example.ipynb) for further demonstration of usage.
For morer information read the [docs](https://github.com/weekend37/string-kernels/blob/master/doc/docs.md) or take a look at the notebook [example.ipynb](https://github.com/weekend37/string-kernels/blob/master/example.ipynb) for further demonstration of usage.

If you end up using this in your research we kindly ask you to cite us! :)
54 changes: 54 additions & 0 deletions doc/fig/docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Documentation

## kernels.string_kernel

**Wrapper for a singly vectorized linear time string kernel implentation for data matrices X and Y**
```python
Parameters
- normalzie : bool, default=True
indicates if the kernel output should be normalized s.t. max(K) <= 1
- n_jobs : int, default=None
how many CPUs to distribute the process over. If None, use maximum available CPUs.

Returns
- string_kernel_func : function
function that takes in two data matrices X and Y as arguments
(np.ndarray's of shapes (NX,MX) and (NY, MY) where N_ is the number of samples and M_ is sequence length)
and returns the string kernel value between product of all samples in X and Y (int, float depending on normalization)
```

**Example**

```python
from sklearn import svm
from stringkernels.kernels import string_kernel
model = svm.SVC(kernel=string_kernel(n_jobs=32))
```

## kernels.polynomial_string_kernel

**Wrapper for a linear time polynomial string kernel distance implentation for two data matrices X and Y for a monomial with exponent p to run across n_jobs different CPUs.**
```python
Parameters
- p: float or int, default = 1.2
exponent of the monomial which will be used
- normalzie : bool, default=True
indicates if the kernel output should be normalized s.t. max(K) <= 1
- n_jobs : int, default=None
how many CPUs to distribute the process over. If None, use maximum available CPUs.

Returns
- polynomial_string_kernel_func : function
function that takes in two data matrices X and Y as arguments
(np.ndarray's of shapes (NX,MX) and (NY, MY) where N_ is the number of samples and M_ is sequence length)
and returns the polynomial string kernel value between product of all samples in X and Y (float)

```

**Example**

```python
from sklearn import svm
from stringkernels.kernels import polynomial_string_kernel
model = svm.SVC(kernel=polynomial_string_kernel(p=1.1))
```
4 changes: 2 additions & 2 deletions example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@
"source": [
"from stringkernels.kernels import string_kernel\n",
"\n",
"svm_sk_model = svm.SVC(kernel=string_kernel)\n",
"svm_sk_model = svm.SVC(kernel=string_kernel())\n",
"svm_sk_model.fit(X_train, y_train)\n",
"y_pred = svm_sk_model.predict(X_val)\n",
"svm_sk_accuracy = accuracy_score(y_val, y_pred)\n",
Expand Down Expand Up @@ -406,7 +406,7 @@
"source": [
"from stringkernels.kernels import polynomial_string_kernel\n",
"\n",
"svm_psk_model = svm.SVC(kernel=polynomial_string_kernel)\n",
"svm_psk_model = svm.SVC(kernel=polynomial_string_kernel(1.2))\n",
"svm_psk_model.fit(X_train, y_train)\n",
"y_pred = svm_psk_model.predict(X_val)\n",
"svm_psk_accuracy = accuracy_score(y_val, y_pred)\n",
Expand Down
49 changes: 47 additions & 2 deletions src/stringkernels/kernels.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def string_kernel_singlethread(X,Y,normalize=True):
"""
return np.array([string_kernel_vectorized(x,Y,normalize=normalize) for x in X])

def string_kernel(X,Y,normalize=True,n_jobs=None):
def string_kernel_multithread(X,Y,normalize=True,n_jobs=None):
"""
Singly vectorized linear time string kernel implentation for data matrices X and Y with multithreading
"""
Expand All @@ -56,6 +56,27 @@ def string_kernel(X,Y,normalize=True,n_jobs=None):

return K

def string_kernel(normalize=True,n_jobs=None):
"""
Wrapper for a singly vectorized linear time string kernel implentation for data matrices X and Y
-----------
Parameters
- normalzie : bool, default=True
indicates if the kernel output should be normalized s.t. max(K) <= 1
- n_jobs : int, default=None
how many CPUs to distribute the process over. If None, use maximum available CPUs.
-----------
Returns
- string_kernel_func : function
function that takes in two data matrices X and Y as arguments
(np.ndarray's of shapes (NX,MX) and (NY, MY) where N_ is the number of samples and M_ is sequence length)
and returns the string kernel value between product of all samples in X and Y (int, float depending on normalization)
"""
if n_jobs is not None and n_jobs==1:
return partial(string_kernel_singlethread, normalize=normalize)
else:
return partial(string_kernel_multithread, normalize=normalize, n_jobs=n_jobs)

## ------------------------------- Polynomial String Kernel ------------------------------- ##

def polynomial_string_kernel_vectors(x,y,p,normalize=False):
Expand Down Expand Up @@ -94,7 +115,7 @@ def polynomial_string_kernel_singlethread(X,Y,p=1.2,normalize=False):

return K

def polynomial_string_kernel(X,Y,p=1.2,n_jobs=16,normalize=False):
def polynomial_string_kernel_multithread(X,Y,p=1.2,normalize=False,n_jobs=16):
"""
Multithreaded linear time polynomial string kernel distance implentation for two data matrices X and Y
for a monomial with exponent p to run across n_jobs different cpus.
Expand All @@ -106,3 +127,27 @@ def polynomial_string_kernel(X,Y,p=1.2,n_jobs=16,normalize=False):
K = np.array(K_list).squeeze()

return K

def polynomial_string_kernel(p=1.2,normalize=False, n_jobs=16):
"""
Wrapper for a linear time polynomial string kernel distance implentation for two data matrices X and Y
for a monomial with exponent p to run across n_jobs different cpus.
-----------
Parameters
- p: float or int, default = 1.2
exponent of the monomial which will be used
- normalzie : bool, default=True
indicates if the kernel output should be normalized s.t. max(K) <= 1
- n_jobs : int, default=None
how many CPUs to distribute the process over. If None, use maximum available CPUs.
-----------
Returns
- polynomial_string_kernel_func : function
function that takes in two data matrices X and Y as arguments
(np.ndarray's of shapes (NX,MX) and (NY, MY) where N_ is the number of samples and M_ is sequence length)
and returns the polynomial string kernel value between product of all samples in X and Y (float)
"""
if n_jobs is not None and n_jobs==1:
return partial(polynomial_string_kernel_singlethread, p=p, normalize=normalize)
else:
return partial(polynomial_string_kernel_multithread, p=p, normalize=normalize, n_jobs=n_jobs)

0 comments on commit a755b0d

Please sign in to comment.