Skip to content

Commit

Permalink
feat: add levenshtein entry.
Browse files Browse the repository at this point in the history
  • Loading branch information
Panadestein committed Feb 23, 2025
1 parent 125a25f commit 3bdded6
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 0 deletions.
4 changes: 4 additions & 0 deletions src/bqn/rollim.bqn
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,7 @@ HL‿HS {𝕎•_timed𝕩}¨< 1e8 •rand.Range 1e3
(+´⊢-˜`⌽⌊⌈`) [0,1,0,2,1,0,1,3,2,1,2,1]

´⥊⌊⌜˜×·-⌜˜˜, ´∨×(`-⌈⊢-⌊`)⟩ {10 𝕎_timed𝕩}¨< •rand.Range˜1e4

_l ← {¯1(1+⥊+)(`⊢⌊⊏»⊢-01+⊣)˝𝔽}
T(=)_l≡=_l
T{@+97+𝕩•rand.Range 25}´ 1e41e5
54 changes: 54 additions & 0 deletions src/rollim.org
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,60 @@ will probably be slow). Here are two solutions, one \(O(n^2)\) and the other \(O
#+RESULTS:
: ⟨ 0.080050875 4.14558e¯5 ⟩

** Computing edit distances

The Levenshtein (or edit) [[https://en.wikipedia.org/wiki/Levenshtein_distance][distance]] is a measure of the similarity between two strings. It is defined
by the following recurrence, which is the basis of dynamic programming algorithms like Wagner-Fisher:

\begin{align*}
d_{i0} &= i, \quad d_{0j} = j, \\
d_{ij} &= \min \begin{cases} d_{i-1,j-1} + \mathbf{1}_{s_i \neq t_j} \\ d_{i-1,j} + 1 \\ d_{i,j-1} + 1 \end{cases}
\end{align*}

There is an elegant implementation of a variation of the Wagner–Fischer algorithm in the BQNcrate.
It has been particularly challenging for me to understand it—not due to the clarity
of the primitives, but rather because of the clever transformation employed.
I believe that this variant can be derived by shifting the distance matrix.
Given two strings \(s\) and \(t\) of lengths \(n\) and \(m\), respectively,
we define a new distance matrix as follows:

\begin{equation*}
p_{ij} = d_{ij} + n - i + m - j
\end{equation*}

Under this transformation, the recurrence relation becomes:

\begin{align*}
p_{i0} &= p_{0j} = m + n, \\
p_{ij} &= \min \begin{cases} p_{i-1,j-1} - (\mathbf{1}_{s_i \neq t_j} + 2) \\ p_{i-1,j} \\ p_{i,j-1} \end{cases}
\end{align*}

The above recurrence can be easily identified in the central function of the three train, which is
folded over the table of the costs (table comparing the characters). For this one has to
notice that we compare insertions and substitutions, and then we can do a min scan over the result
to get the deletions, which yields a vectorized implementation.

Now the only piece I cannot put together is the contruction of the table of costs, which is done
by reversing \(t\), but since the final result on \(pij\) is located in the bottom right corner,
and we do a foldr, I would expect it to be \(s\) the one reversed. They both work, thought, as
the following code shows:

#+begin_src bqn :tangle ./bqn/rollim.bqn :exports both
_l ← {¯1⊑(1⊸+⥊+)○≠(⌊`⊢⌊⊏⊸»∘⊢-0∾1+⊣)˝𝔽}
T ← ⌽⊸(=⌜)_l≡=⌜⟜⌽_l
T○{@+97+𝕩•rand.Range 25}´ 1e4‿1e5
#+end_src

#+RESULTS:
: 1

My hypothesis that this can be put together using this properties of the Levenshtein distance:

- \(L(s,t) = L(t,s)\)
- \(L(s,t) = L(\text{rev}(s),\text{rev}(t))\)
- \(L(\text{rev}(s),t) = L(s,\text{rev}(t))\)

If you know how to do it, please let me know!

[fn:1] Almost Perfect Artifacts Improve only in Small Ways: APL is more French than English,
Alan J. Perlis (1978). From [[https://www.jsoftware.com/papers/perlis78.htm][jsoftware]]'s papers collection.
Expand Down

0 comments on commit 3bdded6

Please sign in to comment.