Skip to content

Commit

Permalink
edits with Chee
Browse files Browse the repository at this point in the history
  • Loading branch information
boris-kz committed Oct 11, 2024
1 parent 0fe5717 commit 2f17b8b
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 71 deletions.
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
CogAlg
======

Proposed algorithm is a strictly bottom-up (connectivity * density) - based clustering, from pixels to eternity. It's derived from my definition of general intelligence: the ability to predict from prior / adjacent input. That includes much-ballyhooed reasoning and planning: tracing and evaluating mediated connections in segmented graphs. Any prediction is interactive projection of known patterns, hence primary process must be pattern discovery (AKA unsupervised learning: an obfuscating negation-first term). This perspective is not novel, pattern recognition a main focus of ML, and a core of any IQ test. The problem I have with statistical ML is the process: it ignores crucial positional and pairwise info, so resulting patterns (clusters) are effectively centroid-based.
Proposed algorithm is a strictly bottom-up (connectivity * density) - based clustering, from pixels to eternity. It's derived from my definition of general intelligence: the ability to predict from prior / adjacent (connected) input. That includes much-ballyhooed reasoning and planning: tracing and evaluating mediated connections in segmented graphs. Any prediction is interactive projection of known patterns, hence primary process must be pattern discovery (AKA unsupervised learning: an obfuscating negation-first term). This perspective is not novel, pattern recognition is a main focus of ML, and a core of any IQ test. The problem I have with statistical ML is the process: it ignores crucial positional and pairwise info, so resulting patterns are indirectly centroid-based.

Pattern recognition is a default mode in Neural Nets, but they work indirectly, in a very coarse statistical fashion. Basic NN, such as [multi-layer perceptron](https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53) or [KAN](https://towardsdatascience.com/kolmogorov-arnold-networks-kan-e317b1b4d075), performs lossy stochastic chain-rule curve fitting. Each node outputs a normalized sum of weighted inputs, then adjusts the weights in proportion to modulated similarity between input and output. In Deep Learning, this adjustment is mediated by backprop of decomposed error (inverse similarity) from the output layer. In Hebbian Learning, it's a more direct adjustment by local output/input coincidence: a binary version of their similarity. It's the same logic as in centroid-based clustering, but non-linear and fuzzy (fully connected in MLP), with a vector of centroids and multi-layer summation and credit allocation in backprop.
Pattern recognition is a default mode in Neural Nets, but they work indirectly, in a very coarse statistical fashion. Basic NN, such as [multi-layer perceptron](https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53) or [KAN](https://towardsdatascience.com/kolmogorov-arnold-networks-kan-e317b1b4d075), performs lossy stochastic chain-rule curve fitting. Each node outputs a normalized sum of weighted inputs, then adjusts the weights in proportion to modulated similarity between input and output. In Deep Learning, this adjustment is mediated by backprop of decomposed error (inverse similarity) from the output layer. In Hebbian Learning, it's a more direct adjustment by local output/input coincidence: a binary version of their similarity. It's the same logic as in centroid-based clustering, but non-linear and fuzzy (fully connected in MLP), with a vector of centroids and multi-layer summation / credit allocation in backprop.

Modern ANNs combine such vertical training with lateral cross-correlation, within an input vector. CNN filters are designed to converge on edge-detection in initial layers. Edge detection means computing lateral gradient, by weighted pixel cross-comparison within kernels. Graph NNs embed lateral edges, representing similarity or/and difference between nodes, also produced by their cross-comparison. Popular [transformers](https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-processing-20220414/) can be seen as a [variation of Graph NN](https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa). Their first step is self-attention: computing dot product between KV vectors within context window of an input. This is a form of cross-comparison because dot product serves as a measure of similarity, though an unprincipled one.
Modern ANNs combine such vertical training with lateral cross-correlation, within an input vector. CNN filters are designed to converge on edge-detection in initial layers. Edge detection means computing lateral gradient, by weighted pixel cross-comparison within kernels. Graph NNs embed lateral edges, representing similarity or/and difference between nodes, also produced by their cross-comparison. Popular [transformers](https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-processing-20220414/) can be seen as a [variation of Graph NN](https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa). Their first step is self-attention: computing dot product between KV vectors within context window of an input. This is a form of cross-comparison because dot product serves as a measure of similarity, just an unprincipled one.

So basic operation in both trained CNN and self-attention is what I call cross-comparison, but the former selects for variance and the latter for similarity. I think the difference is due to relative rarity of each in respective target data: mostly low gradients in raw images and sparse similarities in compressed text. This rarity or surprise determines information content of the input. But almost all text ultimately describes generalized images and objects therein, so there should be a gradual transition between the two. In my scheme higher-level cross-comparison computes both variance and similarity, for differential clustering.

GNN, transformers, and Hinton's [Capsule Networks](https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b) all have positional embeddings, as I use explicit coordinates. But they are still trained through destructive backprop: randomized summation first, meaningful output-to-template comparison last. This primary summation degrades resolution of the whole learning process, exponentially with the number of layers. Hence, a ridiculous number of backprop cycles is needed to fit hidden layers into generalized representations (patterns) of the input. Most practitioners agree that this process is not very smart, the reliance on noise is the definition of stupidity. I think it's just a low-hanging fruit for terminally lazy evolution, and slightly more disciplined human coding. It's also easy to parallelize, which is crucial for glacial cell-based biology.
GNN, transformers, and Hinton's [Capsule Networks](https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b) all have positional embeddings, as I use explicit coordinates. But they are still trained through destructive backprop: randomized summation first, meaningful output-to-template comparison last. This primary summation degrades resolution of the whole learning process, exponentially with the number of layers. Hence, a ridiculous number of backprop cycles is needed to fit hidden layers into generalized representations (patterns) of the input. Most practitioners agree that this process is not very smart, the noise-worshiping alone is the definition of stupidity. I think it's just a low-hanging fruit for terminally lazy evolution, and slightly more disciplined human coding. It's also easy to parallelize, which is crucial for glacial cell-based biology.

Deterministic process with graceful degradation must reverse this sequence: first cross-compare atomic inputs, then sum them into match-defined clusters. That's lateral [connectivity-based clustering](https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity-based_clustering_(hierarchical_clustering)), vs. vertical statistical fitting in NN. In my scheme, this cross-comp and clustering is recursively hierarchical, forming patterns of patterns and so on, in indefinitely extended pipeline of composition. A cluster is defined by interactions between nodes, initially in space-time. Higher levels will reorder the input along all sufficiently predictive derived dimensions, similar to [spectral clustering](https://en.wikipedia.org/wiki/Spectral_clustering). Feedback only adjusts hyperparameters to filter future inputs: no top-down training, only bottom-up learning.
Graceful conditional degradation requires reversed sequence: first cross-comp of original inputs, then summing them into match-defined clusters. That's lateral [connectivity-based clustering](https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity-based_clustering_(hierarchical_clustering)), vs. vertical statistical fitting in NN. This cross-comp and clustering is recursively hierarchical, forming patterns of patterns and so on. Initial connectivity is in space-time, but feedback will reorder input along all sufficiently predictive derived dimensions (eigenvectors). This is similar to [spectral clustering](https://en.wikipedia.org/wiki/Spectral_clustering), but the last step is still connectivity clustering, in new frame of reference. Feedback will only adjust hyperparameters to filter future inputs: no top-down training, just bottom-up learning.

Connectivity likely represents local interactions, which may form a heterogeneous or differentiated system. Such differentiation may be functional, with internal variation contributing to whole-system stability. The most basic example is contours in images, which may represent object resilience to external impact and are generally more informative than fill-in areas. This contrasts with centroid-based methods, which merely group similar items. Thus connectivity clustering is intrinsically superior in hierarchical clustering, where the stability or generality of composed structures is paramount.

Connectivity clustering is among the oldest methods in ML, but I believe my scheme is uniquely organic and scalable in complexity of discoverable patterns:
- links are valued by both similarity and variance, derived by comparison between their nodes, and potentiated by the overlap in the surround of these nodes (context).
Expand Down
Loading

0 comments on commit 2f17b8b

Please sign in to comment.