"Wittgenstein and Hegel embedded in metric learning (VQGAN+CLIP)"

Direct Link to File. 1476 words, 8 minute read, 5 paperback pages

Submitted to PT-AI ‘21

Abstract:

Recent advances in AI have showcased the utility of “metric learning” methods, which seek to produce semantically meaningful vector representations of various kinds of inputs. The relation between the operations of metric learning systems and human decision-making processes may be explored by considering how ideas advanced by Wittgenstein and Hegel are manifested in metric learning schemes. Wittgenstein’s influence is seen in the machine learning paradigm of deriving meaning from use in dealing with vaguely-specified tasks and concepts, whereas Hegel’s use of negation is seen in metric learning via the negative examples appearing in contrastive losses. The challenge of choosing helpful negatives in metric learning has motivated the recent adoption of approaches that might be termed “anti-Hegelian.”

Text:

At the heart of many modern algorithmic systems that are intended to serve as proxies for human decision-making are the two elements of 1) similarity measurements and 2) procedures for optimizing these measurements for inputs regarded as similar by humans. This is achieved by casting the inputs as “vectors” (i.e., groups of numbers serving as the coordinates of a point in multidimensional space) and then optimizing a nonlinear mapping of these vectors into a “latent” space (of more or fewer dimensions than the original) in which the similarity measurement is performed. This mapping is known as “embedding,” though the word is also used to refer to groups of points in the latent space. Successful applications of this method include classification (of images, sounds, tabular data, and text), identity verification, image captioning, and generative art [2]. Embedded vectors are intended as semantically meaningful representations in their own right well as front-ends for downstream tasks. The compositional nature of Deep Learning (DL) systems [3] can be regarded as a succession of embeddings.

These systems may be deployed on large scales with vast societal implications and ethical consequences. Sensitive issues include content moderation (e.g., hate speech, fake news), recommendation (e.g., filter bubbles, radicalizing content), fairness and representation (e.g., suppression of minorities), as well as surveillance. Understanding both how these systems operate and their philosophical antecedents may help clarify the differences between machine-based results and the expectations of humans.

Embedding-based methods form a subset of Machine Learning (ML), whereby the rules for judging similarities (e.g., discriminating between category assignments) are “learned” from a training dataset containing either human-labeled conclusions—so called “supervised” learning— or generated via a set of human-approved priors about valid transformations or “augmentations” that leave the semantic content unchanged—so called “semi-supervised” or “unsupervised” learning. The ML paradigm echoes Wittgenstein’s [4] assertion that meaning is derived from use, where in this case the “use” is the training dataset. In this way, many complex and “messy” decision-making tasks such as image classification can be performed with high accuracy whereas formal rule-based systems fail to achieve comparable performance due to the multiplicity of rules required or the vagueness of the task itself. The influence of Wittgenstein on ML systems design is widely publicized [5], and on occasion stated explicitly [6]. His notion of “family resemblance” can be viewed as a kind of mapping whereby examples of vague categories such as (to use his example) “game” or (I submit) “Artificial Intelligence” are mapped to the same term.

The supervised learning task of classifying inputs into supplied human-labeled categories amounts to a particular kind of embedding whereby each category “ideal” or “form” exists as a unit vector “pole”—known as a “one-hot” vector— along its own dimension-axis in the latent space. The training process amounts to these poles “attracting” embedded points of inputs matching the category while “repelling” points intended for other categories. The coordinates of the embedded points then can be interpreted as probabilities of belonging to each category. A significant downside to these one-hot categorical systems is they have no means to handle categories not seen during training.

An increasingly popular and powerful alternative to this design eschews using pre-ordained categories as “poles” and instead simply tries to embed perceptually similar inputs to points near each other. This more general method is often referred to as metric-based learning or “metric learning” for short, or more specifically “deep embedding learning” [8]. It is the key to the plethora of “few-shot” or “zero-shot” systems entering the ML literature recently. They are “few-shot” in that they can perform classification via inference alone by comparing the embedding-point of an unknown example to the points for potentially similar inputs. This can be performed even for new categories since no “categories” were used during training, yet such systems have achieved state-of-the-art at image classification [9]. In these works the term “contrastive loss” often appears (e.g., [9,10]), an idea usually traced to Hadsell et al [11] who offer this conceptual toy model: points in the latent space are imagined as being connected by springs such that perceptually or semantically similar “positive example” pairs of points are always attracted, and dissimilar “negative example” pairs are repelled if they are within some predefined “margin” distance (akin to the margin of Support Vector Machines).

This use of “negative examples” parallels Hegel’s dialectical use of negation [12] as the mechanism for the iterative refinement of knowledge. A metric learning system using only negative examples would be of questionable utility: points are pushed apart until margins are satisfied and then nothing further happens; since no clustering of points for similar inputs occurred, all inputs get regarded as (equally) dissimilar, and no “information” or representation of “knowledge” results. Metric learning’s process of optimization using both positive and negative examples, separately or together in “triplet losses,” echoes the tripartite dynamic of Hegel’s dialectic. Though the multiplicity of unhelpful or “absurd” negations receives brief attention in Hegel—“the rose is not an elephant, the understanding is not a table” [13]—it is regarded as a major challenge in metric learning [8] since the system learns nothing from negative examples with distances beyond the margin. This is exacerbated by geometry: as the number of dimensions increases, so can the probability that most other points are “far away” from any given point.

Given the challenges of choosing helpful negative examples, recent schemes [10,14] have adopted what we might term an “anti-Hegelian” approach of using only the attraction of “positive” (similar) examples. Of what utility is the clustering of similar points without a clear margin to distinguish dissimilar ones? It is that, via the data-augmentation transformations used to generate positive examples, the embeddings “learn” to filter out irrelevant features. Not only can the embeddings serve as proxies for augmentation (e.g., for downstream tasks), the representations formed are increasingly robust to many perturbations humans regard as irrelevant, hopefully providing a closer analog to representation-forming processes in the human brain—despite the methodological “contradiction” with Hegel.

References

  1. Vylomova, E.; Rimell, L.; Cohn, T.; Baldwin, T. Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning. In Proceedings of the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Berlin, Germany, August 2016; pp. 1671–1682.
  2. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation 2021.
  3. Bengio, Y.; Lecun, Y.; Hinton, G. Deep Learning for AI. Commun. ACM 2021, 64, 58–65, doi:10.1145/3448250.
  4. Wittgenstein, L. Philosophical Investigations; Blackwell: Oxford, UK, 1958;
  5. Goldhill, O. Google Translate Is a Manifestation of Wittgenstein Language Theory. Quartz 2019.
  6. Vadera, S.; Rodriguez, A.; Succar, E.; Wu, J. Using Wittgenstein’s Family Resemblance Principle to Learn Exemplars. Found. Sci. 2008, 13, 67–74, doi:10.1007/s10699-007-9119-2.
  7. Hawley, S.H. Typical (Neural-Network-Based) Classification vs. Zero-Shot, Part 1 - The Joy of 3D. Available online: https://drscotthawley.github.io/blog/posts/2021-05-04-the-joy-of-3d.html (accessed on 30 July 2021).
  8. Wu, C.-Y.; Manmatha, R.; Smola, A.J.; Krähenbühl, P. Sampling Matters in Deep Embedding Learning. In Proceedings of the arXiv:1706.07567 [cs]; January 16 2018.
  9. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. ArXiv210300020 Cs 2021.
  10. Fonseca, E.; Ortego, D.; McGuinness, K.; O’Connor, N.E.; Serra, X. Unsupervised Contrastive Learning of Sound Event Representations. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2021.
  11. Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06); 2006; Vol. 2, pp. 1735–1742.
  12. Hegel, G.W.F.; Miller, A.V.; Findlay, J.N.; Hegel, G.W.F. Phenomenology of Spirit; Oxford paperbacks; Reprint.; Oxford Univ. Press: Oxford, 2013; ISBN 978-0-19-824597-1.
  13. Hegel, G.W.F. The Science of Logic; Cambridge University Press: UK, 2015; ISBN 978-1-107-49963-8.
  14. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. ArXiv200205709 Cs Stat 2020.