Direct Link to File. 1249 words, 7 minute read, 4 paperback pages

I. Ways of Assigning a Classification

There are many ways of obtaining classifications, both for humans and for machines. In this section will enumerate some of them.

Note that in the generalizing statements and terminology that follows, we are performing a classification of methods of classifciation! Whether it is a “taxonomy” depends on wether you think that terms needs to be exhaustive – covering all cases – or not. What follows is not necessarily exhaustive, but serves as a broad coverage of the most common “types”.

  • Dividing
  • Aggregating / Clustering

1. Dividing

1.1 Thresholding

The most common way is to generate some numerical “score,” and then assign a threshold value (or set of values), such as when converting a 100-percentile score to a letter grade, on the basis of various thresholds for each grade.

For a single score, we can view the data as being spread out on the number line, and the threshold is a single value of the score, such that data values above the threshold are assigned to one class, and values above go to a different class.

We need not spread the data along a one-dimensional line, though. We could spread out the data along a plane, and then have a line that forms the boundary between one class another. This is still often done via a threshold value, such as in this example below (sketch), but not necessarily.

For data spread in 3 dimensions, the boundary takes the form of a plane, and in more dimensions we speak of the boundary (or “separatrix”) being a “hyperplane.”

1.2 Ranking

This method of thresholding or boundary-drawing is not the only way of doing it. Sometimes you classify by ranking, and taking only the highest-ranking score, or perhaps the 5 highest-ranking scores. We can view this as a sort of “quorum,” whereby you want pick, say, the 5 “winners” from a pool of applicants, regardless of what their scores might be. In contrast, a thresholding or boundary-drawing method would take any applicants who score at least, say 70%, whether that means one winner, or no winners, or everyone’s a winner!

For ML classifiers, it is often common to state the “Top 1 Accuracy” and the “Top 5 Accuracy” (TODO: learn a bit more about these so I can fill in what these are.)

1.3 Closest-To

A still different way of producing a classification is using a notion of distance: Pick the category that a given data point seems “closest to,” assuming that we’ve already established the “centroids” of the various classes available. This distance-measuring is itself a form of a “similarity measure”: we pick the class that is most similar, either defined by shorted distance to something, or smallest angle of deviation from something (we’ll talk about the “cosine similarity” later).

The method of distance-similarity sounds “similar” to the notion of prototypes put forward by psychologist Eleanor Rosch in the 1970s. For Rosch, people establish categories based on some “typical example” – where by typical we mean “(most) indicative of the type” not necessarily “frequent.” The prototype need not correspond to an actual real instance; for example someone’s prototype of “bird” may be something that has a beak, and feathers, and wings, and [bird-like?] feet, but perhaps not a particular bird or even particular breed of bird.

Biologists concerned with taxomony have a host of words to further split the hairs of meaning which “prototype” can imply. For example, the “holotype” of a species is an actual physical speciming that demonstrates the typical features of the species. See my other entry “Hall O’ Types” for more of this.

…might want to say something about Anomaly Detection, in which the classifcation is not based on the content of the data itself, but how different it is from the other data, and/or how infrequent it is. This is the sort of content that is most likely to be flagged for human intervention, or outright excluded. This has consequence for any sort of “minority” content, which by definition is likely to be less similar to less frequent than whatever the “majority” content is. Humans naturally do this as a means of protection (“He’s weird, I’m going stay away”), and it not surprising that machine systems do similar things to protect their networks and users. For example, _country guy’s team uses anomaly detection to flag potential ticket scalpers for human review, because the scalpers’ purchasing activity doesn’t match that of typical fans.

II. Ways of Creating a Classification Scheme

(This should really go first, above the section on assignment =)

We can have an ordering, a hierarchy, a quota, we can have a vote, we can threshold, we can cluster, we can…do whatever it is humans do. …when this is edited, then the Prototype stuff will come naturally into play.

The type of scheme we want depends on what we want to use it for (and who is going to use it). Library classifciation schemes such as the Dewey Decimal System (or the Library of Congress?) were designed with mindfullness that each book needs to go somehwere, that is it much physically reside on the stacks in a particular location, and yet also to serve the physical library users (customers? visitors) so as to place likely books of interest near each other. [Fill in with an example? I remember something about boats and travel…?] These physical constraints are no longer felt in purely online environments, where the content of a book can be in multiple places, or have easy links connecting different “conceptual areas” (TODO: uh, I’m making up this terminology) of the online library. Library classifciations, like those in Biology, are usually hierarchical, although perhaps for different reasons.

….hierarchies…

An alternative method is what I will call an ordering, an example being the geological eras such as Jurrasic, Triassic, fill in another, or as in the acoustics course I teach, Manfred Schroeder devised a set of four “frequency regions” (descriptively labeled A,B, C, and D) for how the sound in the room behaves. The boundaries between classes in the ordering may be “hard” boundaries (what mathematicians might call discontinuities) or soft boundaries showing gradual transitions (such as boundaries of Schroeder’s B,C, and D regions) — TODO: this example from my own “life” is personal, but perhaps I should fill in a “larger” example that more people will know and/or care about. ? (guessing probably yeah.)

It’s common to try to liken classes to mathematical sets, particularly if we’re dealing with groups of individual members such as people, and then we can look at intersections of sets – there is an entire field of sociology devoted to “Intersectionality”. But other classifciations are based on (boundaries between continuous) regions rather than groups of individuals, such as the classification of land into countries, or [another example]. One can have continuous sets and discrete sets, surely, but… ??? lost my point, sorry. writing on:

A quota system might be like “we’re going to pick 5 people from this pool of ticket holders and call them winners” (even though they showed no higher merit than others), or it could be “we’re going to pick the top 5 applicants based on some exam” – more on that below. (This would be different, by the way, than the thresholding method of saying “we’re going to accept everyone who scores at least 70%”, also covered below.)

References