Paolo Donadeo — LifeLOG › Articles Labelled with “Information retrieval”

Tagging is not the right way

Yesterday I was listening to some music with the Last.fm client, when a song I particularly like started. As always, I decided to mark the song as “loved” and to tag it with something useful. If you use the Last.fm client, you know that it suggests the most common tags for the tune you want to tag. Ok, usually the list of tags includes a lot of stupid words, but this time I was surprised to see the word “gnocca” in the list.

For people not speaking Italian, “gnocca” is a coarse term referring to the vagina and, by extension, to a sexy girl: it can be translated with “pussy” in the first case and… I don’t know a suitable translation for the second meaning.

Actually, this is the worst case of tag I’ve ever found, but it’s not the first time I was disappointed with other user’s tag choice.

As a matter of fact many internet sites that use tags to classify user contents, are showing their limits. The whole paradigm of user-defined tags, well known with the term folksonomy, is based on three ideas:

it’s nearly impossible to classify contents inside a tree of categories;
associating words (tags) to contents is effective, because the user will remember the world and then, searching for the tag, he will easily be able to recover the piece of information she looks for;
as a good side effect, if many users tag the same object, the most appropriate tags will emerge and a big number of users will automatically screen unused or not relevant tags, so that other people will easily retrieve information.

While I agree with the first statement, the last two are questionable, at least. Sure it’s very difficult, if not impossible, to arrange a large and heterogeneous set of objects into a tree-shaped data structure, particularly if the set grows with time and you don’t know in what “direction” the tree’s growing. Everyone who owns a personal computer and has tried to sort out his or her “document” folder , now can understand what I mean: there isn’t a hierarchy that fits to all needs, because many documents can be correctly folded in different places at the same time.

The proposed solution is to tag documents with a chaotic cloud of words freely chosen by people, where the only valid criterion seems to be common sense or a more hazy association of ideas.

My experience is that tagging without a criterion is only another way to lose information. Using a hierarchical tree of directories (or categories) leads the user to lose documents, because people tend to forget the aspect of the document they have chosen to catalogue it. The same situation still gets worst with tags: I usually choose as many tags as I think appropriate, in the secret hope that the large number will help me in the task of searching information later on. The net effect is the proliferation of synonyms, singular and plural tags (eg: tool and tools) and completely useless words, because too much generic (eg: hardware, programming or software), or too much specialistic (eg: xgl or xen), so that about the 48% of my tags actually label only one document.

Those statistics are based on my personal experience using del.icio.us, one of the services to which I pay great attention when I choose my tags, because Internet bookmarks are very important for my job. You can download here the file containing my del.icio.us tags, ordered by frequency. More than 48% of tags is used only once, and only a 20% is used twice, so I guess that the most of my tags is completely useless. Not very good, actually.

The lesson here is that cataloguing a large quantity of information is not for free: a simple and easy way to have tons of documents well ordered, always accessible at any time under your fingertips, is an utopistic dream.

I think now it’s time to drop buzzwords like “web2.0″ (the parent of all buzzwords) and to pass to some more serious and structured ideas about information architecture. Since I don’t like to reinvent the wheel again and again and since I need something to index my documents, I decided to investigate how librarians organize the knowledge in a big library, following these two ideas:

the librarian is a very old job, and librarians can boast a thousand-year-old experience;
I have many documents on my hard disk, and these documents are very different about topics, media type and relevance, but nothing compared to the Library of Congress or other similar libraries in the world.

A quick search around led me to some readings and I discovered a whole universe of studies about information indexing. The most appealing theory I found is the faceted classification, in which multiple trees of “facets” are used to reach information. What are facets, actually?

A faceted classification system is composed by a number of categories (facets) what represent different aspects of the items we are going to classify. Each facet (aspect) is explained and developed in a tree of terms (now to be known as “foci”, or individually, “a focus”). To classify an item, therefore, you apply one or more terms from one or more facets to the item. In this way you have a multidimensional approach to the items you are indexing.

There are two main criteria developed by librarians to compose faceted classifications:

the list of facets should represent several aspects of the items to be classified, and should be “orthogonal”, as much as possible;
the tree of terms belonging to each facet should present at each node a unique criterion of division, i.e. the set of children of a node must be a partition of the whole parent node, so that the hierarchy has no overlapping terms.

Following these two principles the result will be a set of trees in which items are classified, and consequently several access points from which to start the search. Instead of a tree, the final data structure resulting from this kind of classification is a DAG (a directed acyclic graph), which provide a flexible way to organize knowledge, without being chaotic like a “tag cloud”.

An excellent example of faceted classification is the Nobel price winners page, a demo of the Flamenco Project of the Berkeley University: you can navigate through the various criteria in which Nobels are classified in an intuitive way, with a simple and effective interface.

Another example of use of facets is the Amazon jewelry page. You can reach the page going to Amazon.com and looking for “Jewelry and Watches” in the Product Categories menu.

Since I find the idea of facets very interesting, I decided to start a little experiment in this blog: Wordpress handles trees of categories, so I decided to adapt them and use the whole category system as a facets system. Is will not be perfect, because there are no facilities to navigate into the DAG, like in the Flamenco Project, and the reader cannot choose more than one focus at time, but it’s a starting point. If the result, in a few months, will be better than my experience with del.icio.us tags, I’ll probably start a software project to handle files on my computer using facets.

RSS Feed. Valid XHTML 1.1. This blog is written in Objective Caml. Design based on the work of Rodrigo Galindez.

Paolo Donadeo — LifeLOG
A blog about my life, job and thoughts

Articles Labelled with “Information retrieval”

Tagging is not the right way

More (useless) info…

Facets

Recent Posts

My friends on the net

A book is the best present…

Paolo Donadeo — LifeLOG A blog about my life, job and thoughts

Articles Labelled with “Information retrieval”

Tagging is not the right way

More (useless) info…

Facets

Recent Posts

My friends on the net

A book is the best present…

Paolo Donadeo — LifeLOG
A blog about my life, job and thoughts