Modeling a Semantically Meaningful Tag Network

“Tag everything!” It’s the rallying cry of Web 2.0. Bookmarks (Del.icio.us), pictures (Flickr) or videos (YouTube), it can all be tagged.

A tagging system, often called a folksonomy, is categorizing resources into more than one category based on user suggested keywords, or tags. Unlike a traditional taxonomy, it has no hierarchy and no strict categories to choose from when classifying a document. While having benefits such as allowing users to, in a sense, cross-categorize resources, it has its problems. Namely, because there is no hierarchy, it can be difficult to properly categorize a resource in all categories. For instance, a resource could be tagged with dog but not necessarily an animal or a Dalmatian. This turns out to be a bigger problem than expected when trying to find resources by tag.

Some attempts have been made to come up with algorithms to suggest tags based on the aggregate tagging patterns of a large corpus and correlations between tags on a given document. See [1] for a thorough discussion of this probabilistic view by Yahoo! Inc. One problem that should be considered is that most of these methods look only at overlapping tags. That is, two tags are considered related only if some resource is tagged with both tags.

While this may be okay when suggesting tags to users on Del.icio.us, it is not necessarily the ideal for linking related tags. Two tags may be related through another but not ever be tagged together. To examine this phenomenon, it is helpful to think about tags in an undirected graph.

To construct this graph, let the tags be the nodes. An edge between two tags exists when there is any document in the corpus that is tagged with both tags. This basic graph will give us an idea of the relationships between tags. We can look for places where a missing triadic closure leads to tags that are semantically related but are not ever used together on a single resource.

To extend the graph to take into account the frequency of two tags occurring together, one can imagine weighting each edge with a measure similar to TF-IDF. It would have to take into account not only the number of times the two tags occurred together, but also the frequency of both tags in terms of the entire corpus.

Taking the inverse of the weights on the edges would let the relationships between tags be measured in a sort of ’semantic distance.’ With this inverse measure on the edges, the most related tags would have a shorter distance between them. Not only would this sort of measure be useful in determining directly related tags, it would help build a framework for organizing tags in a broader sense than what is done right now. This broad look at tagging habits could lead to a better understanding of aggregate and individual tagging habits as well as the relationships between the resources that are tagged.

[1]: Xu, Z., et al.Towards the Semantic Web: Collaborative Tagging Suggestions. http://blog.rawsugar.com/wikka/wikka.php?wakka=Paper13

Posted in Topics: Education

Responses are currently closed, but you can trackback from your own site.

Comments are closed.



* You can follow any responses to this entry through the RSS 2.0 feed.