Cleaning up the tag soup May 18, 2006
Posted by Steve in : enterprise, tagging , trackbackIn previous articles I’ve shown that “consuming” taggers prefer to use their own vocabulary, while “publishing” taggers have an incentive to match their audiences’ vocabularies. At the same time, the network benefits of social bookmarking depend on enough publishers agreeing upon a common vocabulary. Although these forces are somewhat in opposition, things work out pretty well for most topics. Equilibrium is quickly reached as everyone agrees on a few common tags.
But when it comes to new, complex, or very specific topics, tag-based solutions break down - everyone starts speaking their own language. For instance: what are the right tags to find content related to this blog? “Enterprise” is too broad; “tagging” encompasses too many applications; “delicious” is too application-specific (and misspelled mispunctuated, at that); and “social” + “bookmarking” isn’t a generally accepted term yet. Relevant searches end up requiring a complex tag intersection, which most tag search apps don’t handle very well.
This confusion results in huge variations in the way people tag things - or maybe vice-versa. What can we do to clean up the mess?
Getting past the tag mess (and making tagging palatable to a more controlled enterprise environment) starts with the recognition that for any particular piece of content, not every tag has the same value. If you want to encourage a certain subset of tags, expose tag values to publishers. Their natural desire to reach an audience means that they will gravitate towards the most valuable tags. But this doesn’t just mean showing a tagger the previous tags that people have used. Assuming “value” is based on whether this tag will help your audience find this content, then the tags that people are already searching for must be worth much more – even if fewer people are currently using those tags.
The idea of an open marketplace for keywords is already working very well — and very profitably — in a real-world application: Google’s AdWords program. Ad publishers choose the most appropriate search terms to advertise on, because they’re given very good feedback about the value of keywords with regards to their particular ad. (AdWords goes further, preventing tag spam by factoring in things like content relevance and click-through rates. These concepts may eventually help tag search too, but require a lot of control over how tags are being exposed to users.)
Note that what’s being searched for is not the only way to measure value – it’s just the most commonly available metric on the web today. In an enterprise scenario it really depends on the application in which tagging is integrated or exposed - in other words, the consumer’s day-to-day environment. For instance, bug numbers would have higher value when tagging a bug DB, while in CRM it’s probably account names that will help people find your content. The best tag valuation algorithm would factor usage data from many applications, supplement that with existing or desired corporate taxonomies, and finally integrate the resulting tags back into all those applications.
One problem is that for many kinds of newly published content, there may not yet be a lot of usage data. For instance, most conference organizers (semi-)blindly choose a canonical tag during the event, because search patterns don’t exist yet. Unfortunately, it’s really hard to guess at user behavior and it’s pretty much impossible to force it on a large scale. For every person that attended the conference, there may be 10 or 100 or even more who search for conference content after it’s over. Those people don’t know the suggested tag, so they choose their own way to search, and most likely they’re going to have to try several tag searches before they find the content – that is, if all event participants even managed to use the same tag. Depending on search behavior, there might not even be one right tag!
Some new services are extending basic tag search to address this problem. RawSugar is one great example: they’ve introduced tag clustering as a new way to browse tagged directories (specifically, blogs, but it seems like anything that can produce a well-formed feed would qualify.) I’m bullish on clustering, since we’ve successfully used it to produce automatic expertise profiles in our enterprise product. I have no doubt it will help tag search, but RawSugar faces an uphill battle. They need to get their cluster-based browsing in front of the users who are, for today, visiting applications written by competitors that are probably not amenable to integration. I think this solution may fare better in the enterprise, where app integration is the normal scenario.
There are two remaining solutions to the tag mess, assuming you’re sticking with standard methods of tag browsing: 1) re-tagging content later based on observed usage, and 2) tag equivalence (or at least tag migration.) #1 isn’t going to happen, so let’s focus on the more practical #2. What this means is that users can tag items with some original term, but if another term becomes more popular, the first term can somehow be declared equivalent to the new term, so searchers will find the intended content. Whether automatically applied or added manually, this equivalence can greatly increase the network benefit of a bookmarking application.
Applied generally, this can even solve another problem: the disparity between individuals’ vocabularies. Already, “tag clouds” can be studied to find people with similar tagging interests. Similar analysis can be applied to the tags themselves. There’s nothing that says public tags exist in the same “vocabulary space” as the ones an individual uses to tag his items. The direct correspondence between local and global tag spaces can and should be removed and replaced with indirect {user,tag}-to-{public term} relationships that are much more powerful. [Related: Alex Barnett came to a similar conclusion in February.]
The simplest form of relationship is, of course, the direct one that results when two people use the same tag. But there’s no reason to stop there. Relationships can be found between tags that often appear on the same items. Relationships can be forced by way of translation & dictionary equivalence. And there are many more ways to improve searches once you’re past the idea that public collections only contain items tagged with that exact term. With enough indirection, everyone can tag using their own terms, but still be found in searches for globally popular terms.
(Of course, this all assumes you disagree with Clay Shirky’s ideas on the exact meanings of tags, as I do.
Otherwise you might be worried about assigning tag equivalence. Still, that could be factored into how you generate relationships.)
This level of indirection should be a basic requirement of any social bookmarking application, but it appears in surprisingly few of them. Today, del.icio.us offers “your inbox”, which lets you aggregate several users’ specific tags into a single feed. That’s a one-off that doesn’t solve the problem for multiple public collections. Some bookmarking sites, such as simpy, provide new concepts like public “Groups” which go part-way towards achieving this, but it still isn’t the standard for public tagging – you have to be invited into groups before you can contribute. Overall, the previously-mentioned RawSugar team seems to be making the most strides in this area.
Solving this problem is one of the goals of our own enterprise bookmarking application, so I’ll dedicate an upcoming post to our solution. I think this an area in which there is still incredible room to innovate!
So, to solve the tag mess:
- Gather usage data and calculate tag values
- Expose tag values to publishers
- Provide new ways to browse tags
- Indirectly associate user-specific tags to high-value public tags
Can you think of ways that these lessons could be applied to your favorite bookmarking service? Do you know of other services that are tackling this problem? I’d love to hear…
Related: tagging organization
Tags:bookmarking del.icio.us delicious enterprise social tagging















Comments»
Enterprise Tagging
Since posting the Del.icio.us Inside and Tagging behind-the-firewall posts, I’ve been pinged left right…
One of the approaches we are trying with Taglocity is to allow groupings to form naturally using ‘tag aliases’. What we are finding so far in the beta is that there are certainly a number of different ‘exposure’ levels that people want. Some tags are useful mementos for finding/searching, while others are useful to share and get across context.
We also find that ‘nouns’ tags match up pretty well, or can be directly aliases, while adverbs are much harder to do, or rather more ambiguous. It is also the case that our users so far have wanted the ‘adverb’ tags to remain more private, as in they tag the content but don’t always want it to ‘travel’.
I realize this a little cryptic and will try to write up what we are observing soon. It’s a really fascinating area.
- David
One approach to manual tag aliases would be to create “Synonym Sets” that let users add synonyms for a tag. To define a Synonym Set you’d first define an “Authoritive Tag” like “searchchampsv4″. You could then associate synonyms with that set like {”searchchamps4″, “searchchamps”, “scv4″}. If you then searched for a synonym, say “scv4″, your query would get expanded to cover the Authoritive Tag as well, “searchchampsv4″ in this case. Many search engines already have built in support for synonyms so this would be an easy add for them.
I also wonder if this wouldn’t be a way of dealing with tag localization. For instance would I want to find English authored “searchchampsv4″ content when I expressed the French version of that tag?
-Steve
Thanks for your comments guys!
David - curious to see some examples of how you handle the differences in noun vs. adverb tagging. I’ll keep an eye on your blog and follow your progress!
Steve - That sounds like a great way to leverage existing search functionality… In the case where terms are fairly authoritative and/or unique, it would probably work well. But it does come with some downsides. Part of the mapping I’m talking about is its handling of personal vocabularies - when I tag with “enterprise” it means something different than another user’s use of the same term. We couldn’t insert a global equivalence using “enterprise” because both of us would start seeing some garbage search results. This implies that there’s something other than just global equivalence going on. If it’s possible to express the whole set of {user,tag}-to-{public term} relationship via per-user search dictionaries, then maybe it’s more like what I’m talking about?
In any case it would be really interesting to see just how much can be expressed purely by a global dictionary. The 2nd rawsugar link I posted contains a list of potentially-equivalent terms, to start…
Oh I see… Yeah. Basically every user is going to have their own taxonomy in their head. The synonym idea is just to address the SearchChampsv4 issue Alex called out in the post you linked to. Mapping everyone else’s world view onto mine is a much tougher task. Per user synonym sets could be one approach to doing this mapping as you suggested but you’d still need some network intelligence to complete the solution.
If I tag a resource “foo” there needs to be at least one other person that tags it “foo” as well before the network can draw a conclusion that everyone else tagged the resource “bar” so for me (and the other person) “foo == bar”. This would be a much easier problem to solve (in an automated way) if you could only assign a resource a single tag but given the fact you can assign resources multiple tags this sounds like a really hard problem to me. You would actually need a bunch of people tagging it “foo” but not “bar” and a bunch of other people tagging it “bar” but not “foo” for “foo” & “bar” to rise up out of the noise as an association…
Right, what you describe is a very difficult thing to do - and luckily, not exactly how we’re doing our mapping.
Rather than trying to map every one of your tags to everyone else’s, we’re mapping from multiple local vocabularies to a single global vocabulary.
If the set of all item you’ve tagged with a particular tag can be called a “bundle” of items, then what we’re doing is relating bundles to other bundles by way of a public reference tag. I make the distinction “bundles” vs. “tags” because bundles are a superset of tags - you could, for instance, talk about a bundle of things that have been tagged with “foo” *and* “bar”, whereas a tag is just “foo”.
As mentioned in my “terminology” post, we use “folders” instead of tags at the personal level, and “channels” as the public version. Without context these might seem like strange terms but it has to do with the way they’re exposed to the user.
Anyway, our mapping is only {user,folder}->{channel}. You can probably see how if {user1,foo}->{channel} and {user2,bar}->{channel}, then user1 and user2 might be recommended each others’ tags. But that’s not the primary purpose, and there are also lots of other ways to generate recommendations.
At the simplest reduction this is equivalent to del.icio.us, but it’s capable of a lot more, like your searchchampsv4 example. Of course, it all comes down to a) can it be presented in a way that makes sense to users, and b) can the more complex scenarios be enabled with reasonable performance? I’m going to be talking more about this in future posts, I hope you’ll keep reading and commenting! Thanks…
Blogging about tagging…Steve Eisner
I recently discovered A Social Life, a most excellent blog that is focused, so far, on enterprise tagging (aka internal tagging / bookmarking…), a subject that is *very* close to my heart right now. Steve Eisner, the blog author is VP of Engineer…