Making Headlines on Google News

http://blogoscoped.com/archive/2006-07-28-n49.html

In class we discussed PageRank and the hub-authority computations that allow webpage owners to manipulate their placement in Google search results.  This is useful for most webpages, but what about news sites that want their stories to appear on Google News? Google’s methods for ranking which articles appear on the Google News homepage are different than their regular search results, so it requires a different approach.

Google News does not crawl the web looking for news articles the way regular Google search does.  It uses the opt-in method that search engines used before web crawlers came about.  This gives Google some human screening abilities to make sure that your site is a credible news source.  Once your site is accepted into Google News, you supply a page that contains an up-to-date list of links to all of your current stories and the Google News scanner checks it for updates every few minutes.  New articles appear on Google News within minutes. Since all the news sources are pre-screened, they are all “Authorities,” so running a hub-authority computation on them is not going to help so much.

One important difference between Google News and regular Google search is that Google News clusters articles by topic.  (How they do this is still a mystery, but it probably involves connected components and graph clustering algorithms, in addition to many other interesting networks problems.)  On the homepage it only shows a few stories from each cluster.  Logically, it would seem that the way to make your article appear more often and not get lost in the cluster is to make your article sufficiently unique that it won’t be clustered.  Google thought of that, however, and realized that most of their readers are not so likely to be interested in stories that only one of their 4500+ news outlets reports on.  Rather, they concluded that the most interesting stories will be found in the largest clusters.  So clearly, writing obscure articles and making yourself an outlier is not a dominant strategy.  A dominant strategy would certainly be to write articles that fit into the largest cluster, but the question is how to get yourself up to the top of the cluster and not lost under the “All 2,917 news articles” link.

Of course, given the nature of news, it would not be in Google’s interest to leave any single article as a headline for too long, since people want the “latest news,” not something from a few days ago, or sometimes even a few hours ago.  Often the headline article on Google News changes every 15 minutes or less.  The only method that seems to guarantee you a good place in the cluster, at least temporarily, is to beat everyone else to the market.  Google News rewards articles that break the news first, so for a while, if you are the first to report a story and your cluster becomes sufficiently large to bring your story to the front page, you are basically guaranteed the headline, regardless of any other ranking criteria.  After the first 15 minutes, though,it becomes anyone’s game. Certainly, if you publish multiple articles on a story that make it into the same cluster, your chances of making the top 3 in the cluster become higher, but Google must have a better way of deciding.

While Google hasn’t released their methods to the public (that I know of), I’d like to hazard a guess as to the method they might use.  One possibility is a variation on the hub-authority computation.  With each successive story that a news source publishes, Google learns something about the nature of the news source based on what clusters its articles end up in.  It might be anything from it being a local vs. national vs. international source, to specific specializations in different types of news, to whether the news source frequently publishes press releases and AP content or actually generates unique articles. This information would contribute to an authority score in each of these areas.  Google can then parse all of the articles in a cluster to decide based on some predefined criteria what authorities are needed to make a good article in this cluster.  The cluster is basically functioning as a dynamic hub.  Since articles are constantly coming out (Google News claims they index 100,000 articles a day), this algorithm can be run and the scores can be updated more or less continuously. This, combined with some weighting for how recently the article was published would generate a pretty good result. So basically, in this situation, Google would function as its own hub for the hub-authority algorithm.

Posted in Topics: Education

Responses are currently closed, but you can trackback from your own site.

Comments are closed.



* You can follow any responses to this entry through the RSS 2.0 feed.