Millions of Queries

Despite ongoing research and constant improvements to online search engines, no user in guaranteed the results he/she desires for any given search. Of course we may believe sometimes that our computers can read our minds – with advances such as AutoComplete and cookie-based recognition – but inevitably we find ourselves modifying search queries and/or scrolling through pages of resultant links to find just exactly what we are looking for.

During the year 2006, researchers at the University of Michigan began compiling histories of individual queries from the AOL.com search engine – available at http://tangra.si.umich.edu/clair/clair/qla.html . The following records were kept of about 36 million queries: an anonymous userID, the user’s verbatim text query, the link that that user ended up choosing (from the results page), and the rank of that one link. Interestingly, they began to notice that on occasion a link would appear in different ranks for the same query, or rather, for the same rank in different queries. For the most part, however, they noted that the most popular links were typically ranked 1 or 2, and that these links could be accessed from a variety of (related) textual queries.

It consequently appears that although we may believe computers can exactly interpret our needs as users, we find that when dealing specifically with search engine queries, computers are programmed to display a certain list of results, independent of the exact wishes of the user. The article lists, for example, that if I am trying to buy a car – and simply query “car” – I am likely to find links pertaining to anything in the car world. Because the computer doesn’t know which type of car I would like to buy (or even that I would like to buy a car at all) I need to modify my search, to say “Ford sedan” or “cars for sale.”

Finally, these “query sessions” are made into directed graphs with one node representing the initial query and an edge from that node to the subsequent, modified query (and potentially another edge to a further modified query, etc.) *Note that the size of a node is proportionate to the number of times it was a query.* The compactness of a group of linked nodes is referred to as a cluster, and the graph as a whole is assigned a clustering coefficient – based on the number and magnitudes of its clusters.When displayed as a directed G=(V,E) graph, the results are rather interesting: clustering is readily visible, and the graph has a calculated clustering coefficient of about 0.35. However, when a computer randomly generated an equal number of queries (and made a similar graph) the coefficient dropped to a mere 0.01 – indicating that the likelihood of humans crafting their queries similarly is actually quite high.

I enjoyed reading this article and relating it to what we’ve recently studied in 204, because the ideas of page rank and generalized second price auctions (for slots) do little to address the needs of the user. This does well to define that metaphorical line between actual and artificial intelligence, in that computers must be programmed to return results independent of users’ feelings or thoughts. However, we do find that in Michigan’s query logs, many times the method of page rank was useful – in that many users chose links in the 1 or 2 position.

Also, I thought that the transition from logs to a directed graph was quite unique (as it prompted me to recall topics from earlier in the class.) We see from this directed graph that many cycles exist, as users can begin and end at effectively any query. Also, although it seems that beginning a query with “car” would not likely lead a user to later querying, say, “flower”, it appears that this directed graph actually contains a global network, lending to the small-world principle; the average distance between nodes in the graph is only about 2.5 edges. This helps to solidify the theories of a unified Web.

Posted in Topics: Technology

Responses are currently closed, but you can trackback from your own site.

Comments are closed.



* You can follow any responses to this entry through the RSS 2.0 feed.