Tidbits: Highlights from Web Search and Data Mining 2008

The signal to noise at WSDM was great. The only conference I’ve been to in a while where I had to do homework after the talks (which is a good thing).

There were some really good highlights from WSDM which I wanted to point out. I’m going to sit down with the papers and read through them this weekend.

These are the highlights in my opinion. There were some other really great papers here which deserve a lot of respect. It’s just that these specific papers apply to my direct work.

Crawl Ordering by Search Impact

We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of times the page appears in the top K search results for queries, for some constant K, e.g., K = 10. Since the impact of a page depends on its relevance score for queries, which in turn depends on the page content, the main diﬃculty lies in estimating the impact of the page before actually fetching it. Hence, impact must be estimated based on the limited information that is available prior to fetching page content, e.g., the URL string, number of in-links, referring anchortext.

Ranking Web Sites with Real User Trafﬁc

This was the highlight of the conference for me.

We analyze the traﬃc-weighted Web host graph obtained from a large sample of real Web users over about seven

months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to important diﬀerences. We ﬁnd that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traﬃc patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the trafﬁc data to validate the PageRank random surﬁng model. The ranking obtained by the actual frequency with which a site is visited by users diﬀers signiﬁcantly from that approximated by the uniform surﬁng/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this ﬁnding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior.

A Scalable Pattern Mining Approach to Web Graph Compression with Communities

This paper seems interesting but the setup overhead necessary for smaller graphs might not justify the implementation costs. Huffman coding and Beta codes seem to go a long way towards an efficient graph storage representation.

A link server is a system designed to support efficient implementations of graph computations on the web graph. In this work, we present a compression scheme for the web graph specifically designed to accommodate community queries and other random access algorithms on link servers. We use a frequent pattern mining approach to extract meaningful connectivity formations. Our Virtual Node Miner achieves graph compression without sacriﬁcing random access by generating virtual nodes from frequent itemsets in vertex adjacency lists. The mining phase guarantees scalability by bounding the pattern mining complexity to O(E log E). We facilitate global mining, relaxing the requirement for the graph to be sorted by URL, enabling discovery for both inter-domain as well as intra-domain patterns. As a consequence, the approach allows incremental graph updates. Further, it not only facilitates but can also expedite graph computations such as PageRank and local random walks by implementing them directly on the compressed graph. We demonstrate the eﬀectiveness of the proposed approach on several publicly available large web graph data sets. Experimental results indicate that the proposed algorithm achieves a 10- to 15-fold compression on most real word web graph data sets.

Preferential Behavior in Online Groups

Online communities in the form of message boards, listservs, and newsgroups continue to represent a considerable amount of the social activity on the Internet. Every year thousands of groups ﬂourish while others decline into relative obscurity; likewise, millions of members join a new community every year, some of whom will come to manage or moderate the conversation while others simply sit by the sidelines and observe. These processes of group formation, growth, and dissolution are central in social science, and in an online venue they have ramiﬁcations for the design and development of community software.

On Ranking Controversies in Wikipedia: Models and Evaluation

Wikipedia is a very large and successful Web 2.0 example. As the number of Wikipedia articles and contributors grows at a very fast pace, there are also increasing disputes occurring among the contributors. Disputes often happen in articles with controversial content. They also occur frequently among contributors who are “aggressive” or controversial in their personalities. In this paper, we aim to identify controversial articles in Wikipedia. We propose three models, namely the Basic model and two Controversy Rank (CR) models. These models draw clues from collaboration and edit history instead of interpreting the actual articles or edited content. While the Basic model only considers the amount of disputes within an article, the two Controversy Rank models extend the former by considering the relationships between articles and contributors. We also derived enhanced versions of these models by considering the age of articles. Our experiments on a collection of 19,456 Wikipedia articles shows that the Controversy Rank models can more eﬀectively determine controversial articles compared to the Basic and other baseline models.

Finding High-Quality Content in Social Media

The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content in sites based on user contributions—social media sites becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content.