Algorithmic challenges in temporal Web analytics

Marc Spaniol (GREYC, Caen)

December 15, 2015

Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property.

Longitudinal data analytics – the Web of the Past – poses research challenges, but has not received due attention. The sheer size and content of Web archives render them relevant to analysts within a range of domains. The Internet Archive holds more than 350 billion versions of Web pages, captured since almost two decades. In my talk I will introduce several aspects that are relevant for temporal Web analytics. These include, but are not limited to, achieving archive coherence, named entity disambiguation, emerging concept identification or knowledge linking.

Based on dedicated examples, I will pinpoint the underlying research challenges from an algorithmic point of view.