Crawling AJAX

July 5, 2008 – 3:16 PM

Traditionally, a web spider system is tasked with connecting to a server, pulling down the HTML document, scanning the document for anchor links to other HTTP URLs and repeating the same process on all of the discovered URLs. Each URL represents a different state of the traditional web site. In an AJAX application, much of the page content isn’t contained in the HTML document, but is dynamically inserted by Javascript during page load. Furthermore, anchor links can trigger javascript events instead of pointing to other documents. The state of the application is defined by the series of Javascript events that were triggered after page load. The result is that the traditional spider is only able to see a small fraction of the site’s content and is unable to index any of the application’s state information.


You must be logged in to post a comment.