Why you shouldn’t trust crawlersMarch 8, 2015
A large number of sites are selectively allowing certain services (Google crawler, Facebook Open Graph crawler etc etc) to simply bypass the paywall, leaving their content open and readable. This may not seem much of an issue at first, as allowing Googlebot and others to visit your premium content is neccesary for SEO and social interactions, but the problem comes with developer tools.
Facebook Open Graph Debug Tool
Facebook provides a Debugger Platform for its Open Graph metadata system, available to anyone with a Facebook account to debug any site. It crawls your site using the same bots that are used in the live version (that provides pretty embeds in Facebook posts etc), and shows how it uses & compiles the metadata it finds. It also shows the page-source for the version of the site that the crawler got. All-in-all a very handy tool for optimizing your content for the social media-giant.
Problems arise when people decide to expose all their premium content to the crawlers, instead of just the neccesary metadata. Facebook has a small line in their documentation stating
Additionally, you also do not need to include all the URL’s regular content to our crawler, just a valid HTML document with the appropriate meta tags.
, but during my investigation the majority of the tested sites had decided to expose all the data. This could be due to a possible confusion because of a paragraph above the one mentioned, which states:
If your content requires someone to login or if you restrict access after some amount of free content has been consumed, you will need to enable access for the Facebook Crawler. This access is only used to generate previews, and Facebook will not publicly expose your private content.
Because of this, I decided to reach out to Facebook Security to get the documentation clarified, or possibly a redesign of the debug tools. Their response was that they were going to look over the documentation and clarify it, something that as of today (2015-03-08) hasn’t been done yet.
Google PageSpeed Insights
The PageSpeed Insights tools provide a small snapshot of the website rendered as desktop and mobile. While it isn’t as critical as the Open Graph Debug tools, a few sites still decided to filter and allow the Insights crawler (not in all cases, which seems to mean that Google are using Googlebots for Insights in some cases?), and spill all of its data. This can cause the content to be rendered and readable in the preview.
However I decided to take no action here, as private sites should be tested using their browser plugin instead.
Developers shouldn’t trust crawlers blindly with their premium content, as there seems to be no guarantee on who can access what. I am actively trying to contact those that I have found exposing data, but hopefully people can look over how they handle crawlers and stop giving them more than they what need.
This is not neccesarily an issue on the part of the crawler developers, but they should also help in telling content owners how much data they should give.
PS, don’t forget noarchive!