armedguy@web:/# Johan Jatko | xr.gs

A realm of thoughts, solutions, and breaking things.

Why you shouldn’t trust crawlers

March 8, 2015       

Paywalls are a common concept on digital newspapers and content-rich sites. Often implemented to provide extra revenue where-as ads can’t provide enough, they divide a userbase into paying and non-paying visitors. But how do you allow “rich content”-linking and search-engine crawling when your content is behind a paywall?

Paywall @ The New York Times

A large number of sites are selectively allowing certain services (Google crawler, Facebook Open Graph crawler etc etc) to simply bypass the paywall, leaving their content open and readable. This may not seem much of an issue at first, as allowing Googlebot and others to visit your premium content is neccesary for SEO and social interactions, but the problem comes with developer tools.

Facebook Open Graph Debug Tool

Facebook provides a Debugger Platform for its Open Graph metadata system, available to anyone with a Facebook account to debug any site. It crawls your site using the same bots that are used in the live version (that provides pretty embeds in Facebook posts etc), and shows how it uses & compiles the metadata it finds. It also shows the page-source for the version of the site that the crawler got. All-in-all a very handy tool for optimizing your content for the social media-giant.

Problems arise when people decide to expose all their premium content to the crawlers, instead of just the neccesary metadata. Facebook has a small line in their documentation stating

Additionally, you also do not need to include all the URL’s regular content to our crawler, just a valid HTML document with the appropriate meta tags.

, but during my investigation the majority of the tested sites had decided to expose all the data. This could be due to a possible confusion because of a paragraph above the one mentioned, which states:

If your content requires someone to login or if you restrict access after some amount of free content has been consumed, you will need to enable access for the Facebook Crawler. This access is only used to generate previews, and Facebook will not publicly expose your private content.

Because of this, I decided to reach out to Facebook Security to get the documentation clarified, or possibly a redesign of the debug tools. Their response was that they were going to look over the documentation and clarify it, something that as of today (2015-03-08) hasn’t been done yet.

Google PageSpeed Insights

The PageSpeed Insights tools provide a small snapshot of the website rendered as desktop and mobile. While it isn’t as critical as the Open Graph Debug tools, a few sites still decided to filter and allow the Insights crawler (not in all cases, which seems to mean that Google are using Googlebots for Insights in some cases?), and spill all of its data. This can cause the content to be rendered and readable in the preview.

However I decided to take no action here, as private sites should be tested using their browser plugin instead.

Conclusion

Developers shouldn’t trust crawlers blindly with their premium content, as there seems to be no guarantee on who can access what. I am actively trying to contact those that I have found exposing data, but hopefully people can look over how they handle crawlers and stop giving them more than they what need.

This is not neccesarily an issue on the part of the crawler developers, but they should also help in telling content owners how much data they should give.

PS, don’t forget noarchive!

 

Fixing WordPress to work with CloudFlare’s free SSL

October 13, 2014      

CloudFlare recently rolled out UniversalSSL to all customers, including free. This allows all CloudFlare customers to have a secure connection between their websites and their visitors (well, not entirely, but lets not go into that now). I personally just installed UniversalSSL on my blog, and everything was fine until I cleared my cache.

No stylesheets were loading.

By default, all webbrowsers block attempts to get HTTP resources over a HTTPS connection. The easiest solution for this is protocol-agnostic URL:s and they work really well.

BUT

WordPress by default does NOT apply protocol-agnostic URL:s to its generated links (such as from wp_head()), which causes issues when using a reverse proxy with SSL such as CloudFlare (this also applies to HAProxy, nginx and others). Because SSL terminates at the reverse-proxy, the actual webserver receives as normal HTTP request. WordPress constructs its generated URLs based on what type of request is coming to the webserver itself, and therefore creates normal http://-prefixed URLs that get blocked when they are loaded by the client that connected via HTTPS.

The internal request between the reverse-proxy and the webserver also causes issues if you wish to set the “Site Url” in your WordPress settings to the https://-prefixed link. Because EVERY request between these two are plain HTTP, WordPress will in turn attempt to redirect the user to https://, and cause a redirect loop.

The solution

To solve this, we use information provided (hopefully!) by the reverse proxy that tells us which protocol was used. If HTTPS was the case, we fool WordPress that the current connection is HTTPS.

The magic lines to do this is as follows:

if(isset($_SERVER["HTTP_X_FORWARDED_PROTO"]) && $_SERVER["HTTP_X_FORWARDED_PROTO"] == "https") {
    $_SERVER["HTTPS"] = "on";
}

Simply put it in your theme’s functions.php file or in any other file that is executed before any URLs are printed, and it should generate https://-prefixed urls!

Webradio via Spotify and SHOUTcast

October 10, 2014      

SHOUTcast is one of the most popular software suites for web radios nowdays. But by default (read: without much hassle) the only way to broadcast sounds is via Winamp and its SHOUTcast plugin. However, when ludde from Spotify created his tribute Spotiamp, he included an embedded SHOUTcast server so you could stream Spotify via SHOUTcast to Sonos devices etc.

But this also makes it possible to setup a web radio that streams via Spotify, by using the stream relay function in SHOUTcast DNAS server.
(The reason Spotiamp alone can’t act as a SHOUTcast server is because it doesn’t support more than one connected client)

Step 1

Obviously, install Spotiamp. Then enable its SHOUTcast server on the default address.

Step 2

Install SHOUTcast DNAS from their website. Configure DNAS to your liking, but in your stream config, specify the streamrelayurl/relayurl as your Spotiamp SHOUTcast url.

(Look at your specific DNAS version for the correct way to do this)

 

After this Spotiamp should be feeding the SHOUTcast DNAS server live music and a title.

 

Late Night Ventrilo – Hi Mom

August 7, 2014   

 

Security patches for CoD4 servers

August 5, 2014    

Call of Duty 4 has been a popular game for many years, and its population has remained even though newer Call of Duty titles have been released.

Long time ago the CoD 4 servers I was maintaining were being targeted by hackers that had found a new method to become unbannable on Call of Duty 4 servers.
The exploit was based on the fact that cracked Call of Duty 4 servers never verified user GUID’s and therefore allowed all players, even cracked, to connect. As the server skipped verifying GUIDs towards a master server, anything could be sent to the server, while the server itself was only prepared to get a 32 char hash in the range of 0123456789abcdef.

This caused issues with external admin tools and banning players (which is done by GUID), because they in turn also only expected 0123456789abcdef, and hackers were sending all kinds of russian/hebrew/random characters.

Solution

To solve this the servers had to be patched with a custom routine that validated player GUID’s to their normal format [0-9a-z]{32}, and killed any connecting players not matching that.

Luckily, CoD4 is based on Quake 3 Arena which has an open source nowdays, so I found a pretty worthless function that normally validates if an IP is local or external, and overwrote it with my custom Assembly. =)

https://gist.github.com/ArmedGuy/6ebb87a924f5833bd03e

In short, it does:

  • Validation of player GUIDs
  • Use a special exception to allow for CoD4 master server listing, great for when you are running cracked servers
  • It also includes Aluigis buffer overflow fix for va()

 

Preview

iw3mp.exe

3.18 MB

 

Finally, my new website

August 3, 2014   

That took a while, but now I have a theme that I am satisfied with. yay

I will try to update it as often as a find something interesting.

RCon module for Battlefield 2

  

A small Node.js program that can be used to contact Battlefield 2 servers.

https://gist.github.com/ArmedGuy/7082803

 

PS. I am sorry that I used a Battlefield 3 image, but the Battlefield 2 ones were crap.