Submitted by rok on

Amazon MR is great for analyzing web server logs. In some cases that requires deciphering user agent string which can be a daunting task. There are mulitple agent strings for any browser family, version and plugins enabled.

One of the tools which helps with maching agent is http://us2.php.net/manual/en/function.get-browser.php . To get it working it requires external browscap.ini file and php.ini update. It is possible to achieve this with updating php.ini in MR bootstrap action http://www.zlender.net/2011/02/25/amazon-elastic-map-reduce-adding-php-e...

Another way I found is to use https://github.com/GaretJax/phpbrowscap which removes the need to edit php.ini file. It also uses browscap.ini configuration file for all the browser detection so there are still benefits of regularly updated browser list.

To use external files in your mapper or reducer Hadoop's distributed cache functionality comes handy http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide... . It allows sending of files or archives with job request and those files will then be available on all instances.

Starting a step with distributed cache is easy. Adding

--cache s3n://BUCKET/scripts/php/includes/browscap.ini#browscap.ini \
--args -cacheFile,s3n://BUCKET/scripts/php/includes/Browscap.php#Browscap.php \

to elastic-mapreduce command will make these two files available on all instances to mapper or reducer. Currently elastic-mapreduce has a bug which prevents adding multiple --cache options so all but first needs to be sent as args. Browscap.ini and Browscap.php will be available in same directory as mapper and reducer run.

To use Browscap

include_once('Browscap.php');
$bc = new Browscap('/tmp');
$bc->localFile = 'browscap.ini';
$user_agent = $bc->getBrowser($user_agent_string_from_log);

Tags: 

Add new comment