The ability to combine executables with the hadoop map/reduce engine is a very powerful concept known as streaming. Thanks to a very active community, we have ample of examples available on the web that caters for a large variety of languages and implementations. Sources like this and this have also provided some R language examples that are very easy to follow.
In my case I was only required to look at the integration between two tools. Unfortunately though, I hadn’t so far been able to find any detail on how to implement streaming of R on the Hortonworks Windows distribution.
Why Windows you are asking? – Well, I guess this is based on the same reason on which Hortonworks decided to even consider shipping a Windows distribution in first instance. Sometimes it is just easier to reach a market in a perceived familiar environment. But this may become a topic for another post someday.
In hindsight everything always appears quite straight forward. Still, I would like to briefly share my findings here to reduce the research time for anyone else who is presented with a similar challenge.
As a first requirement R obviously needs to be installed on every data node. This also applies to all additional R packages that are used within your application.
Next you are creating two files containing the instructions for the map and reduce tasks in R. In the example below the files are named map.R and reduce.R.
Assuming that your data is already loaded into hdfs you issue the following command on the hadoop command line:
hadoop jar share/hadoop/tools/lib/hadoop-streaming-22.214.171.124.1.1.0-1621.jar -files "file:///c:/Apps/map.R,file:///c:/Apps/reduce.R" -mapper "C:\R\R-3.1.0\bin\x64\Rscript.exe map.R" -reducer "C:\R\R-3.1.0\bin\x64\Rscript.exe reduce2." -input /Logs/input -output /Logs/output
A couple of comments regarding the streaming jar options used in this command under Windows:
-files: in order to submit multiple files to the streaming process, the comma separated list needs to be encapsulated in double quotes. Access to the local file system is provided using the file:/// scheme.
-mapper and -reducer: since the R application can’t be pre-faced with a hashbang in Windows, we need to provide the execution environment as part of the option. As above, the path to the Rscript executable and the name of the R file again need to be encapsulated in double quotes.