Hadoop 2.7: SequenceFile.Sorter.merge() throws fs.LocalDirAllocator$AllocatorPerContext: Disk Error Exception: org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create directory: /user/xyz/...

Juni 15th, 2015

When using the merge() method with two HDFS-only parameters (Path[] inFiles, Path outFile) on a SequenceFile.Sorter instance it throws a "Disk Error Exception". Solution approaches in the web are misleading, suggesting to have a look at the available harddisk space on all cluster nodes. On the cluster I use, harddisk space is no issue.

The problem is that the method requires a writable space on the nodes' local disk under the same path as the HDFS output path. I worked around the issue by saving the merged file to /tmp, which is writable in HDFS _and_ the local file system. After the completion of the method, the file only persists in HDFS. I move it to the desired place with fs.rename(tmpPath, destinationPath).

This might be a configuration issue, because for some reason, the merge() method accepts a _remote_ HDFS path and does not require the same path to exist locally.

Hadoop 2.6: using distributed cache files

Juni 8th, 2015

Due to API changes, the way to provide and retrieve distributed cache files changed quite a bit. In order to provide a file in the run() method, add a URI to the job:

String cacheFileName = "file:/home/xyz/...#aliasname";
job.addCacheFile(new URI(cacheFileName));

In order to retrieve the file for example in the setup() method of a mapper or reducer:

URI[] uris = context.getCacheFiles();
if (uris != null || uris.length > 0 && uris[0] != null) {
  BufferedReader lineReader = new BufferedReader(new FileReader("./aliasname"));
  // do something with it

The trick is that Hadoop creates a symlink in the current working directory which has the aliasname, so the actual URI retrieved from the context is not necessary.

Netbeans: Remove Main-Class entry from MANIFEST.MF in generated JAR

Juni 4th, 2015

When generating a JAR with the "build" target, Netbeans automatically adds a line with the main class information if this property is set in the project properties. This is not particularly useful if the JAR contains multiple classes containing main functions for Hadoop. If the main class is set within the manifest file, Hadoop ignores the main class parameter from the command line.
In order to get rid of this undesired feature, one can add the following code to build.xml right before the import of nbproject/build.xml (remove the whitespaces at the beginning of the tags, they are just a workaround for the view here):

< propertyfile file="nbproject/project.properties" comment="My properties">
< entry key="main.class" operation="del"/>
< /propertyfile>

< /code>

The code removes the property from the project.properties file and causes build-impl.xml not to output the Main-Class argument anymore.

Ubuntu 14.04: Setting up Epson XP-205 Scanner

Mai 23rd, 2015

Download the drivers from Epson and install the packages: http://download.ebz.epson.net/dsc/search/01/search/searchModule

  • epson-inkjet-printer-201202w_1.0.0-1lsb3.2_amd64.deb: For the printer
  • iscan-data_1.36.0-1_all.deb: For the scanner
  • iscan_2.30.1-1~usb0.1.ltdl7_amd64.deb: For the scanner

After installing, it should be possible to run iscan from command line, but it gives an error that the scanner could not be found. scanimage -L lists an unknown scanner:

device `epkowa:usb:001:006' is a Epson (unknown model) flatbed scanner

Interestingly, it knows the scanner when using root:

device `epkowa:usb:001:006' is a Epson ME-301/XP-200 Series flatbed scanner

The problem and the solution is described here: http://ubuntuforums.org/showthread.php?t=1563178

After adding the corresponding rule to /lib/udev/rules.d/40-libsane.rules and plugging out and in the USB cable, the scanner can be used from any non-root user as well.

Using WiFi

In order to connect the printer to WiFi, the easiest way is to use WPS (WiFi Protected Setup). Start WPS at the router, then press the "Wi-Fi" button on the printer until the lights start flashing. When everything worked, the lights stop flashing. The IP address can be found in the router backend. The IP should be assigned statically.

In order to add the printer to Ubuntu, choose "add" in the printer dialog. The printer appears automatically under "network printers".

To use the scanner with WiFi, add the line "net IP-ADDRESS" to /etc/sane.d/epkowa.conf, and replace IP-ADDRESS with the IP of the device. Any sane-based scan program is able to use the scanner afterwards.

Running gnuplot from Java with parameters

Mai 19th, 2015

It took me quite a while to get a gnuplot call running from Java instead of the terminal. An example terminal call looks like:
gnuplot -e "filename='histograms/input_10000'; titlename='input_10000'; outputname='histograms/input_10000.pdf'" plot_histograms.gnuplot

The corresponding Java call looks like:

String[] command = {"/usr/bin/gnuplot",
"filename='histograms/" + fileNameArr[0] + "'; titlename='" + fileNameArr[0] + "'; outputname='histograms/" + fileNameArr[0] + ".pdf'",
ProcessBuilder pb = new ProcessBuilder(command);
pb.directory(new File(mainPath));
Process process = pb.start();

InputStream is = process.getErrorStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
String line;

while ((line = br.readLine()) != null) {

Note that the quotes around the -e parameters must not be inserted in the command string.