Tags

, , , , ,

Recently I run into problem with Hadoop, causing DFSClient to stop responding and run into infinite while displaying “Could not complete file …” message. Hadoop version is 0.20.1, svn version 810220, but it seems to me from code that this issue can occur on newer version too.

NameNode logs are showing that file was created, blocks where assigned to it and there is no complete message in logs. In DataNode logs there is Exception which looks like network connectivity issue.

I found this issue on web HDFS-148 and it seems that I have same problem. My biggest problem is that I cannot replicate issue, it happens once in let’s say month.

After some digging in code I found part that causes me trouble:

    
    private void completeFile() throws IOException {
      long localstart = System.currentTimeMillis();
      boolean fileComplete = false;
      while (!fileComplete) {
        fileComplete = namenode.complete(src, clientName);
        if (!fileComplete) {
          if (!clientRunning ||
                (hdfsTimeout > 0 &&
                 localstart + hdfsTimeout < System.currentTimeMillis())) {
              String msg = "Unable to close file because dfsclient " +
                            " was unable to contact the HDFS servers." +
                            " clientRunning " + clientRunning +
                            " hdfsTimeout " + hdfsTimeout;
              LOG.info(msg);
              throw new IOException(msg);
          }
          try {
            Thread.sleep(400);
            if (System.currentTimeMillis() - localstart > 5000) {
              LOG.info("Could not complete file " + src + " retrying...");
            }
          } catch (InterruptedException ie) {
          }
        }
      }
    }

So, as I can conclude from logs, DFSClient entered this while loop and is constantly outputting:

LOG.info("Could not complete file " + src + " retrying...");

From some reason file is never completed ( name node doesn’t have complete call in logs, probably some network issue ), but completeFile should throw IOException when this is fulfilled:

if (!clientRunning || (hdfsTimeout > 0 && localstart + hdfsTimeout < System.currentTimeMillis()))

By default hdfsTimeout is set to -1 and client is running so this piece of code that throws exception is never executed. Code that sets hdfsTimeout in Client looks like:

  final public static int getTimeout(Configuration conf) {
    if (!conf.getBoolean("ipc.client.ping", true)) {
      return getPingInterval(conf);
    }
    return -1;
  }

I tried to look more about setting ping to false, found this HADOOP-6099. I will try to play with disabling ping but it’s hard because I can’t recreate issue.

Advertisements