Monday, December 17, 2012

The Traps of String.split in Java

I just stumbled across a few peculiarities in the split Method of java.lang.String.

We'll start with a simple example:
    String value = "A:B";
    String[] parts = value.split(":");
    System.out.println("Parts has " + parts.length + " elements: " + Arrays.asList(parts));       
No surprises here. We get the output:
    Parts has 2 elements: [A, B]

Now let's try the empty string:
    String value = "";
    String[] parts = value.split(":");
    ...
Since there is no ":" in the empty string, the whole input string is returned:
    Parts has 1 elements: []
 


This may easily let you jump to the conclusion that the returned array of strings will always at least contain one element, since it returns the input if no delimiter is found. But that is not the case:
    String value = ":";
    String[] parts = value.split(":");
    ...
This will give:

    Parts has 0 elements: []

So even though our input string is now longer, the output array has fewer elements. How comes? Well if we split ":" along ":", we get ["", ""], but split() will not by default include trailing empty strings and thus the array returned is []. This seems counterintuitive to me.

The solution is to use the 2 parameter version of split and pass a negative number as the second parameter:
    String value = ":";
    String[] parts = value.split(":", -1);
    ...
which will give:
    Parts has 2 elements: [, ]

With that in mind, we can easily answer the question what the following code will give us:
    String value = "|";
    String[] parts = value.split("|");
    ...

 Since this looks just like the ":"-example above, it must give us an empty array, right?
Wrong. It gives us:
    Parts has 2 elements: [, |]
Whoa, what has happened? The parameter passed to split is actually treated as a regular expression and "|" is the or-operator of regular expressions, so we try to split the input string at every occurence of "nothing or nothing" which boils down to splitting the input string before any character, which gives us an array containing an empty string and the pipe character.
If you do:
    String value = "abc";
    String[] parts = value.split("|");
you'll get:
    Parts has 4 elements: [, a, b, c]
If you really want to split at pipe characters you have to use split("\\|"), since this will escape the regular expression meaning of the pipe character.

To summarize it, there are two caveats:
  • Use the two parameter version of split(), if you want to avoid getting an empty array.
  • Escape characters with special meaning in regular expressions (e.g. . * + [ ] ^ $)