Uncategorized

Be careful – Java SimpleDateFormat is not always symmetric

In my job, we often have to work with free-text data storage, where our customers, and the customers of our customers, will be looking directly at the only copy of the same data the we’re depending on in our business logic, encoded as a string. We make pretty heavy use of configurable grammars, and our structured data often comes with reversible encoders/decoders to render it as a human-readable string (in 11 different languages).

For Java date conversion we make heavy use of Java’s SimpleDateFormat – as implemented in the Sun JDK 1.6, which has proved pretty robust in the past. Thus I was surprised when I started seeing ParseExceptions in my unit tests, especially in dates being parsed had originally been produced by SimpleDateFormat itself.

With SimpleDateFormat, you initialise the class with a string pattern – and then we use dateFormat.format(Date) to produce the encoded string, or dateFormat.parse(String) to parse a String back to a Date object. The pattern you choose may be as simple as yyyy – in which case dateFormat.format(Date) would produce something like 2013 from today’s date. When you call dateFormat.parse(String) on 2013, obviously the rest of the date data will be lost. By default dateFormat will instantiate the unknown values of the resulting date with the Date default values, so midnight 1970-1-1 in your timezone.  If you formated 2013 with this example, calling parse with the same formatter would correctly give you back 2013-1-1 00:00 from your input – any day, time or timezone information in the original date would be lost.

In this case, I wanted to print the time of departure of Amtrak trains, using the standard travel agent date format, hmma. [1] So for a train departing at 9:30, travelAgentDate.format(..) will produce 930am.  For 21:30, format will produce 930pm. And for 22:59, format will produce 1059pm. And when I call travelAgentDate.parse(1059pm), I will get a ParseException. Inspecting the source code, it’s easy to see why.

The parse(..) method here parses from left-to-right, generally the most-efficient but less-robust way to parse text. When it tries to parse the hour using h, if there’s a delimiter in the pattern, it will peek ahead to see if the delimiter is the next character, but if the next character is numeric, it will simply assume it’s part of the next variable, in this case the mm minute variable. So if I had used a format like h.mma or even hhmma I would have been safe, but the variable-length h string without a trailing delimiter confuses the parse method, even though the format method can produce such a string. Too bad.

Not having the option of changing the format (travel agencies are pretty set in their ways), and not feeling too enthused about writing my own RTL parser for this one edge case, I began to look for workarounds. After a coffee break and a chat with  a colleague, we decided that we should set the parser to a format that it could use symmetrically, and then massage the resulting input-output into that format. The two options we came up with were

  • Use hhmma format in the parser, trim the leading 0’s after encoding it to a string, and when decoding, pad it back with 0’s to get to the right length before calling parse
  • Use hh:mma format in the parser, remove the : after encoding it to a string, and when decoding, re-inject a : 4 characters from the end.

I plumped for the first option, mostly because I already had the toolkit already there to pad and the numbers – the efficiency gain of remaining LTR with no read-ahead is not relevant in this scenario but does sound nice, although we agreed that the second option would actually be slightly more robust.

Anyway, the code is in production now with no complaints yet, and another lesson learned about the oddities of the Sun JVM!

[1] I lied about the travel agency date format above in order to keep things simple. If the time is on the hour, travel agents will skip the minute part, and simply say 22:00 -> 10pm. This means that the parser itself is actually given by a factory depending on the minute in the time (or when decoding, the length of the input string). To further complicate matters, they also forgoe the m in am and pm, instead just using 10a or 10p. Neither of those details are relevant to the case at hand, except for the fact that it meant we were already wrapping SimpleDateFormat extensively to get the desired results.

Advertisements

One thought on “Be careful – Java SimpleDateFormat is not always symmetric

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s