Recently, I was asked for my opinion on JSON-API as a potential standard for RESTful APIs. While I like the idea of some standardization for RESTful JSON responses, I feel that JSON API woefully misses the mark, and here’s why.

Let’s take the simple example from JSON API’s site

{
  "articles" : [{
      ¨id¨: 1,
      "title": "JSON API paints my bikeshed!",
      "body": "The shortest article. Ever.",
      "created": "2015-05-22T14:56:29.000Z",
      "updated": "2015-05-22T14:56:28.000Z",
      ¨author¨ : {
          ¨id¨ : 42,
          "name": "John",
          "age": 80,
          "gender": "male"
      }
 }]
}

and let’s take a look at how that could be briefly rewritten as regular JSON to achieve the same functionality


{
  "articles" : [{
      ¨id¨: 1,
      "title": "JSON API paints my bikeshed!",
      "body": "The shortest article. Ever.",
      "created": "2015-05-22T14:56:29.000Z",
      "updated": "2015-05-22T14:56:28.000Z",
      ¨author¨ : {
          ¨id¨ : 42,
          "name": "John",
          "age": 80,
          "gender": "male"
      }
 }]
}

Now, let’s give a usage example of the JSON API response above, to print all the article titles and author names, which I would imagine is a typical use case for data that looks like this.

for(var data : response.data){
   if(data.type == ¨articles¨){
    print data.attributes.title;
    var author_id = data.relationships.author.data.id;
    var author_type = data.relationships.author.data.type;
  }
}
  for(var inc : response.included){
    if(inc.id == author_id && inc.type == author_type){
      print inc.attributes.name;
    }
 }

compared to the regular JSON, where you can print the variables simply by writing

for(var article : response.articles){
  print article.title;
  print article.author.name;
}

JSON API is the obvious poor choice here. We can see

  • 3x the implementation cost of regular JSON (12 line vs 4 lines)
  • O(n) execution time vs O(1) execution time, with no temporary variables required
  • Increased code complexity, poor readability and higher maintenance cost vs regular JSON

Ultimately, I get the impression that JSON-API’s approach is trying to treat JSON as a message-passing format, but in doing so it misses the biggest advantage of JSON, which is that it’s an Object Notation.

The structures in JSON – numbers, booleans, strings, keys and objects – can be read natively by any modern, object-oriented programming language. A well-designed JSON response can be parsed directly into an object model by the receiving application at low cost, and then used directly by that program’s business logic. This has been well-understood by the creators of JSON-schema and Open APIs (formerly Swagger), which effectively add type-safe behavior to JSON’s constructs which are dynamic-by-default.

Thus, if you want to describe hypermedia in JSON, you should do so in context, the same way we do on the web.
Hyperlinks on the internet are identified by convention – either by being underlined or by text coloring. A similar approach is valid for JSON responses. Hypermedia is already quite identifiable in a JSON response because it’s a string starting with http:// ;)  Machines can find it by the url type in a corresponding Open API specification. The benefit of re-enforcing it with conventions like keywords (such as href, used by the HAL specification) or a prefix (like starting the field-name with an underscore, like _author,) is that it adds clarity for those reading code that handles the response model.

I think this is the type of clarify we should be aiming for designing standard for RESTful JSON APIs.

We need tools to help programmers name things!

Phil Karlton’s statement

There are only two hard things in Computer Science: cache invalidation and naming things.

has been immortalized in programming literature from Martin Fowler to The Codeless Code, largely because it’s true.

Cache Invalidation is a well-understood problem. Students learn about it in college. Tools exist to help you figure out when to invalidate data in your cache. Naming things, on the other hand, is rarely touched by programming courses (exception to the rule), and help is hard to find beyond basic guidance in some blog posts or powerpoint presentations.

Even when naming is discussed, advice is usually limited to basic conventions of a language such as methods with certain names have certain expectations (getX, setX, equals, etc.) or the use of nouns and verbs. Discussion of the real issue – what are the best words to describe this new functionality to my audience – is shockingly rare.

It would be easy to give a hand-waving response to this question, like

Consider your audience. Writing code for a wide audience of programmers will limit you to using generally understood concepts and vocabulary. Writing specialist code for domain experts will allow you to use more precise and specialized words. What is the vocabulary that those interacting with your code are likely to understand?

Sure – this is great general advice, but it doesn’t help me with today’s problem of How do I call this variable that can contain a single date or a date range of up to 4 days?

This is a solvable general problem. Probably easier than cache invalidation.

I’m imagining a tool that’s something between Stack Overflow and Urban Dictionary. Programmers can submit words that they’ve used in their own applications, along with their meanings, and some examples of their usage in publicly available code or APIs.

Words could be ranked between general and specific, vague and precise, knowing that different words will occupy different parts of the spectrum. For example value is both general and vague, currency code is general and precise, Fare Basis is specific and precise. Developers should be able to vote words up or down depending on their experience with using them. The ideal result? A reusable corpus of defined names that programmers find useful, and a common vocabulary that would permeate a variety of programs across domains, languages and applications.

(Now I just need to find a few days to get time to code it!)

Migrating from Play Framework v2.2 to Activator v2.3

I really like the Play Framework but they’re not shy about making changes when moving to new major versions. The migration of existing an existing app from v2.2.x to v2.3.x can be a painful process, I’ve done it twice now, and there are many pitfalls on the way. Looking through articles as I go suggests I’m not the only one suffering.

It’s a good idea to read the documentation at https://www.playframework.com/documentation/2.4.x/Migration23

Here’s a summary of the changes you’ll need to move a Java project

Before you begin…

Make sure you commit your current version, and do a `play eclipse` in the old version before you start to migrate if you want IDE support. Once you start the migration, you won’t be able to do this

build.sbt

Replace

play.Project.playJavaSettings

with

lazy val root = (project in file(“.”)).enablePlugins(PlayJava).enablePlugins(SbtWeb)

in your build.sbt. You can leave out the SbtWeb plugin if you don’t use Play’s templating language.

If you use external APIs, you’ll also need to add javaWs to your library dependencies. Mine now looks like this:

libraryDependencies ++= Seq(
javaJdbc,
javaEbean,
cache,
javaWs
)

If your project uses LESS, you now have to explicitly indicate that you want LESS files to be compiled. This line in build.sbt will cause all .less assets to be included in the compilation

includeFilter in (Assets, LessKeys.less) := “*.less”

.java files

The javaWs package name has also changed from upper-case to lower-case, which is easy-to-miss in your imports. Do a search and replace on all your .java files, changing

import play.libs.WS.

to

import play.libs.ws.

Note the trailing . at the end of the line!

plugins.sbt

Coffeescript and LESS are now no-longer included by default. If you use them, they need to be included in your plugins.sbt file. Annoyingly, you won’t get an error if you fail to include the LESS component, your site will just look bad.

Add these lines to your plugins.sbt

addSbtPlugin(“com.typesafe.sbt” % “sbt-coffeescript” % “1.0.0”)

addSbtPlugin(“com.typesafe.sbt” % “sbt-less” % “1.0.1”)

Note the blank line between the two plug-ins. That’s not an accident, it’s required.

While you’re here, you should also roll up your Play sbt plugin version

addSbtPlugin(“com.typesafe.play” % “sbt-plugin” % “2.3.3”)

build.properties

Finally, upgrade your Play version to get the changes. Set

sbt.version=0.13.5

in your build.properties

Check it out

That’s it. Run

activator clean update start

to compile and run your new version!

 

Gah! Joda DateTimeBuilder is not always symmetric either

In a previous post I complained that Java Date Format was not symmetric.

Well, turns out Java 7’s Joda Time is also not necessarily symmetric… even when the formatter is not lossy!

Try it and see (for this to work, your current system time must not be UTC!)

DateTime now = new DateTime(DateTimeZone.UTC).withMillis(0);

String nowText = ISODateTimeFormat.dateTimeNoMillis().print(now);

DateTime then = ISODateTimeFormat.dateTimeNoMillis().parse(nowText);

assertEquals(now, then);

and your test will fail! Turns out that the reparsing of the date loses the TimeZone information, so that even though the printed format of the nowText string is correct, the reparsing doesn’t initialize then with the timezone in the response. This, according to Joda, this is considered buggy but won’t be changed for historical reasons (sounds like any other Java Date/Calendar packages?) but it would be useful if this was clearly stated, up front in the documentation. Calling

DateTime then = DateTime.parse(nowText);

will give the correct result… which looks obvious when written here, but is not necessarily so obvious when you’re deep in your debugger. .

So, another new DateTimeCalendar package for Java 9 anyone?🙂

The Importance of Using Real Data when Developing a Proof-of-Concept

When developing data-driven software, there’s a constant tension between anonymity and usefulness.

On one hand, some level of anonymity is required when using any data set to protect sensitive customer and commercial information. On the other, increasingly obfuscating data reduces its usefulness for information discovery.
When demonstrating a proof-of-concept, data-driven application, it is typical to use generated, fake data. This has the obvious advantage of completely protecting customer anonymity, satisfying the commercial folks. It also placates project managers, by replacing the time required to obfuscate the data (an unknown quantity) with the time required to generate fake data (a predictable quantity). But as part of the software development team, you should avoid letting this happen.

Why?

When replacing the real with a fake, it’s important to remember that the best-case for this generated, fake data, is that it looks and smells entirely like real data, and that this best case is unachievable. In order to generate completely accurate fake data, you would have to have a complete understanding of the domain being analysed, obviously an impossible demand. So generally the benchmark is set at generating “believable data.”

Believable to whom?

Well, to anyone you want to sell it to, I imagine. But this is a dangerous demand. You can never know if your data is not believable until it’s too late – in the same way as if you can never know if your network security is good enough, until it’s too late – and even then, only if someone raises an alarm.

Your team probably has good domain knowledge, and you put that knowledge into your generated data, creating what you believe to be realistic scenarios, and believable correlations. You show it to the executives in your own company, who are unlikely to have as strong domain knowledge as your team.
Their reaction is enthusiastic – they are surprised by correlations in your data and this helps generate a positive impression of your product.

So you go and show it to potential customers, who gather executives and experts for your presentation, often putting hundreds of years of domain expertise in a single room. Of course, many of the correlations they expect to see in your fake data don’t exist, because you didn’t think to create them, generating immediate suspicion of your product for your potential customer. If you’re lucky, they’ll understand the issue is due to your generated data.  If you’re lucky, they will question why they’re not seeing what they expect. If you’re unlucky, they will simply use it to dismiss your product, using the missing information to re-enforce the reasons why they don’t need software to support their jobs, or to argue that such tools should really be developed in-house. In any of these cases, you’ve just devalued your potential new product.

But why couldn’t this happen with obfuscated data?

In real data, patterns such as locations, times or names in the data could be used by competitors (to whom you’re going to show this proof-of-concept) to reasonably guess to which of your customers this data belongs (argues the commercial team, who don’t want to risk your customer’s data). So your obfuscations must ensure that these elements of the data are hidden sufficiently to ensure that this doesn’t happen – knowing, of course, that these domain experts might spot some pattern you missed and use it to infer the underlying customer data in any case.

During the obfuscation process it’s important to remember that while many patterns will be diligently preserved, many will be necessarily destroyed, reducing the selection of available correlations in the data. But won’t this obfuscation process – of unknown duration – reduce the quality of the data to such a point that it would be less convincing to clients than generated data (argues the product manager who wants this software to be finished ‘yesterday’)? That’s unlikely.

Data generation is an additive process, but obfuscation is a subtractive one.

When the elements to be subtracted are sensitive commercial data, it’s likely that more time and due care will be allocated to creating the data set – which is the key component of any data-driven software – than if the data was generated, because commercially sensitive data would otherwise be at risk. And because those people doing the subtraction are data-driven domain experts themselves, they will make sure to preserve the correlations they would have otherwise placed in their generated data, as best they can.

Meanwhile, because the process is subtractive, rather than additive, unseen correlations in the original real dataset may live on, providing the ability surprise both the development team and potential customers alike. And finally, using obsfucated data, you can easily explain in advance that some correlations were removed during the obfuscation process. This means that if your potential client doesn’t see a correlation they’re expecting, they’re more likely to blame it on the necessary obfuscation process, or simply question it, than silently blaming it on your incompetence.

The data must be surprising

Remember that it’s undiscovered surprises in the data pushes the development of data-driven software in the first place. It’s important for development teams to remember that understanding what would otherwise be surprises in the data is what makes someone a domain expert. The only way to gain this expertise is to work on real data, to the maximum extent possible.

So argue as hard as you can to never generate fake data… and if you do have to fake it, work with real data until the last possible moment.

Be careful – Java SimpleDateFormat is not always symmetric

In my job, we often have to work with free-text data storage, where our customers, and the customers of our customers, will be looking directly at the only copy of the same data the we’re depending on in our business logic, encoded as a string. We make pretty heavy use of configurable grammars, and our structured data often comes with reversible encoders/decoders to render it as a human-readable string (in 11 different languages).

For Java date conversion we make heavy use of Java’s SimpleDateFormat – as implemented in the Sun JDK 1.6, which has proved pretty robust in the past. Thus I was surprised when I started seeing ParseExceptions in my unit tests, especially in dates being parsed had originally been produced by SimpleDateFormat itself.

With SimpleDateFormat, you initialise the class with a string pattern – and then we use dateFormat.format(Date) to produce the encoded string, or dateFormat.parse(String) to parse a String back to a Date object. The pattern you choose may be as simple as yyyy – in which case dateFormat.format(Date) would produce something like 2013 from today’s date. When you call dateFormat.parse(String) on 2013, obviously the rest of the date data will be lost. By default dateFormat will instantiate the unknown values of the resulting date with the Date default values, so midnight 1970-1-1 in your timezone.  If you formated 2013 with this example, calling parse with the same formatter would correctly give you back 2013-1-1 00:00 from your input – any day, time or timezone information in the original date would be lost.

In this case, I wanted to print the time of departure of Amtrak trains, using the standard travel agent date format, hmma. [1] So for a train departing at 9:30, travelAgentDate.format(..) will produce 930am.  For 21:30, format will produce 930pm. And for 22:59, format will produce 1059pm. And when I call travelAgentDate.parse(1059pm), I will get a ParseException. Inspecting the source code, it’s easy to see why.

The parse(..) method here parses from left-to-right, generally the most-efficient but less-robust way to parse text. When it tries to parse the hour using h, if there’s a delimiter in the pattern, it will peek ahead to see if the delimiter is the next character, but if the next character is numeric, it will simply assume it’s part of the next variable, in this case the mm minute variable. So if I had used a format like h.mma or even hhmma I would have been safe, but the variable-length h string without a trailing delimiter confuses the parse method, even though the format method can produce such a string. Too bad.

Not having the option of changing the format (travel agencies are pretty set in their ways), and not feeling too enthused about writing my own RTL parser for this one edge case, I began to look for workarounds. After a coffee break and a chat with  a colleague, we decided that we should set the parser to a format that it could use symmetrically, and then massage the resulting input-output into that format. The two options we came up with were

  • Use hhmma format in the parser, trim the leading 0’s after encoding it to a string, and when decoding, pad it back with 0’s to get to the right length before calling parse
  • Use hh:mma format in the parser, remove the : after encoding it to a string, and when decoding, re-inject a : 4 characters from the end.

I plumped for the first option, mostly because I already had the toolkit already there to pad and the numbers – the efficiency gain of remaining LTR with no read-ahead is not relevant in this scenario but does sound nice, although we agreed that the second option would actually be slightly more robust.

Anyway, the code is in production now with no complaints yet, and another lesson learned about the oddities of the Sun JVM!

[1] I lied about the travel agency date format above in order to keep things simple. If the time is on the hour, travel agents will skip the minute part, and simply say 22:00 -> 10pm. This means that the parser itself is actually given by a factory depending on the minute in the time (or when decoding, the length of the input string). To further complicate matters, they also forgoe the m in am and pm, instead just using 10a or 10p. Neither of those details are relevant to the case at hand, except for the fact that it meant we were already wrapping SimpleDateFormat extensively to get the desired results.

Innovation starts with good APIs

(Written on September 21st, but only got around to publishing now)

I’ve spent much of the last two weeks, along with several of my colleagues, developing an entry for tHACK SFO 2012. I love these hack competitions – they can sometimes seem like a waste of time but they’re an important driver of solid technical skills, an opportunity to play with some new APIs or tools, and a driver of that overused buzz-word innovation.

The travel industry has an enormous variety of APIs – some are great, but most of which stink, and the APIs made available to us for tHACK really came from all ends of that spectrum. (Well, actually, to be fair, there were no real stinkers!)

One remarkable point that stuck me while coding was how much easier it is to include a great API in an application, than a poor one. We wrote our prototype as a Facebook app – I’d already used Facebook’s API before and I knew it was good, but writing the prototype was a pointed reminder of how much better it is than any in the travel industry. Accessing the features of Facebook within our app was simple, the requests and responses were easy to understand, and the results were fast and reliable.

This made the contrast between many of the competition-offered APIs brutally stark. Many travel industry APIs are intensely frustrating to use – and it was evident during the presentation of the hacks which ones were easy to use, and which ones were  trouble. There were 9 APIs provided, and 18 hacks. Of the 9 APIs provided, only 4 were used by more than one of the hacks. Of the remaining 5, 2 were used by independent competitors, 2 were used only in hacks published by the same company as published the API, and one was not used at all (1).

Here were some of the issues I encountered with those APIs – take it as a list of red flags to be noted when releasing a new API

  • Hard to connect. Soap wrappers are hard to use and rarely necessary these days – if your API is not RESTful, there better be a good reason for that.
  • Login to the API can be a tricky problem, but it should not require a persistent data store on my behalf. In particular, many APIs make a meal of OAuth. I know that OAuth’s a difficult paradigm, but with a bit of thought and understanding, a sensible implementation is possible.
  • If your API model doesn’t send or recieve in either XML or JSON, forget it. Simple tools are available to serialize/deserialize both XML and JSON APIs as constructs in virtually any programming language. Almost every other data transfer mechanism will require serialization by hand, a massive time constraint.
  • Your API should send and receive in the same data format. Sounds obvious, but it’s often ignored. If replies are in JSON, send should also be in JSON
  • Sensible data models with few surprises. A hallmark of an excellent API is rarely having to refer to the documentation. Interface objects should be sensibly named and constrained. Encoded data should use clearly named enumerations, ideally with a full-text translation, not integer codes.
  • Less chit-chat. Saving data, for example, should be a one-call operation – no excuses. For a simple flow, I should be able to send or receive all the data I need in no more than 3 API calls. More calls mean that you’re passing your orchestration and performance problems to me.
  • If a given API call is not fast, it should be ansynchonous. Give me an answer, or give me a token

Providing language SDKs is not enough! Not everyone wants to program in your programming language of choice. In any case, by pushing the hard work to the API means that clients of that API don’t have to implement those features individually, or in a variety of SDKs.

The proof was in the results. With only a couple of exceptions, the quality of the entries in tHACK SFO was unanimously high. (Check out iPlanTrip or Concierge for starters!) However, the judges gave no rewards to the users of difficult APIs, all the winning and highly commended hacks were based on the 5 easy APIs, and all of them used either Geckgo or Vayant, which struck me as being the two APIs that best combined content and ease-of-use. Those who took the time to figure out and use the more complex APIs had less time to produce high quality integrations.

Finally, a shout out to Vayant, the Amadeus Hotel Team and Rezgo whose APIs helped us produce a slick and innovate app – hopefully the people at Amadeus will let us publish tVote for you sooner rather than later.

(1) These figures come from my recollection of the event the following day. If you have official figures to confirm them, that would be nice.