Error handling in GraphQL

When we began work in earnest on the Audacious platform, picking GraphQL for the external API boundary layer was a no-brainer. Having come through the now Dark Ages of attempting to reasonably efficiently consume RESTful APIs into various state management frameworks atop of React (?sideload=category&embed=author, anyone?), GraphQL is one of those seminal ideas that leaves you wondering "why haven't we been doing it this way all along?"

However, as it turns out, while GraphQL provides something that so many standards prior had failed misserably at – a fully defined, strongly typed API that wasn't a WS-*[1] level of hell to work with – it still, sadly, isn't a silver bullet. A recent GraphQL meetup hosted by our friends over at Replicated was living proof of this. Virtually everyone in attendance had seen "the light," but were all battling the same design issues when it came to practical implementation in the wild.

Going in to the meetup, there was one issue I was most curious about with everyone else's implementation. Interestingly enough, this issue stood out in that there seemed to be no real consensus on a canonical let alone a good solution to it: error handling in mutations. So, I thought it would be interesting to share the design problem and the solution we ultimately arrived at for the Audacious API, which deals with fairly complex data input for mutations.

Now, before we dive into the issue, it's useful to establish a distinction between errors. At a high level, we can split the types of errors that occur in the interaction with an API into two: system errors and domain errors. System errors are those that occur on either the client or server end, such as an exception being thrown, or in the interaction between the two, such as a developer mistyping an argument name. Domain errors, on the other hand, are errors caused by the provided data somehow violating a domain constraint, such as user email addresses needing to be unique. The key semantic difference between the two is, that as an end user, there really is no corrective action you can take in the presence of a system error, while there likely is one in the case of a domain error. Trying to remember your old password after being told your email address is already in use is usually a good starting point.

System errors

If you've spent any time working with GraphQL, you've probably realized that the result of a GraphQL mutation or query consists of two parts: data and errors. If you were to mistype an argument name while trying to add a comment through GitHub's excellent GraphQL API, you'd receive an error along these lines:

{
  "data": null,
  "errors": [{
    "message": "InputObject 'AddCommentInput' doesn't accept argument 'bdoy'",
    "locations": [{
      "line": 2,
      "column": 39
    }]
  }]
}

Fair enough, we've made a typo. In fact, virtually all GraphQL implementations give really good descriptions of system errors like these, which makes developing using a GraphQL API a real joy. However, if a system error actually occurs in production, you would likely just log the message instead and show the end user a more generic error such as "An unexpected error has occured. We apologize for screwing this up for you." This is honestly perfectly fine, as the user has no real utility (or care) in the more detailed error. So far, so good.

The domain error problem

But, what about domain errors, then? What about if a required field is empty, an email is not unique, or some other domain violation occurs? We would obviously want to be able to show the end user a more descriptive error message. A simple solution, then, is to have the mutation return an error with a message to be shown to the end user:

{
  "data": {
    "createUser": null
  },
  "errors": [{
    "message": "That email address is already in use",
    "locations": [{
      "line": 5,
      "column": 8
    }]
  }]
}

This is perfectly fine for the simplest APIs where the amount of input data is limited. However, as the complexities of mutations grow, we would ideally be able to provide feedback about specific parts of the data. Say a mutation takes a list of people with their email addresses as its input; what email address then is the source of our troubles? And, how do we know how to discern between system errors like the ones above, and domain errors? Worse still, what if we were to execute two mutations at the same time? How would we necessarily know which mutation the error message belonged to, without manually parsing the query string?

One participant at the aforementioned meetup admitted that they had resolved to actually concatenating errors with a prefix and subsequently parsing the error message string in an attempt to resolve this issue. This obviously is the polar opposite of the spirit of the otherwise well defined GraphQL specification.

The root of the problem is, that the error section of the GraphQL specification seems to have been designed solely for the system error case. This is especially clear, in that the official GraphQL specification only defines the message and locations properties for an error, with the former being humanly readable and the latter referring to a query string location; not a logical location in a query.

However, while the specification only explicitly lists those two properties, it does not actually limit the number of properties that can be contained in an error. GitHub's API makes good use of this, for example, when performing an operation on a node that doesn't exist:

{
  "data": {
    "addComment": null
  },
  "errors": [{
    "message": "Could not resolve to a node with the global id of 'horse'",
    "type": "NOT_FOUND",
    "path": ["addComment"],
    "locations": [{
      "line": 2,
      "column": 3
    }]
  }]
}

As an API consumer, I can now easily discern the above error from system errors by the presence of the type parameter, and I can determine for which mutation the error occurred by the path parameter.

Domain errors in the Audacious API

The solution employed by GitHub is not unique to GitHub. Graphcool – now Prisma – employs a similar design albeit using a numeric error code. Common to both of them however is, that they do not describe errors with regards to particular parts of the input.

To fully solve this problem, we opted to add another property to the errors that can be returned from the Audacious API: fields.

{
  "data": {
    "addContact": null
  },
  "errors": [{
    "message": "One or more fields contain invalid data",
    "code": "DATA",
    "path": ["addContact"],
    "fields": [{
      "path": ["firstName"],
      "code": "REQUIRED"
    }, {
      "path": ["emailAddresses", 1],
      "code": "INVALID"
    }]
  }]
}

fields is a list of the individual fields in the input for which there is an error, with a path relative to the mutation, and an error type. Now, when an error is returned from our GraphQL API, if it contains either a type or a set of fields, we know that we're dealing with a domain error and can display errors to the end user appropriately. You'll also notice that we decided not to include the optional locations property as we consider it irrelevant to the use of the error.

Implementing this was fairly straight forward with our backend being developed in Python and the GraphQL schema defined and executed with Graphene, although we currently have to run off of a patched version of the unreleased graphql-core library as of writing to get path information. The implementation required the declaration of a DomainError exception type that accepts the code and fields along with the message, and overriding graphql.error.format_error to produce the desired output when returning the error to the client. The analog in other languages and implementations should be equally straight forward; a great example is the fairly simple apollo-errors module for Apollo server.

The ideal world

We've settled on this design for now, but it still doesn't feel perfect. First of all, there is the slight chance, that the GraphQL specification will eventually incorporate a more standardized version of the above described, in which case there is no guarantee that our semantic meaning, or, worse still, type of an added error property is not going to be in conflict with the specification.

While that is a risk we're willing to take given the widespread use of this practise already making it hard for the specification to justify making conflicting changes, the bigger issue is, that it doesn't quite feel GraphQL-esque. Here we have a whole data schema that's fully defined, yet our errors are not. What the perfect solution to this is, is still up for debate. When designing a specification like GraphQL, there is a very fine line between general applicability and "opinionated" as we've come to refer to designs tailored more for one environment than another. That being said, having worked with IDLs like Thrift in the past, I do wish we could also fully describe our errors in GraphQL.

For now, though, the great many benefits offered by GraphQL far offset the little bit of trouble we have to go through to get properly descriptive errors in place. I honestly feel like the design we arrived at solved this quite cleanly, and we'll likely be happy with it for a while.


  1. WS-* or WS-Deathstar is a colliqual reference to the myriad of Web Standards enterprise bodies dreamt up to make Web systems interoperable. Needless to say, they didn't succeed even one iota in this mission. A whole generation of developers still get nervous ticks when the acronyms SOAP or WSDL are trotted out. ↩︎

Show Comments