Thunderbird, RNP, and the Importance of a Good API

By Neal | May 6, 2021

I was recently talking to a Thunderbird developer about API design. In the course of that conversation, I expressed concerns about RNP, the new OpenPGP implementation that Thunderbird has recently started using in place of GnuPG. That person, skeptical about my assertion that RNP’s API needs improvement, asked “Isn’t it subjective what a better API is?” I’d agree that we don’t yet have good metrics to evaluate an API. But, I disagree that we can’t judge APIs at all. In fact, I suspect, most experienced programmers know a bad API when they see it. Further, I think we can come up with some good heuristics, which I’ll try to do based on my experience working on and with GnuPG, Sequoia, and RNP. Then, I’ll take a look at RNP’s API. Unfortunately, RNP’s API is not only easy to misuse, but it’s misleading, and, as such, shouldn’t yet be used in a safety-critical context. Yet, Thunderbird is relied on by vulnerable people like journalists, activists, lawyers, and their communication partners who need this protection. For me, this means that Thunderbird should reevaluate their decision to use RNP.

Note: please also see this related mail, Let’s Use GPL Libraries in Thunderbird!, which I sent to Thunderbird’s Planning Mailing List.

What Makes a Bad API?

Prior to starting the Sequoia project with Justus and Kai, the three of us worked together on GnuPG. In addition to hacking on gpg, we also spoke to and collaborated with a lot of gpg’s downstream users. People had a lot of good things to say about GnuPG.

Clicking on the thumbnail will load content from YouTube.
Thumbnail from an interview with Benjamin Ismaïl from Reporters without Borders.  The subtitle says: '... to protect communication with journalists'

Two criticisms of gpg’s API stood out for us. The first criticism can be distilled down to: gpg’s API is too opinionated. For instance, gpg has a keyring-centric approach. This means that it is only possible to use or examine an OpenPGP certificate if it has been imported into the keyring. But some developers only want to import a certificate after they’ve examined it. For instance, when looking up a certificate on a key server by fingerprint, it is possible to check that the returned certificate is the right one, because the URL is self authenticating. It is possible to do this with gpg, but it requires working around gpg’s programming model. The basic idea is the following: create a temporary directory, add a configuration file, tell gpg to use the alternate directory, import the certificate there, examine the certificate, and clean up the temporary directory. That’s the official suggestion, which Justus added based on our conversations with gpg’s downstream users. Yes, it works. But, the approach requires operating system-specific code, is slow, and error prone.

The other criticism that I heard repeatedly is that using gpg requires a lot of arcane knowledge to avoid misusing it. Or, put differently, one has to be extremely careful when using gpg’s API to not inadvertently introduce a vulnerability.

To better understand this second concern, consider the EFAIL vulnerabilities. The basic problem is around gpg’s decryption API: when decrypting a message, gpg emits the plaintext even if the input has been corrupted. gpg does return an error in that case, but some programs display the corrupted plaintext anyway. Because, why not? Surely showing part of the message is better than nothing, right? Well, the EFAIL vulnerabilities demonstrate how an attacker can use this to insert a web bug into an encrypted message, and when the user views the message, the web bug exfiltrates the message. Ouch.

So, who’s responsible for the bug? The GnuPG developers argued that the applications used gpg wrong:

MUAs are advised to consider the DECRYPTION_FAILED status code and not to show the data or at least use a proper way to display the possible corrupted mail without creating an oracle and to inform the user that the mail is fishy.

gpg signaled an error; the applications didn’t adhere to the API contract. I have to agree with the GnuPG developers, and add: gpg’s interface was (and remains) a disaster waiting to happen, because it doesn’t guide the user to do the right thing. On the contrary, the easy, seemingly helpful thing is the wrong thing to do. And, this type of API is unfortunately common in GnuPG.

What Makes a Good API?

These two realizations—that gpg’s API is too opinionated, and is hard to use right—were formative for me. When we started the Sequoia project, we agreed that we wanted to avoid making similar mistakes. Based on these observations, we adopted two tests that we continue to use to guide the development of Sequoia’s API. First, there should be a low-level API in addition to any high-level API, which is unopinionated in the sense that it doesn’t prevent the user from doing anything legitimate. Simultaneously, an API should guide the user to do the right (opinionated) thing by making the right thing the easy, and obvious thing to do.

To realize these two, slightly conflicting goals of enabling everything, but preventing mistakes, we leaned on two tools in particular: types, and examples. Types make it hard to use an object in an inappropriate way by formalizing the API contract at compile time, and even forcing particular transformations. And, examples—code snippets—will be copied. So, good examples will not only teach users how to use a function correctly, but strongly influence how they use it.

Types

I want to present an example of how we use types in Sequoia to help us make a good API. To understand the example, a tiny bit of background knowledge about OpenPGP is useful.

The various components of an OpenPGP Certificate and how they relate to each other: at the top is a primary key, below it are two subkeys and a User ID.  These components are each bound to the primary key by way of a binding signature.
Figure: A simple OpenPGP Certificate.

There are several fundamental data types in OpenPGP. Three are: Certificates, components such as keys and User IDs, and Binding Signatures. The root of a certificate is the primary key, which fully determines a certificate’s fingerprint (fingerprint = Hash(primary key)). A certificate usually includes components like subkeys and User IDs. OpenPGP binds a component to the certificate using a so-called binding signature. Making the fingerprint just the hash of the primary key and using signatures to bind the components to the primary key means that it is possible to add additional components later. Binding signatures also include properties. This makes it possible to change a component, e.g., to extend a subkey’s expiration. A consequence of this is that there can be multiple valid signatures associated with a given component. Binding signatures are not only fundamental, but also an integral part of OpenPGP’s security.

Because there can be multiple valid binding signatures, we need a way to choose the right one. As a first approximation, the right signature is the latest, non-expired, non-revoked, valid signature, which was not created in the future. But what is a valid signature? In Sequoia, the signature does not only need to check out mathematically, it needs to be consistent with a policy. For instance, due to its compromised collision resistance, we only allow SHA-1 in a very limited set of circumstances. (Paul Schaub, who works on PGPainless, recently wrote about these complexities in detail.) Forcing the user of the API to keep all of these concerns in mind invites vulnerabilities. In Sequoia, the easy way to get the expiration time is the safe way. Consider this code, which does the right thing:

let p = &StandardPolicy::new();

let cert = Cert::from_str(CERT)?;
for k in cert.with_policy(p, None)?.keys().subkeys() {
    println!("Key {}: expiry: {}",
             k.fingerprint(),
             if let Some(t) = k.key_expiration_time() {
                 DateTime::<Utc>::from(t).to_rfc3339()
             } else {
                 "never".into()
             });
}

cert is a certificate. We start by applying a policy to it. (Policies are user definable, but normally the StandardPolicy is not only sufficient, but most appropriate.) This effectively creates a view of the certificate where only components with a valid binding signature are visible. Importantly, it also modifies and exposes a number of new methods. The keys method, for instance, has been modified to return a ValidKeyAmalgamation instead of a KeyAmalgamation. (It’s an amalgamation, because it includes not only the Key, but also any associated signatures; some people thought Katamari would have been a better name. ¯\_(ツ)_/¯) A ValidKeyAmalgamation has a valid binding signature according to the above criteria. And, it exposes methods like key_expiration_time, which only make sense on a valid key! Also note: key_expiration_time’s return type is ergonomic. Instead of returning the raw value, key_expiration_time returns a SystemTime, which is safe and easy to work with.

Consistent with our first principle of enabling everything, a developer could still access the individual signatures and examine the subpackets to get the key’s expiration time from a different binding signature. But, compared with the right way to get the key’s expiration time using Sequoia’s API, they would have to go out of their way to do it differently. In our opinion that’s a good API.

Examples

We released v1.0 of the Sequoia library in December of 2020. Nine months prior to that, we were feature complete and ready to release. But, we waited. We spent the following nine months adding documentation and examples to the public API. Take a look at the documentation for the Cert data structure to see an example of the results. As described in the blog post, we didn’t quite manage to provide an example for every function, but we did get pretty far. And, as a side effect of writing the examples, we identified some rough spots, which we polished.

Since the release, we’ve had contact with a number of developers who have integrated Sequoia into their code. A common refrain is how helpful the documentation and examples are. And, we can confirm: even though it is our own code, we reference the documentation almost every day, and copy our own examples. It’s just easier. And, since the examples show how to correctly use the function, why redo the work from scratch?

RNP’s API

RNP is a young OpenPGP implementation developed primarily by Ribose. About two years ago, Thunderbird decided to integrate Enigmail into Thunderbird and simultaneously replace GnuPG with RNP. That Thunderbird has selected RNP is not only an endorsement of RNP, but it means that RNP became perhaps the most used OpenPGP implementation for mail encryption.

A critique can easily be interpreted as being negative. I want to be absolutely clear that I think the work that Ribose is doing is good and important, and I am thankful that they are investing time and resources into a new OpenPGP implementation. The OpenPGP ecosystem desperately needs more diversity. But, that is not an excuse to use an immature product in a safety-critical context.

Safety-Critical Infrastructure

Unfortunately, RNP is not yet at a point where I think it can be safely deployed. Enigmail was used not only by people worried about their privacy, but also by journalists, activists, and lawyers who are worried about their safety and the safety of their communication partners. In an interview with Benjamin Ismaïl, the head of the Asia-Pacific office at Reporters without Borders, in 2017, he said:

We primarily use GPG to freely communicate with our sources. The information they give us about human rights and the violations that they are subjected to are sensitive information, and it is necessary for them to protect their conversations.

Interview with Benjamin Ismaïl from Reporters without Borders

As such, it is essential that Thunderbird continue to provide these users with the safest experience possible even during this transition period.

RNP and Subkey Binding Signatures

When talking about how we use types in Sequoia to make it harder to misuse the API, I showed how to get a key’s expiration time in a few lines of code. I want to start by showing how someone who isn’t an OpenPGP or RNP expert might implement the same functionality using RNP. The following code iterates over a certificate’s (key) subkeys and prints each subkey’s expiration time. Recall: the expiration time is stored on the subkey’s binding signature, and a value of 0 means the key does not expire.

int i;
for (i = 0; i < sk_count; i ++) {
  rnp_key_handle_t sk;
  err = rnp_key_get_subkey_at(key, i, &sk);
  if (err) {
    printf("rnp_key_get_subkey_at(%d): %x\n", i, err);
    return 1;
  }

  uint32_t expiration_time;
  err = rnp_key_get_expiration(sk, &expiration_time);
  if (err) {
    printf("#%d (%s). rnp_key_get_expiration: %x\n",
           i + 1, desc[i], err);
  } else {
    printf("#%d (%s) expires %"PRIu32" seconds after key's creation time.\n",
           i + 1, desc[i],
           expiration_time);
  }
}

I tested this code against a certificate with five subkeys. The first subkey has a valid binding signature, and doesn’t expire; the second has a valid binding signature, and expires in the future; the third has a valid binding signature, and is already expired; the fourth has an invalid binding signature, which says that the subkey expires in the future; and, the fifth does not have a binding signature at all. Here’s the output:

#1 (doesn't expire) expires 0 seconds after key's creation time.
#2 (expires) expires 94670781 seconds after key's creation time.
#3 (expired) expires 86400 seconds after key's creation time.
#4 (invalid sig) expires 0 seconds after key's creation time.
#5 (no sig) expires 0 seconds after key's creation time.

The first thing to notice is that the call to rnp_key_get_expiration succeeds whether the subkey has a valid binding signature, has an invalid binding signature, or even doesn’t have a binding signature at all! Reading the documentation, this behavior is a bit surprising. It says:

Get the key's expiration time in seconds.
Note: 0 means that the key doesn't expire.

Since the key’s expiration time is stored on the binding signature, I, an OpenPGP expert, understand this to mean that the call to rnp_key_get_expiration would only succeed if the subkey has a valid binding signature. Instead, it appears that if there is no valid binding signature, the function simply defaults to 0, which given the note, the user of the API would justifiably interpret as meaning the key doesn’t expire.

To improve this code, it is necessary to first check whether the key has a valid binding signature. Some functions to do this were recently added to RNP to address CVE-2021-23991. In particular, the RNP developers added the function rnp_key_is_valid to return whether a key is valid. This addition is an improvement, but it requires the developer to opt-in to these safety-critical checks, not opt-out, as they would if they were using Sequoia. Since safety checks are non-functional, they are easy to forget: the code appears to work even if the safety check is forgotten. And since knowing what to check requires expert knowledge, they will be forgotten.

The following code includes the safety check and skips any keys that rnp_key_is_valid considers to be invalid:

int i;
for (i = 0; i < sk_count; i ++) {
  rnp_key_handle_t sk;
  err = rnp_key_get_subkey_at(key, i, &sk);
  if (err) {
    printf("rnp_key_get_subkey_at(%d): %x\n", i, err);
    return 1;
  }

  bool is_valid = false;
  err = rnp_key_is_valid(sk, &is_valid);
  if (err) {
    printf("rnp_key_is_valid: %x\n", err);
    return 1;
  }

  if (! is_valid) {
    printf("#%d (%s) is invalid, skipping.\n",
           i + 1, desc[i]);
    continue;
  }

  uint32_t expiration_time;
  err = rnp_key_get_expiration(sk, &expiration_time);
  if (err) {
    printf("#%d (%s). rnp_key_get_expiration: %x\n",
           i + 1, desc[i], err);
  } else {
    printf("#%d (%s) expires %"PRIu32" seconds after key's creation time.\n",
           i + 1, desc[i],
           expiration_time);
  }
}

The output is:

#1 (doesn't expire) expires 0 seconds after key's creation time.
#2 (expires) expires 94670781 seconds after key's creation time.
#3 (expired) is invalid, skipping.
#4 (invalid sig) is invalid, skipping.
#5 (no sig) is invalid, skipping.

The code correctly skips the two keys that don’t have a valid binding signature, but it also skips the expired key, which is probably not what we want although the documentation does warn us that this function “checks … expiration times”.

Although there are cases where we don’t want to use a key or certificate if it is expired, sometimes we do. For instance, if a user forgets to extend a subkey’s expiration time, they should be able to see that the subkey is expired when examining the certificate, and be able to extend the expiration. Although gpg --list-keys doesn’t show expired keys, when editing a certificate, it does show subkeys that are expired so the user can extend their expiry:

$ gpg --edit-key 93D3A2B8DF67CE4B674999B807A5D8589F2492F9
Secret key is available.

sec  ed25519/07A5D8589F2492F9
     created: 2021-04-26  expires: 2024-04-26  usage: C   
     trust: unknown       validity: unknown
ssb  ed25519/1E2F512A0FE99515
     created: 2021-04-27  expires: never       usage: S   
ssb  cv25519/8CDDC2BC5EEB61A3
     created: 2021-04-26  expires: 2024-04-26  usage: E   
ssb  ed25519/142D550E6E6DF02E
     created: 2021-04-26  expired: 2021-04-27  usage: S   
[ unknown] (1). Alice <alice@example.org>

There are other situations where an expired key shouldn’t be considered invalid. For instance, let’s say Alice sends Bob a signed message: “I will pay you 100 Euros in a year,” and the signing key expires in six months. When the year is over, does Alice owe Bob the money on the basis of the signature? I’d say yes. The signature was valid when it was made. The fact that the key expired is irrelevant. Of course, once a key has expired, signatures made after the expiration should be treated as invalid. Likewise a message should not be encrypted with an expired key.

In short, whether a key should be considered valid is highly dependent on the context. rnp_key_is_valid is better than nothing, but, despite its name, it isn’t sufficiently nuanced to generally determine whether a key is valid.

The same commit introduced a second function, rnp_key_valid_till. This function returns “the timestamp till which the key can be considered as valid… If the key was never valid then a zero value will be [returned].” We can use this function to determine whether a key was ever valid by checking whether this function returns a non-zero value:

int i;
for (i = 0; i < sk_count; i ++) {
  rnp_key_handle_t sk;
  err = rnp_key_get_subkey_at(key, i, &sk);
  if (err) {
    printf("rnp_key_get_subkey_at(%d): %x\n", i, err);
    return 1;
  }

  uint32_t valid_till;
  err = rnp_key_valid_till(sk, &valid_till);
  if (err) {
    printf("rnp_key_valid_till: %x\n", err);
    return 1;
  }

  printf("#%d (%s) valid till %"PRIu32" seconds after epoch; ",
         i + 1, desc[i], valid_till);

  if (valid_till == 0) {
    printf("invalid, skipping.\n");
    continue;
  }

  uint32_t expiration_time;
  err = rnp_key_get_expiration(sk, &expiration_time);
  if (err) {
    printf("rnp_key_get_expiration: %x\n", err);
  } else {
    printf("expires %"PRIu32" seconds after key's creation time.\n",
           expiration_time);
  }
}

The results are:

#1 (doesn't expire) valid till 1714111110 seconds after epoch; expires 0 seconds after key's creation time.
#2 (expires) valid till 1714111110 seconds after epoch; expires 94670781 seconds after key's creation time.
#3 (expired) valid till 1619527593 seconds after epoch; expires 86400 seconds after key's creation time.
#4 (invalid sig) valid till 0 seconds after epoch; invalid, skipping.
#5 (no sig) valid till 0 seconds after epoch; invalid, skipping.

Now we get the results that we want! We correctly print the expiration time for the first three subkeys, and indicate that the last two subkeys are invalid.

But, let’s take a closer look at rnp_key_valid_till. First, in OpenPGP, a key’s expiration time is stored as an unsigned 32-bit offset from the key’s unsigned 32-bit creation time. Thus, the function should have used a wider type or at least checked for overflow. (I reported the issue and it has now been fixed.)

But ignoring that nit, the function remains strange. In OpenPGP a key can be valid during multiple periods. For instance, imagine that a key expires on July 1st and the user only extends the key’s expiration time on July 10th. For the time between July 1st and July 10th, the key was not valid, and any signatures generated during that time should be treated as invalid. So, what should this function return for such a key? More importantly, how should a user of that API interpret the result? And, when is it even appropriate to use this API? (Yes, I asked.)

In Sequoia, we take a different approach. Instead of returning when a key is valid, we reverse the question: a user of the API can ask: is this key valid at time t. In our experience, this is what all of the cases that we’ve encountered actually need.

I didn’t cherry pick this particular issue with RNP’s API. It’s just an issue that I’ve been thinking about recently. While reimplementing RNP’s API to create an alternative OpenPGP backend for Thunderbird, we encountered many similar issues.

Conclusion

The mistakes that the RNP developers have made are understandable and forgivable. OpenPGP, like many other protocols, is complicated. But, we can’t significantly simplify it if we want to keep its flexible and robust PKI, and not just have a file encryption tool.

Nevertheless, RNP’s API is dangerous. And, Thunderbird is used in security-critical contexts. In an interview in 2017, Michal ‘Rysiek’ Wozniak from the Organized Crime and Corruption Reporting Project (OCCRP) made clear that lives are on the line:

I do strongly believe that had we not been using GnuPG all of this time, many of our sources and many of our journalists, would be in danger or in jail.

Interview with Michal ‘Rysiek’ Wozniak, Organized Crime and Corruption Reporting Project

What are the consequences for Thunderbird? I see three options. First, Thunderbird could switch back to Enigmail. One might think that porting Enigmail to Thunderbird 78 would be hard, but I’ve heard from multiple Thunderbird developers that that this would technically be feasible with manageable efforts. But, one of the reasons that Thunderbird wanted to switch away from Enigmail is the huge amount of time the Enigmail developers spent helping users correctly install and configure GnuPG. So, this way is not ideal.

Second, Thunderbird could switch to a different OpenPGP implementation. These days, there are a bunch to choose from. Personally, I think that Thunderbird should switch to Sequoia. Of course, I work on Sequoia, so I’m biased. But, it’s not somehow about money: I’m paid by a foundation, and on the open market I would probably earn twice as much as I’m earning now. For me, it’s about protecting the users. But, beyond Sequoia’s API and implementation advantages, it has another advantage for Thunderbird: we already did the implementation work. A few weeks ago, we released the Octopus, an alternative OpenPGP backend for Thunderbird. It not only has feature parity with RNP, but includes a number of oft requested features like gpg integration, some security fixes, and a number of non-functional improvements.

Third, Thunderbird could get out of the OpenPGP business. I don’t want this solution. But, as I’ve said several times, I’m worried about the safety of some of Thunderbird’s most vulnerable users, and I think not providing any OpenPGP support might be safer than the status quo.