Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the pain points when managing ACME certificates #117

Open
6 tasks
bfabio opened this issue Jul 24, 2017 · 12 comments
Open
6 tasks

Remove the pain points when managing ACME certificates #117

bfabio opened this issue Jul 24, 2017 · 12 comments

Comments

@bfabio
Copy link
Contributor

bfabio commented Jul 24, 2017

Maybe it's just me, but the whole system seems really brittle and every time it breaks it makes me wish I could just run certbot and be done with it.

/usr/local/lib/pki/pki-realm is a ~2500 lines bash script and it's a pain to debug when something goes wrong.

This could be an umbrella bug to improve the whole experience. I think the main points to tackle are:

  • Document how the system works
  • Document pki-realm existence and the arguments it takes, also implement a --help switch
  • Have a verbose and interactive mode, where pki-realm tells the user what it's doing, without running in background.
  • Don't just create error.log, but send it to the sysadmin by mail as well
  • Be clear when subsequent runs fail because of the error.log presence.
  • Simplify the whole script, if possible.
@drybjed
Copy link
Member

drybjed commented Jul 24, 2017

I agree, it's a mess. The whole PKI realm concept was created to provide a standardized way to access the X.509 certificates by services - one concept being a support for multiple Certificate Authorities. ACME support was kinda-sorta bolted on there, when it works, it works, but getting it to work might be a pain sometimes.

I think that the whole concept of a PKI realm should be moved out of an Ansible role into its own separate project. Python comes to mind first, but maybe Go would be easier to handle? I'm not sure yet. Having PKI realm as a separate project could help its development - splitting parts of it as separate plugins, one of which would be support for ACME.

@bfabio
Copy link
Contributor Author

bfabio commented Jul 24, 2017

@drybjed I'm wondering if making a generic pluggable system is the right thing to do. I would much prefer a system where Letsencrypt is the first class citizen and everything else comes after.

I'm afraid that, while laudable, that approach would penalize the most common setup in favor of a flexibility that a tiny minority of users really need.

Just a thought, I'm not against it, but I think the LE setup should be an unstoppable tank that Just Works(TM) every time.

@drybjed
Copy link
Member

drybjed commented Jul 24, 2017

@bfabio What about internal networks? Let's Encrypt certificates work well at frontend hosts, webservers, public stuff, but setting up an internal network with LE is unfeasible. Not all hosts are reachable from the public Internet, but still require encrypted communication. Also keep in mind that Let's Encrypt CA has rate limits - I destroy and create hosts sometimes multiple times a day, and relying only on LE certificates I would hit it's rate limits within hours. During DebOps development I don't use Let's Encrypt certificates at all, because all my hosts are on a private network behind NAT, but I still require X.509 certificates to work correctly, in so far that when the roles are deployed in a production, public environment, they work in the same way.

It seems that Let's Encrypt became an instant hit in the webdev/HTTP community. That's great, but what about other services? SMTP, IMAP, LDAP, MQTT, AMQP, are they not first class citizens? Should they just stick to self-signed certs set up by hand? For me all different CA models supported by debops.pki (ACME, external CA, internal CA, selfsigned) are on equal footing here.

From the point of view of an application that uses X.509 certificates, there's no difference between an internal DebOps CA certificates, Let's Encrypt certificates, or any other CA certificates. Currently debops.pki role supports each of these models in the same way. Some of this is currently crude and unwieldy, like dealing with Let's Encrypt errors, but that's just an implementation detail. The script could definitely handle Let's Encrypt issues better (send an e-mail to configured admin address, have better algorithm to handle errors and error.log, hell, change the configuration file to YAML or INI to drop requirement of bash 4.x so that MacOS X users don't have to fiddle with Homebrew bash installation...).

"Unstoppable tank that 'Just Works(TM)'" - as long as the minimum requirements are met (debops.nginx role has configured nginx on the host, the DNS configuration is propagated, host has a public IP address reachable from the Internet, a PKI realm with desired domain is configured), the Let's Encrypt support in debops.pki should 'Just Work". If you have issues, when you resolve them, you can forget about LE certificates. Check the certs on https://debops.org/ website - I haven't messed with them since the host was created, about 1 year now. It "Just Works (TM)". The host updates the certificates by itself, I don't think about it.

@bfabio, you posted a todo list for the changes you would want to see to make the ACME/Let's Encrypt support better. That's great! I'm currently working on updates to the DebOps mail stack - Postfix, OpenDKIM, SPF, OpenDMARC, perhaps rspamd a bit later. If you want to help with debops.pki development, that's great news to me. Looking forward for some pull requests. :-) If you need some clarification about how pki-realm script works, let me know.

@amette
Copy link

amette commented Nov 7, 2017

To me the main thing that makes the ACME configuration unnecessarily complicated is the choice of default Subject/CN (Common Name) and SANs (Subject Alternative Names). The domain name is used as CN, imho it should be the host's fqdn. That is the only thing that can with reasonable certainty be assumed to point to the host. Using the domain name as the CN falls apart as soon as there is more than one host in the domain.

To make Let's Encrypt work the way I expect, I usually put the following into ansible/inventory/group_vars/all/pki.yml:

pki_realms:
  - name: '{{ ansible_fqdn }}'
    acme_default_subdomains: [ ]
    acme: True
    acme_ca: 'le-staging'

With this, it "just works" out-of-the-box, no matter if it is a one-server-domain or a bigger cluster. I wouldn't actually even be scared to use le-live right away, but better safe than sorry. So once this looks good, I can set acme_ca to le-live in the host specific inventory file. The resulting certificate can be used for SMTP, IMAP, XMPP, etc. As soon as I have my service working properly with the fqdn, I can set the DNS for the according subdomain (mx., mail., smtp., imap., xmpp., jabber., etc.) to my machine and add it to acme_domains in host_vars.

Also I tried to set up Let's Encrypt within the 'domain' realm for quite a while, which I eventually realised is just not gonna work out. I think the documentation could be a bit clearer about having to use a dedicated realm for Let's Encrypt.

On the other hand: if the acme-integration would work as explained above (use the fqdn as CN and don't assume any sub-domains), one could just configure the realm 'domain' to use acme and all would work out of the box. I'm not completely sure though what other ramifications this would have as it would effectively kill the internal CA iiuc.

tl;dr: Changing the default values for CA and SAN should make ACME certificates more straight forward to use. No clue about any potentially associated gremlins though.

@drybjed
Copy link
Member

drybjed commented Nov 7, 2017

At some point I noticed that the choice of adding arbitrary subdomains to ACME certificated by default, namely www. was an issue in cases like this and I changed the default to not include any subdomains. In other words, you can create a FQDN-based realm with ACME certificate like this:

pki_realms:
  - name: '{{ ansible_fqdn }}'

By default, if debops.nginx is set up, and a host has a public IP address, the debops.pki role will try and request an ACME certificate for all configured realms, unless disabled.

The domain realm has ACME specifically disabled, mostly due to the above reasons. Not all hosts managed by DebOps have a webserver configured, and not all of them have public IP addresses, but you still would want connections secured with TLS, right? That's why I don't think that the domain PKI should be removed.

And it's best if you don't mess with the domain PKI but create a separate one - the domain certificates, even if clients don't have their respective Root CA installed, can be used by various services in the cluster for secure communication between nodes. Think LDAP, connections to the remote database from applications, what have you. You can set up custom PKI with ACME certificates on publicly-accessible nodes of the cluster and point the services accessed by the clients to them.

Actually, debops.nginx has a specific support for this use case. If you create a PKI based on the host's FQDN, or host's domain, the debops.nginx role, during configuration generation, will check if FQDN or domain-based PKI exists, and it will be used automatically. In other words, if you create a FQDN PKI and afterwards re-run debops.nginx, it should automatically switch to it if any servers are configured with that FQDN. It can also be any domain name, or host name, of course, not just the values detected by Ansible.

So, the use case you want should be already implemented. Of course for this you need to specifically enable the `{{ ansible_fqdn }}' PKI realm, but due to various rate limits of Let's Encrypt, and other factors mentioned earlier, I don't think that ACME support like this can be enabled by default. Maaaybe, with some more specific logic that enables the FQDN-based PKI in specific situations.

@prk0ghy
Copy link

prk0ghy commented Apr 18, 2021

I agree, it's a mess. The whole PKI realm concept was created to provide a standardized way to access the X.509 certificates by services - one concept being a support for multiple Certificate Authorities. ACME support was kinda-sorta bolted on there, when it works, it works, but getting it to work might be a pain sometimes.

I think that the whole concept of a PKI realm should be moved out of an Ansible role into its own separate project. Python comes to mind first, but maybe Go would be easier to handle? I'm not sure yet. Having PKI realm as a separate project could help its development - splitting parts of it as separate plugins, one of which would be support for ACME.

Thank you for your work on this amazing project. I am currently trying to get it to work for me however PKI is a huge pain point (at least for me). Maybe redeveloping the pki in Go is not even necessary since there is already something like that:

https://github.com/smallstep/certificates

maybe we can get this integrated into debops?

@ypid
Copy link
Member

ypid commented Apr 19, 2021

@prk0ghy I support this. We should not implement our own certificate management again. I would say it was a solid way to learn how PKI works, both for @drybjed who implemented it and for me spending one month reviewing it. Now that we do understand it, we can compare other solutions better.

https://awesomeopensource.com/projects/certificate-authority seems to be a good list.

@drybjed
Copy link
Member

drybjed commented Apr 19, 2021

Looking at my 2017 comment from 2021 brings totally new perspective to this issue. :-) The problem with current PKI implementation is that it is "lopsided" and depends entirely on the remote hosts. The environment we can work with on the Ansible Controller is limited, so I did what I could back then and just relied on the remote hosts to provide initial information about the domain(s) we work with, what CA certificate should include, etc.

Today, while working on re-implementing the debops scripts, I imagine that the internal CA part of the pki role would be redesigned to use debops pki subcommand to perform its operations. That way we can implement it in Python and we have control over what is executed on the Ansible Controller. And we can add support for other software as well, such as step-ca. I'm currently swamped by other stuff at work, but hopefully I'll have some free time during summer to work on this more.

One problem is this is finding a way to have internal CA management without the debops scripts installed so that the pki role can still function properly. I guess that we can just provide basic self-signed certificates on the remote hosts and tell the users to install the debops scripts to have fully-fledged internal CA. The remote side could still work independently, handling self-signed and ACME-based certificates.

@ypid
Copy link
Member

ypid commented Apr 20, 2021

"Don’t roll your own crypto". There is still #106. There has to be an existing tool we can use.

@prk0ghy
Copy link

prk0ghy commented Apr 22, 2021

Would it be useful to compile a list of features the new pki should have? I think it would be easier to implement if we know exactly what it should be able to do.

@drybjed
Copy link
Member

drybjed commented Apr 22, 2021

Here are some things I would like to address from the current pki role included in DebOps:

  • Proper implementation of Bridge CA on the Ansible Controller. With new debops scripts that would be done using the debops pki subcommand. The Bridge CA is managed in a new project directory which can contain multiple Ansible inventories and connects separate infrastructures together. This will also improve PKI bootstrapping, since users can properly initialize the internal CAs without going to the remote hosts first.

  • The role should implement support for at least one ACME-based CA managed by DebOps. This service could be deployed as the next step in infrastructure bootstrapping process, and the internal CA could switch to it as the default - that way we don't need to run the pki role manually anymore and the pki-realm scripts on the remote host take over handling of internal CA certificates, making long-term maintenance easier. The CA server can also offer OCSP and CRL services to implement proper certificate revocation, which then would pave the way for client certificates. Ideally something included in Debian would be preferred, if we don't find anything good we can look at step-ca for that purpose.

  • The current ACME support fails to refresh certificates from time to time, due to DNS issues or some other things. The problem is that this fails silently and administrator learns about it month or two later when the certificates stop working. The pki-realm script executed on the remote hosts needs to better handle such failure state, perhaps checking the reason in the log file, and if it's one of the known reasons, re-trying the ACME request some time later. If it's an unknown reason or a re-try fails again, notify the administrator via e-mail.

  • The pki-authority script will be incorporated into the debops scripts Python codebase, so that leaves the pki-realm script on the remote hosts. It could be rewritten in Python, although I'm not sure of feasibility of that being in a single giant script for ease of deployment. The UI of the pki-realm script could also be reworked to allow users to manage the PKI realms on the remote hosts manually if needed, with additional introspection - pki-realm status command, state of ACME and/or other certificates, and so on.

  • The pki-realm script switching to a different set of certificates a week before they expire needs to go away, it was an ill-advised attempt to handle expiration.

  • There's currently no support for Java certificate stores. I'm not sure how useful that would be, it depends on how many Java-based applications we will end up managing, but it's something to consider.

@prk0ghy
Copy link

prk0ghy commented May 2, 2021

@ypid
I went through the list and came up with these candidates. Although I think step-certificates and step-cli is the way to go.

https://github.com/NLnetLabs/krill
https://github.com/letsencrypt/boulder
https://github.com/dogtagpki/pki/wiki/Certificate-Authority
https://github.com/cloudflare/cfrpki
https://github.com/fm4dd/webcert
https://github.com/cloudflare/cfssl
https://github.com/smallstep/certificates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants