Discovering services in a distributed environment can be challenging. With multiple machines being created and removed, load balancers in play, questions arise on how to determine which service is up, where to direct traffic, and ensuring health. The Consul tool will be used to understand the subject practically, it offers Service Discovery along with various other functionalities.
In scenarios where applications communicate, like App A calling App B which in turn calls App C, knowing the machine being called becomes crucial. This involves knowing the IP, port, or subdomain linked to an IP through DNS.
As the usage of App B increases, new instances are created to manage load. This raises questions about which instance of App B App A should call, selection criteria, and readiness of new instances.
Challenges emerge:
- Which machine to call?
- Which port to use?
- Do I need instance IPs?
- How to ensure an instance's health?
- How to confirm permission to access?
- Discover available machines for access
- Segment machines for security
- Resolve via DNS
- Active health checks
- Verify access permission
Consul is often associated with Service Discovery due to its mature and lightweight approach. It offers a wide array of features, including:
- Service Segmentation: Specify services inaccessible to certain services.
- Layer 7 Load Balancing: Balancing internet traffic outside the network.
- Key / Value Configuration (Storage): Store configuration variables, secrets, etc.
- Open Source / Enterprise
To enable service communication, Mutual TLS is utilized, ensuring certificates and incorporating Service Mesh and Proxies.
- Mutual Authentication: Learn more
- TLS (Transport Layer Security): Learn more
The Service Registry contains all services, along with Health Check information.
Imagine three instances of Application B, all of them have Consul agents, continuously reporting to the server, maintaining a consolidated service registry. Health Checks are conducted locally; if an issue arises, Consul removes the machine from the registry.
This communication uses the Gossip Protocol. Consul features its DNS server, allowing DNS queries for available machines.
It also offers an API for querying and obtaining information from the Service Registry.
Consul's Agent runs on each server, configured to access addresses (URLs, Endpoints, TCP Ports). If the agent stops responding, it notifies the Consul Server, which then removes the service from the Registry.
The Consul Server doesn't continually check the Client for service availability. Instead, the Client Agent verifies service health, updating the Server.
- Multi-language
- Multi-operating system
- Consul is not just multi-cloud; it allows communication between Consul instances in different clouds, ensuring consistency.
- Consul's "Datacenter" concept facilitates managing multiple clusters to meet specific needs.
Main functions summarized:
- Agent: A daemon process executed on all cluster nodes, running in Client or Server Mode.
- Client: Locally registers services, performs Health Checks, forwards service information to Server.
- Server: Maintains cluster state, registers services, handles client/server membership, and more.
- Simulate Consul servers
- NEVER use in production
- For testing features, examples, and proof of concepts
- Runs as a server
- Does not scale
- Registers everything in memory
The example provided in this repository does not cover security aspects, therefore, for production environments, it's necessary to encrypt communication between Consul clusters using TLS certificates and encryption among members.
❯ docker compose up
- Clone this repository and navigate to the project directory.
- Run
docker-compose up
to start the demonstration environment. - Run each service in root mode using the command
docker-compose exec <service-name> sh
Consul has three modes: client, server, and dev.
# command example
❯ consul agent -dev
You can see the cluster members and their information using the following command:
❯ consul members
- Always start at least 3 servers for stability and leader election.
- Consul replicates information automatically from the leader to other servers.
- Consul offers a REST API accessible via HTTP and includes a built-in DNS server.
- Consul provides SDKs to simplify service catalog interaction and caching.
Use the following command to access the catalog of the cluster:
❯ curl localhost:8500/v1/catalog/nodes
- Configure three Consul instances in the
docker-compose.yaml
. - Start each agent in server mode with the appropriate parameters, including
-server
,-bootstrap-expect
,-node
,-bind
,-data-dir
, and-config-dir
.
To start a server agent:
- Determine your machine's local IP using
ifconfig
. - Use the following command to launch the agent (replace placeholders as needed):
❯ consul agent -server -bootstrap-expect=3 -node=<node-name> -bind=<local-ip> -data-dir=/var/lib/consul -config-dir=/etc/consul.d
- Create directories
/etc/consul.d
and/var/lib/consul
in the client's container. - Start the client agent using the following command:
❯ consul agent -bind=<client-ip> -data-dir=/var/lib/consul -config-dir=/etc/consul.d
- Create a
.json
file, you can name it anything, for example, let's useservices.json
.
In the file, you'll describe:
"id"
: A unique identifier for the service."name"
: The name of the service."tags"
: Tags that help with service discovery (e.g., "production", "development", "web")."port"
: The port on which the service is running.- Other parameters available in the Consul documentation.
{
"service": {
"id": "nginx",
"name": "nginx",
"tags": [
"web"
],
"port": 80
}
}
Registering a service differs from starting a service. When you register the service, you're not starting it yet. Registration and service deployment are independent. Always run an agent on the same machine as the service.
After configuring, use consul reload
to update Consul with registered services.
❯ consul reload
❯ dig @localhost -p 8600 nginx.service.consul
# expected output
;; ANSWER SECTION:
nginx.service.consul. 0 IN A 172.18.0.3
You can see the service nginx.service.consul
running on IP 172.18.0.3
. Any machine can see the service by using the same command.
You can also use the Consul API to fetch information about the registered services. Use the following command to list the available services:
❯ curl localhost:8500/v1/catalog/services
# expected output
{"consul":[],"nginx":["web"]}
To register another nginx
service, create a JSON configuration similar to the first one but with a unique id
.
{
"service": {
"id": "nginx2",
"name": "nginx",
"tags": [
"web"
],
"port": 80
}
}
Bring up a new client, and using -retry-join
, the client automatically discovers the entire cluster.
❯ consul agent -bind=172.19.0.6 -data-dir=/var/lib/consul -config-dir=/etc/consul.d -retry-join=172.19.0.2 -retry-join=172.19.0.3
The second nginx2
service synchronizes, and the join to the cluster is successful.
If we don't implement this check service, Consul will always show the registered services regardless of whether they are available or not. This is why adding the Health Check feature becomes so useful.
Consul offers various ways to implement Checks, as documented in Define health checks.
We'll add the check only to the consul01
service, and to apply the changes, we need to run the command consul reload
.
First, let's verify the services:
❯ dig @localhost -p 8600 nginx.service.consul
# expected output
;; ANSWER SECTION:
nginx.service.consul. 0 IN A 172.18.0.6
nginx.service.consul. 0 IN A 172.18.0.3
We run consul reload
on the machine where the check was added, i.e., consulclient01
:
❯ consul reload
Configuration reload triggered
# expected output
2023-08-19T19:28:55.199Z [WARN] agent: Check is now critical: check=nginx
After running the command dig @localhost -p 8600 nginx.service.consul
again, we only see the service from client02
:
❯ dig @localhost -p 8600 nginx.service.consul
# expected output
;; ANSWER SECTION:
nginx.service.consul. 0 IN A 172.18.0.6
If we try to check via the API, we'll get the following response since we don't have anything installed. It's important to note that registering a service is one thing, and actually running the service is another. Up to this point, we haven't even installed Nginx yet.
❯ curl localhost
curl: (7) Failed to connect to localhost port 80 after 0 ms: Connection refused
Installing and starting Nginx:
❯ apk add nginx
❯ nginx
❯ ps
PID USER TIME COMMAND
213 root 0:00 nginx: master process nginx
214 nginx 0:00 nginx: worker process
215 nginx 0:00 nginx: worker process
216 nginx 0:00 nginx: worker process
217 nginx 0:00 nginx: worker process
218 nginx 0:00 nginx: worker process
219 nginx 0:00 nginx: worker process
220 nginx 0:00 nginx: worker process
❯ curl localhost
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
But when we look at the terminal of client01
where Consul is running, the service is still critical:
2023-08-19T19:35:25.284Z [WARN] agent: Check is now critical: check=nginx
We need to configure Nginx:
# Adding a text editor
❯ apk add vim
# Creating a directory using -p to create recursively if any of the folders don't exist
❯ mkdir /usr/share/nginx/html -p
#
Opening the default Nginx configuration file
❯ vim /etc/nginx/http.d/default.conf
Replace:
# Everything is a 404
location / {
return 404;
}
# with:
root /usr/share/nginx/html;
# Now, edit the content of the file that will be displayed
❯ vim /usr/share/nginx/html/index.html
Hellooo... finally!!!
# Restart Nginx to apply the changes
❯ nginx -s reload
After reloading Nginx, the service status should change in the terminal:
# expected output
2023-08-19T22:22:13.955Z [INFO] agent: Synced check: check=nginx
Synchronized Check. Running curl [localhost](<http://localhost>)
should display the content we added to the HTML:
❯ curl localhost
# expected output
Hellooo... finally!!!
Now we can verify via DNS if the Client01
service is up and running:
❯ dig @localhost -p 8600 nginx.service.consul
# expected output
;; ANSWER SECTION:
nginx.service.consul. 0 IN A 172.19.0.6
nginx.service.consul. 0 IN A 172.19.0.5
We can now kill the Nginx process to see if the Check actually works and if the service becomes unavailable after being online:
# List processes
❯ ps
PID USER TIME COMMAND
87 root 0:00 nginx: master process nginx
# Kill the process and then query via DNS
❯ kill 87
❯ dig @localhost -p 8600 nginx.service.consul
;; ANSWER SECTION:
# expected output
nginx.service.consul. 0 IN A 172.19.0.6
# Automatically, Consul on Client01 indicates that the service is critical
2023-08-19T22:29:14.071Z [WARN] agent: Check is now critical: check=nginx
After restarting Nginx, wait for 10 seconds, and the service should be visible again:
❯ nginx
❯ dig @localhost -p 8600 nginx.service.consul
# expected output
;; ANSWER SECTION:
nginx.service.consul. 0 IN A 172.19.0.5
nginx.service.consul. 0 IN A 172.19.0.6
# Also in the terminal of Consul Client01
2023-08-19T22:30:24.137Z [INFO] agent: Synced check: check=nginx
In this example, we synchronize servers by using a JSON configuration file placed in the server
directory. This file contains all the necessary parameters to start the servers within it.
// In the case below, the "server" parameter indicates the type of agent being started
// TRUE = SERVER and FALSE = CLIENT
{
"server": true,
"bind_addr": "172.19.0.2",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"retry_join": [
"172.19.0.3",
"172.19.0.4"
]
}
After configuring the above, you can restart the container and then run the simplified Consul command:
❯ docker compose up -d
❯ docker exec -it consulserver01 sh
❯ consul agent -config-dir=/etc/consul.d
Replicate these configurations for the other servers, rerun the Docker command, and start the remaining servers that are missing.
Throughout this example, every time we use the consul join XXX.XX.X.X
command, we enter the network, and it's advisable to add more security. We have two types of encryption here: one refers to membership-related encryption, involving joining the Cluster, synchronizing information, and recognizing all members.
The other type of encryption uses Mutual TLS, where we generate a certificate through a certificate authority, possibly one of the servers. This server generates and distributes client certificates, enabling secure catalog exchange and API access. This ensures secure information exchange.
We made changes to server.json
for all services:
// For the first server, we removed retry_join
{
"server": true,
"bind_addr": "172.19.0.2",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"node_name": "consulserver01"
}
// For the second server, we kept only the address of server01
{
"server": true,
"bind_addr": "172.19.0.3",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"retry_join": [
"172.19.0.2"
],
"node_name": "consulserver02"
}
// We kept server03 the same
{
"server": true,
"bind_addr": "172.19.0.4",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"retry_join": [
"172.19.0.2",
"172.19.0.3"
],
"node_name": "consulserver03"
}
// We added node_name to all servers
Regarding encryption, let's focus on encrypting among members. We need to generate a Consul key and pass it as the "encrypt" parameter with the same key in all server.json
files. However, for testing purposes, we left the key out of server03.
First, we run the rm -rf /tmp/*
command on all terminals to prevent errors related to previously created data. Then, we run consul keygen
and add the generated key to the first two servers:
{
"server": true,
"bind_addr": "172.19.0.2",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"node_name": "consulserver01",
"encrypt": "run the ```consul keygen``` command and add the key here"
}
{
"server": true,
"bind_addr": "172.19.0.3",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"retry_join": [
"172.19.0.2"
],
"node_name": "consulserver02",
"encrypt": "run the ```consul keygen``` command and add the key here"
}
// For server03, we removed the "encrypt" parameter
{
"server": true,
"bind_addr": "172.19.0.4",
"bootstrap_expect": 3,
"data_dir": "/tmp",
"retry_join": [
"172.19.0.2",
"172.19.0.3"
],
"node_name": "consulserver03"
}
Remember to run rm -rf /tmp/*
on all terminals to clean up any previous data.
After applying all the changes, we start the first server and, at the end of its terminal, we see that all services have started. We repeat the same steps for server02 and then server03.
Since, in this example, we deliberately didn't provide the encryption key to server03, the output we see in the terminal is as follows:
❯ consul agent -config-dir=/etc/consul.d
# expected output
==> Starting Consul agent...
Version: '1.10.12'
Node name: 'consulserver03'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 172.19.0.4 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2023-08-19T23:22:49.457Z [INFO] agent: Joining cluster...: cluster=LAN
2023-08-19T23:22:49.457Z [INFO] agent: (LAN) joining: lan_addresses=[172.19.0.2, " 172.19.0.3"]
2023-08-19T23:22:49.457Z [INFO] agent: started state syncer
2023-08-19T23:22:49.457Z [INFO] agent: Consul agent running!
2023-08-19T23:22:51.460Z [WARN] agent.server.memberlist.lan: memberlist: Failed to resolve 172.19.0.3: lookup 172.19.0.3: no such host
2023-08-19T23:22:51.460Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="2 errors occurred:
* Failed to join 172.19.0.2:8301: Remote state is encrypted and encryption is not configured
* Failed to resolve 172.19.0.3: lookup 172.19.0.3: no such host
We can then see that it cannot access the Consul cluster.
And when we run the command consul
members in the server01 terminal, we notice that only server01 and server02 have been added to the Cluster.
❯ consul members
# expected output
Node Address Status Type Build Protocol DC Segment
consulserver01 172.19.0.2:8301 alive server 1.10.12 2 dc1 <all>
consulserver02 172.19.0.3:8301 alive server 1.10.12 2 dc1 <all>
We can also attempt to listen on a port on the TCP network interface to verify if the data is indeed being encrypted, attempting to observe the information being transmitted.
First, let's install tcpdump on the container.
❯ apk add tcpdump
# We verified using `ifconfig` that our TCP interface is `eth0`.
❯ ifconfig
# With the command below, we attempt to read the information being transmitted.
❯ tcpdump -i eth0 -an port 8301 -A
# expected output
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:35:08.963680 IP 172.19.0.3.8301 > 172.19.0.2.8301: UDP, length 116
E.....@[email protected]........ m m.|X......j!@... /......s..@e"iu.$..,xy..g7.y!.y......L~x.B.....@...{:..Qh.A..........*....M..<[email protected].^ ...C.p....0.yK.
23:35:08.964058 IP 172.19.0.2.8301 > 172.19.0.3.8301: UDP, length 181
E...".@.@........... m m..X..n..?X*.:
.ss.&[email protected]"G..6j.k}.qp:.Sb.\.....5.4../..`gv../.r. l(...$.....F&.......z..o....=c...........{.Gt.k.k...]6\@/|...v........../...~N.../f... e.t......BT.Hv...i...A....
23:35:09.292942 IP 172.19.0.2.8301 > 172.19.0.3.8301: UDP, length 116
E...#.@.@..,........ m m.|X..9. .._..ZY0...>".....b.i.+..9.Tb....z....y.....u........M....,._7....M.4[.z..q.zc.....zl........f...L.......T....k.
23:35:09.293476 IP 172.19.0.3.8301 > 172.19.0.2.8301: UDP, length 181
E.....@[email protected]......... m m..X...........$%....&..^.gBT4..zJY....`.+-.'.......a.34.....5......x.\..!z..y..,.<H_}(........lM..\C0k.W.!
!e..........zB.H}......t.r?9..E..5z....Z.........m,...].x....x=..&......Y]....
23:35:09.964173 IP 172.19.0.3.8301 > 172.19.0.2.8301: UDP, length 116
E.....@[email protected]........ m m.|X....4....Q.
@zO..B.c....0e.$.........I...6..P..x.6..S%.h...P_......_.?..8.(S.95.w....Q.@..."...*...B.F...5.9........R
23:35:09.965185 IP 172.19.0.2.8301 > 172.19.0.3.8301: UDP, length 181
E...#.@.@........... m m..X......v.J.+..)[email protected]..].*[...@.%M.......B.).^...iZ>.$2^.R..5.47.........V.j....+=l+5..k...-.i.`.4..>....1.i...&.....{.Y........O....;+.<!.3.....u...z(..n'.akx.Z#.p..D.
23:35:10.843093 IP 172.19.0.2.43660 > 172.19.0.3.8301: Flags [S], seq 1265554281, win 64240, options [mss 1460,sackOK,TS val 2170266511 ecr 0,nop,wscale 7], length 0
E..<..@.@............. mKn.i........XZ.........
The security change is in the data exchange between the servers. Previously, anyone who knew the IP address could join the cluster. However, with this new security model, only machines that have the key are allowed to join the network. This is the main security improvement, encrypting all the network traffic exchanged by servers. Without the correct key, it's impossible to access the network.
Please note that this example is aimed at understanding how to run a cluster and isn't necessarily secure for production environments. In a production environment, you would handle the keys and certificates more securely, preferably using a secure key management system.
The Consul User Interface is accessible via a web browser, and to use it, we need to enable it when bringing up the Cluster. There are two ways to do this:
→ Include the -ui
tag when starting the client: consul agent -config-dir=/etc/consul.d -ui
→ In the case of running within Docker, since the UI is designed to run on localhost and attempting to access localhost from outside Docker by sharing a port, we need to make this UI accessible from anywhere. Therefore, we could pass 0.0.0.0 to the client, but we will configure this directly in the configuration file.
Since the UI runs on port 8500, we also had to share this port in the docker-compose.yaml.
We then ran the docker-compose up and the servers again.
So we have access to the Consul User Interface → http://localhost:8500/ui/dc1/services