At Leven Labs, we're all about microservices, but something that wasn't immediately apparent to us was how to easily communicate between them.
We had 3 basic requirements:
Hard-coding addresses and ports in our services wasn't an option since we'd have to redeploy whenever we want to change or add servers. We decided to use SkyDNS, which sits above the etcd key-value store and provides an easy way to do service discovery. SkyDNS exposes a REST API that lets you add and remove SRV-type DNS records that contain a host, port, priority, and weight.
Sounds perfect! Whenever a service starts up, it will make a PUT request and announce itself. Other services can make SRV queries (or regular A queries if you don't care about weights, priorities, or ports) to get the address/port of another service. Multiple services can start up and announce themselves to add redundancy transparent to the connecting clients. This satisfies our first requirement.
However, when a service fails every other service will have to wait until the TTL expires before it is removed from DNS. A service cannot remove itself unless it stops gracefully, which isn't usually the case.
In order to quickly remove failed services, we built a small go service that sits in front of SkyDNS called skyapi. It exposes a WebSocket endpoint that services connect to and announce what they're providing. The endpoint accepts GET parameters that allow you to specify a port, priority, and weight. As soon as the WebSocket is disconnected, the record is removed from SkyDNS so there's no TTL waiting.
Now we have satisfied our first 2 requirements. We can spin up services on the fly and they'll be advertised through SkyDNS and when a service crashes it'll be immediately removed by SkyAPI.
Developers will need to be running all services in their development environment in order to develop/test our product. When a developer makes changes to a service they run the service in vagrant with the
--runmode=dev flag which sets the priority in SkyDNS to 1 for that particular instance. Lower priority records outweigh higher priority records and since production instances set the priority to 5, the developer's instance will get all local requests. When they're done testing and stop their instance, everything gracefully falls back to the production instance. Keep in mind all your services must take into account priority weighting when making SRV lookups in order for this to work. This also only works if each vagrant box has its own instance of SkyDNS.
What good would it be if we told you what we do without sharing anything? We've open sourced a bunch of libraries to help you get started with SkyDNS and microservice communication:
Thanks for reading and let me know in the comments below if you have any suggestions or questions.