As the implementation of distributed systems for faster data storing, accessing, and processing increases, there are many real-time warnings, as well as other critical conditions that can make the system slow down or even affect the market value of the organization. For this reason, it is imperative for large or small organizations to set up monitoring systems for their complex distributed IT infrastructure.
After all, monitoring has emerged as a key part of the distributed system. Appropriately selecting and designing a monitoring system, only tackles half of the requirements to offer efficient, reliable, and fast cutting-edge technologies. Many benefits are associated with monitoring systems, including earning warnings or notifications. This can contribute significantly to the early-stage fixation of issues.
As the solution becomes automated to initiate any time there is a deviation from the normal routine can play a huge role in tackling the system downtime. Selecting an appropriate monitoring tool for a system is essential and depends on the characteristics of the system z as well as how it can be utilized by the organization.
Riemann: An Overview
Riemann is an Open-Source event-based monitoring tool that is adopted for distributed systems, including servers, hosts, and applications. With Riemann, users can easily monitor by gathering the events that are thrown by the systems while sending them into a powerful Stream Processing Language to be further filtered, processed, and summarized.
Riemann can also act as a routing engine. It is capable of routing the events from the distributed systems to the integrated storage, analyzing, or alerting services. Riemann is a stand-alone system. That implies that there is no need for any integration with any other tools to process and view the output, even though the output data can be lost quickly.
Besides this, its low latency ensures that users can get almost real-time data. Riemann is not just fast, but also easy and highly configurable. With Riemann, it is easy to build various checks and alerts on the state of incoming events. Riemann can also make clients push their events to the server. This ensures that they are visible within milliseconds, which makes it easier to identify problems faster.
Riemann Can Identify Problems Faster
For traditional monitoring systems, it is possible to run polling loops every five minutes. However, in a Riemann infrastructure, clients can push their events to the server. As a result, they become visible within milliseconds. Users can see outages faster, thanks to the low latencies. With this, users can identify the instant the problem has been fixed.
What the stream does with the events will determine the throughput. Yet, a stock Riemann config on commodity x86 hardware is capable of handling millions of events within a second at sub-ms latencies, with 99ths around 5ms. Furthermore, Riemann is fully parallel and makes the most of Clojure and JVM concurrency primitives throughout.
Running Riemann
This section will take a look into how Riemann can be operated.
Changing the Config
For users adopting the use of the Debian or CentOS packages, Riemann adds a Riemann user, stores its configuration in /etc/riemann/riemann.config, and logs to /var/log/riemann.log. Tailing the log file when working with Riemann is important since this alerts you to any kind of configuration errors while assisting in debugging streams.
- tail -F /var/log/riemann.log
In another terminal, users can adopt their favorite editor to open /etc/riemann/riemann.config. By adding a logging stream, all events that pass through the streams can be seen. #(info %) expands into (fn [x] (info x)), which represents a function that takes an event and logs it at the INFO log level.
(streams
index
…
#(info “received event” %))
Then, there is a need to reload the config file. Riemann will respond to SIGHUP, or users can make use of the init scripts. Users can see a message about reloading the config in the Riemann log. If a mistake is made, like a syntax error, Riemann can keep running with the old config while not applying the new one. Rather, it will report the problem in the form of an explanation. This makes it easier for the user to easily investigate and fix it.
Once the config is reloaded, certain events should be seen flowing by in the log. Some of these can originate from external sources. Others are produced by Riemann internally, which forms a part of its internal performance instrumentation. It is also possible to insert logging statements anywhere in the streams, which are needed for the verification of the kind of events that are flowing through the point.
Oftentimes, reloading is not only experimental but also subject to the normal Clojure rules about (def) and (defn). To prevent any form of confusion, use let bindings rather than def. This will ensure that the reloads work accurately. However, if these reloads appear broken, it might be wise to perform a full restart to ensure that the changes take place.
- sudo service riemann reload
A Minimal Configuration
In this subsection, we introduce a minimal configuration, which is only capable of printing incoming events to standard output. This configuration can be quite useful, especially when debugging clients, which ensures that expected events are reaching Riemann.
In this case, we consider the following actions:
- logging/init Setting up logging to log to standard output (STDOUT)
- TCP-server Starting a Riemann TCP server on port 5555 (the default)
- instrumentation Disabling internal event production
It is possible to output any input by simply calling prn on them, thanks to the fact that streams expect a list of functions to call with other individual events.
(logging/init {:console true})
(tcp-server {})
(instrumentation {:enabled? false})
(streams
prn)
Combining Functions From Multiple Tiles
There are many times that the Riemann configuration file can get very large as streams are added and events are handled. Besides this, for users that are managing Riemann with a configuration management tool, such as Puppet or Chef, it can become more difficult to template a huge configuration file. To help with this, Riemann also facilitates other functions via Clojure’s namespacing model. These could include a user’s custom stream functions.
Enabling this involves including the contents of the directory containing the riemann.config configuration file in the CLASSPATH.
To make use of other functions, then create a source directory for them. For instance, to create functions that are particular to your organization, you might create an examplecom.etc namespace.
sudo mkdir -p /etc/riemann/examplecom/etc
After this, users can create a namespace containing the functions that should be included in a file in this directory, for example, email.clj.
; Create a new function with ns
(ns examplecom.etc.email
(:require [riemann.email :refer :all]))
(def email (mailer {:from “reimann@example.com”}))
Then, this namespace must be required in the riemann.config. With this, the email variable can be used inside the Riemann configuration.
Using Reimann With Docker
Rather than installing Riemann directly on your system, as a user, you can also make use of Docker to run it. There is a prepackaged Docker image available as riemannio/riemann.
This default configuration ensures that Riemann indexes all incoming events and prints all expired ones. However, you must ensure that you bind the necessary ports to the host system. Otherwise, it will be impossible to send events to Riemann.
If lots of traffic is expected while not needing any form of segregation of containers, it is possible to skip the Docker network and adopt the host network by using the –net host flag. It is worth pointing out that the default configuration of Reimann in Docker will listen on every network interface. For this reason, it is imperative to either configure the firewall appropriately or mount a separate configuration.
docker run –rm -p 5555:5555 riemannio/riemann
docker run –rm –net host riemannio/riemann
By default, Riemann will make use of a configuration file located at /etc/riemann.config, which you should override by mounting the right configuration into the container.
To achieve this, you can either override the default container command (/bin/riemann /etc/riemann.config) or mount the configuration file at the default location.
docker run \
–rm \
-p 5555:5555 \
-v $(realpath riemann.config):/etc/riemann.config \
riemannio/riemann
docker run \
–rm \
-p 5555:5555 \
-v $(realpath myapp.clj):/myapp.clj \
riemannio/riemann \
/bin/riemann /myapp.clj
To ensure that things remain manageable, it is also possible to adopt Docker Compose. This helps in defining the manner in which the Riemann container should be run.
version: “3”
services:
riemann:
image: riemannio/riemann:latest
ports:
– “127.0.0.1:5555:5555”
– “127.0.0.1:5555:5555/udp”
– “127.0.0.1:5556:5556”
volumes:
– ./riemann.config:/etc/riemann.
Protocols That Can Be Used to Talk to Riemann?
You can use the Riemann TCP protocol. You can adopt UDP if what you wish to do is to have the OS and network get rid of data automatically rather than application-level flow control. However, do not adopt UDP for speed. After all, it is designed to drop data, which it will do, and you will be left trying to figure out what went wrong with your stats.
For sampling events where you are allowed to eliminate a large part of the datastream, you can make use of UDP. However, if different nodes drop packets at different rates, then you might need to resort to sampling bias. You should only use UDP when you know what this means.
You should adopt HTTP only if a TCP client is unavailable; it causes significant encoding and state machine overhead, takes more bytes on the wire, and can’t represent the full extent of Riemann events. Furthermore, you can make use of the Graphite server and other compatibility shims simply for interop with legacy systems.
You should note that the graphite protocol can represent only a small subset of the Riemann data model. For this reason, you must carry out additional work to ensure that meaningful events are reconstructed.
What Ports Does Riemann Use?
By default, Riemann is capable of running servers on many ports. These are listed as follows:
- OpenTSBD on port 4242
- Graphite on port 2003
- TCP on port 5555
- TLS on port 5554
- UDP on port 5555
- Websockets on port 5556
- REPL on port 5557
You can change each server’s port in your configuration.
(let [host “0.0.0.0”
iport 1234]
(tcp-server {:host host :port iport})
(udp-server {:host host :port iport}))
How to Use TLS to Secure Traffic
Riemann can be used across any TCP or UDP VPN, as well as an SSH tunnel. It should be stressed that Riemann only allows bidirectional TLS auth as part of its TCP protocol.
In riemann.config, add a new (or repurpose an existing) TCP server for TLS.
(tcp-server {:host “0.0.0.0”
:port 5554
:tls? true
:key “riemann_server.pkcs8”
:cert “riemann_server.crt”
:ca-cert “ca.crt”})
You can get the client TLS options by referring to your client’s documentation or source. In riemann-clojure-client, try something like:
(riemann/tcp-client {:host “1.2.3.4”
:port 5554
:tls? true
:key “riemann_client.pkcs8”
:cert “riemann_client.crt”
:ca-cert “ca.crt”})
How to Override Riemann Functions
Sometimes, you might not find Riemann working the way you want. When this occurs, users can redefine any function in Riemann in the config file. All you need to do is to switch to the right namespace, redefine the function, and simply switch back to riemann.config.
For example, imagine that a user wishes to change how emails are formatted. The namespace riemann.common has a function known as the body. This can accept a sequence of events and returns a string. We will override it in the config file:
(ns riemann.common)
(define body [events]
; pr-str formats events as a clojure-readable string.
(pr-str events))
(ns riemann.config)
; And then your servers, streams, etc…
(tcp-server …)
(streams …)
Sharding
Thanks to the fact that Riemann can perform arbitrary computation, it is not capable of sharding your workload automatically. To achieve this, you must perform certain local aggregation on these nodes while reducing the results with some high-level Riemann nodes. With the forward stream, you can connect these nodes.
How to Troubleshoot Missing Events
The first thing you must do is to figure out whether the events you wish to find are making it into Riemann. With this, you can localize the issue to the network or the streams. Then, add a new top-level stream that can simply log all inbound events. You can make use of a (where) filter around that logging stream to minimize noise.
(streams
#(info %))
(streams
(where (service “something you’re looking for”)
#(info %)))
If it proves hard to see the events you wish to find in the logs, they must be dropped before arriving at the streaming system. Then, you can check for the following:
- Riemann clients are sent to the correct Riemann host and port.
- Riemann should be listening to that host and port. Double-check the config (e.g. (tcp-server) options), that the appropriate server startup message appears in the Riemann logs, and that netstat -lntp | grep riemann shows the port bound.
- After this, the network can transmit packets to the Riemann server from the client machine. Use telnet some.host 5555 or nmap -sT some.host -P 5555.
- These packets are making it to the server from the client. UDP messages can be delayed, re-ordered, or duplicated at any given time by the network. They can also be removed if the receiving node’s receive buffer is too full. You can resort to TCP.
If the messages reach the streaming system but can’t arrive at the index or any other output stream, you can slide the #(info %) stream downstream close to the exit point. This will verify the point in the chain that the events are not what is expected.
If you wish to index events, but they don’t seem to appear in queries, it is possible that they expired from the index. You can try (expired #(warn “expired” %))to warn you about all expired events. They may start expiring if their TTL happens to be shorter than the interval between events. It is also possible that they expired since the Riemann clock lies in the future, concerning the originating event, due to clock skew or network/processing latency.
Besides this, you should also take a look at the internal instrumentation for the Riemann queue depth and core latencies. However, if the queue appears to be more than a handful of events deep or stream latencies appear to be on the order of the relevant event TTLs, the best option might be to minimize the load on the Riemann server.
You should have a look at the clock skew that is present between nodes by making use of the watch date. Then, ensure that the clocks are not in local time but in UTC. Remember, it must never be in local time, particularly in daylight savings time. It is also possible to make use of Riemann’s clock-skew stream to measure clock skew.
(streams
#(info %) ; First, measure here
(where (service #”^riak .+”)
#(info %) ; Then move the info stream here to check the filter
(by :service
(coalesce
#(info %) ; Third, check the coalesced vector of events
(smap folds/maximum
#(info %) ; Fourth, probe here to check the maximum calculation
(with :host nil
#(info %) ; Finally, check exactly what events are being applied
; to the index.
index))))))
How to Measure Riemann Itself
Riemann is designed with built-in instrumentation, which undergoes periodic sampling while being injected into the event stream. It is possible to query the dashboard for service =~ “riemann %”. What this does is that it makes it possible to see Riemann’s internal metrics. All latencies are measured in milliseconds.
Riemann streams are regarded as the stream processor, with a throughput that measures the number of events that are being passed to streams every second. The latencies describe the amount of time that it takes to process an event through the streaming system.
The UDP, TCP, and Websockets servers can instrument their throughput and latency. What the TCP latency does is that it measures the time from initial message deframing to Netty write of the response. Take, for example, Riemann server tcp 1.2.3.4:5555 in latency 0.95 measures of the 95th percentile time, in milliseconds, for the TCP server on port 5555 to understand, queue, process, and dispatch a response for a message.
Riemann netty event-executor queue size measures the main queue that connects Netty IO threads to the parsing/execution thread pool. This represents an important measure of whether the streaming system is overloaded concerning any inbound request load.
How to Measure Memory, CPU, and Disk Use
Riemann tools involve a program, which is regarded as riemann-health. This is responsible for measuring the local host’s memory,.cpu, and disk use. This can be installed from rubygems.
- gem install riemann-tools
then, run riemann-health with the address of your riemann server as follows:
- riemann-health –host 1.2.3.4
By default, Riemann-health can report utilization fractions ranging from zero to one. The load average is split by the number of cores, as well as disk use by capacity. The polling interval can also be adjusted.
Conclusion
Riemann is an effective monitoring tool that is capable of aggregating events from servers with a strong stream processing language. Riemann is easy to use and is capable of many functions, including combining events from different services and hosts. Knowing how to run the tool can make monitoring and measuring events an easy default.