This is an augmentation article to the one which Mr. Karim has elaborated upon in couchbase blog. If you haven’t read that article it’s better to take a look at it also, so that we can elaborate on in-depth concepts in this article based on what you have learnt from that article. As clear from the title this piece is aimed to dive deep into creating robust solutions for such tools being integrated with basic system design principles. As however cool these hacker like tools may look they must be robust and should be able to scale up to the organizations need. This article can be interpreted as a case-study for Prometheus and Grafana integration with Couchbase-Exporter.
Who is the target audience ?
Any industry professional or developer who wants to use this integration of tools but can clearly see the salability ( more accurately, synchronization ) and automation flaws. Also, people who are genuinely interested in developing robust monitoring solutions that functions in an automated intelligent manner with minimal human intervention possible.
Definitive Key Takeaways
Time Saver Tip: You, can skip this section if OOB CB-Exporter works fine for you.
Couchbase exporter written by totvslabs does provide us with a unique client that scraps 4 ReST endpoints namely /pools/default, /pools/default/tasks, /pools/default/buckets, /pools/nodes. We can run this exporter with the following command.
./couchbase-exporter --couchbase.username Admin --couchbase.password pass --web.listen-address=":9420" --couchbase.url="http://52.38.11.73:8091"
Now, it is important to understand that a couchbase-exporter process binds itself to the listening port and listens to a given Couchbase server for stats . The OOB implementation does scrape data from above mentioned REST endpoints only.
In order to run this couchbase process either we can use nohup so that this process can run for an elongated period of time or create a service that accepts parameters as command line arguments.
Both, approaches will work fine for this but we have opted for an process based nohup approach to keep things simple with respect to removal of a target from monitoring task.
With respect to your use-case you might want to add new metrics for observations with different endpoints for monitoring. We will specify a simple procedure for you to add such a new metric to this couchbase-exporter tool and build your own new variant.
But, limitations with respect to monitoring N1QL queries, active requests etc. still exists. So, we will be discussing a methodology on adding new rest endpoints for monitoring in this section. Please, follow along with the steps mentioned below.
Step 1: Find the rest point that you want to get stats from and make a query via web-browser or postman with proper credentials. Let’s we picked up /indexStatus endpoint. Now, copy the response into a JSON to GoLang struct convertor and you will get your struct in which you will temporarily store your JSON response.
type Index struct {
    Indexes struct {
        StorageMode string   `json:"storageMode"`
        Partitioned bool     `json:"partitioned"`
        InstID      uint64    `json:"instId"`
        Hosts       []string `json:"hosts"`
        Progress    int      `json:"progress"`
        Definition  string   `json:"definition"`
        Status      string   `json:"status"`
        Bucket      string   `json:"bucket"`
        Indx       string   `json:"index"`
        ID          uint64    `json:"id"`
    } `json:"indexes"`
    Version  int           `json:"version"`
    Warnings []interface{} `json:"warnings"`
Remember: Prometheus only stores float64 values or bool values. Hence, it’s better to convert uint64 before feeding them into Prometheus via the collector objects.
Step 2: Now, create a index.go file where we will put this struct(just copy paste). Based on your response or ReST endpoint if it is a single response having an array of responses copy the constructor initialization from cluster.go or tasks.go respectively. Ours is an array of responses here for indexStatus involving multiple indexes stats if multiple indexes are declared. Hence, we copy the style of tasks.go initialization style. But, I’ll recommend you try something with a simpler structure similar to cluster.go client ReST endpoint.
func (c Client) Indexes() ([]Index, error) {
              var index []Index
              err := c.get("/indexStatus", &index)
              return index, errors.Wrap(err, "failed to get cluster")
Now, for our case we will be using tasks.go client file as our reference to create our metrics to observe. If you have used struct similar clusters.go file then use that as reference.
Step 3: Next, we will create a collector object that takes values stored in the client struct object and stores the metrics in which we are interested in into the Prometheus datastore. Now, create index.go in the collector directory that will perform the above mentioned task.
// string data-types are commented as Prometheus won't be able to do use them.
type indexCollector struct {
    mutex  sync.Mutex
    client client.Client
up             *prometheus.Desc
    scrapeDuration *prometheus.Desc
indexesStorageMode *prometheus.Desc
    indexesPartioned *prometheus.Desc
    indexesInstID *prometheus.Desc
    // indexesHosts *prometheus.Desc
    indexesProgress *prometheus.Desc
    // indexesDefinition *prometheus.Desc
    indexesStatus *prometheus.Desc
    // indexesBucket *prometheus.Desc
    // indexesIndx *prometheus.Desc
    indexesID *prometheus.Desc
indexVersion *prometheus.Desc
   //  indexWarnings *prometheus.Desc
}
We will then create a NewIndexCollector function that will define our newly created metrics that we are interested in, see below.
func NewIndexCollector(client client.Client) prometheus.Collector {
const subsystem = "index"
    // nolint: lll
    return &indexCollector{
        client: client,
        up: prometheus.NewDesc(
            prometheus.BuildFQName(namespace, subsystem, "up"),
            "Couchbase cluster API is responding",
            nil,
            nil,
        ),
        scrapeDuration: prometheus.NewDesc(
            prometheus.BuildFQName(namespace, subsystem, "scrape_duration_seconds"),
            "Scrape duration in seconds",
            nil,
            nil,
        ),
        indexesStorageMode: prometheus.NewDesc(
            prometheus.BuildFQName(namespace, subsystem, "indexes_storage_mode"),
            "Mode of Index Storage",
            nil,
            nil,
        ),
        indexesPartioned: prometheus.NewDesc(
            prometheus.BuildFQName(namespace, subsystem, "indexes_partioned"),
            "Partitioned Indexes",
            nil,
            nil,
        ),
        indexesInstID: prometheus.NewDesc(
            prometheus.BuildFQName(namespace, subsystem, "indexes_inst_id"),
            "Inst Id of Index",
            nil,
            nil,
        ),
...
    }
}
Then a channel object that will parse the stored data into Prometheus datastore.
func (c *indexCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.up
    ch <- c.scrapeDuration
ch <- c.indexesStorageMode
    ch <- c.indexesPartioned
    ch <- c.indexesInstID
    // ch <- c.indexesHosts
    ch <- c.indexesProgress
    // ch <- c.indexesDefinition
    ch <- c.indexesStatus
    // ch <- c.indexesBucket
    // ch <- c.indexesIndx
    ch <- c.indexesID
    ch <- c.indexVersion
    // ch <- c.indexWarnings
}
Then with the Collect function like in tasks.go we insert the metrics into Prometheus with bool or float64 type. Remember, to write the loop as written in tasks.go file.
indexes, err := c.client.Indexes()
// sample code inside loop
...
ch <- prometheus.MustNewConstMetric(c.up, prometheus.GaugeValue, 1)
    ch <- prometheus.MustNewConstMetric(c.scrapeDuration, prometheus.GaugeValue, time.Since(start).Seconds())
ch <- prometheus.MustNewConstMetric(c.indexesStorageMode, prometheus.GaugeValue, fromBool(index.Indexes.StorageMode == "plasma"))
    ch <- prometheus.MustNewConstMetric(c.indexesPartioned, prometheus.GaugeValue, fromBool(index.Indexes.Partitioned))
    ch <- prometheus.MustNewConstMetric(c.indexesInstID, prometheus.GaugeValue, float64(index.Indexes.InstID))
...
// sample code outside loop
Step 4: Now, the main.go client file needs to be altered and addition regarding index metric is needed to be made so that it can be collected and put into Prometheus.
...
    nodes   = app.Flag("collectors.nodes", "Wether to collect nodes metrics").Default("true").Bool()
    cluster = app.Flag("collectors.cluster", "Wether to collect      cluster metrics").Default("true").Bool()
    index = app.Flag("collectors.index", "Wether to collect index metrics").Default("true").Bool()
)
...
index object is needed to be added in the var definition section as shown above and in the main section shown below.
if *cluster {
  prometheus.MustRegister(collector.NewClusterCollector(client))
 }
 if *index {
        prometheus.MustRegister(collector.NewIndexCollector(client))
    }
Step 5: Now you need to build the MakeFile for building your own variant of couchbase-exporter. But, before that install the pre-requisites with the following command
make setup
and after that comment out the grafana dependencies as those libraries wouldn’t have been installed and will give an error.
build: grafana
 go build
.PHONY: build
After that run the following three commands as mentioned in the guide.
# test if all dependencies are properly installed
make test
# Build the couchbase-exporter
make build
# For finalization running linters
make ci
Now, we are finished building our own variant of couchbase-exporter. That might have been optional for many developers and users. But, discussing it is quite important for our reference guide. We can now move onto an automated approach to orchestrate communication between all these tools that are in place working perfectly fine individually.
Salability & Total Fallback Recovery: An Automated Solution Approach
Till now we have discussed how to maximize capabilities of each tool with respect to this integration project. Now, we will try to orchestrate these tools to serve the bigger picture.
A satisfactory solution: The solution must start or stop the Couchbase VMs that are being monitored based on single commands only. There shouldn’t be any need to manually add, remove or maintain the targets.json and configuration files for starter.
We want to utilize capabilities of our own network where multiple VMs can communicate. Hence, we can have a HTTP server based ReST API with which we can make these target entries appear and disappear at our disposal with ReST curl commands.
Hence, the below mentioned diagram does explain an HTTP server approach that we will be running side by side again as a service in our main VM that will automatically start and stop CB-Exporter and Prometheus processes plus maintain targets.json files also.
Basically, we also add a targets.json file for couchbase exporter tool which keeps a track of all the couchbase VMs added or removed for monitoring purposes.
The flow diagram would have clearly explained that the HTTP server add/removes entries from targets.json file for both couchbase-exporter and Prometheus. The utility functions starts and stops the couchbase-exporter processes based on that only. Hence, the complete tool works in an orchestration with this given functionality. For code, refer to this repository section.
For a full recovery system recovery let’s say after all the processes are shut down which is common for VMs getting rebooted all the time. We just need a series of commands that are needed to be executed as we have the record of targets that we want to monitor. That too only for couchbase-exporter the command is needed to be executed again & again as for Prometheus it is a one time effort only.
Total Recovery Script Procedures:

1. Start Grafana, Prometheus, Node Exporter and AlertManager Server.

2. Iterative over targets.json of couchbase-exporter and start processes by utiltiy scripts written for HTTP-Server.
Hence, the complete restoration can be done only with this two step procedure script that can be written simply in python. Keep this script as your homework and can make PR for me 😉. Please, try to use subprocess.Popen() if using them in python.
Author Experience:
 I have used python for developing the given server and utility functions scripts. But, while developing the couchbase-exporter tool I did realize that Golang based deployment of web-servers is a much better solution.
Comment of High Availability
Prometheus runs as a standalone instance on a given VM. Now, that does create a problem if our Prometheus VM goes down, our backend data will be lost for that given period. It is a huge problem if our primary monitoring tool is Prometheus based only on running a single instance for monitoring. Hence, we will need a high availability based solution to mitigate this issue. But, we would also want that our HTTP server based automation architecture doesn't go to waste and gets integrated with ease.
What that solution will look like? We definitely have to run identical Prometheus servers(let’s say two, at minimum) to achieve this and data has to be posted to them both. Hence, separate couchbase-exporter processes for each of them with list of targets being the global one for consistency. Therefore, we will have a VIP (or user sending duplicate POST requests) that will POST and DELETE data from both the targets at the same time maintain a global list also for consistency of our solution. The HTTP server will be running on both the VMs to whom the POST/DELETE request will be communicated via a VIP. Also, another HTTP server on each of the VM will be maintaining the consistency of the targets.json file by sample gossip protocol. Finally, two AlertManagers connected by mesh will be receiving the data metrics from any of the Prometheus datasource to raise alerts which will then be deduplicated and sent across multiple communication channels. Below, is the abstract diagram representation of the solution.
Now, alerts that will be raised will be sent to each of AlertManagers running in the Mesh Clusters in this architecture.
AlertManager’s deduplication concept will help us to send proper alerts via proper medium to the designated users. This is a good monitoring solution with lots of metrics being available for monitoring. But, again for business critical applications such monitoring solutions, even the highly available solution that we have discussed needs to be tested robustly. Happy monitoring and alerting to you !!
Conclusion
With this article we have tried discussing almost all the important parameters related to these tools. In summary we have discussed building dashboards for Grafana, Prometheus monitoring and alerting analysis with its multiple tools. creating a custom couchbase-exporter tool, writing automated solutions, full blown recovery homework and discussion about high availability of these systems and possible solutions. Hope, you would have learnt new things about these tools and would create your own scalable monitoring solutions. Plus, would be kind and considerate to share it with us.
In case of a missing issue that we have failed to address or interpret, do either drop an issue or write a response, would appreciate your time in improving this article.
There is still scope of improvement in the current solution like consistency with HA solutions, de-duplication of alert demonstrations, syncing multiple AlertManagers etc. Would really appreciate it if you can share your findings with others as well.
BTW, wouldn't it be nice if we can do predictive analytics as well on this monitored data.
My work here is done. This is the way.