sia.hackernoon.com

This article serves as a logical continuation of my previous piece, "The Internet of Things: Humanity’s New Nervous System".

Building upon the conceptual framework established earlier, I want to pivot from the top vision to a more granular, component-level deep dive into this transformative technology. The core objective this time is to address a critical, high-stakes decision faced by every software architect and product manager in the space:

"What is the optimal path for a new IoT project - building a proprietary solution from the ground up, or strategically leveraging the robust, existing offers in the market?"

Main Idea and Architecture

To truly appreciate the gravity of the build-vs-buy decision, let's ground our discussion in a tangible, high-stakes scenario—the modern oil and gas industry.

Imagine us as Texas oil magnates. We operate two hundred pumpjacks scattered across the vast Texan landscape, steadily extracting crude oil day after day. Central to our operation are the flow meters at each wellhead, which provide critical data on volumetric output. Monitoring these readings is paramount for process regulation and maximizing yield.

Historically, this has been a labor-intensive effort: specialized field crews, driving pickups from site to site, manually logging readings and making on-the-spot operational decisions.

Now, consider a sudden, significant expansion—the successful acquisition of an additional 100 wells. While this is a financial boon, it presents an immediate logistical crisis. Our existing field crews are already operating at capacity; integrating another 100 wells into their manual rounds is impossible, and the overhead cost of hiring and training new, large teams significantly erodes the profit margin.

This is the precise moment when technology intervenes, making the business case for IIoT (Industrial Internet of Things) undeniable.

The solution is clear: Intelligently augment the flow meter. By enabling the wellhead device not just to measure, but to continuously and securely transmit its data stream back to a central command center, we completely transform our operating model.

The core benefit is the shift from routine, inefficient site visits to data-driven, predictive dispatch. Our field crews are no longer reactive surveyors; they become strategic troubleshooters, directed only to the sites where the telemetry indicates a true need for intervention. This small, yet incredibly powerful, deployment represents the essence of modern IIoT.

But before delving into the architecture, I suggest we recall together the main components of a successful IoT infrastructure, described in the previous article.

These four pillars form the essential pipeline that transforms raw field data into actionable business intelligence:

Well then, let's get started!

Sensing Level

Closing the Device Layer (Sensing Layer) layer - equipping our pumpjacks with smart flow meters - is relatively straightforward. The modern market is saturated with suitable Commercial Off-the-Shelf (COTS) solutions, ranging from hardened industrial-grade sensors to highly customizable, cost-effective options like a Raspberry Pi solution augmented with an integrated GSM/LTE module. We can refer to this edge node simply as the "Transmitter".

What about the "Network" level? This is where we approach the core nervous system of the IoT, its synapses.

IoT Synapses (Network)

There is a vast array of methods for transmitting data from a smart sensor to the central storage and processing module. However, the most popular and robust solution, by far, remains message queuing. This preference is intrinsically linked to the nature of IoT: it is fundamentally a continuous stream of data, or telemetry. The event-driven architectural approach is ideally suited for this workload:

A sensor captures an event and publishes the signal, adding it to a queue.
On the other side, a consumer eagerly awaits, reads, processes, and finally persists the received information.

As you might already surmise, microservices are the second critical component in the modern IoT world. This architectural style adds essential flexibility and scalability to handle the varying loads and complex processing needs of millions of devices. Still, we'll dive into that topic in detail a bit later.

“So, which queuing solutions are best suited for these needs?”

The truth is, nearly any reliable message queue can fulfill this fundamental requirement. The most common and powerful contenders in the IoT space include:

Apache Kafka: Ideal for high-throughput, fault-tolerant streaming and replaying data.
RabbitMQ: Excellent for complex routing needs and traditional point-to-point messaging.
Amazon SQS (Simple Queue Service): A highly reliable and scalable managed service for decoupling microservices.
Google Cloud Pub/Sub: A serverless, globally distributed, and auto-scaling solution for asynchronous communication.

However, the reality of the edge layer is that we are talking about smart, yet resource-constrained devices with limited processing power, memory, and bandwidth. Working directly with the "heavyweight" message queues we discussed earlier (like Kafka or standard RabbitMQ connections) can be overly challenging, draining the device's battery and resources.

This is where a special class of messaging protocols comes into play—protocols that, while perhaps not exclusively developed for IoT, have found their most widespread and impactful application within the technology.

The most prominent and successful example is MQTT (Message Queuing Telemetry Transport).

MQTT

MQTT is a lightweight, open-standard messaging protocol built on the Publish-Subscribe (Pub/Sub) pattern.

The initial version was published in 1999 by Andy Stanford-Clark from IBM and Arlen Nipper from Cirrus Link. They conceived MQTT as a method for maintaining reliable machine-to-machine (M2M) communication over networks with limited bandwidth or unpredictable connectivity.

This focus on constraint is critical: one of its very first use cases involved ensuring continuous contact between segments of an oil pipeline and central command links via satellite—a true testament to its reliability in hostile, low-resource environments.

The fundamental difference between MQTT and the standard web paradigm (like HTTP, which uses a direct Request/Response model) is architectural: devices do not communicate directly with one another.

There are only three main elements in an MQTT system:

Broker - the "heart" and "post office" of the system. This is the server that receives all messages, filters them, and distributes them to those interested. The most popular brokers on the market are Mosquitto, HiveMQ, and EMQX.
Clients - the primary users of the queue, are further divided into:
- Publishers - those who create and send messages
- Subscribers - listen to the line and receive messages
Topic - This is the "subject" of a message, similar to a file path on a computer. For a subscriber to receive a message from a publisher, both must use the same topic.

Any data published to or received from an MQTT Broker is encoded in a binary format, as MQTT is a binary protocol. This means that to access the original, readable content - whether it's JSON, XML, or plain text - the message must first be correctly interpreted (decoded) by the subscribing client.

Notably, MQTT transmits security credentials in cleartext; otherwise, it does not support authentication or security features. To protect transmitted information from interception, brokers use an additional security layer, SSL.

As a result, our scheme smoothly transforms into:

Or we can use one topic for all devices, simply by adding the device ID to the message:

Data Processing and Storage

So, we have established our architecture:

Our intelligent sensor is publishing data to a lightweight MQTT topic.
For security, we have exchanged TLS/SSL certificates between our Publisher (the sensor) and the Broker.

As you correctly surmised, the Broker level is precisely where we cross the boundary from the constrained IoT sensor layer to the unconstrained data processing layer. In the processing layer, our architectural imagination is only limited by our financial budget and engineering skills.

It is now time to learn how to consume this data from the other side of the Broker.

The first problem we immediately encounter is a significant limitation of MQTT: it offers no native data persistence. Despite being a powerful tool, its lightweight nature enforces strict constraints, and one of them is the inability to store messages. In simpler terms: if there is no Subscriber actively listening when a Publisher sends a message, that message is sent into the void and lost forever.

To solve this critical gap, a common and highly effective pattern is employed: a dedicated, lightweight, and fast microservice is deployed as the primary Subscriber for all topics. This dedicated microservice acts as the crucial bridge between the low-power MQTT world and the high-throughput processing world.

It reads messages from one or more MQTT topic(s).
It immediately pushes the received data into the primary telemetry processing system.

“Do you remember how I mentioned the Event-Driven Design philosophy earlier? What drives our events now?”

“Another Message Queue!”

This time, we have zero constraints. The data is now off the constrained network and residing on robust cloud/server infrastructure. We can now select any robust AMQP (Advanced Message Queuing Protocol) solution, such as RabbitMQ or the more heavy-duty Apache Kafka, to handle persistence, complex routing, and scaling across multiple consumer microservices.

As a result, our architecture seamlessly receives all the main components:

When it comes to the Total Cost of Ownership (TCO) and detection rate of such solutions, a machine with 4 vCPU, 16 GB RAM, and 100 GB of storage will be sufficient, costing us $100-200 per month. Of course, you can build your own desktop machine for greater savings in the long run.

Failure point analysis

Now, let's step back and critically review the resulting architecture to identify the most critical components that could bring the entire system down.

It's clear, even to the naked eye, that the MQTT Broker is the most vulnerable and critical component.

Not only does it serve as the essential bridge between the constrained sensor layer and the powerful data processing layer, but a self-hosted broker also introduces significant limitations:

Security Overhead: Messages are transmitted in plain text by default. This forces us to implement security using TLS/SSL certificates, which requires either standing up or integrating with a robust Certificate Authority (CA).
Scaling Topic Limits: Depending on the broker implementation, there may be practical limits on the volume and number of topics that can be managed efficiently before performance degrades.
Single Point of Failure (SPOF): If you operate a single broker instance and it fails, your entire network effectively goes blind. Devices continue to operate, but no data is transmitted, resulting in a critical loss of telemetry.
Clustering Complexity: Building a highly available (HA) cluster of MQTT Brokers (to ensure continuity if one server fails) is far more complex than it might initially appear, requiring intricate load balancing and state synchronization.

When dealing with a few dozen connected devices, setting up your own broker and gradually patching these limitations might be feasible.

However, since we are major oil producers with, say, 300 wellheads and ambitious long-term digitalization plans, we need to consider more robust alternatives. The smart move is to look toward major cloud providers and the highly available, fully managed IoT services they offer.

Cloud IoT Solutions

So what solutions do today's leading cloud providers offer us?

AWS offers AWS IoT Core, a managed cloud service that allows you to reliably and securely connect billions of IoT devices and route trillions of messages to other AWS services - all without the need to manage any underlying infrastructure.
Microsoft Azure provides Azure IoT Hub, which acts as the central message hub in Microsoft's cloud IoT solution. It facilitates reliable and secure bi-directional communication between your application and your devices.
Google previously offered Cloud IoT Core, but deprecated the solution in 2023. GCP now advises customers to build their implementations using a combination of existing services, including: Pub/Sub, Cloud Storage, Compute Engine/Kubernetes and other solutions

AWS IoT Core

The AWS platform offers a formidable suite of solutions tailored specifically for the Internet of Things (IoT) domain, providing the necessary tools for architecting, deploying, and managing large-scale device fleets.

AWS breaks down the key IoT challenges into highly focused, managed services:

AWS IoT Core: The flagship managed cloud service that allows billions of devices and trillions of messages to interact securely. Its fundamental components include the Device Gateway, a versatile Message Broker (supporting protocols like MQTT and LoRaWAN), Device Shadows (for state persistence), and a Rules Engine for seamless data routing and integration with other AWS services.
AWS IoT Device Management: A service designed for the remote organization, monitoring, and management of large device fleets throughout their entire lifecycle, from onboarding to decommissioning.
AWS IoT Device Defender: A critical security service focused on auditing device configurations, monitoring security metrics, and identifying deviations from best practices to ensure the security of your IoT fleet.

For our current task - architecting a robust and scalable IoT solution - we will primarily focus on AWS IoT Core. This service is a game-changer; it virtually eliminates much of the inherent boilerplate and complexity associated with secure device connectivity.

AWS IoT Core acts as a comprehensive Registry of "Things" with a vast array of surrounding functionality. When a new device (a "Thing") is added to the Registry, the service automatically takes care of critical security steps:

Certificate Generation: It generates all necessary security certificates and keys for secure, authenticated communication.
Policy Configuration: It provides the flexibility to define granular policies that dictate what actions the created "Thing" is authorized to perform within the AWS ecosystem (e.g., which MQTT topics it can publish/subscribe to).

The "cherry on top" for developers is the streamlined onboarding process. Upon "Thing" creation, AWS IoT Core provides a nearly ready-to-use and fully configured package for our device's transmitter or client code. With support for a wide selection of programming languages and environments via the AWS SDKs, this capability significantly accelerates the initial development and integration process.

In addition to connectivity, every created "Thing" can have a virtual representation known as a "Device Shadow".

A Device Shadow enables your application to work with a virtual representation of the device even when the physical device is not connected to the cloud. You can retrieve the last reported state, fetch various data points, and create **delayed commands** that will be automatically delivered to the device once it reconnects. This is crucial for maintaining a responsive user experience and reliable device management, regardless of connectivity status.

By modifying our architecture, we get:

And if we go even deeper, then, in essence, the entire "Date Processing and Storage" level can be moved to cloud services and even made almost completely serverless. To do this, we'll do the following:

We'll replace the Rabbit MQ queue with AWS Simple Queue Service (SQS)
We can eliminate the MQTT Reader service, since messages from IoT Core can be sent directly to SQS using an IoT Rule
We'll implement our Data Save Service using AWS Lambda
We'll fully contain and run the Data Analytics Service in AWS Elastic Container Service (ECS)
We'll use AWS RDS for the databases

We ultimately achieve an architecture that is not only robust but also highly optimized for cost and operations:

High Scalability and Elasticity: We gain significant scalability by utilizing AWS Lambda for compute and AWS SQS for message queuing. This allows the system to automatically scale up to handle sudden spikes in message volume from the device fleet and scale down when traffic subsides, ensuring cost efficiency.
Asynchronous Message Processing for Increased Fault Tolerance: Employing an SQS queue between the data ingestion and processing layers enables asynchronous communication. This is a crucial element of resilience - if the downstream Data Save Service (or any processing service) is temporarily unavailable or overwhelmed, messages are not lost. Instead, they wait safely in the queue, ensuring delivery when the service recovers.
Decoupling (Loose Coupling) Each component - the Lambda Reader (processing data from IoT Core), SQS, the Data Save Service, and the Data Analytics Service - is weakly coupled. This is a best practice that greatly simplifies maintenance and modification. For example, we could easily swap the Lambda-based approach for the Data Save Service with an ECS Container deployment without ever having to touch the upstream Reader component.
Managed Services & Reduced Operational Overhead: The majority of the services utilized (IoT Core, Lambda, SQS, RDS) are fully managed by AWS. This drastically reduces the operational burden of infrastructure maintenance, including patching, monitoring, and backup/recovery. The pay-as-you-go pricing model also directly translates to lower operational expenditure (OpEx), as you only pay for the compute and messaging resources you actually consume.

While our serverless architecture offers immense benefits in scalability and operational efficiency, it is not without its challenges:

Complexity in Monitoring and Tracing: This is a common characteristic of any microservices or serverless architecture. Since the data flow spans multiple managed services (IoT Core, Rules Engine, SQS, Lambda, RDS), tracing a single message's journey from device to database becomes more complex than in a monolithic structure. We must rely heavily on tools like AWS X-Ray and Amazon CloudWatch to establish clear, end-to-end observability.
Latency and Delivery Guarantees:
- Latency: The use of SQS intentionally introduces a slight delay between when a device publishes data and when that data is persisted in the database. This is the trade-off we accept for achieving asynchronous processing and fault tolerance.
- Delivery Guarantees: SQS provides an "at least once" delivery guarantee. This means that a message might occasionally be delivered to the consuming Data Save Service (or Lambda) multiple times. Therefore, it is mandatory to implement idempotency in the Data Save Service to prevent duplicate records in the AWS RDS database.
The "Cold Start" Challenge: If device traffic is irregular or sparse, our AWS Lambda functions may experience a cold start. This is the time required for AWS to initialize a new execution environment (container) for the function, adding noticeable latency to the message processing pipeline. This can be mitigated through techniques like Provisioned Concurrency for critical functions, though at a higher cost.
High Cost at Extreme Volumes: While cost-effective at moderate scale, the pay-per-invocation model can become expensive at the extreme end. At very high volumes (millions of messages per second), the cumulative cost of every individual Lambda invocation and every SQS message transaction can become substantial. In such high-throughput scenarios, a stream-processing service such as Amazon Kinesis or AWS IoT Analytics might be a more cost-effective solution.

Overall, within the framework of our oil mission, we can safely accept all these risks, and launch configurations can control high costs at high volumes.

Total Estimated Cost of Ownership (TCO)

“So, how much will our AWS solution cost us per month?”

Let's expect about 3 million messages per month from 300 wells, since we don't need to send telemetry very frequently.

AWS IoT Core

Metric	Monthly Volume	Price per Unit (per Million)	Free Tier	Calculation	Cost
Connectivity (300 devices)	300 devices	$0.01 per device	-	$0.01 * 300	$3.00
Messaging (Inbound)	3 million msgs	$2.30 per Million	-	$2.30 * 3	$6.90
Rules (Triggered)	3 million rules	$0.15 per Million	-	$0.15 * 3	$0.45
Rules (Actioned)	3 million actions	$0.15 per Million	-	$0.15 * 3	$0.45
Total					$10.80

AWS Lambda (Data Save Service)

Metric	Monthly Volume	Price per Unit	Free Tier	Calculation	Cost
Requests (Invocations)	3 million	$0.20 per Million	1M free	$0.20 x (3 - 1)	$0.40
Duration (Assumption: 128MB, 100ms per invocation)	3M x 0.1s x 128MB = 37k GB-s	$0.0000166667 per GB-s	400k GB-s free	0	$0.00
Total					$0.40

AWS SQS

Total Requests: 3M (SendMessage) + 3M (ReceiveMessage) = 6 M

Metric	Monthly Volume	Price per Unit (per Million)	Free Tier	Calculation	Cost
Requests (API Requests)	6 million	$0.40 per Million	1M free	$0.40 x (6 - 1)	$2.00
Total					$2.00

AWS RDS

Assumption: You are using a small, entry-level instance for development or light workload: db.t4g.micro (PostgreSQL/MySQL, On-Demand) in US East (N. Virginia), 20 GB of GP3 storage with 3000 IOPS.

Metric	Monthly Volume	Price per Unit	Free Tier	Calculation	Cost
db.t4g.micro Instance	730 hours	$0.016 per hour	-	$0.016 x 730	$11.68
GP3 Storage	20 GB	$0.115 per GB-mo	-	$0.115 x 20	$2.30
IOPS	3,000		Free with 3000 on GP3	0	$0.00
Total (1 instance)					$13.98

AWS ECS (Data Analytics Service)

Assumption: Using AWS Fargate for the analytics container. A month has approximately 24 hours/day x 30.44 days/month = 730 hours.

Metric	Monthly Volume	Price per Unit	Calculation	Cost
vCPU Hours	730 hours	$0.04048 per vCPU-hour	1 x 730 x 0.04048	$29.55
Memory Hours	730 hours	$0.004445 per GB-hour	1 x 730 x 0.004445	$6.50
Total				$36.05

Total Estimated Cost of Ownership

Component	Estimated Monthly Cost
AWS IoT Core	$10.80
AWS Lambda	$0.40
AWS SQS	$2.00
AWS RDS (2 x db.t4g.micro)	$27.96
AWS ECS (Fargate)	$36.05
Total	$77.21

Microsoft Azure IoT Hub

Azure IoT Hub is a managed service that acts as a central message hub in your cloud-based Internet of Things (IoT) solution. It provides reliable, secure, and scalable communication between your IoT application and the devices connected to it. Virtually any device can be connected to IoT Hub.

The service supports several messaging patterns, including device-to-cloud telemetry, file uploads from devices, and request-reply methods for device management. IoT Hub also supports monitoring capabilities, which help you track device creation, connectivity, and failures.

IoT Hub is built to scale up to millions of simultaneously connected devices and millions of events per second to support demanding IoT workloads.

Similar to AWS, Microsoft provides support for digital twins (or "shadows"), allowing for a digital representation of your physical device in the cloud. Just like AWS IoT Core, Azure supports multiple communication protocols, such as MQTT, AMQP, and HTTPS.

In our specific case, we are primarily interested in MQTT.

It is worth noting that while the official documentation states that IoT Hub is not a full-fledged MQTT broker and recommends using Azure Event Grid for more comprehensive MQTT features, the functionality that IoT Hub provides is perfectly sufficient for the scope of our current project.

So, let’s replace the MQTT Broker in our architecture with Azure IoT Hub

A particularly appealing feature of Azure IoT Hub is the flexibility it offers in device authentication. While it supports the standard practice of using X.509 Certificates (which, like AWS, it can help manage and generate), it also offers a pragmatic alternative: Shared Access Signature (SAS) tokens. This gives developers more options for managing device security, especially in resource-constrained or heterogeneous device environments.

And just like AWS, we can make our entire cloud component almost completely serverless:

Use Azure Service Bus as a queue
MQTT Reader can be completely simplified and write directly to the Service Bus via IoT Hub Routing.
The data storage service has been migrated to Azure Functions.
The data analysis service should be containerized and deployed to Azure Container Apps.
And the database should be deployed to Azure SQL Database

Total Estimated Cost of Ownership

Just like in the previous example, we'll take 300 wells that generate about 3 million messages per month.

Azure IoT Hub

Azure pricing is tiered. For our purposes, Standard S1 (400,000 messages/day per unit and 500,000 devices) is sufficient. The monthly cost is $25.00.

Azure Functions (Data Save Service)

Metric	Monthly Volume	Price per Unit	Free Tier	Calculation	Cost
Requests (Invocations)	3 million	$0.40 per Million	250k free	$0.40 x (3 - 0.25)	$1.10
Duration (Assumption: 128MB, 100ms per invocation)	3M x 0.1s x 128MB = 37k GB-s	$0.000026 per GB-s	100k GB-s free	0	$0.00
Total					$1.10

Azure Service Bus (ASB)

Like IoT Hub, it can be paid for in tiers. The Standard tier, at $5.00 per month, suits us perfectly, as it includes 12.5 million operations.

Azure SQL Database (Data Storage)

Metric	Monthly Volume	Price Per Unit	Free Tier	Calculation	Cost
vCore	730 hours	$0.05825 per hour	-	$0.05825 x 1 x 730	$42.52
Storage	20GB	$0.115 per GB-month	-	$0.115 x 20	$2.30
Total (1 instance)					$44.82

Azure Container Apps (Data Analytics Service)

Metric	Monthly Volume	Price Per Unit	Free Tier	Calculation	Cost
vCPU Cost	730 hours	$0.0571 per vCPU-hour	-	1 x 730 x 0.0571	$41.68
Memory Cost	2 GB	$0.0050 per GiB-hour	-	2 x 730 x 0.0050	$7.30
Total					$48.98

Total Estimated Cost of Ownership

Component	Estimated Monthly Cost
Azure IoT Hub	$25.00
Azure Functions	$1.10
Azure Service Bus	$5.00
Azure SQL Database (2 instances)	$89.64
Azure Container Apps	$48.98
Total	$169.72

Conclusions

So what conclusions can we draw from this article? Of course, any architectural solution is limited only by our imagination and experience: we can endlessly optimize and improve, replacing some components with others. The main thing is that all this makes us happy oil owners who strive for perfection.

Our analysis showed that a managed cloud solution is the best fit for the mission-critical, scalable Industrial Internet of Things (IIoT).

We clearly saw that the perceived savings and low initial costs of in-house development using a standard MQTT broker lead to huge and unjustified risks:

Security and Reliability: Self-deploying a PKI infrastructure, clustering the broker for fault tolerance, and monitoring become a colossal, distracting task. In managed hubs (like AWS IoT Core or Azure IoT Hub), these functions are built-in, ensuring scalability up to billions of messages and robust protection.
Operational Expenses (OpEx) and TCO: The main cost of in-house solution development lies not in the hardware, but in high operational expenses. The team must manage the OS, patching, updates, clusters, and troubleshooting 24/7. With cloud services, these OpEx are minimized, making the "Pay-as-you-go" model more cost-effective in the long run, despite the apparent "freeness" of open-source software.
Integration: Cloud IoT hubs offer native and seamless integration with analytics services, databases, and ML tools. This allows the team to immediately move to extracting value from data, bypassing the stage of manually writing connectors and managing intermediate message queues.

For any serious IIoT project that must operate 24/7, scale, and deliver business value, using managed cloud platforms is the path to success.

By choosing this, you're not just buying a service; you're buying time, reliability, and security. Instead of spending resources on reinventing the wheel, we can focus on what is most important: developing predictive models and applications that directly generate profit for the company.

The choice of cloud solutions is a strategic decision that allows a business to remain focused on its core competencies, rather than on infrastructure management.

It is important to understand that a cloud solution is not a silver bullet. The recent discontinuation of Google Cloud IoT Core in 2023 serves as additional evidence of the complexity in managing an IoT hub. Even a giant like Google acknowledged the expediency of delegating this task to specialized partners (HiveMQ, EMQX, ClearBlade), which further confirms the risks associated with an unproven or self-developed solution.

But that's another story entirely.

Oilfield IIoT Architecture - To Build or Not To Build - That is the Question