By Adam Bellemare, Principal Technologist at Confluent


The emergence of generative AI has resurfaced a long-debated question: how do you get your systems and services the data they need to do their jobs? While most commonly asked for microservices and populating a data lake, generative AI has pushed its way to the front of this list. This article explores how the data demands of generative AI are an extension to the age-old problem of data access, and how data streams can provide you with the missing answer.


The key problem to accessing data is that the services that create the original record of data are not necessarily the best suited for hosting ad-hoc access to it. Your service may be perfectly capable of performing its actual business responsibilities, but it isn’t able to serve that data to prospective clients. While you can expose the data using an interface, the service might not be able to handle the query volume, or the types of queries that are expected.


Data analysts ran into this problem decades ago, where the original system of record (an OLTP database) couldn’t provide the necessary power and performance for analytical use cases. A data engineer would extract the data from the original system of record and load it into an OLAP database so the data analysts could do their job. While the tools and technologies have changed over the decades, the gist remains the same: copying data from the operational space to the analytical space.



Fig 1: A simple Extract-Transform-Load (ETL) job copying data from the operational domain to the analytical domain.


Microservices have the same problem. How do they get the data they need? One common option is a direct query to the original system of record, via HTTP, SOAP, or RPC, for example. Similar to the data analyst case, the same limitations apply since the service is unable to handle the access patterns, latency requirements, and load put on it by other dependent services. Updating the systems to handle the new requirements may not be reasonable either, considering complexity, limited resources, and competing requirements.


Fig 2: Other services will require the data to solve their own business use cases, resulting in a web of point-to-point connections.


The crux of the matter is that the services that create the data must also provide access to it for external systems. This open-ended requirement complicates things because the service must do a good job fulfilling its direct business responsibilities, and it must also support data access patterns beyond its direct business use cases. 


Fig 3: The application that created the data is also responsible for fulfilling the on-demand data queries of all other services.


The solution to providing data access to services, systems, and AIs is a dedicated data communications layer, responsible only for the circulation and distribution of data across an organization. This is where data streaming comes in (also sometimes known as event streaming).


In short, your services publish important business data to durable, scalable, and replayable data streams. Other services that need that data can subscribe to the relevant data streams, consume the data, and react to it according to their business needs.


Fig 4: A dedicated data communication layer, provided by data streams, simplifies the exchange of data across your organization.


Data streaming allows you to power services of any size (either micro or macro), populate your data lakes and other analytical endpoints, and power AI applications and services across your business.


Services don’t have to write all of their data to the data stream, only that which is useful to others. A good place to start is to investigate the requests that a service handles, like GET requests, as they illustrate the types of data commonly requested from others. Also, talk to your colleagues, as they’ll have a good idea of the types of data their services need to accomplish their tasks. 


Other services read the data from the data streams and react to it by updating their own state stores, applying their own business logic, and generating results which they may also publish to its own stream. There are three big changes for the consumer:


  1. They no longer request data ad-hoc from the producer service - instead, they get all their data through the data stream, including new data, deleted data, and changes made to data.
  2. Since they no longer request data on demand, they must maintain a replica of the state that they care about within their own data stores. (Note: They do not need to store ALL data, just the fields that they care about)
  3. The consumer becomes solely responsible for its own performance metrics, as long as the data is available in the data stream. It is no longer reliant on the producer to handle its load or meet its SLAs.


Data streaming offers significant benefits to microservices, AI, and analytics. 










Data streams enable you to power operations, analytics, and AIs, all from the same data source. As a data communication layer, it makes it easy for your colleagues and their services to find and use the data they need for their business use cases.


One of the last major benefits is a strategic benefit. This one is a bit more difficult to quantify, but it is undoubtedly one of the most important. By investing in a data streaming layer, you open up a wide range of possibilities for putting your data to work. Apache Kafka, a popular choice for data streaming, offers a wide range of connectors for integrating with all kinds of systems and services. You’re no longer restricted to only using AIs that are integrated with your data lake offering, or those attached to the cloud service provider that is storing all your analytical data. Instead, you can easily trial models from all sorts of providers as they become available, giving you a first-mover advantage in leveraging the latest tools.


Thinking about data, how to access it, and how to get it to where it needs to be has always been a challenge, particularly for the operational/analytical divide. But the advent of GenAI has made it even more important, adding even more weight and importance towards solving this age-old problem. At its heart is a simple principle - let your business services focus on their business use cases, and let the data communication layer provide data to all who need it through low latency data streaming. And from that single set of data streams, you’ll be able to power your operational, analytical, and AI use cases.