sia.hackernoon.com

Today, I’m talking to Prem Ramaswami, the Head of Data Commons at Google. Prem and his team recently launched the Data Commons Model Context Protocol (MCP) Server, which is Google’s effort to give builders and developers access to a boatload of trustworthy, verifiable data. Rather than building their own proprietary standard protocol, Google Data Commons chose to build on Anthropic's open-source Model Context Protocol. We’re going to talk about MCPs, real unit economics, the challenges of unstructured data and hallucinations, and what this all means for the future of building an internet business. Lets dive in:

David Smooke: What strategies should AI researchers and builders be leveraging to make AI hallucinate less often?

Prem Ramaswami: Researchers and builders can ground AI outputs in trusted, authoritative data sources where the model interprets queries but only returns information sourced from reliable databases. That’s what Data Commons is: it brings together the world's public data from trusted and verifiable sources. We help make data sources transparent in an open source manner, limit model responses strictly to trustworthy data, and incorporate continual evaluation and feedback. We also streamline access to that authoritative data via our recently released Model Context Protocol (MCP) Server, which provides a standardized way for AI agents to discover and access our data resources. That means it’s faster and easier for developers to develop and deploy trustworthy AI applications and agents.

Data Commons has been around since 2018. Can you walk us through its purpose, traction, and current scale? What does success look like for Data Commons, and how much historical vs. real-time data are we talking about?

Data Commons stemmed from the fact that there’s a lot of important public data available, but it’s not exactly usable or useful. It’s hard to find, it’s scattered, you need to read a 500 page PDF before using it, and it’s difficult to work with. We make it simpler. Data Commons organizes and unifies the world’s publicly available data from trusted sources such as The ONE Campaign, the U.S. Census Bureau, the United Nations, Eurostat, and the World Bank, making it both universally accessible and useful.

Today, we integrate hundreds of datasets and tens of thousands of variables, serving billions of data points across sectors like health, economics, and sustainability. Success for Data Commons is democratizing access to high-quality, transparent data so anyone can quickly get reliable answers and make informed decisions. That’s especially important in the age of AI. We want making a data based decision to be the easiest choice. More details here and this video.

Why does MCP matter? What is Google's strategic approach to MCP? And more specifically, what is the Data Commons strategic approach to MCPs?

MCP creates an open, standardized way for AI agents and applications to access data sources. As AI systems become more prevalent, the reliability and transparency of their outputs depend on how well they can ground their answers in real data – and Data Commons delivers that real data with the benefit of MCP. Rather than having to know the ins and outs of our API, or our data model, you can use the “intelligence” of the LLM to help interact with the data at the right moment.

We believe that an open ecosystem, where multiple organizations contribute and adopt shared standards, leads to better quality, more reliable AI applications, and broader societal benefit.

We have built an MPC server, making our vast repository of public data easily accessible to AI models, and collaborating with partners to set best practices. Our goal is to empower developers, NGOs, journalists, governments, and anyone who needs reliable data, while building a foundation of trust and transparency into the next generation of AI-powered tools.

Why did Google choose to build on Anthropic's open-source MCP standard rather than create its own? What was the internal debate like re an a proprietary vs. open source protocol? And how did Google owning a 14% stake in Anthropic impact the decision?

As a small team working on an open source effort, Data Commons' goal is primarily to ensure broad interoperability and accelerate the development of reliable, data-grounded AI applications. In addition to MCP, we’ve recently integrated the Statistical Data MetaExchange (SDMX) format and most of our ontology is an extension of Schema.org, another open web standard. Many Google products including Google Cloud databases like BigQuery and industry products have already integrated to MCP, making it an easy choice.

What are the unit economics here? How expensive is a query? Is this a free product forever, or are there future plans for a paid tier based on usage? What is to prevent Data Commons from being sunset in the future?

Data Commons is open source so we hope for a thriving community of users and developers to help it grow and Google has shown commitment to that success. Currently, Data Commons is focused on maximizing access. Data Commons helps provide data to Search and separately is actively researching different ways we can make LLMs more reliable and trustable. It is also free to use at DataCommons.org.

One of the clever aspects of MCP is that our users can use their own LLM to interact with the MCP server. Said differently, the user’s LLM is what translates the human language query into a set of API calls and then interprets the result back to the user. Google’s compute isn’t involved!

I should note, we do set a cap on the number of API requests to Data Commons, we want to encourage broad use, but also want to ensure there isn’t abuse or pure scraping.

What techniques does the Data Commons API use to make its data cleaner, more structured and more accessible than the average public data dump? And what general advice do you have for usefully structuring unstructured data?

One of our key innovations is to transform data into a common knowledge graph. We import raw public data from thousands of sources into a single, canonical ontology - if one column in one dataset says “Type 2 diabetes” and the other column in another dataset has the ICD code “E11” we can understand they are both referring to the same thing. We can also normalize units to make them more easily comparable. This allows data scientists to focus on data analysis instead of the busy work.

Every data point is accompanied by detailed metadata and provenance, so users always know where information comes from. Focusing on these principles allows users to turn disparate data into valuable, actionable resources.

What types of verticals and companies do you see using this MCP server to grow their business? And what specific datasets are you most excited to see developers build on, and why?

Many of the world’s pressing challenges are holistic problems. In other words, I can’t just look at one dataset from one government agency but need to combine multiple datasets.

As an example of this, the One Campaign recently launched the ONE Data Agent, an interactive platform for health financing data. This new tool enables users to quickly search through tens of millions of health financing data points in seconds, using plain language. They can then visualize that data and download clean datasets, saving time while helping to improve advocacy, reporting and policy-making.

I’m excited to see developers build new understanding on datasets imported into Data Commons in public health, climate, economics, education, and many other fields. These are foundational datasets that, when made more accessible and actionable, can drive real-world impact—helping communities measure progress and hopefully more clearly understand which interventions lead to which outcomes. It can help us achieve sustainability goals, spot economic changes early, or supercharge advocacy organizations. The MCP server lowers barriers for innovators in these fields, and I’m eager to see the creative solutions that emerge.

How do you define "trustworthy" data in a way that is verifiable and auditable for a developer building an application on top of your platform?

For us, “trustworthy” data is from authoritative and reputable organizations such as government agencies, academic institutions, and civil society groups. Every data point on our platform is accompanied by detailed metadata, including its original source.

For developers, this means you can always trace any number or statistic back to its origin, review the context in which it was collected, and understand any limitations. Our platform surfaces this provenance transparently through the API, making it easy to build applications that not only deliver answers, but also provide users with the evidence and audit trail behind every result.

We don’t try to place judgement on those datasets or specific values. Instead, we want potential disagreements in this data to be more easily visible. Each one of these differences is another story to be told.

The industry has a problem with AI "hallucinations." Is Google's long-term bet that the future of credible AI will be built on verifiable data layers like Data Commons, rather than on models with ever-larger training sets?

Not at all. We are very early in our work with LLMs. Google’s transformer paper was released in 2017! At the moment, I believe the answer to hallucinations is to try all of the above.

Data Commons is attempting to ground outputs in verifiable data with provenance.

Our long-term bet is that the most reliable AI systems will combine the strengths of these models with robust, auditable data sources. By making it easy for AI agents to reference authoritative data sources, we can deliver answers that are trustworthy, transparent and reliable.

What's on the Data Commons MCP Server roadmap for next year? Are there specific data sources or capabilities you're planning to add that developers should be excited about?

If you asked me to tell you my roadmap 9 months ago, I wouldn’t have been talking to you about MCP! The rate of development and change right now in the AI space is dizzying. That said, a few areas we will focus on:

Currently, Data Commons data has a lot of depth and coverage in the U.S., then India, then OECD countries, and then the coverage thins out, a gap the team is now aggressively working to close. One of our goals is to work with more national statistical agencies, international organizations, and civil society organizations to both capacity build on the creation of the data and source that data to make sure the grounded AI systems we build are more globally representative.
We want Data Commons to be easier to use. For example, we’ve recently worked to make it compatible with the Statistical Data Meta Exchange format (SDMX) and hope to continue to increase the ability of Data Commons to work with different open standards more seamlessly.

Five years from now, do you think every major AI application will have some kind of structured data layer like Data Commons underneath it, or will we still be building on top of pure language models?

As AI becomes more deeply integrated into applications that matter, the need for trustworthy and reliable information will only grow. If I were to posit a guess, I expect the industry will shift toward hybrid systems, definitely in the nearterm, where language models provide the interface and reasoning, but the facts and evidence always come from robust, authoritative data sources. We already see this with the usage of Retrieval Augmented Generation (RAG) systems. This combination will be essential for building lasting trust in the next generation of AI-powered products and services.

Ten years from now, what aspects of your current job will AI not be able to do?

I’m honestly less worried about what AI will do that I do today and more excited for what AI will do that I would never be able to.

Let me give you an example. Our human minds are trained to think in three dimensions. We’re not great in 4D. Yet, most of the holistic problems we spoke about earlier are 30/60/3000 dimension problems. When designing an urban space for example, every change in the building floorprint casts shadows differently, changes the mobility networks, alters walkability, changes the financial outcome, and much more. I hope in 10 years we can more reliably model and understand such systems to understand which interventions can have the best quality of life improvements for all of us. I also hope AI continues to take out the tedious parts of my job.

Tomorrow, what is the most important thing on your calendar?

Personal answer: Dinner with my wife and children. It’s also why I work hard on the problems I do.

Work answer: Our team meeting! Culture eats strategy for breakfast.

We work in a difficult, “low dopamine” space where it's unclear if our actions today will move the needle in the future. I get to give this interview, but behind me is a wonderful team that I get to work with daily. Teddy Roosevelt’s quote about “the man in the arena” rings true for me every morning. “If he fails, at least fails while daring greatly.” And I’d add, while having fun with wonderful family, friends, and colleagues.

Lastly, if you could fix one thing about how the world treats data, what would it be?

Too often, valuable public data is locked away in silos, hard to find, or presented in ways that make it inaccessible to all but a few experts. If we made data, along with clear sourcing, context, and documentation open and easily usable for everyone, we’d unlock enormous potential for innovation, accountability and informed decision-making. Now, we do need to be responsible and take safeguards to ensure that this data does not reveal information about individuals or put any at-risk groups at further harm. But I do believe, empowering people with trustworthy, transparent data should be a fundamental principle.

Learn more about Prem Ramaswami, the Head of Data Commons at Google.

"We Are Very Early in Our Work With LLMs," - Prem Ramaswami, Head of Data Commons at Google