Most of the attention around large language models (LLMs) has centered on headline-grabbing use cases: AI coding assistants, chatbots, and email summarizers. But I’ve recently been playing with a less talked-about LLM superpower, the ability to extract structured, useful information from unstructured, messy text.
This ability has major implications for anyone working with large amounts of text. And it’s exactly what I’ve been experimenting with in a project for practiceproblems.org, a site that helps students discover and solve math problems by watching real humans walk through solutions on video.
In order to find these problems, I need to be able to extract math practice problems from YouTube video transcripts. This is something that, when done by humans, is painstakingly boring and tedious.
But for an LLM with a good prompt, the results are shockingly good. Imagine trying to extract what 3x3 matrix someone is finding the eigenvalues of simply by reading a transcript of them solving it, without them ever explicitly reading off the values. That would be tough for a human, but somehow, these latest LLMs can figure it out… sometimes.
The challenge is that the transcripts are often incomplete. There simply isn’t enough information in the transcript to reconstruct what problem is being solved. A math problem might be shown on screen but never read aloud. Or it might be described in fragments that are only coherent when stitched together.
This means that often, the LLM simply is not able to extract a meaningful problem from a transcript. I found that when I told the LLM it must only extract meaningful problems, it would be way too cautious and maybe only extract 2 or 3 problems from a video that contained 20.
When I told it to extract anything resembling a problem, it would hallucinate, generating problems that didn’t exist and giving way too many false positives to be useful. So, more advanced prompting techniques were needed.
Confidence Scores
“Extract as many problems as you can, and give each one a confidence score from 0 to 10.”
This had two benefits. First, it encouraged the model to attempt extractions instead of skipping the potential problem altogether. Second, it gave me a way to rank and filter results post-processing.
Flags
Telling a model “don’t hallucinate” doesn’t really work, but surprisingly, asking after the fact “did you make this up?” actually does work rather well (at least in this use case). The key thing I've found working with LLMs is that instead of simply telling a prompt to "not do X behavior" after it begins behaving badly, it is to instead give it the ability to set a flag when it does the bad behavior. Then, afterwards, you can manually parse out the bad output.
Asking the LLM to output a flag "notFoundInOriginalTranscript" worked quite well at identifying when the LLM wasn’t quite able to reconstruct the problem found in the transcript.
The reason prompting the LLM to output confidence scores and flags for bad output allows the LLM to “think twice” about the output. Token prediction uses both the prompt and the model’s own output so far, so allowing the model to output as much as possible can be helpful. I believe this is somewhat similar to chain-of-thought or including “think step by step” in a prompt, where encouraging more token output results in better performance.
I suspect these techniques could be quite helpful with RAG-style codebases where fragments of text related to the query are fetched from a database and then fed into the context of an LLM, providing whatever background information may be needed to help answer the query. Adding an extra layer of text extraction to provide more structured, meaningful data to the LLM could help produce better output, though it might hurt your wallet. 😅