Seerist

Blog · 2026.06.05 · 2041 UTC

When "Good Enough" Isn't Good Enough: Testing AI for Intelligence Work

By

Melissa Newberg, Head of Intelligence, Seerist

Since we started testing last year and then launched AskAnna six months ago, I've run thousands of queries through it - testing it, breaking it, refining the architecture. But at a certain point, internal testing only tells you so much. The frontier AI landscape has moved fast, and I wanted to know how AskAnna actually stacked up against the best general-purpose models available today. It’s something we also get asked a lot.

So I ran a structured evaluation: AskAnna plus five leading models, across 18 questions and seven scoring dimensions selected for the practitioner's needs.

For those unfamiliar, AskAnna is Seerist's AI-powered intelligence tool, built on a closed corpus of Control Risks' analyst reports, human-authored Events, and reliability-rated news sources. The architecture was built on a specific premise: that for intelligence work, where a confident wrong answer can inform a bad decision, how you build matters as much as what you build on.

Consider this an honest look at a landscape that's changing fast; what the best general models can and cannot do, and what still separates a purpose-built intelligence tool from even the most capable frontier models.

It's also written for anyone building in this space. The questions it raises about trust, information quality, and what it actually takes to make AI sound like an analyst, apply whether you're a service provider, a practitioner standing something up internally, or somewhere in between.

The rest of this piece explains what was tested, what I found, and what it means for how you should be thinking about AI in intelligence contexts. But if you only have a few minutes, here's the tl;dr:

The leading general-purpose models have gotten genuinely good. They can be analytically serious, well-structured, capable of providing real insight on complex questions. But "good" and "trustworthy" are different things when the output informs a decision with real consequences. In this evaluation, one leading model fabricated specific facts across 40% of questions. A second model — one that many enterprise security teams already have access to — confirmed a fabricated US military operation as fact when asked about a city there was no verified intelligence. That's not a minor error when it means a security manager is potentially briefing leadership on a >threat that didn't apply or may not have existed in that specific geographic context.

AskAnna had a high overall success rate with zero fabrications across all 18 questions in this test set. It was the only model to contradict a false premise with sourced counter-evidence. Every claim linked to a typed, dated, traceable source. Most importantly, it scored a perfect 100% on operational briefings, the category most directly tied to real decisions about people's safety.

The gap is architectural, not cosmetic. More fundamentally, the challenge in intelligence is determining whether you can trust those outputs when the stakes are high.

What Was Tested

The evaluation covered 18 questions across two test batteries, six models, and seven scoring dimensions. The questions were designed to reflect how intelligence tools are used by everyday practitioners, not to favor any particular product.

It covered nine countries across four continents. Query types ranged from current situation reports and pre-deployment travel briefings to scenario forecasting, kidnap risk assessments, and breaking news verification. I deliberately tested edge cases: false premises, unverified incidents, and scenarios designed to reveal whether a model would invent facts when evidence was incomplete.

Each response was scored across seven dimensions: fabrication resistance, source transparency, epistemic honesty, operational utility, tradecraft discipline, tone, and fit for purpose.

The goal was a structured, honest evaluation of how these tools perform on the questions that matter to practitioners, and what the gaps reveal about where the technology actually is. The scoring criteria were established before testing, and every model was assessed against the same framework. The goal was not to prove a predetermined outcome, but to understand where different approaches succeeded and failed under the same conditions.

Where the General Models Exceeded

We'll start here, because there are legitimate strengths.

The general models are good. One leading model produced a Mali kidnap risk assessment I would put in front of a senior analyst without embarrassment. Without real prompting, another generated a Haiti 90-day outlook with a tiered indicator framework that reflected real intelligence tradecraft. It flagged a factual error embedded in one of our questions before answering. On complex, multi-variable questions, its analytical depth was genuine and differentiated. Scary good even.

One model in particular was the most consistent of the five: a clean fabrication record in most rounds, good citation hygiene, rarely wrong. Its framing of the Nuevo León cartel risk was exactly right, even if it did frame the answer in a very AI way: "the question isn't whether your employees will be targeted, it's whether they could be caught near something intended for someone else."

For background research, general situational awareness, and questions where the stakes allow for a slim margin of error, some of these models are impressive and provide outputs I would call “good enough.” But the more interesting question is what “good enough” really means when the stakes are high, not whether LLMs are "good" at intelligence.

Where They Broke Down

The fabrication problem is systematic, not incidental.

A third model fabricated specific facts in 40% of questions and these were not vague generalizations. It fabricated a US Embassy security alert with a specific date that I couldn't trace back, and in a pre-deployment brief, an Ebola quarantine warning was applied to a location where no such restriction actually existed.

That last one is the one that gets me. Without tracing that back, a security manager receiving that briefing could delay a trip, brief leadership on a public health risk, or implement screening protocols based on a health threat that didn't apply. The model provided no signal that it had misapplied this information. But it was being helpful nonetheless. And when I went back to ask it to verify that information, it did then backtrack and say it wasn't factually correct, which is almost worse. Because at that point, I wasn't sure whether to believe it then either. And those precious minutes spent verifying pieces of information can really add up over time, when time is already one of the scarcest resources in this industry.

In intelligence, being that kind of helpful is dangerous. Intelligence failures rarely happen because someone lacked information. They often happen because someone accepted information that sounded plausible.

A model trained to be helpful will confirm what you imply.

That's a subtle distinction, but it's one of the most consequential differences between generating an answer and conducting analysis.

When asked whether the security situation in a specific city had "deteriorated significantly in the last 48 hours," most models confirmed the premise or hedged around it. The sourced intelligence picture at that time actually showed a stable posture, with a named official on record stating there were no active alerts, evacuation or otherwise. Our analysts verified that. Yet, every general model agreed with what the question implied, or avoided committing, which in operational terms is the same as confirming it.

The more unsettling version of this failure came from a model that most enterprise security teams already have by default, bundled into their existing software stack. When asked the same false premise question, it not only confirmed the deterioration of the security environment but named a specific US military operation as the cause. The operation did not exist. However, the output was well-formatted, confidently worded, and specific enough that a security manager would have no obvious reason to question it before forwarding. A fabrication that looks authoritative is more dangerous than one that looks uncertain.

Source quality was more inconsistent than expected.

One model, marketed on its web search capability, cited Reddit threads, YouTube travel vlogs, Facebook posts, and an Instagram story as primary sources in professional security briefings. These appeared alongside government advisories and established news wires with no real distinction. A recipient has no way to differentiate a State Department advisory from a backpacker forum from the model’s training data without clicking every link. There is a time and a place for those sources, certainly. But when they're presented alongside government advisories and professional reporting with no distinction, the burden of source evaluation shifts entirely to the reader.

Another model cited no sources at all in 16 of 18 questions. There was no way to verify any specific claim it made, and presumably, it generated the answer exclusively from its training data.

What Intel Standards Actually Require

Intelligence work has specific standards that predate AI: source typing, confidence calibration, distinction between what is known versus assessed versus assumed, temporal discipline, and epistemic honesty when you genuinely don't know.

None of the general models applied these standards consistently without being explicitly prompted. When they did apply them, it was because the question invited it. Even then, the same model that tiered its confidence correctly on one question would assign invented probability percentages on the next against an unknown methodology. The tradecraft appeared when the format of the question happened to surface it and it disappeared when it didn't. That inconsistency is itself a trust problem: you can't build a reliable intelligence workflow on a tool that applies analytical standards selectively.

AskAnna applies these standards structurally. Every citation includes a source type, a date, and a traceable link. The difference between a Control Risks' analyst assessment and a Reuters news item is visible in the output, not something the reader has to infer. The model is grounded in a closed corpus of human-authored or human-rated content, which means it won’t cite a random Reddit thread, because the end user has the ability to choose the inputs.

It might sound like a limitation but it's a deliberate architectural choice with a specific rationale. In intelligence, the provenance of information is as important as the information itself.

Can’t I just replicate? It's Harder Than It Looks

Some organizations are building this themselves. It would be dismissive not to acknowledge that. Teams across the security and intelligence industry are investing in the right architectures, curating proprietary content, and building tools that aim to solve the same problems AskAnna was built to solve. The underlying technology is accessible, the patterns are well understood, and the demand is real. After all, we are all becoming builders in the age of AI.

But there's a meaningful difference between most current implementations and what a well-built retrieval-grounded system actually does. It comes down to where in the pipeline trust gets enforced.

Here's the non-technical version: most AI tools work like a very fast researcher. They retrieve relevant information, categorize and/or synthesize it, and sometimes fill gaps with what seems plausible. The result can be impressive. It can also be confidently wrong in ways that are very hard to spot at first glance (or even second), because the invented detail is woven into the real one.

A retrieval-layer approach is different. The corpus is curated before the model ever sees a query. The model can only draw on what's in that corpus so there's no reaching outside to fill gaps, no blending with training data, no "well I don't have this source but here's what seems plausible." When the corpus doesn't contain the answer, the output says so.

That's why AskAnna pushed back on the false premise question when every general model confirmed it. It was less about prompt instruction and more that the corpus was returning a picture that contradicted the user's claim, and the model was inherently constrained to that picture. You can't prompt-engineer your way to that behavior…you have to build it.

Can you improve a general model with careful prompting? Absolutely. I've spent plenty of time trying. The results are often better. But they're also inconsistent, and in intelligence, variable reliability is simply another form of unreliability. The data boundary is almost impossible to enforce through instruction alone and the model's training data is always there as a fallback to satisfy your request.

Building this properly requires decisions at the architecture level: what goes in the data lake, how the data is rated, how retrieval is constrained, and how the system behaves when information is incomplete.

We've done that work. Others are doing it too, but most of what's currently deployed is operating at the synthesis layer only. The distinction matters more than most people realize.

There's one more dimension I haven't seen discussed enough: voice and tone. AskAnna intentionally sounds like it was trained by intelligence professionals, because it was. The hedging language, the way uncertainty is framed, the analytical register, the instinct to distinguish between what is known and what is assessed. This comes from nearly two decades of Control Risks' analysts writing in a specific tradecraft voice, and that voice is embedded in the corpus. I've spent time trying to get general-purpose models to replicate it and I haven't been able to do it one-for-one. You can get close on individual outputs but it doesn’t sustain it across thousands of queries. It still drifts back toward the generic confidence of a model that learned to sound authoritative from the internet. The data lake gives AskAnna its analytical identity, not just its facts.

How AskAnna Performed and Where It's Going

AskAnna scored a 91% overall success rate — 343 out of 378 points across 18 questions and seven dimensions. Those numbers matter less than what's behind them.

It scored perfectly, per our standards, on operational briefings. These were the questions most directly tied to real decisions: pre-deployment briefs, advisories for staff, executive security briefings. On every one, AskAnna opened with a clear assessment, sourced every claim to a named report, and distinguished between what analysts had assessed and what news had reported. The sourcing is the mechanism by which a security manager can calibrate how much weight to give each claim. It also simply saves the reader a lot of time from a due diligence perspective.

It was the only tool to contradict a false premise. When I embedded a falsity in a question, implying a situation had deteriorated when the true intelligence picture showed stability, every other general model confirmed or hedged. AskAnna reported what its corpus showed, even when that contradicted what the question implied. A model that confirms what you say rather than what the evidence shows is a liability in any operational context.

It handled uncertainty correctly. On an unverified social media report of an attack near a hotel in Bamako, AskAnna explicitly stated it had no confirmed reporting of the incident. It provided verified context on the current threat environment and gave the user a clear action: verify before acting. Most models described the situation as if it were realistic or possible. One invented specific details about the attack.

The source architecture is visible throughout. Every output shows whether a claim comes from a Control Risks' analyst report, a human-authored Breaking Event, or a reliability-rated news source. It also clearly displays when that source was published. A reader can calibrate how much weight to give each claim based on its provenance and recency rather than accepting the output at face value.

The full archive test showed what's possible. When I re-ran historical comparison questions with AskAnna's 2008–present Control Risks archive enabled, the outputs changed materially. On a Sahel question, it pulled a September 2012 contemporaneous Control Risks' assessment, written as the Malian state was collapsing, and showed exactly how that assessment developed against the 2026 reality. That's institutional memory put to analytical use, not just retrieval. It traces the evolution of analyst thinking over time rather than synthesizing a static body of knowledge.

In addition to the above, we’re using this evaluation to prioritize enhancements. Like any product, AskAnna still has development areas. Fit for purpose is AskAnna's lowest dimension score. On high-urgency questions where a concise and efficient output is the right answer, the structured format occasionally over-delivers. The output is always accurate and sourced but not always calibrated to the urgency of the moment, where the user just need the bottom line. We are currently building out additional output formats, including a BLUF-style one to address the above, and creating pathways for the user to incorporate their own templates into the output, introducing more personalization options.

There's also an honest recency limitation. AskAnna's corpus is updated continuously, but it's a closed corpus and not a live web search. So for truly breaking intelligence where the decisive fact emerged in the latest minutes to hour, web-search models have a real-time edge that closed-corpus architecture struggle to match. AskAnna already incorporates high reliability news and so we intend to close that gap further by introducing more content into AskAnna from our existing data lake, including low and medium reliability rated news and social. We will still maintain our principles of transparency in this, allowing the user the control to toggle this on and off, knowing there are many queries or situations where that information would not be an appropriate inclusion.

Where We Go Next

The general models are getting better by the day. That's genuinely exciting, and it will keep pushing purpose-built tools to be better too. But better at generating text and better at generating intelligence are not the same thing. Conflating the two has a cost that's easy to underestimate until something goes wrong.

Earlier in this piece I posed the question: what does "good enough" actually mean when the stakes are high? So here's the answer I've come up with:

"Good enough" is a reasonable standard for a lot of things. But not when the downstream consequence of getting it wrong is measured in the safety of people, or in the kind of investment and operational decisions that don't get walked back easily. And before you trust any AI tool with those decisions, ask your vendors the harder questions: what's actually in the data lake, who wrote it, how is it rated, and what happens when the model doesn't have the answer?

I experienced many of these exact scenarios and this evaluation was designed to test those. They're the scenarios where the gap between a general-purpose model and a purpose-built intelligence tool becomes very real, very fast. In a world where AI is increasingly embedded in decision-making, the question is ultimately whether you can trust what it's telling you when it actually matters.

The tools most likely to go unquestioned are the ones that already live in your stack. Familiarity is not the same as trustworthiness. In intelligence, the output that gets forwarded without verification is the one that causes harm.

Right now it can be good fun to keep asking "can AI do this?" Often, it can! Increasingly though, that's the wrong question.

In this business, whether you’re building yourself or using a provider, we need to be asking: "Can I trust this output when the stakes are high and I'm moving fast?" And that question has a very different answer depending on what's underneath.