methods · research · education

The Numbers Game

1

The Numbers Game: What Therapists Should Know About Effect Sizes

The study landed in Dr. Katrin Weiss's inbox on a Monday morning in Munich, sandwiched between appointment reminders and a departmental memo about parking. A pharmaceutical company was promoting a new adjunctive treatment for treatment-resistant depression, and the headline was impressive: "Significant improvement demonstrated (p < 0.001)." Weiss, who had trained in both psychotherapy and psychiatry, felt the familiar twitch of scepticism. She clicked through to the original paper. Sample size: forty-three patients. No confidence intervals reported. Effect size buried in supplementary materials. She closed the tab and returned to her clinical notes.

This small act of statistical hygiene—questioning the claim, seeking the actual numbers—happens less often than it could, given what is at stake in treatment decisions. Not because clinicians are gullible, but because the language of research has evolved in ways that make it hard to access, wrapped in Greek letters and abbreviations that often feel more excluding than inviting. Effect sizes, confidence intervals, numbers needed to treat—used well, these help distinguish robust findings from polished marketing. Understanding them is not a luxury for research-minded specialists. It is practical self-defence.

Let us begin with the most commonly cited metric: Cohen's d, which measures how much two groups differ in standard deviation units. A small effect (d = 0.2) means the treatment group improved about one-fifth of a standard deviation more than controls. A medium effect (d = 0.5) reaches half a standard deviation. A large effect (d = 0.8) suggests the average treated patient did better than roughly seventy-nine percent of the control group—a figure derived from the overlap of two normal distributions, sometimes called the "common language effect size."

But here is what those numbers mean in explained variance: a small effect accounts for roughly eight percent of the difference between groups. A medium effect explains about twenty percent. A large effect—the kind that makes headlines—captures thirty-seven percent. Even when a therapy demonstrates a "large" effect, sixty-three percent of what determines outcomes remains unexplained. This is not a failure of therapy. It is the nature of human complexity. The factors shaping recovery extend far beyond any intervention: relationships, employment, genetics, timing, luck. When someone claims to have "proven" a treatment's efficacy, it is usually overreach—at best shorthand, at worst marketing. In psychotherapy, we estimate effects and their uncertainty. We rarely prove anything.

Consider a concrete example. Imagine a German outpatient trial comparing cognitive-behavioural therapy plus usual care against usual care alone for moderate depression. The study reports d = 0.35, with a 95% confidence interval of [0.10, 0.60]. What does this actually tell you? The point estimate suggests a small-to-medium effect. But the confidence interval—the range of plausible true values—spans from barely detectable (0.10) to moderately substantial (0.60). The study is honest about its uncertainty. Had the interval been [−0.05, 0.75], including zero, the conclusion would shift: we cannot rule out that the treatment has no effect at all.

This brings us to the most misunderstood statistic in research: the p-value. When a study reports p < 0.05, it means the observed result would occur less than five percent of the time by chance alone, assuming no real effect exists. It does not mean there is a ninety-five percent chance the treatment works. It does not tell you how large the effect is. A study with ten thousand participants might achieve p < 0.001 for a trivially small effect, while a study with thirty participants might miss significance despite meaningful benefit. The p-value is a threshold, not a measure of importance. Confidence intervals reveal what p-values hide: the range of uncertainty around any finding.

Translating effect sizes into clinical meaning requires another metric: the number needed to treat, or NNT. This answers a straightforward question: how many patients must receive an intervention for one additional person to benefit compared to control? Lower is better. For our hypothetical CBT trial with d = 0.35, the NNT works out to approximately eight to ten—meaning for every eight to ten patients treated, one additional patient recovers who would not have with usual care alone. For context, antidepressant medication versus placebo typically shows an NNT of seven to eight in large meta-analyses. Lambert and Shimokawa's 2011 meta-analysis of feedback-informed therapy—using routine outcome monitoring to guide treatment—found an effect size of approximately d = 0.25 for preventing deterioration, translating to an NNT of roughly ten to fifteen. These are modest numbers. They are also meaningful. In psychotherapy, where chronicity is common and base rates of spontaneous recovery vary widely, an NNT of twelve represents genuine clinical value. The mistake is expecting miracle cures.

There are reliable warning signs when evaluating therapy research. Studies with fewer than fifty participants per group are underpowered—their estimates unstable, liable to fluctuate wildly on replication. Small samples also tend to produce inflated effect sizes, a phenomenon called the "winner's curse." Pre-post designs without control groups are essentially meaningless for establishing treatment effects, since patients often improve naturally over time through regression to the mean, life changes, or the simple passage of a depressive episode. When a study reports only that patients improved from baseline, without comparison to untreated controls, it tells you almost nothing about whether the treatment caused that improvement. Industry-funded trials without independent replication deserve the same scepticism you would apply to a restaurant reviewing itself. None of these red flags prove a study is worthless, but they should lower your confidence considerably.

Dr. Weiss has developed a simple heuristic over two decades of practice. When presented with a new treatment claim, she asks three questions: What is the effect size? What is the confidence interval? Has it been replicated independently? If any answer is "not reported" or "unknown," she files the claim under "interesting but unproven" and moves on.

This approach has limits, and honesty requires naming them. Not every clinician wants to engage at this level of statistical detail, and that reluctance is legitimate—therapy is relational work, not data science. Group averages obscure individual variation; the patient who responds dramatically to an intervention with small average effects is no less real than the one who shows no response to a "proven" treatment. Moreover, randomised trials typically exclude the complex, comorbid patients who fill German outpatient and inpatient practices—the person with depression, chronic pain, and a precarious housing situation rarely qualifies for the studies that generate our effect sizes. This means the numbers we cite may not generalise to the people we actually treat. Statistical literacy will not make you a methodologist. It simply gives you enough to ask better questions.

In Germany, where data protection carries almost constitutional weight and guideline-based care shapes reimbursement, these questions matter practically. When the G-BA evaluates a new psychotherapy approach, or when a Krankenkasse demands evidence for a treatment, the underlying data will include effect sizes and confidence intervals. Knowing how to read them—and knowing their limits—is not academic. It determines which treatments become available and how they are delivered. Other European contexts carry their own sensitivities: in Poland, systematic data collection can evoke uncomfortable historical echoes; in France, psychodynamic traditions raise legitimate questions about what quantification leaves out. These are not irrational objections. They are reminders that numbers carry cultural weight, and that measurement serves clinical judgment, not the other way around.

What does this mean on Monday morning? When you hear "proven," ask for effect size and confidence interval. Be cautious with studies under fifty participants per group. Dismiss pre-post designs as evidence for causal effects. Treat an NNT of ten to fifteen as clinically meaningful in psychotherapy, given the chronicity of many conditions. And when research contradicts your experience with a specific patient, use both sources of knowledge—do not throw either away. The research tells you about populations. Your judgment is about the person in front of you.

Statistical sophistication should not become a new form of paralysis. A modest effect size might represent, for one particular patient, the difference between gradual recovery and prolonged suffering. The goal is not certainty but calibrated uncertainty—the ability to distinguish robust findings from noise, honest researchers from skilled marketers. In a field as complex and humbling as psychotherapy, that modest protection is worth a great deal.

Dr. Weiss still receives promotional emails. She still reads them, sometimes. But now she knows which numbers to look for—and what to do when they are missing.

Stay informed with
evidence-based insights

Subscribe to receive new research translations and updates directly to your inbox.