27 percent return in three months. Too good to be true.

The first backtest of my trading system. For about ten seconds I felt great. Then not anymore.

This morning at quarter past eight I sat with coffee behind my laptop, looking at the result of the first backtest of my trading system. Three months simulated, January through end of March 2024. Ten thousand euros starting capital.

The numbers appeared in the terminal:

Total return: +27.36%
Sharpe ratio: 7.10
Max drawdown: -1.79%
Win rate: 69.2%

For about ten seconds I felt great. Then not anymore.

Too good to be true

A Sharpe ratio of 7. For those unfamiliar with the term: it's a measure of return divided by risk. Above 1 is good. Above 2 is exceptional. The most famous hedge fund ever, Renaissance Technologies' Medallion Fund, reportedly ran around a Sharpe of 2.5 over decades. Three times higher than what I was now seeing.

A Sharpe of 7 in three months for a long-only equity strategy? That can't be right. That's not edge, that's a bug.

I typed "+27% Sharpe 7" into the terminal and sent it to Harry, my AI sparring partner. He wrote back immediately:

"This is exactly why we smoke test before running 4 years. Sharpe 7 is bizarrely high. Good = 1.5. Top quant = 2-3. A real strategy that makes 27% in 3 months MUST fluctuate somewhere. A -1.8% drawdown over 63 days is too smooth to be true."

We went searching. Two bugs later, I hadn't even finished my coffee.

Bug one: the "simplification" that broke everything

In my system there's an exit trigger called target_hit. The idea: buy a stock when it's undervalued within its sector, sell it when it reaches the sector median. Classic value mean-reversion.

When we looked at the first run, we saw something odd. 90% of all exits were target_hit. Ninety percent. And often the day after entry. BKNG bought at 124.93 euro, target_hit at 122.57 a day later. That's not a target hit, that's a 1.9 percent loss.

I asked Cursor, my other AI that writes the code, to show the target_hit function. There it was:

def check_target_hit(position, ticker_ev_ebitda, sector_median):
    """Trigger 6: EV/EBITDA reached sector median (coming from below).
    
    Logic: if at entry the ticker was below sector median, 
    and now is at/above, exit.
    For V1 simplification: just check if current >= sector median.
    """
    return ticker_ev_ebitda >= sector_median

Read the docstring. He knew exactly how it should work. "If at entry the ticker was below sector median, and now is at/above, exit." That's correct. But then came the line that broke everything: "For V1 simplification: just check if current >= sector median."

Cursor wrote well-documented code and then skipped half the logic with a comment that it would be fine. Because of that the trigger was no longer about "stock was undervalued and has reached target," but about "stock is above sector median right now." That's half the market, every day.

For my portfolio it meant: buy a stock, check if it's somewhere above the median, sell it again right away. A kind of algorithmic ADHD.

We had Cursor write the fix. Four new tests, old tests adjusted, done in ten minutes. But the real question remained: how many other "V1 simplifications" are there somewhere in my code?

Bug two: talking about what you don't yet know

After the target_hit fix I ran the same three months again.

Total return: +27.36%
Sharpe ratio: 7.10

Yes, even higher. Sharpe now 7.10. No change, because bug one wasn't the source of the too-good numbers. That's bug two.

Here it gets subtle.

My system calculates scores for 462 stocks every trading day. Those scores are based on the closing prices of that day. Then it decides: buy this, sell that. And in my backtest the purchase also happened at that closing price.

That sounds logical. It isn't logical.

In the real world the closing price arrives at half past nine in the evening, then my backend calculates the scores. By eleven I know which stocks are interesting tomorrow. But tomorrow is tomorrow. I can't buy today's closing price anymore, that one has passed.

In the backtest I did. The system looked at the closing price, calculated that it scored high, and bought as if it could still get that same closing price. That's look-ahead bias: a system that uses information that wasn't yet available at the moment of trading.

What's the effect? Imagine NVIDIA presented unexpectedly good quarterly results on a Wednesday, after market close. That news is in Thursday's closing price. My system saw a nice score for NVIDIA on Thursday evening and "bought" it that same Thursday, at closing price. But in reality I could only have bought Friday morning, and by then the market knows it too. NVIDIA would probably open three percent higher.

Over sixty trading days, with fourteen positions, summed up: the difference between optimistic backtest and reality quickly adds up to thirty percentage points of return. That's exactly what I saw.

The pattern

Both bugs have something in common. They're not stupid mistakes in the sense of "syntax error" or "wrong number." They're logic errors in a system that otherwise works as it should. Tests green. Code clean. Output looks plausible.

A test can verify: "if you call the check_target_hit function with these parameters, you get this output." But a test can't verify: "is this conceptually what we want to test?" That's something a human has to see. Or better: multiple humans, with multiple perspectives.

In my project that works like this: I think and design, Harry reviews and writes prompts, Cursor builds the code. Three pairs of eyes on every decision. Today it worked. Cursor wrote the bug, Harry saw the numbers and said "this can't be right," I checked the code. Ten minutes later bug one was fixed. An hour later bug two.

What's new for me: I trust Cursor to write code, but I can't trust him to be logically correct. He's brilliant at mechanics. It works. But whether it also makes sense, that's another question.

What now

After the two fixes I ran a sanity check. Not the bull market of Q1 2024, but Q3 2022, when the market dropped about 16 percent. If my strategy is unbiased, it should come out around zero in that period. Not positive, not heavily negative. Just: defensive behavior in a falling market.

Result: +0.04 percent. Flat. No loss, no gain. Exactly what a healthy system in a crashing market should do.

Next time you see a backtest with Sharpe 7 and a drawdown of one and a half percent, you'll know: somewhere, in one of hundreds of lines of code, someone is fooling themselves. Sometimes it's you. Sometimes it's your AI.

In this case it was both of us.

Follow weekly?

Follow on LinkedIn RSS feed