A/B Testing Outreach Messages: Definition & Guide
Most people test their outreach messages the wrong way: they send variant A for two weeks, switch to variant B, then compare results across different time periods with different audience compositions. That is not a test. That is a guess with extra steps.
A/B testing outreach messages means sending two message variants to comparable audience segments at the same time, under the same conditions, with exactly one variable changed between them. You measure the outcome (reply rate, acceptance rate, or booked meeting) and declare a winner only when the sample is large enough to trust.
That definition sounds obvious. The execution is where most outbound programs fall apart.
What "One Variable" Actually Means in Practice
Every experienced operator has a story about testing five things at once and having no idea what moved the needle. We have that story too.
The variables you can change in a LinkedIn outreach message fall into a clear hierarchy:
| Variable | What to measure | When to test it |
|---|---|---|
| Opening line | Reply rate | First |
| Call to action | Reply rate / click rate | Second |
| Message length | Reply rate | Third |
| Tone (formal vs. direct) | Reply rate | Fourth |
| Personalisation depth | Reply rate | Only after basics are settled |
Start with the opening line. It is the highest-leverage element in any cold message because recipients on LinkedIn preview the first 30-40 characters before deciding whether to open at all. Change only that. Keep the body, the CTA, and the sign-off identical between variants.
Once you have a winning opener, freeze it and move to the call to action. Then length. Then tone. Sequential testing like this takes longer but it tells you something you can actually act on, rather than giving you a winner you cannot explain.
Minimum Sample Sizes (and Why 20 Sends Is Not Enough)
We cap our own campaign sends at volumes that keep accounts safe, which means tests take real calendar time. That is the honest constraint nobody likes discussing.
The practical minimum is 50 sends per variant. Below that, a single person who happens to be unusually responsive (or unusually unresponsive) can shift your apparent reply rate by several percentage points. At 50 sends each, one outlier has much less power to mislead you.
For connection note tests, where the outcome is binary (accepted or not), 50 per variant is workable because the signal is clean. For longer follow-up message sequences where you are measuring booked meetings, push to 100 per variant. The funnel is longer and variance compounds at each step.
If your LinkedIn acceptance rate is sitting in the typical range for cold outreach, a test of 50 sends per variant will give you enough acceptances to detect a large difference between variants, but not a small one. Decide before you start what effect size would actually change your behaviour, then size the test to detect it. Do not size it based on whatever volume feels convenient.
Running the Test Safely Inside a Campaign
The mistake we keep seeing in beta is founders who try to collect data fast by spiking daily send volumes. That creates two problems at once: noisy data (because you are reaching different parts of your audience at different densities) and real account risk.
Ampliflow's cloud-based execution runs through the Unipile API with human-like daily rate limits and randomised timing jitter built in. Your laptop can be closed. The jitter exists to look like a real person, not an automated system hitting the API at exactly 9:00 am every morning. When you override those limits to collect test data faster, you are trading account safety for speed. That trade is rarely worth it.
A cleaner approach: run the split test inside a single campaign using Ampliflow's built-in A/B testing feature. The visual workflow builder splits your imported list (from LinkedIn search or Sales Navigator) into two comparable segments automatically. Variant A goes to one half, variant B to the other, in the same time window, with the same daily pacing. Funnel analytics surface reply rates per variant in real time.
You also benefit from auto-pause on reply: when someone responds, they exit the sequence immediately. That keeps test data clean because you are not counting people who already converted mid-sequence.
What to Vary in Connection Notes vs. Follow-Up Messages
Connection notes and follow-up messages are different enough that the testing hierarchy shifts between them.
For connection notes (LinkedIn's 300 character limit), the entire message is the opening line. There is no body to fall back on. The most productive variables to test here are:
- Mention a shared context vs. lead with a direct ask
- Name their company vs. name their role
- End with a question vs. make a statement
One finding from our own testing: connection notes that end with a question get accepted at meaningfully higher rates than ones that state a value proposition. Our working theory is that a question signals genuine curiosity rather than a broadcast. Take that with the usual caveat that your audience and ICP will behave differently from ours.
For follow-up messages (after the connection is accepted), the opening line is still first priority, but message length matters more here than it does in connection notes. Short messages, under 60 words, tend to outperform longer ones at the first follow-up. By the third touchpoint in a drip campaign, the relationship has some warmth and you can go longer without losing people.
What Not to Test Yet
Personalisation depth is tempting to test early because it feels like the most sophisticated variable. It is also the hardest to isolate. "Personalised" messages vary in ways you cannot fully control (different people mention different things about different prospects), so you end up comparing apples to oranges rather than two clean variants.
Get your baseline message structure right first. A well-structured generic message usually outperforms a poorly structured personalised one, and you need a strong baseline to even detect the lift that personalisation adds.
Similarly, do not run three variants at once thinking you will find the winner faster. Three-way tests need three times the sample size, and most outbound programs at founder-led scale do not have the volume to sustain them cleanly.
Reading the Results Without Fooling Yourself
When variant B beats variant A by 3 percentage points on 30 sends each, that is not a result. That is noise. The honest call is to keep running.
When variant B beats variant A by 8 percentage points across 100 sends each, that is worth acting on. Promote variant B to your main sequence, archive variant A, and start the next test on a different variable.
Avoid ending tests early because one variant is "clearly winning." Early leads flip all the time. Set your sample size before the test starts, not after you see interim numbers that happen to look promising.
Ampliflow's funnel analytics track reply rate and acceptance rate per variant in real time, which is useful for monitoring but dangerous for early stopping. Use the dashboard to check that the test is running cleanly. Do not use it to find permission to call it early.
Building disciplined testing habits is part of treating your lead list as a real asset rather than a list to burn through. Every test improves the next campaign, and that compounds across months in ways that one-shot blasts never do.