Skip to main content
Open BetaWe’re learning fast - your sessions and feedback directly shape AI CogniFit.
methodologyevaluationcomparisongetting-startedA Evidence

Fair Trial: Your First Paired Run

A step-by-step checklist and log template for running your first valid AI vs. manual comparison.

What makes a trial 'fair'

A fair trial controls for confounds: same task, same evaluator, same rubric, same time pressure. Without controls, you're measuring noise, not AI impact.

Pre-Flight Checklist

Before you start, confirm:

  • Task selected: Routine task you do at least weekly
  • Rubric locked: Using anti-drift template (see Rubrics guide)
  • Sample size: Minimum 5 matched pairs (10 total outputs)
  • Evaluator assigned: Same person scores all outputs
  • Order randomized: Evaluator doesn't know which is AI vs. manual
  • Timing ready: Stopwatch for task completion time
  • TLX prepared: Workload questionnaire for each task

The Fair Trial Protocol

Step 1: Select 5 Matched Task Instances

Choose 5 representative examples of the task. They should vary in complexity but be comparable:

| Pair | Task Description | Complexity (1-3) | |------|------------------|------------------| | 1 | [e.g., "Summarize Q3 report"] | 2 | | 2 | [e.g., "Summarize competitor analysis"] | 2 | | 3 | [e.g., "Summarize customer feedback"] | 1 | | 4 | [e.g., "Summarize market research"] | 3 | | 5 | [e.g., "Summarize internal audit"] | 2 |

Step 2: Generate Both Versions

For each task:

  1. Manual version: Complete the task without AI (time it)
  2. AI version: Complete the task with AI assistance (time it)
  3. Record workload: Complete TLX after each version

Important: Randomize which you do first (flip a coin per pair).

Step 3: Blind the Evaluator

  1. Remove any identifying markers (prompts, AI artifacts)
  2. Assign random IDs (A1, A2, B1, B2, etc.)
  3. Shuffle order before evaluation
  4. Evaluator sees only: Output + Rubric

Step 4: Score All Outputs

Use your locked rubric. Record:

| Output ID | C1 Score | C2 Score | C3 Score | Total | Time (min) | TLX Score | |-----------|----------|----------|----------|-------|------------|-----------| | A1 | | | | | | | | B1 | | | | | | | | A2 | | | | | | | | ... | | | | | | |

Step 5: Unblind and Analyze

Reveal which outputs were AI vs. manual. Calculate:

| Metric | Manual (avg) | AI-Assisted (avg) | Δ | |--------|--------------|-------------------|---| | Quality Score | | | | | Time (min) | | | | | TLX Workload | | | |

Fair Trial Log Template

Fair Trial Log — [Task Type] — [Date]

Setup

  • Task: [description]
  • Sample size: 5 pairs
  • Evaluator: [name/initials]
  • Rubric: [link or name]

Pairs

| Pair | Task | Manual First? | Manual Time | AI Time | |------|------|---------------|-------------|---------| | 1 | | Y/N | min | min | | 2 | | Y/N | min | min | | 3 | | Y/N | min | min | | 4 | | Y/N | min | min | | 5 | | Y/N | min | min |

Blinded Evaluation

| ID | C1 | C2 | C3 | Total | |----|----|----|-----|-------| | | | | | |

Results (after unblinding)

| Metric | Manual | AI | Δ | Significant? | |--------|--------|----|---|--------------| | Quality | | | | | | Time | | | | | | TLX | | | | |

Conclusions

  • Quality difference: [higher/lower/same]
  • Time difference: [faster/slower/same]
  • Workload difference: [lighter/heavier/same]
  • Recommendation: [continue/adjust/abandon]

Interpreting Results

Scenario A: AI wins on quality AND time

Action: Document the workflow and scale it. Repeat trial in 2 weeks to confirm.

Scenario B: AI wins on time, loses on quality

Action: AI draft + human edit may work. Calculate: does edit time + AI time < manual time?

Scenario C: AI loses on time, wins on quality

Action: Use AI for high-stakes outputs where quality justifies time investment.

Scenario D: AI loses on both

Action: Try different prompts, different model, or accept this task isn't AI-ready yet.

"Our first fair trial showed AI summaries were 30% faster but scored 15% lower. We almost abandoned it—but the edit time was only 5 minutes. Net win."
PM Lead

Common Fair Trial Mistakes

  • Uncontrolled complexity: Comparing AI on easy tasks vs. manual on hard tasks
  • Evaluator bias: Evaluator knows which is AI and scores accordingly
  • Missing workload data: Claiming "faster" without measuring cognitive load
  • Single trial: One comparison is anecdote; 5+ pairs is evidence
  • Rubric drift: Standards change between manual and AI scoring

Citations

  • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs. Houghton Mifflin.
  • Hart, S. G. (2006). "NASA-TLX: 20 Years Later." Proceedings of the Human Factors and Ergonomics Society.
  • Google Research. (2024). "Rigorous A/B Testing for AI-Assisted Workflows." Technical Report.

Apply this now

Practice prompt

Run a 5-pair fair trial on your most common AI-assisted task this week.

Try this now

Identify one task you currently use AI for—that's your first trial candidate.

Common pitfall

Running unblinded trials—if you know which is AI, you'll score it differently.

Key takeaways

  • Control variables: same task, same evaluator, same rubric, randomized order
  • Minimum 5 matched pairs—single comparisons are anecdotes, not evidence
  • Always capture workload (TLX) alongside time—speed without sustainability is false savings

See it in action

Drop this into a measured run—demo it, then tie it back to your methodology.

See also

Pair this play with related resources, methodology notes, or quickstarts.

Further reading

Next Steps

Ready to measure your AI impact? Start with a quick demo to see your Overestimation Δ and cognitive load metrics.

Key Takeaways

  • Control variables: same task, same evaluator, same rubric, randomized order
  • Minimum 5 matched pairs—single comparisons are anecdotes, not evidence
  • Always capture workload (TLX) alongside time—speed without sustainability is false savings

Share this resource

PrivacyEthicsStatusOpen Beta Terms
Share feedback