Olatunji Habeeblahi O. — AI Model Evaluation Engineer

AI Model
Evaluation
Engineer

I test, compare, and score AI systems with structured precision — turning messy model outputs into clear, actionable quality signals that make AI products smarter and safer.

50+

Evaluation Projects

3×

Models Benchmarked

95%

Accuracy Uplift

100%

Remote Ready

About Me

AI Evaluation Engineer

From words to workflows —
to making AI tell the truth.

I’m Olatunji Habeeblahi O., an AI Model Evaluation Engineer and Automation Specialist based in Lagos, Nigeria. My path into AI didn’t begin in a lab — it started with language.

I began my career as a Technical Writer, learning how to translate complex systems into clear, structured communication. That discipline for precision — knowing exactly what something does and why it matters — turned out to be the perfect foundation for everything that followed.

From writing, I moved into AI Automation, building intelligent, self-running workflows in n8n, Zapier, and Make.com. I was designing systems that didn’t just do tasks — they made decisions, routed content, and scaled without adding manual overhead. That work deepened my curiosity about the intelligence behind the tools themselves.

That curiosity led me into Prompt Engineering — studying how language shapes model behaviour, what makes a prompt fail, and how small wording changes can produce entirely different outputs. I began testing systematically, not just intuitively.

Today, I work as a full-on AI Evaluation Specialist. I test AI systems with structured rubrics, run head-to-head agent comparisons, identify bias and failure modes, and produce the kind of actionable feedback that engineering teams can actually use. I am adept at independent remote work, rapid guideline adaptation, and delivering insights that improve AI system reliability and user experience.

Career Journey

✍️

Technical Writer

Built the foundation — precision, clarity, and structured communication under ambiguity.

⚙️

AI Automation Specialist

Designed intelligent n8n workflows, API integrations, and multi-platform automation systems.

🧷

Prompt Engineer

Studied how language steers model behaviour — prompt design, testing, and refinement at scale.

🔮

AI Evaluation Specialist

Rubric-based scoring, QA testing, agent comparison, bias detection — making AI accountable.

LLM EvaluationRubric DesignAgent Testingn8n AutomationQA EngineeringBias DetectionPrompt EngineeringTechnical Writing

Core Competencies

🧠

AI Model Evaluation

LLM output comparison
Rubric-based scoring frameworks
Qualitative & quantitative feedback
Bias & hallucination detection
Prompt behaviour testing
Edge-case & failure-mode analysis

🔮

QA & Testing

AI workflow validation
Regression testing
UX-focused output assessment
Annotation guideline interpretation
Scenario-based testing
Documentation & SOPs

⚙️

Automation & Tech

n8n workflow design
API-based system validation
Data integrity checks
Automation QA & monitoring
Multi-platform integration
Prompt segmentation per platform

📊

Analysis & Reporting

Structured analytical reports
Performance benchmarking
Root cause analysis
Actionable recommendation docs
Scoring framework design
Model behaviour mapping

✍️

Writing & Communication

Technical writing & documentation
Professional English (written + verbal)
Clear analytical reporting
Prompt & annotation guidelines
SOP creation
Stakeholder communication

🔧

Tools & Platforms

n8n, Zapier, Make.com
GPT-4, Claude, Gemini
HubSpot CRM (Certified)
Google Workspace & Airtable
MindStudio, Notion
API & webhook integrations

Experience

2023 — Present

Freelance · Remote

AI Automation & Evaluation Specialist

Evaluated AI-driven automation workflows for logical accuracy, consistency, and reliability. Tested system outputs against defined requirements and edge cases. Analyzed AI-generated responses to identify errors, inconsistencies, and areas for improvement. Produced structured, actionable feedback to improve model behaviour and workflow quality. Designed and executed QA test cases and documented evaluation criteria and outcomes.

LLM EvaluationRubric Designn8nQA TestingPrompt Engineering

2021 — 2023

TeeJay Construction · Ibadan, Nigeria

Managing Partner — Operations & Analysis

Evaluated operational processes and decision outcomes to drive efficiency improvements. Analyzed market and client data to guide strategic business decisions. Managed documentation, reporting, and stakeholder communication. Applied structured judgment in high-ambiguity scenarios — a discipline that translates directly into AI evaluation work.

Process AnalysisStrategic ReportingOperations

Selected Work

Work that
speaks.

🧪

QA Testing · Agent Comparison

Multi-Model Agent QA & Benchmark Study

Structured head-to-head evaluation of GPT-4, Claude, and Gemini across reasoning, accuracy, and consistency.

View Case Study →

📐

Rubric Design · LLM Scoring

LLM Response Quality Scoring Framework

Designed a multi-dimensional rubric to evaluate LLM output quality — adapted from industry standards and original design.

View Case Study →

🔄

Automation · n8n · AI Pipeline

AI-Powered Content Curation & Publishing Automation

End-to-end n8n pipeline: RSS ingestion → AI scoring → quality filtering → rewriting → auto-publishing to X and LinkedIn.

View Case Study →

Education & Certifications

2025 · Obafemi Awolowo University

B.Sc. Animal Science

Ile-Ife, Osun State, Nigeria

A degree built on empirical observation, data analysis, and systematic documentation — skills that transfer directly into structured AI evaluation work.

🤖

AI & Automation Systems

Udemy

📈

Digital Marketing Analytics

HubSpot Academy · Certified

⚙️

QA & Process Optimization

Professional Training

🎯

Customer Experience & UX Fundamentals

Online Certification

Dimension	GPT-4	Claude	Gemini
Logical Reasoning	82%	88%	76%
Factual Accuracy	86%	80%	84%
Instruction Following	78%	91%	74%
Edge-Case Handling	70%	85%	72%
Output Consistency	80%	87%	69%

The Problem

Content is the lifeblood of modern digital presence — but producing it consistently at quality is exhausting. The business needed a reliable way to source relevant content, evaluate its quality, and publish across multiple social platforms without any manual review, rewriting, or posting.

The existing process was entirely manual: find content, read it, judge quality, rewrite for the platform, schedule, post. This took hours daily, produced inconsistent results, and didn’t scale. When the person responsible was unavailable, the pipeline simply stopped.

Manual content handling caused delays, inconsistency, and limited scale. The business needed an intelligent system that could operate continuously — without a human in the loop for every decision.

The Pipeline

Content Ingestion via RSS Feeds

Automatically pulls fresh content from configured RSS feeds on a schedule. No manual browsing — content flows in continuously from trusted sources.

RSS · Scheduled Trigger · n8n

AI Quality Scoring

Each piece of content is passed to an AI model which evaluates relevance, originality, depth, and audience fit. The model assigns a numerical quality score that drives the next decision.

AI Scoring · OpenAI · Custom Prompt

Logic-Based Quality Filtering

Conditional logic evaluates the AI score against defined thresholds. High-quality content moves forward. Borderline content goes to review. Low-quality content is automatically archived.

Conditional Logic · IF Node · Routing

AI Content Enhancement & Rewriting

Approved content is rewritten by AI before publishing. Rewriting prompts are segmented per platform — X posts are concise, LinkedIn posts are professional and insight-led.

AI Rewriting · Prompt Segmentation · Platform-Native

Review, Approve, or Archive Routing

Custom code steps handle exception routing. Content passing all filters goes directly to publishing. Flagged content enters a review workflow. All content is stored for tracking.

Custom Code · Fail-Safe Routing · Storage

Multi-Channel Publishing

Approved, platform-optimised content is published automatically to X and LinkedIn via API. Upload flow control and retry logic ensure reliability.

X API · LinkedIn API · Retry Logic

Content Storage & Tracking

Every piece of content processed is stored in a structured database. Creates a searchable content library, audit trail, and enables future automation to avoid repetition.

Database Storage · Audit Trail · Reuse Logic

What I Delivered

🔄

End-to-End n8n Workflow

Complete automation from RSS ingestion to multi-channel publishing, fully documented and production-ready.

🧠

AI Scoring & Enhancement Logic

Custom AI prompts for quality scoring and platform-segmented content rewriting (X vs LinkedIn).

📲

Social Media Publishing System

API-connected publishing to X and LinkedIn with upload flow control and retry logic for reliability.

📁

Review, Archive & Storage System

Structured content routing — approved, flagged, and archived — with full tracking and reuse capability.

📄

Documented Automation Design

Full workflow documentation covering logic, routing decisions, configuration, and scaling instructions.

📈

Scalable Architecture

System designed to support additional channels, sources, and logic layers without a full rebuild.

n8nWorkflow AutomationAI IntegrationRSS AutomationContent OperationsX APILinkedIn APIPrompt EngineeringMulti-Channel PublishingRetry Logic

AI ModelEvaluationEngineer

From words to workflows —to making AI tell the truth.

Work thatspeaks.

Let’s buildsmarterAI together.

The Problem

My Approach

Task Categories Tested

QA Testing Protocol

Model Comparison Results

Key Findings

What I Delivered

The Problem

Design Process

Baseline Research

Dimension Selection & Weighting

Level Anchors (1–5 Descriptors)

Calibration & Pilot Testing

Live Evaluation & Iteration

The Rubric

Scoring Scale

Outcomes & Impact

The Problem

The Pipeline

Skills Applied

Outcomes & Impact

What I Delivered

AI Model
Evaluation
Engineer

From words to workflows —
to making AI tell the truth.

Work that
speaks.

Let’s build
smarter
AI together.