Human Evaluation & AI Assessment Charity“Kiki” Kuchin

Evaluation Background

My foundation in human evaluation of AI systems comes from leading Google's Common Review Team, a global content review operation I built and ran with three full-time staff, 80 contractors, and a $700K annual budget across teams in the United States, India, and Brazil.

The work was fundamentally a calibration problem. When I inherited the operation, reviewer error rates were at 11%. In a single quarter I redesigned the quality standards, developed a golden-set methodology requiring independent agreement from multiple reviewers before any content entered model training data, and personally led rater calibration sessions in Hyderabad to align judgment across teams working in different cultural and linguistic contexts. Error rates dropped to 1%. I was given a Google Platinum Award for Measurable Improvement to Review Quality.

What that work taught me was that the hardest evaluation problems aren't the clear cases. They're the ones where reasonable people disagree, where cultural context shifts the right answer, where a rubric written at 9am breaks on an edge case by noon. Building systems that hold up across raters at scale requires understanding not just what good looks like but why raters diverge, and designing the methodology to reduce that divergence without flattening genuine judgment.

How I Think About Evaluation

Good evaluation is a calibration practice, not a labeling task. The question isn't just whether a model response is correct or helpful, it's whether the criteria I'm applying would produce consistent judgments from another qualified rater working independently. When I encounter a rubric, I'm immediately looking for the places it will break, the cases it didn't anticipate, the spots where two careful people will read the instruction differently and both feel right.

I also think carefully about the distinction between tasks where there's a defensible ground truth and tasks that are genuinely subjective. Conflating those two in evaluation design produces noisy data. The work I value most is in the middle, responses that aren't factually wrong but that vary in quality in ways that matter, and where the evaluator's job is to apply consistent, articulable judgment rather than check against an answer key.

My practitioner experience with current large language models is extensive and hands-on. I work with these systems daily, which means I come to evaluation tasks with genuine familiarity with how they behave, where they tend to fail, and what distinguishes a response that sounds confident from one that's actually sound.

Current Practice

I am currently active as an AI data annotator under NDA.

Certifications

Stanford HAI: Generative AI: Technology, Business, and Society Program https://digitalcredential.stanford.edu/check/D6677745472AB2A9F57814DAFA3E82C776AD7F9878021EF0AB31299CD4DEB735UkYvNHVhcnIzdWMveHR1SDU2cFlqV3VJZ2hnOUhnQzBMVzNPNzhyS0JVdzY0QjJ3

Stanford Engineering: Change Management: Reskilling in the Age of Analytics and AI https://digitalcredential.stanford.edu/check/96CE00A801DE98DF8D09C5C747876A36A5CDED20833BB0553C4D204EDA877A3BQVhWOC9QbmlSdnhZbGwvTVNoUmdneHFZeHRQbHI0ZHVPb0k0VDQ3TDlMbC9URzRC