Back to Jobs
Other

SWE Expert

Mercor
Pay
$70 - $150 / hr
Hourly
Location
Worldwide
Remote
Posted
Mar 12, 2026
Languages
English

Description

Role Overview

Mercor is seeking SWE Experts to support the design of evaluation-ready workflows for advanced AI systems. This engagement focuses on translating ambiguous requirements into structured, repeatable artifacts that can be tested automatically. You’ll produce clearly specified deliverables (documentation + scripts) that enable consistent assessment of agent performance across scenarios. Work is contract-based, outcome-oriented, and optimized for reproducibility and clear acceptance criteria.

Key Responsibilities

  • Convert high-level objectives into tightly scoped, testable deliverables with clear inputs/outputs and measurable success criteria.

  • Create structured documentation that defines expected behavior, constraints, and edge cases in a way other evaluators can reuse.

  • Build lightweight automation scripts to support evaluation flows (e.g., generating required artifacts, validating outputs, enforcing format rules).

  • Write deterministic Python verifier scripts that check completion via final state or output validation (files, directories, content assertions).

  • Design prompts/tasks that reliably elicit the target workflow behavior while avoiding leakage of internal instructions or implementation details.

  • Implement robust error handling and actionable failure messages in verification tooling.

  • Develop plausible but ineffective “baseline” or “distractor” approaches to confirm evaluation discrimination (i.e., the solution must use the intended approach).

  • Maintain clean artifact hygiene: versionable structure, consistent naming, minimal ambiguity, and reproducible execution.

Ideal Qualifications

  • Strong Python skills (file system operations, parsing, validation, test-style assertions, deterministic execution).

  • Experience with evaluation harnesses, automated grading, or QA-style verification (unit/integration test mindset).

  • Familiarity with prompt design and LLM evaluation methodologies (closed-ended tasks, leakage avoidance, reliability testing).

  • Comfort with structured specs and documentation conventions (Markdown, YAML frontmatter patterns, well-scoped requirements).

  • Working knowledge of common developer tooling: Git, CLI workflows, virtual environments, dependency management.

  • Bonus: embeddings/similarity concepts (e.g., cosine similarity) for “looks relevant but fails” negative-control design.

  • Ability to communicate clearly and keep scope controlled without relying on domain-specific context.

More About the Opportunity

  • Deliverables are primarily documentation + scripts intended to support automated evaluation and consistent replay.

  • Emphasis on: determinism, reproducibility, closed-ended outcomes, and strong verifier reliability.

  • Tasks and validators should be resilient to superficial shortcuts and confirm the intended workflow is actually used.

  • Work can include designing negative controls (distractors) that appear credible while failing for principled reasons.

  • Time-sensitive elements should be explicitly date-bounded where applicable.

Interested in this position?

Apply directly on the company's website