Careers at ClaimHawk - Join our team building digital labor agents
Careers at ClaimHawk
Now Hiring

Build the Future of Digital Labor

Join our mission to create AI agents that can do real work using a computer mouse and keyboard. We're revolutionizing the medical claims industry with cutting-edge agent tooling and vision-language models.

Workflow Datasets
VLMs (QwenVL)
Training Flywheel
MEN WANTED FOR UNCERTAIN JOURNEY
BELOW MARKET WAGES, LONG HOURS
BUILDING TRAINING PIPELINES AND
DATASETS. DEFEAT IS NOT AN OPTION.
Equity and impact from day one,
fame and money on success.

Founding ML Training Specialist

$40k/year + 10% equity + studio in Austin + Uber Eats budget · Austin, TX

Now Hiring

This isn't a job this is a Mission. Mercenaries need not apply.

This is a startup founder's dream job - get in just when things have gotten moving, but before revenue goes from a pledge to a reality. We have a pilot program of 10 dental practices (worth $6,000/mo revenue) and are wrapping up the MVP. Employee #4 carries a lot of weight, and you will have a lot of responsibility, and will drive the future value of the company with your own hands.

We're looking for someone who believes in creating digital labor that transforms industries. If you're chasing a paycheck, this isn't the role for you. But if you want to build something that matters and are willing to do whatever it takes to make it happen - this is your moment.

Imagine working in the garage of Apple Computer back in the late 70s. You are employee #4. Missionaries will be rewarded far more long term than a mercenary could dream of now.

Today the ML Training Specialist, tomorrow the VP of ML Research.

About the Role

We're looking for an experienced ML Training Specialist to help us build and scale our digital labor agents. You'll work on training vision-language models, implementing fine-tuning pipelines, creating scalable inference systems, and building training flywheels that continuously improve our agent capabilities. This is a foundational role where you'll have significant impact on our technology and product direction.

Compensation & Company Stage

What We Offer

$40,000/year salary (starting, scales with investment/revenue)

10% equity in the next funding round (4-year vest, 1-year cliff)

Studio apartment in Austin

Monthly Uber Eats budget

The Reality

We're a seed-stage pre-revenue company running a pilot program with 10 dental practices. We're approximately 6 months from revenue. The $40k salary meets FLSA exempt requirements and will scale to market rate as we secure investment and generate revenue. ML engineers will be first to scale.

Why 10% Equity is Massive

• Investors at this stage typically get 1-4%

• Board members get ~1%

• Pilot partners get 0.5%

Founding ML engineer gets 10%

Location & Requirements

📍 Austin, TX Required

Work from your studio apartment (we provide). CTO is down the hall. Immediate collaboration, sharp feedback cycles. This is the distributed garage - everyone has their workspace, close enough to solve problems together.

🇺🇸 US Citizens Only

Equity component requires US citizenship.

What We're Looking For

Distributed Training (Required)

Hands-on experience with distributed training for models 7B+ parameters using DeepSpeed, FSDP, or similar frameworks. We use Modal.com for our training infrastructure.

Fine-Tuning Expertise

Proven track record fine-tuning large models for specific tasks and domains

Scalable Inference

Experience building and optimizing inference systems that scale efficiently

Training Flywheels

Ability to implement continuous training loops that improve model performance over time

Dataset Creation & Testing

Strong ability to create, curate, and test datasets for training and evaluating large models

Bonus Points

  • Experience with vision language models (VLMs) and multimodal AI systems
  • Experience with agentic AI systems or computer-using agents
  • Knowledge of RLHF, RLAIF, or other alignment techniques
  • Contributions to open-source ML projects or research publications
  • Experience with model quantization and optimization techniques

What You'll Work On

  • Training and fine-tuning vision-language models for dental claims processing workflows
  • Building scalable inference infrastructure for production agent deployments
  • Implementing training flywheels that leverage real-world agent interactions to improve models
  • Collaborating with product team to integrate ML capabilities into agent tooling
  • Researching and implementing cutting-edge techniques in agentic AI and LLM training

Why ClaimHawk?

Impact

Work on technology that directly reduces administrative burden in healthcare and creates measurable value for dental practices

Cutting-Edge Work

Push the boundaries of what's possible with agentic AI systems that interact with real-world software

Ownership

Take ownership of critical ML infrastructure and strategy in a foundational role

Growth

Join early-stage startup with significant upside potential and room to grow into leadership

How to Apply

No resumes. Answer the questions below in an email. We use an automated system to surface the top 5 responses based on depth of experience and technical accuracy.

⚠️ DO NOT USE AN LLM FOR THIS

Answers must be from your mind alone. We check for LLM-generated text and will filter you out. If you need an LLM to answer these questions, you're not who we're looking for.

1. Debugging Failed Evals

You are building a dataset based on some customer screens. The data includes both an action and a thought. You have an eval script that runs against the finetuned model, and it consistently fails. What experiments would you do to try and find the problem?

2. GPU Memory Management

You are finetuning a model and the dataset is growing to the point where a single GPU can't fit the batches in its memory. How do you solve this?

3. Surprising Ablation

Tell us an ablation you performed that surprised you.

4. Experiment Design Process

Tell us your thought process when building your hypothesis, data, and experiment, and how you determine if your experiment made positive gains.

5. Wrong Hypothesis

Tell us of a time you were SURE you knew why an eval was failing, and you turned out to be wrong. What did you learn from the experience?

6. Dataset Quality vs Quantity

You need to build a dataset for desktop automation. You have two options: 500 meticulously hand-crafted examples, or 10,000 synthetically generated examples with some noise. Which do you choose and why? How do you validate your choice was correct?

7. Synthetic Data Generation

Describe your experience generating synthetic training data. What methods have you used? What are the failure modes you watch for? How do you validate synthetic data quality before training?

8. Eval Design

You're building an eval for an agent that needs to navigate complex UIs. What metrics do you track? How do you know if a 5% improvement in your metric actually means the agent is better in production?

9. Training Run Goes Wrong

Tell us about your worst training run. What went wrong, how much did it cost you (time/money), and what safeguards did you put in place afterward?

10. Dataset Size Intuition

You're fine-tuning a 7B model to click the correct button on a dental insurance website. How many examples do you need in your training set? Walk us through your reasoning.

Bonus: HIPAA and ML Training

How does HIPAA compliance impact the traditional ML training process when working with healthcare data? What constraints does it create, and how would you work around them?

Email your answers to:

mljobs@claimhawk.app

Subject: ML Training Specialist

Our Culture

We value truth and curiosity most of all

Truth & Curiosity

Seek objective reality over comforting narratives. Ask why relentlessly and dig deeper until you understand how things really work.

Move Fast

Bias toward action and rapid iteration. Ship early, learn quickly, and improve continuously.

Customer Focus

Obsess over customer success and feedback. Build products that create real, measurable value.

Technical Excellence

Push the boundaries of what's possible while maintaining high code quality and system reliability.