NAV · N NOTES

A talk on computer-assisted fieldwork

AI as a scout,
not a judge.

Building a Southern Kurdish dialect tree with PARSE.

The classification puzzle

Sixty years.
No consensus.


  • 11 speakers, 6 varieties — Faili, Kalhori, Khanaqini, Qasri, Mandali, Sahana
  • Iraq–Iran borderlands — a shatter zone of tribal, political, and contact pressure
  • MacKenzie (1961), Fattah (2000), Belelli (2019), Mohammadirad — no agreement
  • Isoglosses cross-cut each other; no clean tree by traditional methods
Map of Southern Kurdish speaker origins across the Iraq-Iran borderlands
Speaker origins — Iraq–Iran borderlands

Why a "press the button" AI does not work here

Three problems. One job.


A guiding principle

The machine narrows the search.
The linguist makes the judgment.

I.

AI scouts

Locates likely regions in long recordings. Offers a candidate transcription.

II.

The linguist judges

Every accept, edit, or rejection is an explicit human action.

III.

Saved as drafts

Models write candidates, not commitments. Acceptance is an explicit human action.

IV.

The trail stays visible

Computed result and human correction stored as separate layers. Auditable forever.

Where AI sits in the pipeline

Four stations. One chain of human review.


Step 1

Read

Three models transcribe 2–5 h of audio per speaker — orthographic, Kurdish-script, and phoneme-level IPA.

AI
Step 2

Review

Linguist checks quality, picks the best repetition, splits multi-word responses, re-runs noisy outputs.

Human
Step 3

Locate

Use verified anchors plus the transcript layers to find the right 85 target words in 530+.

AI
Step 4

Group cognates

LexStat clusters forms using Levenshtein distance + sound correspondences learned from the data.

Computational
All four outputs feed Bayesian inference → a probability-weighted distribution over candidate trees.

Job 1 · Reading the audio

Three reads
of the same audio.


  • Speech-to-text. Whisper produces orthographic words with word-level timestamps.
  • Kurdish-tuned. A Whisper variant fine-tuned for Southern Kurdish returns Kurdish-script transcriptions.
  • IPA, independently. A phoneme-level model reads the same audio as IPA — not as words.
  • Three layers, one waveform. Finding the right 85 of 530+ words joins these layers with cross-speaker reference data.
PARSE Annotate workstation showing waveform with parallel transcription tiers and ranked candidate regions
Annotate — three layers, one waveform

Job 2 · Review and decide

The work AI
can't do.


  • Check transcription quality. Is what the model heard plausible given the audio?
  • Pick the repetition. Each lexeme was elicited 2–4 times; choose which to use.
  • Split multi-word responses. Speakers sometimes gave synonyms or false starts — both belong, in their own slots.
  • Re-run noisy IPA. If the model output came back garbled, run it again with different settings.
  • Spectrogram check. For ambiguous IPA, verify against formants and voice-onset timing.
PARSE transcription lanes showing parallel IPA and orthographic tiers for one speaker
Parallel IPA and orthography lanes for review

Job 3 · Anchoring

The dataset
becomes its own map.


  • Verified words form a map of the elicitation
  • Missing words can be predicted to fall in a narrow time window
  • Cross-speaker matching: seven verified "hand"s make the eighth easier
  • Gets faster as the dataset grows — not magic, just more reference points
PARSE pipeline view showing the full processing chain from raw audio to ranked candidates
Raw audio → ranked candidates, ready for human review

Job 4 · Adjudicating

Cognate groups —
and where they fail.


  • LexStat uses Levenshtein distance on IPA strings, weighted by sound-correspondence patterns learned from the data itself
  • Got this wrong: grouped Arabic waqt ("time") as cognate — a shared borrowing, not shared inheritance
  • Missed this: failed to link dast and das ("hand") — final-cluster deletion looked like a mismatch
  • Both fixed by hand; corrections preserved alongside the algorithm's output
PARSE Compare mode showing concept-by-speaker matrix with cognate adjudication controls
Compare — concept × speaker matrix

PARSE — keeping the human in charge

Two modes. One dataset.

PARSE Annotate mode — waveform, IPA and orthography tiers, ranked candidates
Annotate · close listening · one speaker
PARSE Compare mode — concept by speaker matrix with cognate adjudication
Compare · pattern recognition · all speakers

Original audio never cut · Algorithm and human stored separately · LingPy + NEXUS export

Beyond the four jobs

AI all the way down.


Step 4, in plain English

Like a confidence interval —
but for trees.

I.

Old way

Pick one "best" tree and report it as the answer.

II.

New way

Sample a probability-weighted ensemble of plausible trees.

III.

Each grouping gets a number

"Kalhori–Khanaqini together: 87% of sampled trees." A probability, not a yes/no.

IV.

Honest about uncertainty

When the data is ambiguous, the method says so. Low support is information, not failure.

Why this method, for this data

Built for messy,
contested data.


From the engine, visually

What the output
actually looks like.

Rooted Bayesian phylogenetic tree of Kurdish, Gorani, and Zazaki varieties with posterior probabilities at each internal node
Rooted tree — posterior support at each node
Unrooted radial display of the same Kurdish, Gorani, and Zazaki variety relationships
Unrooted radial display — same data

Numbers like 0.9999, 0.5675 = posterior probability — the share of sampled trees containing that grouping.

What this delivers — and what it doesn't claim

The honest ledger.


DELIVERS

  • 30 hours of audio → a verified 85 × 11 matrix
  • Every cell listened to and approved by a human
  • Full audit trail where algorithm and linguist disagreed
  • A Bayesian distribution over candidate trees — uncertainty is part of the result

DOES NOT CLAIM

  • The AI adjudicates dialect membership
  • The AI decides what is a cognate
  • The AI transcribes IPA for publication
  • The classification answer comes from the machine

For language assessment

The rater's judgment is the result.
Audit trails matter more than raw accuracy.


The same pattern — scout, second opinion, human-owned decision, full provenance — applies wherever expert judgment doesn't scale. Proficiency rating, error coding, dictation, oral-task transcription. The tool may not transfer; the discipline does.