A talk on computer-assisted fieldwork

AI as a scout,
not a judge.

Building a Southern Kurdish dialect tree with PARSE.

Lucas Ardelean · University of Bamberg · University of Zurich · 2026

The classification puzzle

Sixty years.
No consensus.

11 speakers, 6 varieties — Faili, Kalhori, Khanaqini, Qasri, Mandali, Sahana
Iraq–Iran borderlands — a shatter zone of tribal, political, and contact pressure
MacKenzie (1961), Fattah (2000), Belelli (2019), Mohammadirad — no agreement
Isoglosses cross-cut each other; no clean tree by traditional methods

Map of Southern Kurdish speaker origins across the Iraq-Iran borderlands — Speaker origins — Iraq–Iran borderlands

Why a "press the button" AI does not work here

Three problems. One job.

Diagnostic contrasts get flattened. Uvular /q/ collapses to /k/. Dark /ɫ/ merges into plain /l/. Unaspirated stops get heard as voiced. The contrasts that distinguish one Southern Kurdish variety from another are exactly the ones European-trained models were never built to hear.
Borrowing looks like cognacy. Basic vocabulary is shot through with Arabic and Persian loans. A machine treats waqt as shared inheritance — phonetically it is; historically it is not.
Fieldwork is rich; this question needs a slice. 530+ words elicited per speaker; only 85 are phylogenetically necessary. The other 445+ are valuable for other research — phonology, syntax, sociolinguistic variation — just not what a phylogeny is asking.
The job is not "transcribe everything." It is to find the right 85, in the right order, with the right confidence.

A guiding principle

The machine narrows the search.
The linguist makes the judgment.

I.

AI scouts

Locates likely regions in long recordings. Offers a candidate transcription.

II.

The linguist judges

Every accept, edit, or rejection is an explicit human action.

III.

Saved as drafts

Models write candidates, not commitments. Acceptance is an explicit human action.

IV.

The trail stays visible

Computed result and human correction stored as separate layers. Auditable forever.

Where AI sits in the pipeline

Four stations. One chain of human review.

Step 1

Read

Three models transcribe 2–5 h of audio per speaker — orthographic, Kurdish-script, and phoneme-level IPA.

AI

Step 2

Review

Linguist checks quality, picks the best repetition, splits multi-word responses, re-runs noisy outputs.

Human

Step 3

Locate

Use verified anchors plus the transcript layers to find the right 85 target words in 530+.

AI

Step 4

Group cognates

LexStat clusters forms using Levenshtein distance + sound correspondences learned from the data.

Computational

All four outputs feed Bayesian inference → a probability-weighted distribution over candidate trees.

Job 1 · Reading the audio

Three reads
of the same audio.

Speech-to-text. Whisper produces orthographic words with word-level timestamps.
Kurdish-tuned. A Whisper variant fine-tuned for Southern Kurdish returns Kurdish-script transcriptions.
IPA, independently. A phoneme-level model reads the same audio as IPA — not as words.
Three layers, one waveform. Finding the right 85 of 530+ words joins these layers with cross-speaker reference data.

PARSE Annotate workstation showing waveform with parallel transcription tiers and ranked candidate regions — Annotate — three layers, one waveform

Job 2 · Review and decide

The work AI
can't do.

Check transcription quality. Is what the model heard plausible given the audio?
Pick the repetition. Each lexeme was elicited 2–4 times; choose which to use.
Split multi-word responses. Speakers sometimes gave synonyms or false starts — both belong, in their own slots.
Re-run noisy IPA. If the model output came back garbled, run it again with different settings.
Spectrogram check. For ambiguous IPA, verify against formants and voice-onset timing.

PARSE transcription lanes showing parallel IPA and orthographic tiers for one speaker — Parallel IPA and orthography lanes for review

Job 3 · Anchoring

The dataset
becomes its own map.

Verified words form a map of the elicitation
Missing words can be predicted to fall in a narrow time window
Cross-speaker matching: seven verified "hand"s make the eighth easier
Gets faster as the dataset grows — not magic, just more reference points

PARSE pipeline view showing the full processing chain from raw audio to ranked candidates — Raw audio → ranked candidates, ready for human review

Job 4 · Adjudicating

Cognate groups —
and where they fail.

LexStat uses Levenshtein distance on IPA strings, weighted by sound-correspondence patterns learned from the data itself
Got this wrong: grouped Arabic waqt ("time") as cognate — a shared borrowing, not shared inheritance
Missed this: failed to link dast and das ("hand") — final-cluster deletion looked like a mismatch
Both fixed by hand; corrections preserved alongside the algorithm's output

PARSE Compare mode showing concept-by-speaker matrix with cognate adjudication controls — Compare — concept × speaker matrix

PARSE — keeping the human in charge

Two modes. One dataset.

PARSE Annotate mode — waveform, IPA and orthography tiers, ranked candidates — Annotate · close listening · one speaker

PARSE Compare mode — concept by speaker matrix with cognate adjudication — Compare · pattern recognition · all speakers

Original audio never cut · Algorithm and human stored separately · LingPy + NEXUS export

Beyond the four jobs

AI all the way down.

Setup work, agent-driven. Import wordlists, rename surveys, organize speakers, kick off jobs, export data — all callable.
Any pipeline job, from chat. Read, Review, Locate, Group cognates — the same four jobs an agent can invoke directly.
No coding required. A researcher with basic AI literacy gets the same tooling a developer would.
The principles still hold. Agent scouts. Linguist judges. Drafts, not commitments. Audit trail intact.

Step 4, in plain English

Like a confidence interval —
but for trees.

I.

Old way

Pick one "best" tree and report it as the answer.

II.

New way

Sample a probability-weighted ensemble of plausible trees.

III.

Each grouping gets a number

"Kalhori–Khanaqini together: 87% of sampled trees." A probability, not a yes/no.

IV.

Honest about uncertainty

When the data is ambiguous, the method says so. Low support is information, not failure.

Why this method, for this data

Built for messy,
contested data.

Sample-size honesty. 11 speakers, 6 varieties. Traditional methods produce over-confident single answers; Bayesian methods report what the evidence can actually support.
Contact zones are first-class citizens. Borrowing and convergence are modeled alongside inheritance, and different features change at different speeds — unlike tree-only methods that break when isoglosses cross-cut.
Every hypothesis gets a probability. Instead of "no consensus," you get "Kalhori–Khanaqini grouping: 87% under a contact model." Testable. Updatable by future fieldwork.
Lowers the transcription bar. Levenshtein on hand-IPA needs publication-quality precision — months per speaker. Bayesian cognate detection only needs same word or not? — a bar the orthographic + IPA model layers together can clear.

From the engine, visually

What the output
actually looks like.

Rooted Bayesian phylogenetic tree of Kurdish, Gorani, and Zazaki varieties with posterior probabilities at each internal node — Rooted tree — posterior support at each node

Unrooted radial display of the same Kurdish, Gorani, and Zazaki variety relationships — Unrooted radial display — same data

Numbers like 0.9999, 0.5675 = posterior probability — the share of sampled trees containing that grouping.

What this delivers — and what it doesn't claim

The honest ledger.

DELIVERS

30 hours of audio → a verified 85 × 11 matrix
Every cell listened to and approved by a human
Full audit trail where algorithm and linguist disagreed
A Bayesian distribution over candidate trees — uncertainty is part of the result

DOES NOT CLAIM

The AI adjudicates dialect membership
The AI decides what is a cognate
The AI transcribes IPA for publication
The classification answer comes from the machine

For language assessment

The rater's judgment is the result.
Audit trails matter more than raw accuracy.

The same pattern — scout, second opinion, human-owned decision, full provenance — applies wherever expert judgment doesn't scale. Proficiency rating, error coding, dictation, oral-task transcription. The tool may not transfer; the discipline does.

PARSE — github.com/ArdeleanLucas/PARSE · MIT licensed
Thank you. Questions welcome.