From interview transcripts and open-ended surveys to large corpora and web text, we provide defensible qualitative coding and modern NLP workflows with reproducible code.
Inductive/deductive frameworks; memos; exemplar quotes.
Cohen’s κ, Krippendorff’s α; calibration & consensus rules.
Category frequencies, co-occurrence, keyness, collocations.
Rule-based/ML sentiment; lexicons; domain adaptation notes.
LDA/CTM/BERT-based topics; coherence; interpretability reporting.
SVM/logistic/trees; feature engineering; confusion matrices.
spaCy pipelines, custom labels, quality checks.
n-grams, keywords-in-context (KWIC), dispersion & concordances.
| Area | Examples | Notes |
|---|---|---|
| Pre-processing | Lowercasing, stopwords, lemmatisation | Custom dictionaries per domain |
| Feature sets | Bag-of-words, tf-idf, embeddings | Justified choice & ablations |
| Model quality | Accuracy/F1/AUC, coherence | Cross-validation & error analysis |
| Reliability | κ/α, % agreement | Coder training & reconciliation logs |
| Ethics | Consent, anonymisation | PII removal & risk notes |
| Reporting | Tables, wordclouds, KWIC | Journal-style figures with captions |
Codebook + coder training + κ/α + summary tables.
Clean → model (topics/sentiment or classifier) → figures + write-up.
Qual coding + NLP automation + mixed-methods integration.
Pricing varies with corpus size, annotation depth, and turnaround. You’ll receive a clear plan after discovery.
Research aims, data sources, ethics/privacy constraints.
Codebook/NLP plan, reliability and validation strategy, milestones.
Transcription QA, anonymisation, tokenisation, splits.
Qual coding and/or NLP modeling with diagnostics.
Tables, figures, exemplar quotes; Methods/Results text.
Datasets (as allowed), scripts/notebooks, codebook, change log.
Yes—native projects or exports (CSV/Excel); we can round-trip coded outputs.
We provide training rounds, compute κ/α, and document reconciliation steps.
Classical ML (tf-idf + SVM/logistic), topic models (LDA/BERT), lexicons, and modern embeddings.
Yes—with rate-limit aware collection, deduplication, bot/spam filters, and ethics notes.
Language-specific tokenisation/stoplists; multilingual models when appropriate.
You get all scripts/notebooks and (where allowed) an anonymised corpus with a dictionary.
Yes—joint displays, code counts, exemplar quotes, and statistical links.
Co-occurrence networks, topic-term heatmaps, sentiment timelines, and KWIC tables.
Small sets: days; large corpora or heavy coding: weeks with milestones.
By corpus size, annotation depth, modeling complexity, and turnaround. Quote follows discovery.