An interesting realization here is that an automated transcription of a lecture is superior for this purpose than manual closed captions or a written transcript, as those edit the material down.
You need to tokenize whatever text you have:
from nltk import word_tokenize tokens = word_tokenize(transcript)
Realistically, you only care if this is a frequent occurrence, so the best way to use this is combined with a threshold, or to feed this into a polynomial function that reduces the quality score for a transcript as it gets more severe.
check = ["um", "uh", "ah", "ehm", "eh", "uhm", "ah", "umm", "er"] def umsScore(tokens): bad = 0 for t in tokens: if (t.lower() in check): cnt = cnt + 1 return cntIf you're looking for a python book, Natural Language Processing with Python is a great way to learn the language while building some really interesting projects.