Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Scoring documents for quality in Python how often does a speaker say “um”?

$
0
0

An interesting realization here is that an automated transcription of a lecture is superior for this purpose than manual closed captions or a written transcript, as those edit the material down.

You need to tokenize whatever text you have:

from nltk import word_tokenize tokens = word_tokenize(transcript)

Realistically, you only care if this is a frequent occurrence, so the best way to use this is combined with a threshold, or to feed this into a polynomial function that reduces the quality score for a transcript as it gets more severe.

check = ["um", "uh", "ah", "ehm", "eh", "uhm", "ah", "umm", "er"] def umsScore(tokens): bad = 0 for t in tokens: if (t.lower() in check): cnt = cnt + 1 return cnt

If you're looking for a python book, Natural Language Processing with Python is a great way to learn the language while building some really interesting projects.


Viewing all articles
Browse latest Browse all 9596

Trending Articles