Compute TF-IDF in Python with Google N-grams dataset

If you import Google N-Grams data into Postgres , you can use this to compute TF-IDF measures on documents.

In my environment, I have talk transcripts stored in JSON files. In this example, I’ll show how to measure the distance between these and a word list (e.g. “I, me, my, myself, mine” etc).

import json def get_transcript(theFile): try: with open(path + theFile, encoding="utf8") as json_data: d = json.load(json_data) json_data.close() return d["transcript_s"] except: print("Found error") return null

Once we have a transcript we need to tokenize the text into words. The best way to do this is to use NLTK, since it has a lot of choices for how to go about doing this.

from nltk.tokenize import RegexpTokenizer from collections import defaultdict def get_tokens(text): tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+') return [t for t in tokenizer.tokenize(text)] def get_counts(tokens): counts = defaultdict(int) for curr in tokens: counts[curr] += 1 return counts

Before we comput TF-IDF, we need to know how often each word occurs in the N-Grams dataset. The important thing with this is to memoize the results.

import psycopg2 seen_tokens = {} def get_docs_with_token(token): if token in seen_tokens: return seen_tokens[token] conn = psycopg2.connect( \ "dbname='postgres' " + \ "user='postgres' " + \ "host='localhost' " \ "password='postgres'") cur = conn.cursor() table = token[0].lower() cur.execute(\ "select volume_count from ngrams_" + \ table + " where year = 2008 and ngram = '" + \ token + "'") rows = cur.fetchall() result = 0 for row in rows: result = row[0] seen_tokens[token] = result; return result

Once we have this, we can define the tf-idf function for one term in our search. Strangely, the “log” function in python is a natural log (there is no “ln” like you might expect). THere are some options here you may wish to dampen the values ( “Relevant Search” says that Lucene takes the square root of values)

Note also that we’re using “volumes” reported by Google n-grams as the number of documents in the “full” set. I’ve hard-coded the max # of documents in that set, since there is no point querying for this, but if you wanted to re-execute this computation for every year in the dataset, it would need to be an array or a SQL query.

def tfidf_token(search_token, all_tokens, all_token_counts): total_terms = len(all_tokens) term_count = all_token_counts[search_token] total_docs = 206272 tf = 1.0 * term_count / total_terms docs_with_term = get_docs_with_term(search_token) idf = math.log(1.0 * total_docs / docs_with_term) tfidf = tf * idf return tf * idf

Once we have this it’s a trivial exercise to get the score for each search term, and sum them up:

def tfidf_search(search, file): transcript = get_transcript(file) all_tokens = get_tokens(transcript) all_token_counts = get_counts(all_tokens) vals = [tfidf_token(token, all_tokens, all_token_counts) for token in search] print(vals) score = sum(vals) print(score) return score

Once we’ve done this, all sorts of interesting possibilities are now available.

personal = ["I", "i", "Me", "me", "My", "my", "myself", "Myself"] for file in files: tfidf_search(personal, file)

Compute TF-IDF in Python with Google N-grams dataset

Trending Articles

C180 Coupe 交車 + 鍍膜 + 初步駕駛心得

【英文字幕/OVA/冷门动画】装鬼兵系列美版两部全

中華電信 NOKIA G-040W-Q 版本更新

Ubuntu 超快部署 wireguard 服务端

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

DWG FastView 8.7 - DWG檢視軟體

奉勸絕對不要相信skoda電池三年保固(update已出保固)

[转载]煞貢、直星、人專吉日\金神七煞歌

Adobe Photoshop 2025 (v26.3) （m0nkrus多语言版）

出售: ys audio ysa2a3

(枪版)Chainsaw Man The Movie Reze Arc 2025 1080p HDTS x264-RGB

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

[MagicStar] 特搜9 第七季 / Tokusou 9 Season7 EP04 [WEBDL] [1080p]【生】【附日字】

关门一家亲：习远平、张澜澜、徐才厚

郭豫倫畫筆凝結林文月最美年華

越坂康史days系列：48天島村舞花、58天树花凛、38天范田纱纱、39天杏堂怜，耻辱中的情欲，暂缺68天鳴海小春！

告發片商強迫拍AV？ SOD「最強新人」竹内乃愛作品被刪光

CES2024：当贝首款内置Google TV激光投影亮相

中国电动车泡沫面临破灭消费者受害最深

ffmpeg学习七：avformat_find_stream_info函数源码分析