If you import Google N-Grams data into Postgres , you can use this to compute TF-IDF measures on documents.
In my environment, I have talk transcripts stored in JSON files. In this example, I’ll show how to measure the distance between these and a word list (e.g. “I, me, my, myself, mine” etc).
import json def get_transcript(theFile): try: with open(path + theFile, encoding="utf8") as json_data: d = json.load(json_data) json_data.close() return d["transcript_s"] except: print("Found error") return nullOnce we have a transcript we need to tokenize the text into words. The best way to do this is to use NLTK, since it has a lot of choices for how to go about doing this.
from nltk.tokenize import RegexpTokenizer from collections import defaultdict def get_tokens(text): tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+') return [t for t in tokenizer.tokenize(text)] def get_counts(tokens): counts = defaultdict(int) for curr in tokens: counts[curr] += 1 return countsBefore we comput TF-IDF, we need to know how often each word occurs in the N-Grams dataset. The important thing with this is to memoize the results.
import psycopg2 seen_tokens = {} def get_docs_with_token(token): if token in seen_tokens: return seen_tokens[token] conn = psycopg2.connect( \ "dbname='postgres' " + \ "user='postgres' " + \ "host='localhost' " \ "password='postgres'") cur = conn.cursor() table = token[0].lower() cur.execute(\ "select volume_count from ngrams_" + \ table + " where year = 2008 and ngram = '" + \ token + "'") rows = cur.fetchall() result = 0 for row in rows: result = row[0] seen_tokens[token] = result; return resultOnce we have this, we can define the tf-idf function for one term in our search. Strangely, the “log” function in python is a natural log (there is no “ln” like you might expect). THere are some options here you may wish to dampen the values ( “Relevant Search” says that Lucene takes the square root of values)
Note also that we’re using “volumes” reported by Google n-grams as the number of documents in the “full” set. I’ve hard-coded the max # of documents in that set, since there is no point querying for this, but if you wanted to re-execute this computation for every year in the dataset, it would need to be an array or a SQL query.
def tfidf_token(search_token, all_tokens, all_token_counts): total_terms = len(all_tokens) term_count = all_token_counts[search_token] total_docs = 206272 tf = 1.0 * term_count / total_terms docs_with_term = get_docs_with_term(search_token) idf = math.log(1.0 * total_docs / docs_with_term) tfidf = tf * idf return tf * idfOnce we have this it’s a trivial exercise to get the score for each search term, and sum them up:
def tfidf_search(search, file): transcript = get_transcript(file) all_tokens = get_tokens(transcript) all_token_counts = get_counts(all_tokens) vals = [tfidf_token(token, all_tokens, all_token_counts) for token in search] print(vals) score = sum(vals) print(score) return scoreOnce we’ve done this, all sorts of interesting possibilities are now available.
personal = ["I", "i", "Me", "me", "My", "my", "myself", "Myself"] for file in files: tfidf_search(personal, file)