Back when I was learning about text mining, I wrote this post titled IR Math with Java: TF, IDF and LSI . A recent comment/question on that post sparked off a train of thought which ended up being a driver for this post. In addition, I also wanted to compare a few text vectorization strategies for something else I was doing, so that was another driver. In contrast to the previous post, where I was exploring ideas described in the Text Mining Application Programming book using toy datasets, in this post I use a larger dataset. In addition, the dataset is (sort of) labeled, so I use that to compare the approaches quantitatively.
The dataset I chose for this exercise is the Reuters-21578 corpus from the UCI Machine Learning Repository . The corpus is a collection of 21,578 news stories that appeared on the Reuters newswire service in 1987. Each document is manually categorized into zero or more category tags. There are 481 unique tags across the documents. The number of tags per document vary from 0 (for about 1,862 documents) to 35. The distribution is heavily right-skewed, with the mean number of tags per document being 2.194 and the median being 2. The top histograms below show the distribution of tag tags per document across the corpus. The bottom chart shows the distribution of the top 20 (by frequency) tags.

In addition to the category tags, each document has a title and a block of text. For our analysis, we will consider the title to be part of the text and treat each document as simply a collection of terms. The longest document has 53 sentences, but the average document contains about 6.67 sentences each.
Since the category tags are manually assigned, we can think of them as ground truth labels. Then the overlap of the category tags for a pair of documents can be considered the true value of the similarity between them. We can now try various vectorization techniques using the title and body of the documents, and compute the similarity between pairs of document vectors. The correlation between the distribution of similarities computed between document vectors and that computed between category tags would then indicate the overall quality of the vectorization technique.
The vectorization techniques I have compared in this post are raw word counts (aka Term Frequency or TF), Term Frequency Inverse Document Frequency (TF-IDF) , Latent Semantic Analysis (LSA) , Global Vectors for Word Representation (GloVe) and Word2Vec embeddings . The general approach is as follows. We compute (once) the category tag vectors based on raw counts. Then for each vectorization strategy, we generate the document vectors using that strategy. We then compute the category tag similarities and corresponding text similarity between all pairs of documents, and compute the Pearson Correlation coefficient between these two distributions.
Before we do that though, we need to parse out the Reuters-21578 corpus into a format our downstream components can consume easily. Scikit-Learn examples contain a parser for the Reuters-21578 corpus that I adapted in my code (almost verbatim) below, which parses the dataset and writes out the text and tags into two separate files.
# Source: src/parse-input.py from __future__ import division, print_function from sklearn.externals.six.moves import html_parser from glob import glob import collections import nltk import os import re class ReutersParser(html_parser.HTMLParser): """ Utility class to parse a SGML file and yield documents one at a time. """ def __init__(self, encoding='latin-1'): html_parser.HTMLParser.__init__(self) self._reset() self.encoding = encoding def handle_starttag(self, tag, attrs): method = 'start_' + tag getattr(self, method, lambda x: None)(attrs) def handle_endtag(self, tag): method = 'end_' + tag getattr(self, method, lambda: None)() def _reset(self): self.in_title = 0 self.in_body = 0 self.in_topics = 0 self.in_topic_d = 0 self.title = "" self.body = "" self.topics = [] self.topic_d = "" def parse(self, fd): self.docs = [] for chunk in fd: self.feed(chunk.decode(self.encoding)) for doc in self.docs: yield doc self.docs = [] self.close() def handle_data(self, data): if self.in_body: self.body += data elif self.in_title: self.title += data elif self.in_topic_d: self.topic_d += data def start_reuters(self, attributes): pass def end_reuters(self): self.body = re.sub(r'\s+', r' ', self.body) self.docs.append({'title': self.title, 'body': self.body, 'topics': self.topics}) self._reset() def start_title(self, attributes): self.in_title = 1 def end_title(self): self.in_title = 0 def start_body(self, attributes): self.in_body = 1 def end_body(self): self.in_body = 0 def start_topics(self, attributes): self.in_topics = 1 def end_topics(self): self.in_topics = 0 def start_d(self, attributes): self.in_topic_d = 1 def end_d(self): self.in_topic_d = 0 self.topics.append(self.topic_d) self.topic_d = "" def stream_reuters_documents(reuters_dir): """ Iterate over documents of the Reuters dataset. The Reuters archive will automatically be downloaded and uncompressed if the `data_path` directory does not exist. Documents are represented as dictionaries with 'body' (str), 'title' (str), 'topics' (list(str)) keys. """ parser = ReutersParser() for filename in glob(os.path.join(reuters_dir, "*.sgm")): for doc in parser.parse(open(filename, 'rb')): yield doc def maybe_build_vocab(reuters_dir, vocab_file): vocab = collections.defaultdict(int) if os.path.exists(vocab_file): fvoc = open(vocab_file, "rb") for line in fvoc: word, idx = line.strip().split("\t") vocab[word] = int(idx) fvoc.close() else: counter = collections.Counter() num_docs_read = 0 for doc in stream_reuters_documents(reuters_dir): if num_docs_read % 100 == 0: print("building vocab from {:d} docs" .format(num_docs_read)) topics = doc["topics"] if len(topics) == 0: continue title = doc["title"] body = doc["body"] title_body = ". ".join([title, body]).lower() for sent in nltk.sent_tokenize(title_body): for word in nltk.word_tokenize(sent): counter[word] += 1 for i, c in enumerate(counter.most_common(VOCAB_SIZE)): vocab[c[0]] = i + 1 num_docs_read += 1 print("vocab built from {:d} docs, complete" .format(num_docs_read)) fvoc = open(vocab_file, "wb") for k in vocab.keys(): fvoc.write("{:s}\t{:d}\n".format(k, vocab[k])) fvoc.close() return vocab ##################### main ###################### DATA_DIR = "../data" REUTERS_DIR = os.path.join(DATA_DIR, "reuters-21578") VOCAB_FILE = os.path.join(DATA_DIR, "vocab.txt") VOCAB_SIZE = 5000 vocab = maybe_build_vocab(REUTERS_DIR, VOCAB_FILE) ftext = open(os.path.join(DATA_DIR, "text.tsv"), "wb") ftags = open(os.path.join(DATA_DIR, "tags.tsv"), "wb") num_read = 0 for doc in stream_reuters_documents(REUTERS_DIR): # skip docs without specified topic topics = doc["topics"] if len(topics) == 0: continue title = doc["title"] body = doc["body"] num_read += 1 # concatenate title and body and convert to list of word indexes title_body = ". ".join([title, body]).lower() title_body = re.sub("\n", "", title_body) title_body = title_body.encode("utf8").decode("ascii", "ignore") ftext.write("{:d}\t{:s}\n".format(num_read, title_body)) ftags.write("{:d}\t{:s}\n".format(num_read, ",".join(topics))) ftext.close() ftags.close()The next step is to build the vectors for the category tags. A document can have zero or more tags, but tags are never repeated within a document. So we use a CountVectorizer to build a sparse vector of the same size as the number of unique tags. The vector is mostly zero except for the positions represented by its tags.
# Source: src/tag-sims.py from __future__ import division, print_function from sklearn.feature_extraction.text import CountVectorizer import os import re import dsutils DATA_DIR = "../data" VECTORS_FILE = os.path.join(DATA_DIR, "tag-vecs.mtx") tags = [] ftags = open(os.path.join(DATA_DIR, "tags.tsv"), "rb") for line in ftags: docid, taglist = line.strip().split("\t") taglist = re.sub(",", " ", taglist) tags.append(taglist) ftags.close() cvec = CountVectorizer() X = cvec.fit_transform(tags) dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)On the document text side, the baseline vectorizer using raw counts is very similar. The only difference is that we filter out English stop words and we limit our vocabulary to the top 5,000 of the approximately 45,000 terms in the vocabulary.
# Source: src/wordcount-sims.py from __future__ import division, print_function from sklearn.feature_extraction.text import CountVectorizer import os import dsutils DATA_DIR = "../data" MAX_FEATURES = 50 VECTORS_FILE = os.path.join(DATA_DIR, "wordcount-{:d}-vecs.mtx".format(MAX_FEATURES)) texts = [] ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb") for line in ftext: docid, text = line.strip().split("\t") texts.append(text) ftext.close() cvec = CountVectorizer(max_features=MAX_FEATURES, stop_words="english", binary=True) X = cvec.fit_transform(texts) dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)Having generated these files, we can now compute the similarity between all pairs of tag vectors and text vectors. The similarity metric used is Cosine Similarity , chosen because it can be efficiently computed using matrix operations. We then extract the upper triangular matrix from each matrix so we count each pair only once. Further, the diagonal is also excluded so we don't consider similarities between the same vectors. The upper triangular matrices are flattened and the Pearson correlation coefficient calculated.
# Source: src/calc-pearson.py from __future__ import division, print_function from scipy import stats import os import time import dsutils DATA_DIR = "../data" VECTORIZER = "wordcount" #VECTORIZER = "tfidf" #VECTORIZER = "lsa" #VECTORIZER = "glove" #VECTORIZER = "w2v" X_IS_SPARSE = True Y_IS_SPARSE = True #Y_IS_SPARSE = False NUM_FEATURES = 5000 XFILE = os.path.join(DATA_DIR, "tag-vecs.mtx") YFILE = os.path.join(DATA_DIR, "{:s}-{:d}-vecs.{:s}" .format(VECTORIZER, NUM_FEATURES, "mtx" if Y_IS_SPARSE else "csv")) X = dsutils.load_vectors(XFILE, is_sparse=X_IS_SPARSE) Y = dsutils.load_vectors(YFILE, is_sparse=Y_IS_SPARSE) XD = dsutils.compute_cosine_sims(X, is_sparse=X_IS_SPARSE) YD = dsutils.compute_cosine_sims(Y, is_sparse=Y_IS_SPARSE) XDT = dsutils.get_upper_triangle(XD, is_sparse=X_IS_SPARSE) YDT = dsutils.get_upper_triangle(YD, is_sparse=Y_IS_SPARSE) corr, _ = stats.pearsonr(XDT, YDT) print("Pearson correlation: {:.3f}".format(corr))Another thing to note is that the CountVectorizer returns a Scipy Sparse Matrix , so the tag vectors and the raw count based text vectors are both sparse. We continue to use sparse matrix operations all the way till we extract the upper triangle from the similarity matrices, ie, until line 36 above. The input vectors to the stats.pearson call are both dense.
However, for some of the later vectorization approaches starting with LSA, the vectors are necessarily dense, so we use Numpy's operations for dense matrices instead. That is the reason we specify the is_sparse parameter for all our parameters. Also, sparse matrices are stored in Matrix Market Format , and dense matrices are stored in Numpy Text (CSV) format , so the setting of the is_sparse parameter can be used to detect the file name suffix as well. Code for the dsutils module is shown below:
# Source: dsutils.py from __future__ import division, print_function from scipy import sparse, io, stats import matplotlib.pyplot as plt import numpy as np import numpy.linalg as LA def compute_cosine_sims(X, is_sparse=True): if is_sparse: Xnormed = X / sparse.linalg.norm(X, "fro") Xtnormed = X.T / sparse.linalg.norm(X.T, "fro") S = Xnormed * Xtnormed else: Xnormed = X / LA.norm(X, ord="fro") Xtnormed = X.T / LA.norm(X.T, ord="fro") S = np.dot(Xnormed, Xtnormed) return S def save_vectors(X, filename, is_sparse=True): if is_sparse: io.mmwrite(filename, X) else: np.savetxt(filename, X, delimiter=",", fmt="%.5e") def load_vectors(filename, is_sparse=True): if is_sparse: return io.mmread(filename) else: return np.loadtxt(filename, delimiter=",") def get_upper_triangle(X, k=1, is_sparse=True): if is_sparse: return sparse.triu(X, k=k).toarray().flatten() else: return np.triu(X, k=k).flatten()For word count based vectors, using a vocabulary of the top 5,000 words, correlation of the cosine similarity distribution with the tag vectors was 0.135. Filtering out the English stopwords increased it to 0.276. Binarizing the count vector (so we count each word in a document only once) increased it further to 0.414. Varying the vocabulary size did not change these numbers very significantly.
Generating vectors for TF-IDF vectors is simply a matter of using a different vectorizer, the TfidfVectorizer , also available in Scikit-Learn. Like the CountVectorizer, it generates sparse vectors.
# Source: src/tfidf-sims.py from __future__ import division, print_function from sklearn.feature_extraction.text import TfidfVectorizer import os import dsutils DATA_DIR = "../data" MAX_FEATURES = 300 VECTORS_FILE = os.path.join(DATA_DIR, "tfidf-{:d}-vecs.mtx".format(MAX_FEATURES)) texts = [] ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb") for line in ftext: docid, text = line.strip().split("\t") texts.append(text) ftext.close() tvec = TfidfVectorizer(max_features=MAX_FEATURES, min_df=0.1, sublinear_tf=True, stop_words="english", binary=True) X = tvec.fit_transform(texts) dsutils.save_vectors(X, VECTORS_FILE, is_sparse=True)With a vocabulary of 5,000 most important terms, the correlation was 0.453. Adding stopwords made it rise to 0.464, and binarizing the vector gave us our best correlation of 0.466.
Next up is using Latent Semantic Analysis (LSA) to rotate the co-ordinate space so that the first few dimensions contain the maximum variances, and reducing the features to the first few dimensions. As you can see from the code below, we use a TfidfVectorizer to generate vectors against the full vocabulary, then use TruncatedSVD to rotate the co-ordinate space and restrict the number of dimensions. The resulting vectors are dense.
# Source: src/lsa-sims.py from __future__ import division, print_function from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD import os import dsutils DATA_DIR = "../data" MAX_FEATURES = 50 VECTORS_FILE = os.path.join(DATA_DIR, "lsa-{:d}-vecs.csv".format(MAX_FEATURES)) texts = [] ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb") for line in ftext: docid, text = line.strip().split("\t") texts.append(text) ftext.close() tvec = TfidfVectorizer(sublinear_tf=True, stop_words="english", binary=True) Xraw = tvec.fit_transform(texts) lsa = TruncatedSVD(n_components=MAX_FEATURES, random_state=42) X = lsa.fit_transform(Xraw) dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)Unlike textbook examples where the first few dimensions account for 90+ percent of the variance, I needed to go to the top 1000 dimensions to get 44 percent of the variance.
In [32]: np.sum(lsa.explained_variance_ratio_[0:10]) Out[32]: 0.0465150637457711 In [33]: np.sum(lsa.explained_variance_ratio_[0:300]) Out[33]: 0.24895843614681598 In [34]: np.sum(lsa.explained_variance_ratio_[0:500]) Out[34]: 0.31942420156803719 In [35]: np.sum(lsa.explained_variance_ratio_[0:500]) Out[35]: 0.32257375258317 In [36]: np.sum(lsa.explained_variance_ratio_[0:1000]) Out[36]: 0.44443753062911762Paradoxically, using a dimension of 1,000 for the text vectors gave me a correlation of 0.424, while reducing the dimension progressively to 500, 300, 200, 100 and 50 gave me correlations of 0.431, 0.437, 0.442, 0.450 and 0.457 respectively. In other words, decreasing the number of dimensions resulted in higher correlation between similarities achieved using category tags and LSA vectors.
The next vectorizing approach I tried uses GloVe embeddings. GloVe uses matrix factorization on a matrix of word co-occurrence statistics from a corpus to generate word representations that include its semantics. The GloVe project has made these embeddings available via their website (see link). We will be using the glove.6B set, which is created out of Wikipedia 2016 and Gigaword 5 corpora, containing 6 billion tokens and a vocabulary of 400,000 words. The zip file contains 4 flat files, containing 50, 100, 200 and 300 dimensional representations of these 400,000 vocabulary words.
In the code below, I use CountVectorizer with a given vocabulary size to generate the count vector from the text, then for each word in a document, get the corresponding GloVe embedding and add it into the document vector, multiplied by the count of the words. I then normalize the resulting document vector by the nuber of words. The resulting dense vector is then written out to file.
# Source: src/glove-sims.py from __future__ import division, print_function from sklearn.feature_extraction.text import CountVectorizer import collections import numpy as np import os import dsutils DATA_DIR = "../data" EMBEDDING_SIZE = 200 VOCAB_SIZE = 5000 GLOVE_VECS = os.path.join(DATA_DIR, "glove.6B.{:d}d.txt".format(EMBEDDING_SIZE)) VECTORS_FILE = os.path.join(DATA_DIR, "glove-{:d}-vecs.csv".format(EMBEDDING_SIZE)) texts = [] ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb") for line in ftext: docid, text = line.strip().split("\t") texts.append(text) ftext.close() # read glove vectors glove = collections.defaultdict(lambda: np.zeros((EMBEDDING_SIZE,))) fglove = open(GLOVE_VECS, "rb") for line in fglove: cols = line.strip().split() word = cols[0] embedding = np.array(cols[1:], dtype="float32") glove[word] = embedding fglove.close() # use CountVectorizer to compute vocabulary cvec = CountVectorizer(max_features=VOCAB_SIZE, stop_words="english", binary=True) C = cvec.fit_transform(texts) word2idx = cvec.vocabulary_ idx2word = {v:k for k, v in word2idx.items()} # compute document vectors. This is just the sum of embeddings for # individual words. Thus if a document contains the words "u u v" # then the document vector is 2*embedding(u) + embedding(v). X = np.zeros((C.shape[0], EMBEDDING_SIZE)) for i in range(C.shape[0]): row = C[i, :].toarray() wids = np.where(row > 0)[1] counts = row[:, wids][0] num_words = np.sum(counts) if num_words == 0: continue embeddings = np.zeros((wids.shape[0], EMBEDDING_SIZE)) for j in range(wids.shape[0]): wid = wids[j] embeddings[j, :] = counts[j] * glove[idx2word[wid]] X[i, :] = np.sum(embeddings, axis=0) / num_words dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)I tried various combinations of GloVe embedding dimension and vocabulary size. The best correlation numbers were 0.457 and 0.458 with GloVe dimension of 200 and a vocabulary size of 5,000 with stopword filtering, for non-binarized and binarized count vectors respectively. Larger GloVe dimensions and larger vocabulary sizes tended to perform better until 200d.
My final vectorizing approach was Word2Vec. Word2Vec achieves a similar semantic representation as GloVe, but it does so by training a model to predict a word given its neighbors. A binary word2vec model, trained on the Google News corpus of 3 billion words is available here , and gensim provides an API to read this binary model in python.
# Source: src/w2v_sims.py from __future__ import division, print_function from gensim.models.word2vec import Word2Vec from sklearn.feature_extraction.text import CountVectorizer import numpy as np import os import dsutils DATA_DIR = "../data" MAX_FEATURES = 300 VOCAB_SIZE = 5000 WORD2VEC_MODEL = os.path.join(DATA_DIR, "GoogleNews-vectors-negative300.bin.gz") VECTORS_FILE = os.path.join(DATA_DIR, "w2v-{:d}-vecs.csv".format(MAX_FEATURES)) texts = [] ftext = open(os.path.join(DATA_DIR, "text.tsv"), "rb") for line in ftext: docid, text = line.strip().split("\t") texts.append(text) ftext.close() # read word2vec vectors word2vec = Word2Vec.load_word2vec_format(WORD2VEC_MODEL, binary=True) # use CountVectorizer to compute vocabulary cvec = CountVectorizer(max_features=VOCAB_SIZE, stop_words="english", binary=True) C = cvec.fit_transform(texts) word2idx = cvec.vocabulary_ idx2word = {v:k for k, v in word2idx.items()} # compute document vectors. This is just the sum of embeddings for # individual words. Thus if a document contains the words "u u v" # then the document vector is 2*embedding(u) + embedding(v). X = np.zeros((C.shape[0], 300)) for i in range(C.shape[0]): row = C[i, :].toarray() wids = np.where(row > 0)[1] counts = row[:, wids][0] num_words = np.sum(counts) if num_words == 0: continue embeddings = np.zeros((wids.shape[0], MAX_FEATURES)) for j in range(wids.shape[0]): wid = wids[j] try: emb = word2vec[idx2word[wid]] embeddings[j, :] = counts[j] * emb except KeyError: continue X[i, :] = np.sum(embeddings, axis=0) / num_words dsutils.save_vectors(X, VECTORS_FILE, is_sparse=False)Since the word2vec model provides vectors of a single dimensionality (300), I tried a few variations of vocabulary size (with stopwords). I see that correlation rises from 0.429 to 0.534 as I increase the vocabulary size from 50 to 5000. Binarizing the text vector results in a drop to 0.522.
The chart below summarizes the spread of correlation numbers against the category tag similarity matrix for document similarity matrices produced by each of the different vectorizers. The top of the blue area represents the best result I got out of that vectorizer with some combination of hyperparameters and the bottom represents the worst result. Obviously, my tests were not that extensive, and its very likely that these vectorizers might yield better results with some other combination of hyperparameters. But it does give an indication of the relative merits of different vectorizers, which is what I was after. Based on this, it looks like TF-IDF is still the best approach for traditional vectorization and word2vec is the best approach for deep learning based vectorization (although I have seen cases where GloVe is clearly better).

So anyway, thats all I had for today. If you enjoyed this post and would like to work with the code, it can be found in my Github project reuters-docsim
. If you have ideas for other vectorization approaches for this corpus, do drop me a note or better still, a pull request with the vectorizer code.