Visualizing Tweet Vectors Using Python

I try to experiment with a lot of different technologies. I’ve found that having experience with a diverse set of concepts, languages, libraries, tools etc. leads to more robust thinking when trying to solve a problem. If you don’t know that something exists then you can’t use it when it would be helpful to do so! There are lots of ways to gain these experiences. One can find great content online for almost any topic imaginable. But I’ve found that the best way to understand a technology is to try to build something with it.

My latest target was a basket of different libraries in the python ecosystem covering things like web development, caching, asynchronous messaging, and visualization. And since I’m a data scientist, I threw in a machine learning component just for fun. To explore these technologies, I created a semi-practical application that reads from the Twitter stream, parses tweets, and does some machine learning magic to score the tweet’s sentiment and project it into a two-dimensional grid, where tweets with similar content will appear closer to each other. It does all of this more or less in real time using asynchronous messaging.

The remainder of this blog post is devoted to showing how to build this from scratch. Just to be completely transparent about what it does and what technologies are involved, here are the components that I’ll demonstrate how to build:

Basic web development in Python using Flask + standard front-end stuff (Bootstrap, JQuery etc.) Asynchronous, chainable task queues using Celery and Redis Real-time event-based communication between Flask and connected clients using Socket-IO Twitter stream filtering/parsing using Pattern Streaming real-time visualization using NVD3 Sentiment analysis and word embeddings using Scikit-learn and Gensim (word2vec)

And here's a screenshot of what the finished app looks like.

Why are we doing this? Does this app have some tangible, real-world purpose? Probably not. But it’s fun and neat and you’ll hopefully learn a lot. If that sounds interesting then read on!

(NOTE: The completed application is located here . Feel free to reference this often as I may leave out some details throughout the post).

Setup

To get started, let’s talk about the setup required to run the application. In addition to a working Python interpreter, you’ll need to install a bunch of libraries that the application uses. You’ll also need to know how to perform a few tasks such as starting Redis, launching Celery, and running a Flask server. Fortunately these are all very easy to do. I wrote up detailed instructions in the project’s README file . Follow those steps and you should be up and running.

Sentiment & Word Vector Models

Now let’s build sentiment and word vector models to transform tweets. We’ll use this Twitter sentiment dataset as training data for both models. The first step is to read in the dataset and do some pre-processing using TF-IDF to convert each tweet to a bag-of-words representation.

print('Reading in data file...') data = pd.read_csv(path + 'Sentiment Analysis Dataset.csv', usecols=['Sentiment', 'SentimentText'], error_bad_lines=False) print('Pre-processing tweet text...') corpus = data['SentimentText'] vectorizer = TfidfVectorizer(decode_error='replace', strip_accents='unicode', stop_words='english', tokenizer=tokenize) X = vectorizer.fit_transform(corpus.values) y = data['Sentiment'].values

Note that we’re using a custom tokenizer designed to handle patterns common in tweets. I borrowed this from a script Christopher Potts wrote and adapted it slightly (final version is in the "scripts" folder). Next, we can train the sentiment classifier and word2vec model.

print('Training sentiment classification model...') classifier = MultinomialNB() classifier.fit(X, y) print('Training word2vec model...') corpus = corpus.map(lambda x: tokenize(x)) word2vec = Word2Vec(corpus.tolist(), size=100, window=4, min_count=10, workers=4) word2vec.init_sims(replace=True)

This should run pretty fast since the training data set is not that big. We now have a model that can read a tweet and classify its sentiment as either positive or negative, and another model that transforms the words in a tweet to 100-dimensional vectors. But we still need a way to use those 100-dimensional vectors to spatially plot a tweet on a 2-dimensional grid. To do that, we’re going to fit a PCA transform for the word vectors and keep only the first 2 principal components.

print('Fitting PCA transform...') word_vectors = [word2vec[word] for word in word2vec.vocab] pca = PCA(n_components=2) pca.fit(word_vectors)

Finally, we’re going to save all of these artifacts to disk so we can call them later from the web application.

print('Saving artifacts to disk...') joblib.dump(vectorizer, path + 'vectorizer.pkl') joblib.dump(classifier, path + 'classifier.pkl') joblib.dump(pca, path + 'pca.pkl') word2vec.save(path + 'word2vec.pkl') Web App Initialization

Now that we have all the models we need ready to go, we can get started on the meat of the application. First, some initialization. This code runs only once, when the Flask server is launched.

# Initialize and configure Flask app = Flask(__name__) app.config['SECRET_KEY'] = 'secret' app.config['CELERY_BROKER_URL'] = 'redis://localhost:6379/0' app.config['CELERY_RESULT_BACKEND'] = 'redis://localhost:6379/0' app.config['SOCKETIO_REDIS_URL'] = 'redis://localhost:6379/0' app.config['BROKER_TRANSPORT'] = 'redis' app.config['CELERY_ACCEPT_CONTENT'] = ['pickle'] # Initialize SocketIO socketio = SocketIO(app, message_queue=app.config['SOCKETIO_REDIS_URL']) # Initialize and configure Celery celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL']) celery.conf.update(app.config)

There’s a bunch of stuff going on here so let’s break it down. We’ve created a variable called "app" that’s an instantiation of Flask, and set some configuration items to do things like tell it to use Redis as the broker (note that "config" is just a dictionary of key/value pairs which we can use for other settings not required by Flask). We also created a SocketIO instance, which is a class from the Flask-SocketIO integration library that basically wraps Flask with SocketIO support. Finally, we created our Celery app and updated its configuration settings to use the "config" dictionary we defined for Flask.

Next we need to load the models we created earlier into memory so they can be used by the application.

# Load transforms and models vectorizer = joblib.load(path + 'vectorizer.pkl') classifier = joblib.load(path + 'classifier.pkl') pca = joblib.load(path + 'pca.pkl') word2vec = Word2Vec.load(path + 'word2vec.pkl')

Finally, we’ll create some helper functions that use these models to classify the sentiment of a tweet and transform a tweet into 2D coordinates.

def classify_tweet(tweet): """

Visualizing Tweet Vectors Using Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本