Introduction to Natural Language Processing with Python

In this talk, Jess Bowden introduces the area of NLP (Natural Language Processing) and a basic introduction of its principles. She uses python and some of its fundamental NLP packages, such as NLTK, to illustrate examples and topics, demonstrating how to get started with processing and analysing Natural Languages. She also looks at what NLP can be used for, a broad overview of the sub-topics, and how to get yourself started with a demo project.

This talk was part of AsyncJS (May event).

[00:00:08] Today, I’m going to be talking about natural language processing, specifically with Python, just a bit of an introduction. Yes. This is an overview of what I’ll be talking about today, so I’m going to give a little introduction to who I am in case you don’t know me. Then a bit of an introduction into what natural language processing is and what’s going on, in case you don’t know, and then why I think you should be using Python for NLP. Some people might disagree, that’s fine. Then an introduction, like a crash course on what the syntaxes in Python, just some things that are different that you might not know about. Sorry if you do. Then I’ll be looking at preparing your data for building prototypes and how to load data in Python and then a little look at how to explore and analyse data, just like the same things like tokenising and then I’ll be looking at a couple of little sentiment based projects that you could hopefully play around with yourself. Then looking at some more advanced things, perhaps. [00:01:17] Yes, like I said, I’m Jessica, I work at Brand Watch, Dan said that. I just started on the data science team and I’ve been there for about two years now, and that’s my Twitter handle. Yes, natural language processing is a really, really broad topic. I’ll be trying to cover some basic techniques today. It covers some topics like machine translation, summarising blocks of text, like something that got big like Sumly, which is a terrible name; spam detection, sentiment analysis. There are a couple more really big fields. I think Python is great, which is the main language that I use for programming now. It’s really readable so it makes really fast prototypes and it’s got really rich support for text, analysis, strings and lists. There are loads of great available NLP libraries like NLTK, Spacey, Text Block and there are also some really great passing libraries. I’ve also just added a couple of tools I like using, if you want to have a look in your spare time. [00:02:35] Now, I’m going to do a little bit of a crash course in case you’re not familiar with Python. Sorry if you are. The first thing to know if that Python has no brackets for separating your lines of text. I’m pretty sure they do in javascript, sorry, my JavaScript is awful. It’s really dependent on white space for indenting and separating new lines. There are generally no semi colons, so whilst this block of text would still run, it’s not very Pythonic and you should avoid it because semi colons are actually used to separate multiple statements that are used on the same line. You might you it like that or if you’re importing multiple lines, but even then it’s avoided. Strings are written like that and you can format strings like that with curly braces and the format function. Pretty similar to other languages, I think. Then just a couple on data structures. Lists, so the equivalent in JavaScript is arrays, you define them with square brackets and you iterate through them as follows. A nice thing to note is that strings in Python work a bit like lists, you can just iterate through them and slice them like that, which is a thing you’ll probably see me using quite a lot. You’ve also got list comprehensions, which just allows you to perform some “if” conditionals on some data and return it into a list. This here is just like getting all the ones which are even numbers in the range of zero to ten and I thought I’d just show the equivalent of an actual four loop, a lot more of those. [00:04:25] Dictionaries, really similar to JSON Blobs, just key value storage. Yes, they look really similar to JSON and when you read in JSON in Python it often comes back as dictionaries anyway. You can access the values for keys like this and iterate through them as follows. It’s also worth noting that this is called, “Unpacking variables”, I don’t think you get it. You can do that now, okay. Well, I don’t know about anything else. I don’t know about any JavaScript. You can also do dictionary comprehensions in Python, which are pretty cool. You can also do set comprehensions. Here you’re just iterating through the values in the dictionary we defined here and selecting all the ones where the first letter of the key begins with J, which is pretty nice. Then the last but not least, data structure sets, which are just like really similar to lists but they’re unordered and they’re got no duplicates. On a last note, comparing values and comparing objects in Python is as follows: You compare values with double equals and objects with the keyword “is”. In case you didn’t know, the null keyword is just “none” in Python. I also include a couple of links to coding style in case any of you are interested in making your code super Pythonic and really annoying with Pep 8. [00:06:02] Yes, just a little intro to getting started with NLP in Python. In case you need to, that’s how you open and read text files from a local file and then like this for online files, in case you need to read an online text files, to process. Then I’m going to do a little introduction to NLTK, which is a really popular NLP library in Python. It’s quite old and it’s not often updated now but it’s really great for educational purposes, which is why I’m introducing it here. It’s got a free book included, it’s got loads of open data sets that are free, you can just use it’s great. The first thing I want to go over is tokenising. Tokenising is where you just split your document up into logical chunks, which are usually broken up by sentences, so if I wanted to tokenise the first line from Alice in Wonderland, it would end up looking like this if I used the default NLTK tokenizer. It just breaks it up. It looks like punctuation and spaces. The next thing is stemmers and lemmatizes, they basically just reduce words to their normalised form. Am would become be and cars would become car. That’s how you use a stemmer and this is how you use a lemmatizer in NLTK. They look like they do pretty much the same thing but stemmers are a lot more nave and they don’t analyse the text like a lemmatizer does. They’re a lot faster. If you just want to chunk your text and just have it in a comparable format, then you’re better off using lemmatizers if you just want to cluster the similar text in some way. You’ll notices things like the E of Alice has just been chopped off but that’s not a plural just because it’s got an E on the end. The same with Lewis or Carol as well, that’s really crap. [00:08:20] Lemmatizing is just the same as stemming. It’s reducing it to its normal form. This one doesn’t work so well because I haven’t added in the part of speech that it is. It should come to you later, but it just basically considers the context and it doesn’t just do it naively and goes through it and chop off where it sees an S.

Okay, so stemmers do it towards the end or the beginning, does it at the end as well?

No, just from the end.

Just from the end.

[00:08:54] Yes, so plurals and lemmatizers consider the context. Okay, so now I’m going to look at exploring and analysing data. The first thing that’s quite fun that you can do with NLTK is explore the frequency distributions so we can try and find out which are the most informative tokens in our text. To do this we can just use the freak dist package from NLTK, run it against our s

Introduction to Natural Language Processing with Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本