Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Introduction to Natural Language Processing with Python

$
0
0

In this talk, Jess Bowden introduces the area of NLP (Natural Language Processing) and a basic introduction of its principles. She uses python and some of its fundamental NLP packages, such as NLTK, to illustrate examples and topics, demonstrating how to get started with processing and analysing Natural Languages. She also looks at what NLP can be used for, a broad overview of the sub-topics, and how to get yourself started with a demo project.

This talk was part of AsyncJS (May event).

[00:00:08] Today, I’m going to be talking about natural language processing, specifically with Python, just a bit of an introduction. Yes. This is an overview of what I’ll be talking about today, so I’m going to give a little introduction to who I am in case you don’t know me. Then a bit of an introduction into what natural language processing is and what’s going on, in case you don’t know, and then why I think you should be using Python for NLP. Some people might disagree, that’s fine. Then an introduction, like a crash course on what the syntaxes in Python, just some things that are different that you might not know about. Sorry if you do. Then I’ll be looking at preparing your data for building prototypes and how to load data in Python and then a little look at how to explore and analyse data, just like the same things like tokenising and then I’ll be looking at a couple of little sentiment based projects that you could hopefully play around with yourself. Then looking at some more advanced things, perhaps. [00:01:17] Yes, like I said, I’m Jessica, I work at Brand Watch, Dan said that. I just started on the data science team and I’ve been there for about two years now, and that’s my Twitter handle. Yes, natural language processing is a really, really broad topic. I’ll be trying to cover some basic techniques today. It covers some topics like machine translation, summarising blocks of text, like something that got big like Sumly, which is a terrible name; spam detection, sentiment analysis. There are a couple more really big fields. I think Python is great, which is the main language that I use for programming now. It’s really readable so it makes really fast prototypes and it’s got really rich support for text, analysis, strings and lists. There are loads of great available NLP libraries like NLTK, Spacey, Text Block and there are also some really great passing libraries. I’ve also just added a couple of tools I like using, if you want to have a look in your spare time. [00:02:35] Now, I’m going to do a little bit of a crash course in case you’re not familiar with Python. Sorry if you are. The first thing to know if that Python has no brackets for separating your lines of text. I’m pretty sure they do in javascript, sorry, my JavaScript is awful. It’s really dependent on white space for indenting and separating new lines. There are generally no semi colons, so whilst this block of text would still run, it’s not very Pythonic and you should avoid it because semi colons are actually used to separate multiple statements that are used on the same line. You might you it like that or if you’re importing multiple lines, but even then it’s avoided. Strings are written like that and you can format strings like that with curly braces and the format function. Pretty similar to other languages, I think. Then just a couple on data structures. Lists, so the equivalent in JavaScript is arrays, you define them with square brackets and you iterate through them as follows. A nice thing to note is that strings in Python work a bit like lists, you can just iterate through them and slice them like that, which is a thing you’ll probably see me using quite a lot. You’ve also got list comprehensions, which just allows you to perform some “if” conditionals on some data and return it into a list. This here is just like getting all the ones which are even numbers in the range of zero to ten and I thought I’d just show the equivalent of an actual four loop, a lot more of those. [00:04:25] Dictionaries, really similar to JSON Blobs, just key value storage. Yes, they look really similar to JSON and when you read in JSON in Python it often comes back as dictionaries anyway. You can access the values for keys like this and iterate through them as follows. It’s also worth noting that this is called, “Unpacking variables”, I don’t think you get it. You can do that now, okay. Well, I don’t know about anything else. I don’t know about any JavaScript. You can also do dictionary comprehensions in Python, which are pretty cool. You can also do set comprehensions. Here you’re just iterating through the values in the dictionary we defined here and selecting all the ones where the first letter of the key begins with J, which is pretty nice. Then the last but not least, data structure sets, which are just like really similar to lists but they’re unordered and they’re got no duplicates. On a last note, comparing values and comparing objects in Python is as follows: You compare values with double equals and objects with the keyword “is”. In case you didn’t know, the null keyword is just “none” in Python. I also include a couple of links to coding style in case any of you are interested in making your code super Pythonic and really annoying with Pep 8. [00:06:02] Yes, just a little intro to getting started with NLP in Python. In case you need to, that’s how you open and read text files from a local file and then like this for online files, in case you need to read an online text files, to process. Then I’m going to do a little introduction to NLTK, which is a really popular NLP library in Python. It’s quite old and it’s not often updated now but it’s really great for educational purposes, which is why I’m introducing it here. It’s got a free book included, it’s got loads of open data sets that are free, you can just use it’s great. The first thing I want to go over is tokenising. Tokenising is where you just split your document up into logical chunks, which are usually broken up by sentences, so if I wanted to tokenise the first line from Alice in Wonderland, it would end up looking like this if I used the default NLTK tokenizer. It just breaks it up. It looks like punctuation and spaces. The next thing is stemmers and lemmatizes, they basically just reduce words to their normalised form. Am would become be and cars would become car. That’s how you use a stemmer and this is how you use a lemmatizer in NLTK. They look like they do pretty much the same thing but stemmers are a lot more nave and they don’t analyse the text like a lemmatizer does. They’re a lot faster. If you just want to chunk your text and just have it in a comparable format, then you’re better off using lemmatizers if you just want to cluster the similar text in some way. You’ll notices things like the E of Alice has just been chopped off but that’s not a plural just because it’s got an E on the end. The same with Lewis or Carol as well, that’s really crap. [00:08:20] Lemmatizing is just the same as stemming. It’s reducing it to its normal form. This one doesn’t work so well because I haven’t added in the part of speech that it is. It should come to you later, but it just basically considers the context and it doesn’t just do it naively and goes through it and chop off where it sees an S.

Okay, so stemmers do it towards the end or the beginning, does it at the end as well?

No, just from the end.

Just from the end.

[00:08:54] Yes, so plurals and lemmatizers consider the context. Okay, so now I’m going to look at exploring and analysing data. The first thing that’s quite fun that you can do with NLTK is explore the frequency distributions so we can try and find out which are the most informative tokens in our text. To do this we can just use the freak dist package from NLTK, run it against our s

Viewing all articles
Browse latest Browse all 9596

Trending Articles