Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

从R转向Python:你需要知道的那些库

$
0
0
Moving from R to python: The Libraries You Need to Know

导读:R语言和Python语言有很多共性,也是数据科学家门喜欢的两门语言。R语言能够做的事情,有很多丰富的包,在Python语言里也有很多经典的库可以做好相应的事情。这篇文章从同一问题出发,把R语言的包和Python的库对应起来,值得一读,更是值得实践。实践出真知!

Why the switch?

One of my favorite parts of machine learning in Python is that it got the benefit of observing the R community and then emulating the best parts of it. I'm a big believer that a language is only as helpful as its libraries. So in this post I'm going to go over some critical packages that I use almost every time I work in R, and their counterpart(s) in Python.

glm, knn, randomForest, e1071 (yes, this is actually a meaningful package's name) -> scikit-learn

One thing that is a blessing and a curse in R is that the machine learning algorithms are generally segmented by package. Meaning instead of having a single (or set) of ML libraries that each implement some common algorithms, each algorithm gets its own package. It's sort of nice because you can find very esoteric, cutting edge implementations of algorithms, but it can be a pain for day-to-day use where you might be switching between algorithms. This pain is something that Python's scikit-learn solves really well. scikit-learn provides a common set of ML algorithms all under the same API. It makes switching between LogisticRegression and GradientBoostingMachines a one-liner.

reshape/reshape2, plyr/dplyr -> pandas

This was actually the subject of one of our first posts. pandas took the best parts of data munging in R and turned it into a Python package. This includes its own implementation of a data frame along with ways to modify and restructure it. Basically it took the best parts of reshape/reshape2 and plyr/dplyr and Pythonified it!

ggplot2 -> ggplot + seaborn + bokeh

One thing that R still does better than Python is plotting. Hands down, R is better in just about every facet. Even so, Python plotting has matured though it's a fractured community. If you like the ggplot-style syntax, then look no further than Yhat's own ggplot. If you're after super statistical and technical plots then reach for seaborn. And if you're in the market for some super slick, great looking interactive plots then try out bokeh.

stringr -> nothing

String manipulation in "base R" is nearly as unintuitive as it is silly. Any time I'm working with strings in R I do 2 things (in order):

briefly nod in appreciation to New Zealand for producing Hadley Wickham import stringr
从R转向Python:你需要知道的那些库

Much obliged, New Zealand

stringris an absolute lifesaver. It's well written, performant (at least I think so), and easy to install (don't overlook this last item. if people can't install your software, there's no sense in making it).

Ok so stringr appreciation monologue complete. So the good news for you is that Python is so great for string manipulation, you don't really need a string library! It has a fantastic built-in regular expressions library, re, and a built-in string meta-libarary appropriately called string. So lucky for you, Python comes with all string-related batteries included!

RStudio -> Rodeo

To many users, RStudio is synonymous with R. And why not? It's a great IDE for data analysis in R. Historically speaking, there haven't been a lot of comparable options for Python. Of course this is no longer the case. We released the very first version of Rodeo just over a year ago and released the 2.0 for windows, OSX, and linux about a month ago.

"Ever since we've used RStudio, we've been looking for an IDE like it for Python. We went through IDEs such as Sublime Text and Spyder, none of which suited our likings. We searched and found Rodeo and couldn't have been more pleased with the IDE." -Stephen Hsu, University of California, Berkeley

Knitr -> Jupyter

Knitr is a great way to create reproducible and highly visual analysis using R. It's been a staple in RStudio for a while now. In the Python world, the most analagous package is Jupyter. Jupyter notebooks provide an interactive environment for programming in Python (and other languages) that focuses on reproducibility and visualization--it even has a plugin for R!

sqldf -> pandasql

sqldf is a great way for SQL users to comfortably manipulate data in R. I myself used it when I first started learning R. Way back when, Yhat actually built a similar package for Python called pandasql. Same concept: write SQL queries against your data frames, get data frames back! Fast-forward 3 years and pandasql has over 256 stars on GitHub :). Not bad for a library with only 358 lines of code!

原文链接:http://blog.yhat.com/posts/moving-from-r-to-python.html 感谢作者分享

数据人网是数据人学习、交流和分享的平台http://shujuren.org 。专注于从数据中学习。 平台的理念: 人人投稿,知识共享;人人分析,洞见驱动;智慧聚合,普惠人人。 您在数据人网平台,可以1)学习数据知识;2)创建数据博客;3)认识数据朋友;4)寻找数据工作;5)找到其它与数据相关的干货。 我们努力坚持做原创,分享和传播数据知识干货! 我们都是数据人,数据是有价值的,坚定不移地利用数据价值创造价值!

严禁修改,可以转载,请注明出自数据人网和原文链接

Viewing all articles
Browse latest Browse all 9596

Trending Articles