Higher-Order Functions

January 29, 2017, 3:20 pm

≫ Next: Building Decision Tree Algorithm in Python with scikit learn

≪ Previous: How to install Django on Ubuntu (the right way) for novice users

Higher-order functions are the pendant to First-Class Functions because higher-order functions can take functions as argument or return them as result.

Higher-order functions

The three classics: map, filter, and fold

Each programming language supporting programming in the functional style supports at least the three functions map, filter, and fold. map applies a function to each element of its list. filter removes all elements of a list not satisfying a condition. fold is the most powerful of the three ones. fold successively applies a binary operation to pairs of the list and therefore reduces the list to a value.

I can do better:

The name variations

The statement, that each programming language supporting programming in a functional style has to support the three functions map, filter, and reduce, has little restrictions. The names of the three functions have variations in the different programming languages. You can see in the table the names of the Haskell functions in comparison with their names in C++ and python.

I want to stress two points. Firstly, Haskell has four variations of fold. The variations are due to the fact if the binary operation starts at the begin or at the end of the list and if the binary operation has an initial value. Secondly, string concatenation of map and reduce (fold) gives MapReduce. This is no accident. The Google framework is based on a map and a reduce phase and is therefore based on the ideas of the functions map and reduce (fold).

The easiest way to get an idea of the functions is to use them. As input, I choose a list (vec) with the integers from 0 to 9 and a list (str) with the words "Programming","in","a","functional","style." In the case of C++, the list is a std::vector.

// Haskell
vec = [1..9]
str = ["Programming","in","a","functional","style."]
// Python
vec=range(1, 10)
str=["Programming", "in", "a", "functional", "style."]
// C++
std::vector<int> vec{1, 2, 3, 4, 5, 6, 7, 8, 9}
std::vector<string> str{"Programming", "in", "a", "functional", "style."}

For simplicity reasons, I will directly display the results in the list syntax of Haskell.

map

map applies a callable to each element of a list. A callable is all that behaves like a function. A callable can be in C++ a function, a function object, or a lambda function. The best fit for higher-order functions is often a lambda function. That for two reasons. At on hand, you can very concisely express your intent and the code is easier to understand. On the other hand, the lambda function defines their functionality exactly at the place where it is used. Because of that, the compiler gets maximum insight in the source code and has the maximum potential to optimise. That is the reason that you will often get more performant executables with a lambda function

// Haskell
map(\a -> a * a) vec
map(\a -> length a) str
// Python
map(lambda x :x*x , vec)
map(lambda x : len(x), str)
// C++
std::transform(vec.begin(), vec.end(), vec.begin(),
[](int i){ return i*i;} );
std::transform(str.begin(), str.end(), std::back_inserter(vec2),
[](std::string s){ return s.length ();} );
// [1,4,9,16,25,36,49,64,81]
// [11,2,1,10,6]
Of course, the syntax of lambda functions in Haskell, Python, and C++ is different. Therefore, a lambda function will be introduced in Haskell by a slash \a → a*a , but will be introduced in Haskell by the keyword lambda: lambda x: x*x In C++ you have to use square brackets: [](int i){ return i*i; } . These are only syntactical differences. More interesting is the fact that you invoke functions in Haskell without braces and that Haskell and Python generates a list, while you can modify in C++ the existing std::vector or fill a new one. filter

filter keeps only this elements in the list that satisfies the predicate. A predicate is a callable that returns a boolean. Here is the example.

// Haskell
filter (\x-> x <3 || x> 8) vec
filter (\x -> isupper (head x)) sts
// Python
filter(lambda x: x<3 or x>8 , vec)
filter(lambda x: x[0].isupper(), str)
// C++
auto it = std::remove_if(vec.begin(), vec.end (),
[](int i){return ((i <3) or (i> 8))!});
auto it2 = std::remove_if(str.begin(), str.end (),
[](std::string s) {return !(std::isupper (s[0]));});
// [1,2,9]
// ["Programming"]

The function composition isUpper(head x) checks for each word if it starts ( head x ) with a capital letter ( isUpper ) ist.

Two quick remarks to std::remove_if. std::remove_if removes no element but it returns the new logical end of the list. Afterwards, you have to apply the erase-remove idiom . The logic of std::remove_if is the other way around. It will remove the elements satisfying the condition. Therefore, I have to negate the condition.

fold

fold is the most powerful of the three higher-order functions. You can implement map and filter by using fold. The code snippet shows the calculation of the faculty of 9 and string concatenation in Haskell, Python, and C++.

// Haskell
foldl (\a b -> a * b) 1 vec
foldl (\a b -> a ++ ":" ++ b ) "" str
//Python
reduce(lambda a , b: a * b, vec, 1)
reduce(lambda a, b: a + ":" + b, str, "")
// C++
std::accumulate(vec.begin(), vec.end(), 1,
[](int a, int b){ return a*b; });
std::accumulate(str.begin(), str.end(), string(""),
[](std::string a,std::string b){ return a+":"+b; });
// 362800
// ":Programming:in:a:functional:style."
foldl needs as the Python pendant reduce and the C++ pendant std::accumulate an initial value. This is in the case of the faculty the 1; this is in the case of the string concatenation the empty st

↧

Building Decision Tree Algorithm in Python with scikit learn

February 3, 2017, 8:22 am

≫ Next: Python 为何能坐稳 AI 时代头牌语言

≪ Previous: Higher-Order Functions

Building Decision Tree Algorithm in Python with scikit learn

Decision tree algorithm in python Decision Tree Algorithm implementation with scikit learn

One of the cutest and lovable supervised algorithms is Decision Tree Algorithm. It can be used for both the classification as well as regression purposes also.

As in the previous article how the decision tree algorithm works we have given the enough introduction to the working aspects of decision tree algorithm. In this article, we are going to build a decision tree classifier in python using scikit-learn machine learning packages for balance scale dataset.

The summarizingway of addressing this article is to explain how we can implement Decision Tree classifier on Balance scale data set.We will program our classifier in Python language and will use its sklearn library .

How we can implement Decision Tree classifier in Python with Scikit-learn

Decision tree algorithm prerequisites

Before get start building the decision tree classifier in Python, please gain enough knowledge on how the decision tree algorithm works. If you don’t have the basic understanding of how the Decision Tree algorithm. You can spend some time on how the Decision TreeAlgorithm works article.

Once we completed modeling the Decision Tree classifier, we will use the trained model to predict whether the balance scale tip to the right or tip to the left or be balanced . The greatness of using Sklearn is that. It provides the functionality to implement machine learning algorithms in a few lines of code.

Before get started let’s quicklylook into the assumptions we make while creating the decision tree and the decision tree algorithm pseudocode.

Assumptions we make while using Decision tree In the beginning, thewhole training set is considered at the root. Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model. Records are distributed recursivelyon the basis of attribute values. Order to placing attributes as root or internal node of thetree is done by using some statistical approach. Decision Tree Algorithm Pseudocode Place the best attribute of our dataset at the root of the tree. Splitthe training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.

While building our decision tree classifier,we can improve its accuracy bytuning it with different parameters. But this tuning should bedone carefullysince by doing this our algorithm can overfit on our training data & ultimately it will build bad generalization model.

Sklearn Library Installation

Python’s sklearn library holds tons of modules that help to build predictive models. It contains tools for data splitting, pre-processing, feature selection, tuning and supervised unsupervised learning algorithms, etc. It is similar to Caretlibrary in R programming.

For using it, we first need to install it. The best way to install data science libraries and its dependencies is by installing Anaconda package. You can also install only the most popular machine learning Python libraries .

Sklearn library provides us direct access to a different module for training our model with different machine learning algorithms like K-nearest neighbor classifier , Support vector machine classifier , decision tree,linear regression, etc.

Balance Scale Data Set Description

Balance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as thetarget attribute. We will try to build aclassifier for predicting the Class attribute. The index of target attribute is 1st.

1.: 3 (L, B, R)

2. Left-Weight: 5 (1, 2, 3, 4, 5)

3. Left-Distance: 5 (1, 2, 3, 4, 5)

4. Right-Weight: 5 (1, 2, 3, 4, 5)

5. Right-Distance: 5 (1, 2, 3, 4, 5)

Index Variable Name Variable Values 1. Class Name( Target Variable) “R” : balance scale tip to the right
“L” :balance scale tip to the left
“B” : balance scale be balanced 2. Left-Weight 1, 2, 3, 4, 5 3. Left-Distance 1, 2, 3, 4, 5 4. Right-Weight 1, 2, 3, 4, 5 5. Right-Distance 1, 2, 3, 4, 5

The above table shows all the details of data.

Balance Scale Problem Statement

The problem we are going to address is To model a classifier forevaluating balance tip’s direction.

Decision Tree classifier implementation in Python with sklearnLibrary

The modeled Decision Tree will compare the new records metrics with the prior records(training data) that correctly classifiedthe balance scale’s tip direction.

Python packages used NumPy NumPy is a Numeric Python module. It provides fast mathematical functions. Numpyprovides robust data structures for efficient computation of multi-dimensional arrays & matrices. We used numpy to read data files into numpy arrays and data manipulation. Pandas Provides DataFrame Object for data manipulation Provides reading & writing data b/w different files. DataFrames can hold different types data of multidimensional arrays. Scikit-Learn It’s a machine learning library. It includes various machine learning algorithms. We are using its train_test_split, DecisionTreeClassifier, accuracy_score algorithms.

If you haven’t setup the machine learning setup in your system the below posts will helpful.

Python Machine learning setup in ubuntu

↧

Python 为何能坐稳 AI 时代头牌语言

February 3, 2017, 8:21 am

≫ Next: Pyston 0.6.1 发布，高性能 Python JIT

≪ Previous: Building Decision Tree Algorithm in Python with scikit learn

谁会成为AI 和大数据时代的第一开发语言？这本已是一个不需要争论的问题。如果说三年前，Matlab、Scala、R、Java 和 python还各有机会，局面尚且不清楚，那么三年之后，趋势已经非常明确了，特别是前两天 Facebook 开源了 PyTorch 之后，Python 作为 AI 时代头牌语言的位置基本确立，未来的悬念仅仅是谁能坐稳第二把交椅。

不过声音市场上还有一些杂音。最近一个有意学习数据科学的姑娘跟我说，她的一个朋友建议她从 Java 入手，因为 Hadoop 等大数据基础设施是用 Java 写的。无独有偶，上个月 IBM developerWorks 发表的一篇个人博客，用职位招聘网站indeed 上的数据做了一个统计。这篇文章本身算得上是客观公正，实事求是，但传到国内来，就被一些评论者曲解了本意，说 Python 的优势地位尚未确立，鹿死谁手尚未可知，各位学习者不可盲目跟风啊，千万要多方押宝，继续撒胡椒面不要停。

在这里我要明确表个态，对于希望加入到 AI 和大数据行业的开发人员来说，把鸡蛋放在 Python 这个篮子里不但是安全的，而且是必须的。或者换个方式说，如果你将来想在这个行业混，什么都不用想，先闭着眼睛把 Python 学会了。当然，Python不是没有它的问题和短处，你可以也应该有另外一种甚至几种语言与 Python 形成搭配，但是Python 将坐稳数据分析和 AI 第一语言的位置，这一点毫无疑问。我甚至认为，由于 Python 坐稳了这个位置，由于这个行业未来需要大批的从业者，更由于Python正在迅速成为全球大中小学编程入门课程的首选教学语言，这种开源动态脚本语言非常有机会在不久的将来成为第一种真正意义上的编程世界语。

讨论编程语言的优劣兴衰一直被认为是一个口水战话题，被资深人士所不屑。但是我认为这次 Python 的上位是一件大事。请设想一下，如果十五年之后，所有40岁以下的知识工作者，无分中外，从医生到建筑工程师，从办公室秘书到电影导演，从作曲家到销售，都能使用同一种编程语言进行基本的数据处理，调用云上的人工智能 API，操纵智能机器人，进而相互沟通想法，那么这一普遍编程的协作网络，其意义将远远超越任何编程语言之争。目前看来，Python 最有希望担任这个角色。

Python 的胜出令人意外，因为它缺点很明显。它语法上自成一派，让很多老手感到不习惯；“裸” Python 的速度很慢，在不同的任务上比C 语言大约慢数十倍到数千倍不等；由于全局解释器锁（GIL）的限制，单个Python 程序无法在多核上并发执行；Python 2 和 Python 3 两个版本长期并行，很多模块需要同时维护两个不同的版本，给开发者选择带来了很多不必要的混乱和麻烦；由于不受任何一家公司的控制，一直以来也没有一个技术巨头肯死挺 Python ，所以相对于 Python 的应用之广泛，其核心基础设施所得到的投入和支持其实是非常薄弱的。直到今天，26岁的Python 都还没有一个官方标配的 JIT 编译器，相比之下， Java 语言在其发布之后头三年内就获得了标配 JIT 。

另一个事情更能够说明问题。Python 的 GIL 核心代码 1992 年由该语言创造者 Guido van Rossum 编写，此后十八年时间没有一个人对这段至关重要的代码改动过一个字节。十八年！直到2010年，Antoine Pitrou才对 GIL 进行了近二十年来的第一次改进，而且还仅在 Python 3.x 版本中使用。这也就是说，今天使用 Python 2.7 的大多数开发者，他们所写的每一段程序仍然被26年前的一段代码牢牢制约着。

说到 Python 的不足，我就想起发生在自己身上的一段小小的轶事。我多年前曾经在一篇文章里声明自己看好 Python，而不看好 Ruby。大概两年多以前，有一个网友在微博里找到我，对我大加责备，说因为当年读了我这篇文章，误听谗言，鬼迷心窍，一直专攻 Python，而始终对 Ruby 敬而远之。结果他Python 固然精通，但最近一学 Ruby，如此美好，如此甜蜜，喜不自胜，反过来愤然意识到，当年完全被我误导了，在最美的年华错过了最美的编程语言。我当时没有更多的与他争辩，也不知道他今天是否已经从Python后端、大数据分析、机器学习和 AI 工程师成功转型为Rails快速开发高手。我只是觉得，想要真正认识一件事物的价值，确实也不是一件容易的事情。

Python 就是这样一个带着各种毛病冲到第一方阵的赛车手，但即便到了几年前，也没有多少人相信它有机会摘取桂冠，很多人认为 Java 的位置不可动摇，还有人说一切程序都将用 javascript重写。但今天我们再看，Python 已经是数据分析和 AI的第一语言，网络攻防的第一黑客语言，正在成为编程入门教学的第一语言，云计算系统管理第一语言。Python 也早就成为Web 开发、游戏脚本、计算机视觉、物联网管理和机器人开发的主流语言之一，随着 Python 用户可以预期的增长，它还有机会在多个领域里登顶。

而且不要忘了，未来绝大多数的 Python 用户并不是专业的程序员，而是今天还在使用 Excel、PowePoint、SAS、Matlab和视频编辑器的那些人。就拿 AI 来说，我们首先要问一下，AI 的主力人群在哪里？如果我们今天静态的来谈这个话题，你可能会认为 AI 的主力是研究机构里的 AI 科学家、拥有博士学位的机器学习专家和算法专家。但上次我提到李开复的“AI红利三段论”明确告诉我们，只要稍微把眼光放长远一点，往后看三至五年，你会看到整个 AI 产业的从业人口将逐渐形成一个巨大的金字塔结构，上述的 AI 科学家仅仅是顶端的那么一点点，95% 甚至更多的 AI 技术人员，都将是AI 工程师、应用工程师和AI 工具用户。我相信这些人几乎都将被Python 一网打尽，成为 Python 阵营的庞大后备军。这些潜在的 Python 用户至今仍然在技术圈子之外，但随着 AI 应用的发展，数百万之众的教师、公司职员、工程师、翻译、编辑、医生、销售、管理者和公务员将裹挟着各自领域中的行业知识和数据资源，涌入 Python 和 AI 大潮之中，深刻的改变整个 IT，或者说 DT （数据科技）产业的整体格局和面貌。

为什么 Python 能够后来居上呢？

如果泛泛而论，我很可以列举 Python 的一些优点，比如语言设计简洁优雅，对程序员友好，开发效率高。但我认为这不是根本原因，因为其他一些语言在这方面表现得并不差。

还有人认为 Python 的优势在于资源丰富，拥有坚实的数值算法、图标和数据处理基础设施，建立了非常良好的生态环境，吸引了大批科学家以及各领域的专家使用，从而把雪球越滚越大。但我觉得这是倒因为果。为什么偏偏是 Python 能够吸引人们使用，建立起这么好的基础设施呢？为什么世界上最好的语言 php 里头就没有 numpy 、NLTK、sk-learn、pandas 和 PyTorch 这样级别的库呢？为什么 JavaScript 极度繁荣之后就搞得各种程序库层次不齐，一地鸡毛，而 Python 的各种程序库既繁荣又有序，能够保持较高水准呢？

我认为最根本的原因只有一点： Python 是众多主流语言中唯一一个战略定位明确，而且始终坚持原有战略定位不动摇的语言。相比之下，太多的语言不断的用战术上无原则的勤奋去侵蚀和模糊自己的战略定位，最终只能等而下之。

Python 的战略定位是什么？其实很简单，就是要做一种简单、易用但专业、严谨的通用组合语言，或者叫胶水语言，让普通人也能够很容易的入门，把各种基本程序元件拼装在一起，协调运作。

正是因为坚持这个定位，Python 始终把语言本身的优美一致放在奇技妙招前面，始终把开发者效率放在CPU效率前面，始终把横向扩张能力放在纵向深潜能力之前。长期坚持这些战略选择，为 Python 带来了其他语言望尘莫及的丰富生态。

比如说，任何一个人，只要愿意学习，可以在几天的时间里学会Python基础部分，然后干很多很多事情，这种投入产出比可能是其他任何语言都无法相比的。再比如说，正是由于 Python 语言本身慢，所以大家在开发被频繁使用的核心程序库时，大量使用 C 语言跟它配合，结果用 Python 开发的真实程序跑起来非常快，因为很有可能超过 80% 的时间系统执行的代码是 C 写的。相反，如果 Python 不服气，非要在速度上较劲，那么结果很可能是裸速提高个几倍，但这样就没人有动力为它开发 C 模块了，最后的速度远不如混合模式，而且很可能语言因此会变得更复杂，结果是一个又慢又丑陋的语言。

更重要的是，Python 的包装能力、可组合性、可嵌入性都很好，可以把各种复杂性包装在 Python 模块里，暴露出漂亮的接口。很多时候，一个程序库本身是用 C/C++ 写的，但你会发现，直接使用 C 或者 C++ 去调用那个程序库，从环境配置到接口调用，都非常麻烦，反而隔着一层，用其python 包装库更加清爽整洁，又快又漂亮。这些特点到了 AI 领域中，就成了 Python 的强大优势。Python 也借助 AI 和数据科学，攀爬到了编程语言生态链的顶级位置。Python 与 AI绑在一起，对它们来说，无论是电子商务、搜索引擎、社交网络还是智能硬件，未来都只是生态链下游的数据奶牛、电子神经和执行工具，都将听命于自己。

对编程语言发展历史缺乏了解的人可能会觉得，Python 的战略定位是犬儒主义和缺乏进取心的。但事实证明，能同时做到简单而严谨、易用而专业，是很难的，而能够坚守胶水语言的定位，更是难上加难。

有的语言，从一开始就是出于学术而非实用的目的，学习曲线过于陡峭，一般人很难接近。有的语言，过于依赖背后金主的商业支持，好的时候风光无限，一旦被打入冷宫，连生存下去都成问题。有的语言，设计的时候有明确的假想场景，要么是为了解决大规模并发，要么是为了解决矩阵运算，要么是为了做网页渲染模板，一旦离开这个场景，就各种不爽。更多的语言，刚刚取得一点成功，就迫不及待的想成为全能冠军，在各个方向上拼命的伸展触角，特别是在增强表达能力和提升性能方面经常过分积极，不惜将核心语言改得面目全非，最后变成谁都无法掌控的庞然大物。相比之下，Python 是现代编程语言设计和演化当中的一个成功典范。

Python 之所以在战略定位上如此清晰，战略坚持上如此坚定，归根结底是因为其社区构建了一个堪称典范的决策和治理机制。这个机制以 Guido van Rossum (BDFL，Pythoners 都知道这是什么意思), DavidBeazley, Raymond Hettinger 等人为核心，以 PEP 为组织平台，民主而有序，集中而开明。只要这个机制本身得以维系，Python 在可见的未来里仍将一路平稳上行。

最有可能向 Python 发起挑战的，当然是Java。Java 的用户存量大，它本身也是一种战略定位清晰而且非常坚定的语言。但我并不认为 Java 有很大的机会，因为它本质上是为构造大型复杂系统而设计的。什么是大型复杂系统？就是由人清清楚楚描述和构造出来的系统，其规模和复杂性是外生的，或者说外界赋予的。而 AI 的本质是一个自学习、自组织的系统，其规模和复杂性是一个数学模型在数据的喂养下自己长出来的，是内生的。因此，Java大多数的语言结构对于大数据的处理和 AI 系统的开发显得使不上劲，你强的东西这里用不上，这里需要的东西你做起来又别扭。而 Python 在数据处理方面的简洁强悍早就尽人皆知。对比两个功能相同的 Java 和 Python 机器学习程序，正常人只要看两眼就能做出判断，一定是 Python 程序更加清爽痛快。

大概在 2003 或者 2004 年的时候，我买过一本 Python 的书，作者是一位巴西人。他说自己之所以坚定的选择 Python，是因为他小时候经常梦到未来世界将由一条大蟒蛇（蟒蛇的英文为python）统治。我当时觉得这哥们好可怜，做个梦都能梦到这么恐怖的场景。但今天来看，也许他只是像黑客帝国里的程序员安德森一样，不小心穿越到未来，并且窥探到了世界的真相。

本文转载自公众号AI100（ rgznai100）

End.

↧

Pyston 0.6.1 发布，高性能 Python JIT

February 3, 2017, 8:20 am

≫ Next: Million requests per second with Python

≪ Previous: Python 为何能坐稳 AI 时代头牌语言

Pyston 0.6.1 发布了。Pyston 是一个 Dropbox 推出的新的基于 JIT 的 python 2.7 的实现。Pyston 解析 Python 代码并转换到 LLVM 的 intermediate representation (IR)。然后 IR 通过 LLVM 优化器处理后在 LLVM JIT 引擎上执行，其结果是机器码的执行。

该版本在 0.6 版本的基础上做了系列性能提升，是 Pyston 在标准的基准测试中比 Pyston 快 95%。此为 Dropbox 赞助的最后一个版本，原因如下：

在兼容性和内存使用方面花费时间过多

Dropbox 越来越多地使用其他语言编写性能敏感代码，如 Go

项目成本有所增加，但利益并无提升

虽然如此，该项目源码依然开源可用。

下载地址：

↧

Million requests per second with Python

February 3, 2017, 8:19 am

≫ Next: Dropbox pulls the plug on faster Python project

≪ Previous: Pyston 0.6.1 发布，高性能 Python JIT

Is it possible? Probably not until recently. Many large companies have been investigating migrating to other programming languages to boost their operation performance and save on server prices but there is no need really. python can be right tool for the job and there is a lot of work around performance in the community happening. CPython 3.6 boosted overall interpreter performance with new dictionary implementation, CPython 3.7 is gonna be even faster thanks to introducing faster call convention and dictionary lookup caches. For number crunching tasks you can use PyPy with its just-in-time code compilation. Since recently it can run NumPy test suite and improved overall compatibility with C extensions drastically. Later this year PyPy is expected to reach Python 3.5 conformance.

All this great work inspired me to innovate in one of the areas which Python is used extensively, web and micro-services development.

Enter Japronto!

Japronto is a brand new micro-framework tailored for your micro-services needs. It’s main goals include being fast , scalable and lightweight . It lets you do synchronous and asynchronous programming with asyncio and it’s shamelessly fast . Even faster than NodeJS and Go.

Python micro-frameworks (blue), Dark side of force (green) and Japronto(purple)

This micro benchmark was done using a “Hello world!” application but it clearly demonstrates server-framework overhead for a number of solutions. These results were obtained on AWS c4.2xlarge instance that has 8 VCPUs launched in So Paulo region with default shared tenancy, HVM virtualization and magnetic storage. The machine was running Ubuntu 16.04.1 LTS (Xenial Xerus) with linux 4.4.0 53-generic x86_64 kernel. The OS was reporting Xeon CPU E5 2666 v3 @ 2.90GHz CPU. I used Python 3.6 which I freshly compiled from source code. To be fair all the contestants (including Go) were running single worker process. Servers were load tested using wrk with 1 thread, 100 connections and 24 simultaneous (pipelined) requests per connection (cumulative parallelism of 2400 requests).

HTTP pipelining (image credit Wikipedia)

HTTP pipelining is crucial here since it’s one of the optimizations that Japronto takes into account when executing requests. Most of the servers execute requests from pipelining clients in the same fashion they would do from non-pipelining clients and don’t try to optimize it (in fact Sanic and Meinheld would also silently drop requests from pipelining clients which is a violation of HTTP 1.1 protocol). In simple words pipelining is a technique in which the client doesn’t need to wait for the response before sending following request over the same TCP connection. To ensure integrity of the communication server sends back several responses in the same order requests were received.

The gory details of optimizations

When many small GET requests are pipelined together by the client there is a great chance they are gonna arrive in one TCP packet (thanks to Nagle’s algorithm ) on the server side and read back by one system call . Doing a system call and moving data from kernel-space to user-space is a very expensive operation compared to e.g. moving memory inside process space. That’s why doing as little as possible (but not less) system calls is important. When Japronto receives data and successfully parses out many requests out of it it tries to execute all requests as fast as possible, glue back responses in correct order and write back in one system call . In fact the kernel can aid with the gluing part thanks to scatter/gather IO system calls which Japronto doesn’t use yet. Beware that all this is not always possible since some of the requests could take too long and waiting for them would needlessly increase latency. Care needs to be taken when tuning heuristics weighting between the cost of system calls and expected request completion time.

Japronto gives a 1,214,440 RPS median of grouped continuous data, calculated as the 50th percentile, using interpolation.

Besides delaying writes for pipelined clients there are several other techniques employed in the code. Japronto is written almost entirely in C. The parser, protocol, connection reaper, router, request and response objects are written as C extensions. Japronto tries hard to delay creation of Python counterparts of its internal structures until asked explicitly. For example headers dictionary won’t be created until requested in a view. All the token boundaries are already marked before but normalization of header keys and creation of several str objects is done when accessed for the first time.

Japronto relies on the excellent picohttpparser C library for parsing status line, headers and chunked HTTP message body. Picohttpparser directly employs text processing instructions found in modern CPUs with SSE4.2 extensions (almost any 10 year old x86_64 CPU has it) to quickly match boundaries of HTTP tokens. The I/O is handled by the super awesome uvloop, which itself is a wrapper around libuv. At the lowest level this is a bridge to epoll system call providing asynchronous notifications on read-write readiness.

Picohttpparser relies on SSE4.2 and CMPESTRI x86_64 intrinsic to doparsing

Python is a garbage collected language and care needs to be taken when designing high performance systems not to needlessly increase pressure on the GC. The internal design of Japronto tries to avoid reference cycles and do as little allocations/deallocations as possible. It does so by preallocating some objects in so called arenas. It also tries to reuse Python objects for future requests if they are no longer referenced instead of throwing them away.

All the allocations are done as multiples of 4KB, internal structures are laid out carefully so that data used frequently together is close enough in memory minimizing possibility of cache misses. Japronto tries not to copy between buffers unnecessarily and does many operations in-place. For example percent-decoding the path before matching in the router process.

Call forhelp

I’ve been working on Japronto continuously for last 3 months often during weekends as well as normal labor days. This was only possible due to taking a break from my regular programmer job and putting all the efforts into this project. I think it’s time to share fruit of my work with the community.

Currently

↧

Dropbox pulls the plug on faster Python project

February 3, 2017, 8:18 am

≫ Next: Pythonnet brings Python to Microsoft .Net

≪ Previous: Million requests per second with Python

Dropbox pulls the plug on faster Python project

Pyston , Dropbox's project to create a faster python runtime similar to just-in-time compiling systems like PyPy, will no longer be sponsored by Dropbox after its latest release.

Version 0.6.1 brings Pyston's performance up to almost twice that of CPython, the standard-issue Python interpreter. "On web-workload benchmarks that we created, we are 48% faster," wrote Dropbox's Kevin Modzelewski on the Pyston blog. "On Dropbox’s server, we are 10% faster."

Despite these improvements, Dropbox decided further work on Pyston wouldn't be worth it from a cost-benefit point of view.

Compatibility was one of the biggest stumbling blocks. The blog post noted that it took "more time than we expected" to make Pyston compatible with existing Python code.

For context, PyPy has achieved a high level of compatibility with existing Python code, but only after years of work, and C extensions for Python -- a major part of the Python ecosystem -- were a major obstacle. Pyston also lacked Python 3 compatibility; PyPy provides it, but only through a different edition of the program that supports up to Python 3.3.

Dropbox also said it's moving away from using Python for performance-sensitive code in the first place, opting for "other languages such as Go" instead.

Dropbox's need to justify Pyston as a business investment seems to have been a big part of why it elected to find other ways to speed up its services. PyPy, on the other hand, is a relatively independent project that doesn't have any one company's fortunes tied to it (or vice versa) and isn't on any specific timetable to deliver results.

The innovations Pyston brought to the table aren't going to evaporate once Dropbox withdraws from the project. For one, Pyston is open source, so another team could pick up where Dropbox left off. Also, many of the innovations Dropbox brought to Pyston could in theory be donated back to mainline Python development: "We are also looking into upstreaming parts of our code back to CPython, since our code is now based on theirs," Modzelewski wrote.

Dropbox doesn't believe the speedups obtained from its work with Pyston is limited to what it was able to achieve. "[T]he 10% speedup on Dropbox code is just a small fraction of what we think is possible with our approach," Modzelewski wrote, but "[we] have not had time to optimize this particular workload."

There's no question efforts will continue to make Python performant through projects like PyPy,Cython, and now maybe Pyston in the hands of another team. But Python's convenience and ecosystem, rather than its raw performance, will remain the biggest part of its value as a language for now.

↧

Pythonnet brings Python to Microsoft .Net

February 3, 2017, 8:17 am

≫ Next: 如何抓取汽车之家的车型库

≪ Previous: Dropbox pulls the plug on faster Python project

Pythonnet brings Python to Microsoft .Net

The pythonnet package gives Python developers interoperability between Microsoft's .Net Common Language Runtime and the CPython implementation of the language.

Also known as Python for .Net, the package lets developers script .Net applications or build entire applications in Python, using .Net services and components built in any language targeting the CLR. It also provides an application scripting tool and enables Python code to be embedded into a .Net application. But there are limitations.

"Note that this package does not implement Python as a first-class CLR language -- it does not produce managed code (IL) from Python code," the GitHub description notes . "Rather, it is an integration of the CPython engine with the .Net or Mono runtime."

Developers thus can use CLR services and existing Python code and C-based extensions while still having native execution speeds for Python code. The Pythonnet team is working on CLR support and want to have Pythonnet work as it would be expected in Python except for cases that are .Net-specific, in which case the intent is to work as developers would expect in C#.

On windows, Pythonnet supports version 4.0 of the .Net CLR, and it works with Mono, the open source, cross-platform .Net framework , linux, and MacOS. For a pure managed-code implementation of Python, Pythonnet builders recommend IronPython, an open source version of Python integrated with the .Net Framework.

Pythonnet is another example of the growing popularity of Python, which has seen a boost with its usage in artificial intelligence applications and has been lauded forease of use. Google, with its recent Grumpy project , began bridging Python to the search giant's own Go language.

↧

如何抓取汽车之家的车型库

February 3, 2017, 8:16 am

≫ Next: Import Python: The Hacker's Guide To Python - Book Review and Interview With The ...

≪ Previous: Pythonnet brings Python to Microsoft .Net

实际上，关于「如何抓取汽车之家的车型库」，我已经在「使用 Mitmproxy 分析接口」一文中给出了方法，不过那篇文章里讲的是利用 API 接口来抓取数据，一般来说，因为接口不会频繁改动，相对 WEB 页面而言更稳定，所以通常这是数据抓取的最佳选择，不过利用 API 接口来抓取数据有一些缺点，比如有的数据没有 API 接口，亦可能虽然有 API 接口，但是数据使用了加密格式，此时只能通过 WEB 页面来抓取数据。

既然要通过 WEB 页面来抓取数据，那么就不得不提到 Scrapy ，它可以说是爬虫之王，我曾经听说有人用 Scrapy，以有限的硬件资源在几天的时间里把淘宝商品数据从头到尾撸了一遍，如此看来，本文用 Scrapy 来抓取汽车之家的车型库应该是绰绰有余的了。

在抓取汽车之家的车型库之前，我们应该对其结构有一个大致的了解，按照百科中的描述，其大致分为四个级别，分别是品牌、厂商、车系、车型。本文主要关注车系和车型两个级别的数据。在抓取前我们要确定从哪个页面开始抓取，比较好的选择有两个，分别是产品库和品牌找车，选择哪个都可以，本文选择的是品牌找车，不过因为品牌找车页面使用了 js 来按字母来加载数据，所以直接使用它的话可能会有点不必要的麻烦，好在我们可以直接使用从 A 到 Z 的字母页面。

假设你已经有了 Scrapy 的运行环境（注：本文代码以 python3 版本为准）：

shell> scrapy startproject autohome
shell> cd autohome
shell> scrapy genspider automobile www.autohome.com.cn -t crawl

如此就生成了一个基本的蜘蛛骨架，需要说明的是 Scrapy 有两种蜘蛛，分别是 spider 和 crawl，其中 spider 主要用于简单的抓取，而 crawl 则可以用来实现复杂的抓取，复杂在哪里呢？主要是指蜘蛛可以根据规则萃取需要的链接，并且可以逐级自动抓取。就抓取汽车之家的车型库这个任务而言，使用 spider 就可以实现，不过鉴于 crawl 在功能上更强大，本文选择 crawl 来实现，其工作流程大致如下：通过 start_urls 设置起始页，通过 rules 设置处理哪些链接，一旦遇到匹配的链接地址，那么就会触发对应的 callback，在 callback 中可以使用 xpath/css 选择器来选择数据，并且通过 item loader 来加载 item：

车系

车型

文件：autohome/items.py：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst
class SeriesItem(scrapy.Item):
series_id = scrapy.Field(
input_processor=MapCompose(lambda v: v.strip("/")),
output_processor=TakeFirst()
)
series_name = scrapy.Field(output_processor=TakeFirst())
class ModelItem(scrapy.Item):
model_id = scrapy.Field(
input_processor=MapCompose(lambda v: v[6:v.find("#")-1]),
output_processor=TakeFirst()
)
model_name = scrapy.Field(output_processor=TakeFirst())
series_id = scrapy.Field(output_processor=TakeFirst())

文件：autohome/autohome/spiders/automobile.py：

# -*- coding: utf-8 -*-
import json
import string
from scrapy import Request
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import parse_qs, urlencode, urlparse
from autohome.items import ModelItem, SeriesItem
class AutomobileSpider(CrawlSpider):
name = "automobile"
allowed_domains = ["www.autohome.com.cn"]
start_urls = [
"http://www.autohome.com.cn/grade/carhtml/" + x + ".html"
for x in string.ascii_uppercase if x not in "EIUV"
]
rules = (
Rule(LinkExtractor(allow=("/\d+/#",)), callback="parse_item"),
)
def parse(self,response):
params = {
"url": response.url,
"status": response.status,
"headers": response.headers,
"body": response.body,
}
response = HtmlResponse(**params)
return super().parse(response)
def parse_item(self, response):
sel = response.css("div.path")
loader = ItemLoader(item=SeriesItem(), selector=sel)
loader.add_css("series_id", "a:last-child::attr(href)")
loader.add_css("series_name", "a:last-child::text")
series = loader.load_item()
# 即将销售 & 在售
for sel in response.css("div.interval01-list-cars-infor"):
loader = ItemLoader(item=ModelItem(), selector=sel)
loader.add_css("model_id", "a::attr(href)")
loader.add_css("model_name", "a::text")
loader.add_value("series_id", series['series_id'])
yield loader.load_item()
# 停售
url = "http://www.autohome.com.cn/ashx/series_allspec.ashx"
years = response.css(".dropdown-content a::attr(data)")
for year in years.extract():
qs = {
"y": year,
"s": series["series_id"]
}
yield Request(url + "?" + urlencode(qs), self.stop_sale)
def stop_sale(self, response):
data = parse_qs(urlparse(response.url).query)
body = json.loads(response.body_as_unicode())
for spec in body["Spec"]:
yield {
"model_id": str(spec["Id"]),
"model_name": str(spec["Name"]),
"series_id": str(data["s"][0]),
}

把如上两段源代码拷贝到对应的文件里，下面我们就可以让蜘蛛爬起来了：

shell> scrapy crawl automobile -o autohome.csv

抓取的结果会保存到 autohome.csv 里。如果保存到 json 文件中，那么有时候你可能会发现输出的都是 unicode 编码，此时可以设置 FEED_EXPORT_ENCODING 来解决，如果想保存到数据库中，那么可以使用 Scrapy 的 pipeline 来实现。

如果你完整读过 Scrapy 的文档，那么可能会记得在 spiders 一章中有如下描述：

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

意思是说，在使用 crawl 的时候，应该避免覆盖 parse 方法，不过本文的源代码中恰恰重写了 parse 方法，究其原因是因为汽车之家的字母页存在不规范的地方：

shell> curl -I http://www.autohome.com.cn/grade/carhtml/A.html
HTTP/1.1 200 OK
Date: ...
Server: ...
Content-Type: text/html, text/html; charset=gb2312
Content-Length: ...
Last-Modified: ...
Accept-Ranges: ...
X-IP: ...
Powerd-By-Scs: ...
X-Cache: ...
X-Via: ...
Connection: ...

乍看上去好像没什么问题，不过仔细一看就会发现在 Content-Type 中 text/html 存在重复，此问题导致 Scrapy 在判断页面是否是 html 页面时失败。为了修正此问题，我重写了 parse 方法，把原本是 TextResponse 的对象重新包装为 HtmlResponse 对象。通过抓取竟然还帮助汽车之家找到一个 BUG，他们真是应该谢谢我才对。

有时候，为了避免蜘蛛被对方屏蔽，我们需要伪装 User-Agent，甚至通过一些代理服务来伪装自己的 IP，本文篇幅所限，就不多说了，实际上，Scrapy 不仅仅是一个库，更是一个平台，本文涉及的内容只能算是管中窥豹，有兴趣的读者不妨多看看官方文档，此外，网上也有很多例子可供参考。

↧

Import Python: The Hacker's Guide To Python - Book Review and Interview With The ...

February 3, 2017, 8:15 am

≫ Next: Caktus Consulting Group: Caktus at PyCaribbean

≪ Previous: 如何抓取汽车之家的车型库

Book Review

First heard about The Hacker’s Guide To python at Pycon - 2014. I brought the ebook version and converted it to my favourite format .dead_tree and read it.

Hacker's Guide To Python #python

A photo posted by Ankur Gupta (@originalankur) on Feb 1, 2017 at 3:49am PST

I mentally classify Self Published Python books in two category.

Category One
Written by Authors who have gained insights from shipping Python software in production / worked on complex software products for years. Then converted these learning into a book. Category Two
Written by Authors with only background in Python Training / little experience of working on complex / large software that used by people. These books rely on covering APIs, modules, remixing and improvising content found on the internet, or have chapters with code on building toy examples.

There are very few books that belong to Category One and this book does.

Here is a quick review / tour of What I learned from the book.

Chapter Twogives you a listing of standard modules that matter most in day to day Python Programming with a single line description. Helpful for those who just learned Python. The chapter shared an interesting “tale from the trenches” on How and Why Openstack migrated from SQLAlchemy to Alembic?. The chapter then shares a checklist of sort when evaluating external libraries. Insight like these put the book in Category One.

Chapter Threeis on Documentation . This chapter shows us how to get started with Sphinx and reST for your documentation.

Chapter Four and Fivetalk about Distributing Python software and virtual environments . The chapter starts with disutils and trivial example with setup.py showing How it’s used?, followed by pbr and how it allows you to write your setup.py. pbr is pretty cool since it allows for automatic dependency installation based on requirements.txt, documentation generation using Sphinx and more. The chapter also shows how to share your project on PyPI. If you are one of those looking to put an open source project out Chapter Four is for you. Chapter Five is a simple introduction to virtualenv.

Chapter Sixis on Unit Testing and talks about unit testing best practices from a Python standpoint. It touches upon everything from Fixtures, Mocking. It was from this book I first learned about subunit for test aggregation, archiving and isolation.

Chapter Seven and Eighttalk about Decorators and Functional Programming . These two chapters have a lot of short code snippets to explain every aspect of decorators and functional programming in Python.

Chapter Nineis on Abstract Syntax Tree and the ast module. While it’s highly unlikely a vast majority of us Python Programmer would need it to use it. But finding programmatically what the current grammar looks like is insightful.

Chapter Tenis on Performance and Optimizations . It starts of with showing how to use cProfile , pyprof2calltree, usage of dis module. It moves on to show utility of bisect module and mentions blist and bintree packages. I learned of these modules first from this book. Imagine if you had a function called millions of time during a week ( say some basic math function in a graphic library you are shipping) you want to learn about memoization usage of functools.lru_cache. The first I read it I was like “Ahh They got a name for it”. It funny how lot of programmer come up with algorithms and patterns in day to day programming that already existing and documented, only to learn later about it.

Chapter Elevenis on Scaling and Architecure . The title means different things to different people. Someone coming from hadoop or spark background may have a different point of view. This chapter talks about multi-threading vs Multiprocessing aka multiprocessing package vs multithreading module ( Ahh GIL ). The chapter also touched upon event driven programming. Coming from C++/Qt I have learned first hand the simplicity of event driven programming and it’s suprising How many programmers haven’t even used one or aren’t aware of.

ImportPython Newsletter and Blog Subscribers can avail a 20% Discount on the book by clicking here .

Import Python: The Hacker's Guide To Python - Book Review and Interview With The ...

Interview with Julien ( Author )

Ankur- Hey Julien. Why don’t you tell us a bit about yourself ?.

Julien- Hey! Well, sure. I’m a 33 years old software engineer living in Paris,

France. I work for Red Hat. While being based in Paris, I work remotely,

as my team is spread all across the world! That allows me to travel from

time to time and enjoy working with different views.

Some people might know me for my open source contributions: I’ve started

to be interested by Free Software 15 years ago. Starting as a user, I

quickly became a contributor (Debian, Freedesktop, GNU Emacs, Python,

OpenStack…) and then a maintainer (the awesome window manager, some

Python modules…) of various open source software.

Besides talking to computers, I spend a lot of time cooking and learning

cuisine. It’s like hacking with food I find it wonderful. These days

I’m also training for Paris half-marathon, which I already ran a couple

of times!

Ankur- How did you get started in Python and later Openstack ?.

Julien- I got hooked by Perl pretty young because it was the high-level language

most of my friends were you using. So they taught me it and I started

programming using that. But as I learned more about programming and the

object-oriented approach, I discovered Python 10 years ago and decided

to learn it.

It took me a while to find a itch to scratch with Python, but I finally

did and my first project was rebuildd , a daemon to rebuild

Debian packages. I learned Python with a book from O’Reilly back then

and it was pretty disappointing. The book talked about things I did not

care about such as installation (I just did apt-get install) and the

rest was just copy/paste from the official Python documentation which

I could search and read online anyway.

I wrote a few small Python application, and 5 years ago got hired to

work on OpenStack, where I really started to solve big problems with

Python.

Ankur- I read the book when you self published the first edition. How did it start?. Did the book come out of your personal notes?.

Julien- When I started to work on OpenStack, I started to review code a lot: it

is part of the process of contributing to OpenStack. After doing that

for 2 years, I started to notice all the things new and seasoned Python

developers did not know about the language. So I (mentally) took note of

them, and that gave me a pretty long list of pitfalls engineers would

often fall into.

Around that same time, I read a book about self-publishing (Authority by

Nathan Barry) and it just stroke me: I should write a book to enlighten

my fellow Python developers!

Ankur- Openstack where I come from ( India ) has had a substantial impact in 2014 - 2015 on the Python ecosystem in terms of hiring, training new developers on Python for Openstack development. However 2016 has seen interest wane?. Few of my openstack developer buddies suggest everything from the project being complex, to Python not being the right language for the job?, the whole private cloud bandwagon / party coming to an end. Is Openstack adoption on decline and if so why?.

Julien- There was a huge increase in recruiting and staffing people on

OpenStack. Five years ago, nobody had a clue what the project would

become, so everybody jumped into the wagon and tried to influence the

project. Five years later, some have succeeded, some have failed, and

the hype has declined. The project is getting mature and therefore:

boring. That does not influence the adoption which is, on the other

hand, increasing, as the project is getting more fully grown.

Python is definitely a right language for the job. Using another

programming language would not solve anything new: there are little

technical challenges that would not be solved by Python. If there is any

problem in OpenStack, it is projects that are still carrying technical

debt it takes to reduce it and social issues, as in any activity

involving human beings!

Ankur- The new book that you are working on i.e. Scaling Python . How much of your experience with Openstack is the reason behind wanting to write on Scaling Python code?

Julien- A lot. OpenStack is supposed to work on platforms with very large number

of servers which is not what your typical Web application is capable

of, for example. So designing applications to achieve large scalability

using Python is quite interesting and makes you discover things that you

had no idea existed. Once again, I think it’s interesting to share that

knowledge with the world and writing a book is a good way to do it.

Ankur- Thanks for answering the questions. Best of luck.

Disclaimer - The Author of the book will be sponsoring an upcoming issue of the ImportPython Newsletter. I choose to review and interview the author as this is a book I have read years ago and recommend. Note importpython don’t make any money directly/indirectly from sale of the book.

↧

Caktus Consulting Group: Caktus at PyCaribbean

February 3, 2017, 8:14 am

≫ Next: 2016年总结

≪ Previous: Import Python: The Hacker's Guide To Python - Book Review and Interview With The ...

For the first time, Caktus will be gold sponsors at PyCaribbean February 18-19th in Bayamon, Puerto Rico. We’re pleased to announce two speakers from our team.

Erin Mullaney, Django developer, will give a talk on RapidPro, the open source SMS system backed by UNICEF. Kia Lam, UI Developer, will talk about how women can navigate the seas of the tech industry with a few guiding principles and new perspectives. Erin and Kia join fantastic speakers from organizations like 18F, the python Software Foundation, IBM, and Red Hat.

We hope you can join us, but if you can’t, there’ll be videos!

↧

2016年总结

February 3, 2017, 8:13 am

≫ Next: Python Excel Tutorial: The Definitive Guide

≪ Previous: Caktus Consulting Group: Caktus at PyCaribbean

16年匆匆而过，17年也已开始一个多月，趁着过年空闲下来，回想一下去年的点滴，也希望今年能过的更充实。

16年有四件大事，上半年的毕设，年中的本科毕业，8月全面接管实验室服务器，研一一学期匆忙而过。

毕设做的是利用parameter server 的思想做分布式矩阵分解。一开始异想天开，想着直接用python 实现整套方案，利用mpi4py 库完成通信。矩阵分解算法直接照搬之《Distributed stochastic ADMM for matrix factorization》文章，并无过多创新。后来在约4 月上中旬发现在设计中无可避免地需要用到一个独立的io 线程，然而由于python 的是伪线程，担心存在性能问题（大误，你都用python了还担心个毛的性能），在4月中下旬用C＋openMPI 重写。后来测试发现，效果并没有想象的好，到最后连加速比都没脸放在论文里了。现在想来，如果一开始就直接用C 写，在client 与server 通信这一块再优化一下，效果会不会好一些呢？当然如果换到现在，那就有大把的现成parameter server 框架使用了，也就不至于自己瞎写，调了老半天还各种问题了。

年中的本科毕业，看着相处四年的同学各奔东西，而我仍然留在本校读研，貌似并没有多少分离的感觉，只是隐约觉得，有些人，可能这辈子也都难以再见了。高中毕业时也一样，没多少离别的感觉，可能我本来就是一个对感情看的比较淡的人吧。只是面对未知的未来，少了身边的朋友，多少有一点不安吧。

毕业后给自己放了半个月的假，就回到了学校，在8月开始接管实验室的服务器。那一个多月几乎天天蹲机房，基础环境搭好后又马不停蹄地学习docker，学习shell 脚本，把用户环境通过docker 管理了起来，然后通过自己写的shell 完成用户环境的manager 工具。那一段时间下来，感触最深的是，文档对于一个长期运营的项目／环境等是多么的重要，尤其是需要人员交接的情况下。由于前人文档不齐，8月份大部分时间都在跳坑。后来我仔细把环境的全部配置流程，方案写成文档，到时候交接无论是我还是下一任管理者，都会获益匪浅。直到目前为止，整个环境运行非常稳定。由于在manager 工具里加入了更新指令，可以自动获取github repo 上的release 版本，然后执行更新，相当便利。感觉自己在

运维的道路上越走越远了

（大误大误）

9月就开始了研一的生活。回想起来，过的真的不好。奖学金的压力，老板又不发钱（真的想吐槽，不说研一上课，8月份天天跑机房，没功劳也有苦劳吧），上课时又难以集中精神，周边新同学关系莫名其妙，周末又要学习吴恩达的cs229，要继续学习张量分解，到了11 月后还要带领几个本科生学习，基本上一学期没有远离过学校3公里，期末考试更是揪心无比。现在想来，这一学期过来，真的脱了一层皮。直到今年年初期末考完，才算松了一口气。用当时的一句话说，整天忙的跟个陀螺似的，然而却不知道在瞎忙什么。现在领着两个本科生做分布式张量分解，更深入地了解了git 的使用，了解到一个完整的项目，代码，测试，文档都不能少。算是摸索着前进吧。

回想上一年实验室的年会中自己提的目标，“一个有趣的项目”，这一项未能达成，实有遗憾。看一眼github profile，这一年大部分都是空白，虽然今年新开了十几个repo，然而大部分都是一些小脚本，未能上得了台面。

2017年，想列一个书单，多读一些各方面的书。想认真上课，不再留有遗憾。想好好地学一遍C++，思想不再受限于python。想做一个有趣的项目，继续上一年的目标。想找一个女朋友，让自己能学会思考未来，学会作为一个男人来承担责任。

2017, to be a better me.

↧

Python Excel Tutorial: The Definitive Guide

February 3, 2017, 8:12 am

≫ Next: Python 3 与 Javascript escape 传输确保数据正确方法和中文乱码解决方案

≪ Previous: 2016年总结

Originally published at https://www.datacamp.com/community/tutorials/python-excel-tutorial

You will probably already know that Excel is a spreadsheet application developed by Microsoft. You can use this easily accessible tool to organize, analyze and store your data in tables. What’s more, this software is widely used in many different application fields all over the world.

And, whether you like it or not, this applies to data science.

You’ll need to deal with these spreadsheets at some point, but you won’t always want to continue working in it either. That’s why Python developers have implemented ways to read, write and manipulate not only these files, but also many other types of files.

Today’s tutorial will give you some insights into how you can work with Excel and Python. It will provide you with an overview of packages that you can use to load and write these spreadsheets to files with the help of Python. You’ll learn how to work with packages such as pandas , openpyxl , xlrd , xlutils and pyexcel .

It might also be interesting for you to take a look at DataCamp’s Importing Data in Python course . If you also want to know more about how to read files into R, consider taking DataCamp’s R Tutorial on Reading and Importing Excel Files into R .

Python Excel Tutorial: The Definitive Guide

Starting Point: TheData

When you’re starting a data science project, you will often work from data that you have gathered maybe from web scraping, but probably mostly from datasets that you download from other places, such as Kaggle , Quandl , etc.

But more often than not, you’ll also find data on Google or on repositories that are shared by other users. This data might be in an Excel file or saved to a file with .csv extension,… The possibilities can seem endless sometimes. But whenever you have data, your first step should be to make sure that you’re working with a qualitative data.

In the case of a spreadsheet, you should corroborate that it’s qualitative because you might not only want to check if this data can answer the research question that you have in mind but also if you can trust the data that the spreadsheet holds.

Quality of Your Excel Spreadsheet

To check the overall quality of your spreadsheet, you can go over the following checklist:

Does the spreadsheet represent static data? Does your spreadsheet mix data, calculation, and reporting? Is the data in your spreadsheet complete and consistent? Does your spreadsheet have a systematic worksheet structure? Did you check if the live formulas in the spreadsheet are valid?

This list of questions is to make sure that your spreadsheet doesn’t ‘sin’ against the best practices that are generally accepted in the industry. Of course, the above list is not exhaustive: there are many more general rules that you can follow to make sure your spreadsheet is not an ugly duckling. However, the questions that have been formulated above are most relevant for when you want to make sure if the spreadsheet is qualitative.

Quality of YourData

Previous to reading in your spreadsheet in Python, you also want to consider adjusting your file to meet some basic principles, such as:

The first row of the spreadsheet is usually reserved for the header, while the first column is used to identify the sampling unit; Avoid names, values or fields with blank spaces. Otherwise, each word will be interpreted as a separate variable, resulting in errors that are related to the number of elements per line in your data set. Consider using underscores, dashes, Camel case, or concatenating words. Short names are preferred over longer names; Try to avoid using names that contain symbols such as ? , $ , % , ^ , & , * , ( , ) , - , # , ? , , , < , > , / , | , \ , [ , ] , { , and } ; Delete any comments that you have made in your file to avoid extra columns or NA’s to be added to your file; and Make sure that any missing values in your data set are indicated with NA.

Next, after you have made the necessary changes or when you have taken a thorough look at your data, make sure that you save your changes if you have made any. By doing this, you can revisit the data later to edit it, to add more data or to change them, while you preserve the formulas that you maybe used to calculate the data, etc.

If you’re working with Microsoft Excel, you’ll see that there are a considerable amount of options to save your file: besides the default extension .xls or .xlsx , you can go to the “File” tab, click on “Save As” and select one of the extensions that are listed as the “Save as Type” options. The most commonly used extensions to save datasets for data science are .csv and .txt (as tab-delimited text file). Depending on the saving option that you choose, your data set’s fields are separated by tabs or commas, which will make up the “field separator characters” of your data set.

Now that have checked and saves your data, you can start with the preparation of your workspace!

Prepping Your Workspace

Preparing your workspace is one of the first things that you can do to make sure that you start off well. The first step is to check your working directory.

When you’re working in the terminal, you might first navigate to the directory that your file is located in and then start up Python. That also means that you have to make sure that your file is located in the directory that you want to work from!

But perhaps more importantly, if you have already started your Python session and you’ve got no clue of the directory that you’re working in, you should consider executing the following commands:

# Import `os` import os # Retrieve current working directory (`cwd`) cwd = os.getcwd() # Change directory os.chdir("/path/to/your/folder") # List all files and directories in current directory os.listdir('.')

Great, huh?

You’ll see that these commands are pretty vital not only for loading your data but also for further analysis. For now, let’s just continue: you have gone through all the checkups, you have saved your data and prepped your workspace.

Can you already start with reading the data in Python?

Unfortunately, you’ll still need to do one more last thing.

Even though you don’t have an idea yet of the packages that you’ll need to import your data, you do have to make sure that you have everything ready to install those packages when the time comes.

Pip That’s why you need to have pip and setuptools installed. If you have Py

↧

Python 3 与 Javascript escape 传输确保数据正确方法和中文乱码解决方案

February 3, 2017, 8:11 am

≫ Next: PythonFOSDEM 2017 - Call for Volunteers

≪ Previous: Python Excel Tutorial: The Definitive Guide

前几天用python的Bottle框架写个小web程序，在进行Ajax交互之时，前端则先用 JSON.stringify 来将类序列化，然后用escape() 函数将其编码，确保传输正确。

再基本上配合上Jquery的$.ajax应该就可以了，可能是经验不足，即使编码之后的数据依然在 Python 中难以处理。

后来慢慢思考出一种方式，在网上也发现了类似的方式，于是将其实现。

基本思路如下： escape('你好世界ABC'); //返回 "%u4F60%u597D%u4E16%u754CABC" 这串字符串提交到Bottle后，我利用Python 的 urllib 进行解码，结果发现一个问题。。 >>> urllib.parse.unquote('%u4F60%u597D%u4E16%u754CABC') '%u4F60%u597D%u4E16%u754CABC' >>> 这个字符串该是什么样子还是什么样子，于是研究了片刻，然后傻逼的发现，这个压根就已经不是URL编码的字符了，不能用 unquote解了。。
我们应该用 decode('UTF-8')！

%uXXXX 是javascript转成 Unicode 的样子。所以我们必须要使它变成 \uXXXX 的Unicode的标准形式

而且Python中 unquote 只能对str进行URL解码，所以这个是对Unicode编码的中文字符是无法解开的，那么我就必须使用 decode('UTF-8');

但是收到的字符又是str，不存在什么decode，只有encode。后来查了下手册找到了个 urllib.parse.unquote_to_bytes 函数，可对Str进行URL解码，并且返回 byte。

对的，要的就是这个，然后依据返回的byte，就可以对其使用 decode 。

于是我就写了一个： def load_json(value): value = value.replace('%u','\\u') #将%uxxxx 替换换 \uxxxx 这才可以进行utf-8解码 byts = urllib.parse.unquote_to_bytes(value) #返回的 byte byts = byts.decode('UTF-8') # decode UTF-8 解码只能解开 \uXXXX 的Unicode 标准形式 return json.loads(byts) 并且做以下测试: escape('{"value":[123,"你好世界ABC"]}') //"%7B%22value%22%3A%5B123%2C%22%u4F60%u597D%u4E16%u754CABC%22%5D%7D" Python Shell: >>> load_json('%7B%22value%22%3A%5B123%2C%22%u4F60%u597D%u4E16%u754CABC%22%5D%7D') {'value': [123, '你好世界ABC']}

恭喜测试成功，看起来已经成功了。

总结：

这样的话，纵使再如何的字符，都会被Unicode编码。Javascript采用 escape（其他也行）来对字符进行UTF-8编码，虽然得到的是%uXXXX，但是你可以传化为 \uXXXX 的标准形式。

而且就算某些特别任性浏览器不会变成 %uXXXX，我们也只是替换的%u，并不会影响实际的字符。

过程：

Javascript Object ->JSON.stringify(obj) ->escape(json_str)-> 浏览器的自动URL编码(任性除外) -> Python urllib URL decode -> %uXXXX 替换 \uXXXX -> decode('UTF-8') -> json.load();

这只是一点经验之谈，如果有任何错误或更好之处，欢迎指正，向其学习，定当谢过。

↧

PythonFOSDEM 2017 - Call for Volunteers

February 3, 2017, 8:10 am

≫ Next: Invalid version number error with Python

≪ Previous: Python 3 与 Javascript escape 传输确保数据正确方法和中文乱码解决方案

Introduction

The python Community will be represented during FOSDEM 2017 with the Python Devrooms. This year, we will have two devrooms, the first one for 150 people on Saturday and the second one for 450 people on Sunday, it’s really cool because we had accepted 24 talks instead of 16.

This is the official call for sessions for the Python devroom at FOSDEM 2017 from 4th to 5th February 2017.

FOSDEM is the Free and Open source Software Developer’s European Meeting, a free and non-commercial two-day week-end that offers open source contributors a place to meet, share ideas and collaborate.

It’s the biggest event in Europe with +5000 hackers, +400 speakers.

For this edition, Python will be represented by its Community. If you want to discuss with a lot of Python Users, it’s the place to be!

But we have an issue, we need some volunteers because on the week-end, we will only be 4 volunteers.

If you think you can help us, please fill this Google Form

Thank you so much for your help

Stephane

↧

Invalid version number error with Python

February 3, 2017, 8:09 am

≫ Next: 新博客，新的开始

≪ Previous: PythonFOSDEM 2017 - Call for Volunteers

Problem

I tried to import a python package that I had installed from source. The import failed with this error:

File "/usr/lib/python2.7/distutils/version.py", line 40, in __init__ self.parse(vstring) File "/usr/lib/python2.7/distutils/version.py", line 107, in parse raise ValueError, "invalid version number '%s'" % vstring ValueError: invalid version number '2.7.0rc3' Solution

It turns out that package version number has to be in the x.y.z format. Else Python throws this error.

Since I had the source code of this package, I found all instances of 2.7.0rc3 and changed it to 2.7.0 . Typically, this will be in the setup.py and version.py files. I removed the previously installed package and reinstalled this changed source code. I was able to import after this successfully.

Tried with:Ubuntu 14.04

↧

新博客，新的开始

February 3, 2017, 8:08 am

≫ Next: Personal thoughts about Pyston’s outcome

≪ Previous: Invalid version number error with Python

新年新气象，给大家拜个晚年。

有心的同学可能已经注意到了，新年假期这几天，我更换了自己用了4年多的博客，从后端到前端大换血，变成了现在这个样子。新博客前几天就已经竣工了，这几天一直在做一些功能上的改进，让它真正成为一个功能完善而且强大的博客系统。

我将其取名为Talkbook，意为“一本娓娓道来的书”。

Talkbook基于python3+Django，部署于Docker+Gunicorn+Gevent，数据库是Postgres+Redis，使用Docker-Compose进行一键化管理。其特点如下：

支持Markdown编写文字支持自定义模板支持多模板切换支持文章多标签支持多图片一键上传支持图片自动生成缩略图支持自定义图片水印支持邮件提醒评论、回复树形结构 Docker一键化安装兼容老Emlog文章绝对链接支持Emlog数据（文章、评论、分类、标签）导入一键升级，保持所有依赖组件为最新稳定版安全性有保证

以后还会有如下改进：

引入插件机制搜索引擎优化其他博客数据导入

并不会增加太多功能，保持博客的轻量与纯粹性。相比于静态的优势：

URL可定制化，不会因为更换博客导致绝对链接失效保持文章的浏览数据保持老评论与老评论结构 Python开发，让Python选手也拥有了可以自己定制的博客

当然，静态博客是趋势（上述“优势”静态博客也都能够实现），如果我是一个新入门选手，我应该会选择直接用一个静态博客。

Talkbook暂时不会开源，也不会出售，但如果你确定你懂上面这些技术中的一些，就可以作为贡献者参与开发了。

另外，前几天搞了个小投票，你是喜欢老博客的样式还是新博客的样式，结果……微博上喜欢老博客样式的人更多，V2EX（ https://www.v2ex.com/t/337442 ）上清一色喜欢新博客样式。

弄的我很尴尬呀，于是我把老博客的模板移植过来了，现在点击博客右下角“更换模板”的链接即可更换模板了。

希望更多人喜欢。

↧

Personal thoughts about Pyston’s outcome

February 3, 2017, 8:07 am

≫ Next: Causal inference in python

≪ Previous: 新博客，新的开始

I try to not read HN/Reddit too much about Pyston, since while there are certainly some smart and reasonable people on there, there also seem to be quite a few people with axes to grind (*cough cough* python 3). But there are some recurring themes I noticed in the comments about our announcement about Pyston's future so I wanted to try to talk about some of them. I'm not really aiming to change anyone's mind, but since I haven't really talked through our motivations and decisions for the project, I wanted to make sure to put them out there.

Why we built a JIT

Let's go back to 2013 when we decided to do the project: CPU usage at Dropbox was an increasingly large concern. Despite the common wisdom that "Python is IO-bound", requests to the Dropbox website were spending around 90% of their time on the webserver CPU, and we were buying racks of webservers at a worrying pace.

At a technical level, the situation was tricky, because the CPU time was spread around in many areas, with the hottest areas accounting for a small (single-digit?) percentage of the entire request. This meant that potential solutions would have to apply to large portions of the codebase, as opposed to something like trying to Cython-ize a small number of functions. And unfortunately, PyPy was not, and still is not, close to the level of compatibility to run a multi-million-LOC codebase like Dropbox's, especially with our heavy use of extension modules.

So, we thought (and I still believe) that Dropbox's use-case falls into a pretty wide gap in the Python-performance ecosystem, of people who want better performance but who are unable or unwilling to sacrifice the ecosystem that led them to choose Python in the first place. Our overall strategy has been to target the gap in the market, rather than trying to compete head-to-head with existing solutions.

And yes, I was excited to have an opportunity to tackle this sort of problem. I think Idid as good a job as Icould to discount that, but it's impossible to know what effect it actually had.

Why we started from scratch

Another common complaint is that we should have at least started with PyPy or CPython's codebase.

For PyPy, it would have been tricky, since Dropbox'sneeds areboth philosophically and technically opposed to PyPy's goals. We needed a high level of compatibility and reasonable performance gains on complex, real-world workloads. I think this is a case that PyPy has not been able to crack, and in my opinion is why they are not enjoying higher levels of success. If this was just a matter of investing a bit more into their platform, then yes it would have been great to just "help make PyPy work a bit better". Unfortunately, I think theirissues (lack of C extension support, performance reliability, memory usage) are baked into their architecture. My understanding is that a "PyPy that is modified to work for Dropbox" would not look much like PyPy in the end.

For CPython, this was more of a pragmatic decision. Our goal was always to leverage CPython as much as we could, and now in 2017 I would recklesslyestimate that Pyston's codebase is 90% CPython code. So at this point, we are clearly a CPython-based implementation.

My opinion is that it would have been very tough to start out this way. The CPython codebase is not particularly amenable to experimentation in these fundamental areas. And for the early stages of the project, our priority was to validate our strategies. I think this was a good choice because our initial strategy (using LLVM to make Python fast) did not work, and we ended up switching gears to something much more successful.

But yes, along the way we did reimplement some things. I think we did a good job of understanding that those things were not our value-add and to treat them appropriately. I still wonder if there were ways we could have avoided more of the duplicated effort, but it's not obvious to me how we could have done so.

Issues people don't think about

It's an interesting phenomenon that people feel very comfortable having strong opinions about language performance without having much experience in the area. I can't judge,because I was in this boat -- I thought that if web browsers made JS fast, then we could do the same thing and make Python fast. So instead of trying to squelch the "hey they made Lua fast, that means Lua is better!" opinions, I'll try to just talk about what makes Python hard to run quickly (especially as compared to less-dynamic languages like JS or Lua).

The thing I wish people understood about Python performance is that the difficulties come from Python's extremely rich object model, not from anything about its dynamic scopes or dynamic types. The problem is that every operation in Python will typically havemultiple points at which the user can override the behavior, and these features are used, often very extensively. Some examples are inspecting the locals of a frame after the frame has exited, mutating functions in-place, or even something as banal as overriding isinstance. These are all things that we had to support, and are used enough that we have to support efficiently, and don't have analogs in less-dynamic languages like JS or Lua.

On the flip side, the issues with Python compatibility are also quite different than most people understand. Even the smartest technical approaches will have compatibility issues with codebases the size of Dropbox. We found, for example, that there are simply too many things that will break when switching from refcounting to a tracing garbage collector, or even switching the dictionary ordering. We ended up having to re-do our implementations of both of these to match CPython's behavior exactly.

Memory usage is also a very large problem for Python programs, especially in the web-app domain. This is, unintuitively, driven in part by the GIL:while a multi-process approach will be conceptually similar to a multi-threaded approach, the multi-process approach uses much more memory. This is because Python cannot easily share its memory between different processes, both for logistical reasons, but also for some deeper reasons stemming from reference counting. Regardless of the exact reasons, there are many partsof Dropbox that are actually memory-capacity-bound, wherethe key metric is "requests per second per GB of memory". We thought a 50% speed increase would justify a 2x memory increase, but this is worse in a memory-bound service. Memory usage is not something that gets talked about that often in the Python space (except for MicroPython), and would be another reason that PyPy would struggle to be competitive for Dropbox's use-case.

So again, this post is me trying to explain some of the decisions we made along the way, and hopefully stay away from being too defensive about it. We certainly had our share of bad bets and schedule overruns, and if I were to do this all over again my plan would be much better the second time around. But I do think that most of our decisions were defensible, which is why I wanted to take the time to talk about them.

↧

Causal inference in python

February 3, 2017, 8:06 am

≫ Next: pip changing from pep8 to pycodestyle

≪ Previous: Personal thoughts about Pyston’s outcome

Causality

This package contains tools for causal analysis using observational (rather than experimental) datasets.

Installation

Assuming you have pip installed, just run

pip install causality Causal Inference

The causality.inference module will contain various algorithms for inferring causal DAGs. Currently (2016/01/23), the only algorithm implemented is the IC* algorithm from Pearl (2000). It has decent test coverage, but feel free to write some more! I've left some stubs in tests/unit/test\_IC.py .

To run a graph search on a dataset, you can use the algorithms like (using IC* as an example):

import numpy import pandas as pd from causality.inference.search import IC from causality.inference.independence_tests import RobustRegressionTest # generate some toy data: SIZE = 2000 x1 = numpy.random.normal(size=SIZE) x2 = x1 + numpy.random.normal(size=SIZE) x3 = x1 + numpy.random.normal(size=SIZE) x4 = x2 + x3 + numpy.random.normal(size=SIZE) x5 = x4 + numpy.random.normal(size=SIZE) # load the data into a dataframe: X = pd.DataFrame({'x1' : x1, 'x2' : x2, 'x3' : x3, 'x4' : x4, 'x5' : x5}) # define the variable types: 'c' is 'continuous'. The variables defined here # are the ones the search is performed over -- NOT all the variables defined # in the data frame. variable_types = {'x1' : 'c', 'x2' : 'c', 'x3' : 'c', 'x4' : 'c', 'x5' : 'c'} # run the search ic_algorithm = IC(RobustRegressionTest, X, variable_types) graph = ic_algorithm.search()

Now, we have the inferred graph stored in graph . In this graph, each variable is a node (named from the DataFrame columns), and each edge represents statistical dependence between the nodes that can't be eliminated by conditioning on the variables specified for the search. If an edge can be oriented with the data available, the arrowhead is indicated in 'arrows' . If the edge also satisfies the local criterion for genuine causation, then that directed edge will have marked=True . If we print the edges from the result of our search, we can see which edges are oriented, and which satisfy the local criterion for genuine causation:

>>> graph.edges(data=True) [('x2', 'x1', {'arrows': [], 'marked': False}), ('x2', 'x4', {'arrows': ['x4'], 'marked': False}), ('x3', 'x1', {'arrows': [], 'marked': False}), ('x3', 'x4', {'arrows': ['x4'], 'marked': False}), ('x4', 'x5', {'arrows': ['x5'], 'marked': True})]

We can see the edges from 'x2' to 'x4' , 'x3' to 'x4' , and 'x4' to 'x5' are all oriented toward the second of each pair. Additionally, we see that the edge from 'x4' to 'x5' satisfies the local criterion for genuine causation. This matches the structure given in figure 2.3(d) in Pearl (2000).

Nonparametric Effects Estimation

The causality.nonparametric module contains a tool for non-parametrically estimating a causal distribution from an observational data set. You can supply an "admissable set" of variables for controlling, and the measure either the causal effect distribution of an effect given the cause, or the expected value of the effect given the cause.

I've recently added adjustment for direct causes, where you can estimate the causal effect of fixing a set of X variables on a set of Y variables by adjusting for the parents of X in your graph. Using the dataset above, you can run this like

from causality.nonparametric.causal_reg import AdjustForDirectCauses from networkx import DiGraph g = DiGraph() g.add_nodes_from(['x1','x2','x3','x4', 'x5']) g.add_edges_from([('x1','x2'),('x1','x3'),('x2','x4'),('x3','x4')]) adjustment = AdjustForDirectCauses(g, X, ['x2'],['x3'],variable_types=variable_types)

Then, you can see the set of variables being adjusted for by

>>> print adjustment.admissable_set set(['x1'])

If we hadn't adjusted for 'x1' we would have incorrectly found that 'x2' had a causal effect on 'x3' due to the counfounding pathway x2, x1, x3 . Adjustment for 'x1' removes this bias.

You can see the causal effect of intervention, P(x3|do(x2)) using the measured causal effect in adjustment ,

>>> x = pd.DataFrame({'x2' : [0.], 'x3' : [0.]}) >>> print adjustment.effect.pdf(x) 0.268915603296

Which is close to the correct value of 0.282 for a gaussian with mean 0. and variance 2. If you adjust the value of 'x2' , you'll find that the probability of 'x3' doesn't change. This is untrue with just the conditional distribution, P(x3|x2) , since in this case, observation and intervention are not equivalent.

Other Notes

This repository is in its early phases. The run-time for the tests is long. Many optimizations will be made in the near future, including

Implement fast mutual information calculation, O( N log N ) Speed up integrating out variables for controlling Take a user-supplied graph, and find the set of admissable sets Front-door criterion method for determining causal effects

Pearl, Judea. Causality . Cambridge University Press, (2000).

↧

pip changing from pep8 to pycodestyle

February 3, 2017, 8:05 am

≫ Next: Sending SMS messages with Amazon SNS and Python

≪ Previous: Causal inference in python

I recently updated one of the packages in Atom that was called linter-pep8 to version 2.0 which was renamed to linter-pycodestyle . This is because PEP8 the package was renamed to pycodestyle to reduse confusion between the package and the PEP8 the specification .

However after I opened Atom I got an error message Error: spawn pycodestyle ENOENT

because I hadn't upgraded the python package. As I wasn't using PEP8 for anything else I uninstalled it and installed pycodestyle. On windows I'd installed Python 3.6 x64 for all users so it was Python was installed C:\Program Files\Python36\

"C:\Program Files\Python36\Scripts\pip.exe" uninstall pep8 "C:\Program Files\Python36\Scripts\pip.exe" install pycodestyle

on linuxpip was in my PATH environment varable so I simply ran

sudo pip uninstall pep8 sudo pip install pycodestyle

And that fixed up my issues.

↧

Sending SMS messages with Amazon SNS and Python

February 3, 2017, 5:49 am

≫ Next: Python为何能坐稳 AI 时代头牌语言

≪ Previous: pip changing from pep8 to pycodestyle

There are many services out there that will let you programmatically send SMS messages. One of the more popular is Twilio , and they have a great API and a python client that's easy to use. There's an interesting quora thread with several other suggestions as well.

Another option is to use Amazon's Simple Notification Service (SNS), which also supports sending SMS messages. I recently incorporated this into a project, and thought I'd share.

Step 1: API key + boto3

If you're already using AWS, you've probably jumped through these hoops. I'm not going to walk you through them, but just realize you need to figure out how to sign up for an AWS account and get some api keys.

The second part of this is boto3 , amazon's python sdk.

pip install boto3

Boto's quickstart guide should help, and it also includes some info on getting boto configured.

Step 2: Send your message

At the bare minimum, you can just send a message directly to a single phone number. Here's the code:

import boto3 # Create an SNS client client = boto3.client( "sns", aws_access_key_id="YOUR ACCES KEY", aws_secret_access_key="YOUR SECRET KEY", region_name="us-east-1" ) # Send your sms message. client.publish( PhoneNumber="+12223334444", Message="Hello World!" )

Note the formate of the phone number. It's got to be in something called E.164 format . For US phone numbers, this includes the +1 country code, then the area code + the rest of the phone number without any additional formatting.

If you just need to send a message every once in a while (e.g. to notifiy yourself when something happens), then congrats! You're done.

Step 3: Do actual Pub-Sub

If you need to send messages to multiple recipients, it's worthwhile to read though Amazon's docs on sending to multiple phone numbers .

The SNS service implements the Publish-Subscribe pattern, and you can use it to send messages to a topic . Here are the steps to make this work:

Create a named topic. This is just a commuication channel to which you can subscribe phone numbers. Subscibe your recipients to the topic. Publish a message on the topic.

The python code looks something like this:

import boto3 # Create an SNS client client = boto3.client( "sns", aws_access_key_id="YOUR ACCES KEY", aws_secret_access_key="YOUR SECRET KEY", region_name=us-east-1 ) # Create the topic if it doesn't exist (this is idempotent) topic = client.create_topic(Name="notifications") topic_arn = topic['TopicArn'] # get its Amazon Resource Name # Add SMS Subscribers for number in some_list_of_contacts: client.subscribe( TopicArn=topic_arn, Protocol='sms', Endpoint=number # <-- number who'll receive an SMS message. ) # Publish a message. client.publish(Message="Good news everyone!", TopicArn=topic_arn)

All your susbscibers should recieve an SMS message once you've published it on the topic. In addition, you should be able to monitor SNS usage on the AWS console , which will tell you how many messages are sent (as well as how many sms mesages fail). If you plan to use SNS for any commercial usage, you'll also want to read up on SNS Pricing .

That's it! Hope this article has helped. Let me know in the comments below :)

↧