码云推荐｜SaltStack 和 Django 开发的运维平台 SaltOps

February 4, 2017, 4:38 am

≪ Previous: Sentry 8.13.0 发布，Python 实时日志平台

saltops [系统开发中～请不要用于生产系统] 目标

SaltOps是一个基于SaltStack和Django开发的运维平台，

平台的主要功能包括：CMDB、包发布管理、工具系统、最终作为包发布和工具系统的接色

与Jenkins、Zabbix等系统进行整合

系统会具备什么功能 CMDB：这个也是没办法的事情，资产信息还是要的。。而且Salt的Agent非常适合采集这些基础信息最后，包发布的过程是需要用到CMDB信息的，所以CMDB是作为附属品存在的包发布：程序包发布的功能，这块主要是用到salt的state.sls，通过编写好 sls文件，然后调用salt进行发布的动作，发布完后应用与主机的信息自然就对接起来了工具平台：既然都接上了Salt，把工具平台做了也是很自然的事情啦～为什么使用DjangoAdmin

DjangoAdmin大多作为后台管理员使用的，这里用DjangoAdmin的原因是：没资源。。且每天写的时间也有限，用它的话大多数界面都不用自己做，还是挺省事的

配合着Django-jet的话也长得还不错

一些图片

↧

[Python] 知乎多线程爬虫

February 4, 2017, 4:37 am

≫ Next: PentestBox入门实战教程

知乎多线程爬取问题：

有哪些可以单曲循环一星期的歌曲值得推荐？那些单曲循环过的歌，有哪些句子打动了你？ )
[Python] 知乎多线程爬虫

自寒假以来，我就一直想把系统的学习一遍python爬虫的知识。因为以前只是零碎的学习，造成许多东西都只是一知半解。

项目灵感来源是觉得单曲循环的歌至少让一个人曾经在聆听中感动过，歌曲的歌词中或许有触动他的旋律，也可能只是歌词恰巧与他的人生经历相似。不论怎样，我觉得能让一个人单曲循环的歌必定不差，所以就希望能获得这些歌曲名称并在网易云生成歌单。

然而，当我看到各位答主的答案后，我就发现有点麻烦了。

比如：

南山南

----------------我是不安分的分割线---------好多评论里的小伙伴说不得不推荐傲寒也是马

↧

PentestBox入门实战教程

February 4, 2017, 4:36 am

≫ Next: Anaconda 4.3.0，Python 科学计算软件包发行版

≪ Previous: [Python] 知乎多线程爬虫

PentestBox不同于运行在虚拟机或者双启动环境的linux渗透测试发行版。它打包了所有的安全工具，并且可以在windows系统中原生地运行，有效地降低了对虚拟机或者双启动环境的需求。在新闻报道中有提到该工具，详细请看之前不试你可能会后悔，Windows渗透测试利器Pentest Box

安装 PentestBox安装很简单,首先你需要下载（百度网盘） pentestbox [2g62]

下载完这个文件后,打开提供的安装程序.（建议安装在C盘）提取文件单击next

提取完毕找到安装目录下的 PentestBox.exe 或者 PentestBox.bat 运行使用。

查看该系统所包含的工具列表

工具管理

使用 toolsmanager 命令可以进行安装/更新/卸载

在PentestBox终端输入打开,它会从Github中更新.

可以看到工具分类列表,选择序号继续输入这边我选择10.列表中包含whatweb

输入 whatweb 安装whatweb

更新

update

指定更新某个程序比如 update webapplication

快捷键

CTRL + T 打开新标签

CTRL + C 关闭正在运行脚本

CTRL + w 关闭当前控制台

ALT +Enter 全屏

添加工具

如果你想添加自己的神器,配置文件在 C:PentestBoxbincustomtools 下的customaliases文件

python

sqlmap=python "%pentestbox_ROOT%bincustomtoolssqlmapsqlmap.py" $*

Ruby

"%pentestbox_ROOT%bincustomtoolswpscanwpscan.rb" $*

EXE

tool="%pentestbox_ROOT%bincustomtoolstool.exe" $*

Java

tool=start javaw -jar "%pentestbox_ROOT%bincustomtoolstool.jar" $*

重启PentestBox 运行设定命令就可以运行自己添加的工具

PentestBox入侵安卓手机演示

视频地址： https://v.qq.com/x/page/f03688969zb.html

(视频中所演示内容仅供学习,请勿对他人手机进行非法行为;)

微博

*本文作者：lr3800_，转载请注明来自CodeSec

↧

Anaconda 4.3.0，Python 科学计算软件包发行版

February 4, 2017, 4:35 am

≫ Next: 使用Python的turtle（海龟）模块画图

≪ Previous: PentestBox入门实战教程

Anaconda 4.3.0 发布，Anaconda 是一个用于科学计算的 python 发行版，软件支持 linux, Mac, windows, 整合众多流行的科学计算、数据分析等 Python 包。

更新记录如下：

重要更新：

The Anaconda3 installers are based on Python 3.6. Anaconda 4.3 supports Python 2.7, 3.4, 3.5 and 3.6. Anaconda 4.3 will be the last release which supports Python 3.4. We will discontinue regular Python 3.4 package updates in the next release.

The Intel Math Kernel Library (MKL) is updated from 11.3.3 to 2017.0.1.

Over 90 packages are updated.

seaborn is now installed by default.

更改：

Updates jpeg and libpng to increase compatibility with conda-forge.

Warns about possible errors if installing on Windows into an install path with spaces, and does not allow installation if the install path contains unicode characters.

Fixes many Windows menu uninstallation issues and some other often reported uninstallation issues on Windows.

Anaconda 4.2 is the last release that supports macOS 10.7 and macOS 10.8. Anaconda 4.3 supports macOS versions from 10.9 through the current version 10.12.

conda-build, anaconda-clean and the Jupyter Notebook extensions are no longer installed by default but can be installed with a single conda command.

更新:

anaconda-client from 1.5.1 to 1.6.0

anaconda-navigator from 1.3.1 to 1.4.3

astroid from 1.4.7 to 1.4.9

astropy from 1.2.1 to 1.3

backports_abc from 0.4 to 0.5

beautifulsoup4 from 4.5.1 to 4.5.3

bokeh from 0.12.2 to 0.12.4

boto from 2.42.0 to 2.45.0

bottleneck from 1.1.0 to 1.2.0

cairo from 1.12.18 to 1.14.8

cffi from 1.7.0 to 1.9.1

click from 6.6 to 6.7

cloudpickle from 0.2.1 to 0.2.2

conda from 4.2.9 to 4.3.8

contextlib2 from 0.5.3 to 0.5.4

cryptography from 1.5 to 1.7.1

curl from 7.49.0 to 7.52.1

cython from 0.24.1 to 0.25.2

cytoolz from 0.8.0 to 0.8.2

dask from 0.11.0 to 0.13.0

datashape from 0.5.2 to 0.5.4

decorator from 4.0.10 to 4.0.11

docutils from 0.12 to 0.13.1

flask from 0.11.1 to 0.12

flask-cors from 2.1.2 to 3.0.2

fontconfig from 2.11.1 to 2.12.1

gevent from 1.1.2 to 1.2.1

glib from 2.43.0 to 2.50.2

greenlet from 0.4.10 to 0.4.11

hdf5 from 1.8.15.1 to 1.8.17

idna from 2.1 to 2.2

ipaddress from 1.0.16 to 1.0.18

ipykernel from 4.5.0 to 4.5.2

jdcal from 1.2 to 1.3

jinja2 from 2.8 to 2.9.4

jpeg from 8d to 9b

jupyter_core from 4.2.0 to 4.2.1

lazy-object-proxy from 1.2.1 to 1.2.2

libpng from 1.6.22 to 1.6.27

libxml2 from 2.9.2 to 2.9.4

libxslt from 1.1.28 to 1.1.29

llvmlite from 0.13.0 to 0.15.0

lxml from 3.6.4 to 3.7.2

matplotlib from 1.5.3 to 2.0.0

menuinst from 1.4.1 to 1.4.4

mkl from 11.3.3 to 2017.0.1

multipledispatch from 0.4.8 to 0.4.9

nbformat from 4.1.0 to 4.2.0

nltk from 3.2.1 to 3.2.2

notebook from 4.2.3 to 4.3.1

numba from 0.28.1 to 0.30.1

numpy from 1.11.1 to 1.11.3

openpyxl from 2.3.2 to 2.4.1

openssl from 1.0.2j to 1.0.2k

pandas from 0.18.1 to 0.19.2

partd from 0.3.6 to 0.3.7

path.py from 8.2.1 to 10.0

pathlib2 from 2.1.0 to 2.2.0

pexpect from 4.0.1 to 4.2.1

pillow from 3.3.1 to 4.0.0

pip from 8.1.2 to 9.0.1

pixman from 0.32.6 to 0.34.0

prompt_toolkit from 1.0.3 to 1.0.9

psutil from 4.3.1 to 5.0.1

py from 1.4.31 to 1.4.32

pycparser from 2.14 to 2.17

pyflakes from 1.3.0 to 1.5.0

pylint from 1.5.4 to 1.6.4

pyopenssl from 16.0.0 to 16.2.0

pytables from 3.2.2 to 3.3.0

pytest from 2.9.2 to 3.0.5

python from 2.7.12 to 2.7.13

python-dateutil from 2.5.3 to 2.6.0

pytz from 2016.6.1 to 2016.10

pyzmq from 15.4.0 to 16.0.2

qt from 5.6.0 to 5.6.2

qtawesome from 0.3.3 to 0.4.3

qtpy from 1.1.2 to 1.2.1

requests from 2.11.1 to 2.12.4

scikit-learn from 0.17.1 to 0.18.1

sphinx from 1.4.6 to 1.5.1

spyder from 3.0.0 to 3.1.2

sqlalchemy from 1.0.13 to 1.1.5

toolz from 0.8.0 to 0.8.2

tornado from 4.4.1 to 4.4.2

traitlets from 4.3.0 to 4.3.1

werkzeug from 0.11.11 to 0.11.15

wrapt from 1.10.6 to 1.10.8

xlsxwriter from 0.9.3 to 0.9.6

xlwings from 0.10.0 to 0.10.2

xlwt from 1.1.2 to 1.2.0

zeromq from 4.1.4 to 4.1.5

新增：

chardet 2.3.0

isort 4.2.5

libiconv 1.14

numpydoc 0.6.0

pcre 8.39 (on Linux)

scandir 1.4

seaborn 0.7.1

subprocess32 3.2.7 (Python 2)

移除：

anaconda-clean

dynd-python

filelock

libdynd

nb_anacondacloud

nb_conda

nb_conda_kernels

nbpresent

patchelf

pkginfo

↧

使用Python的turtle（海龟）模块画图

February 4, 2017, 4:34 am

≫ Next: Pythonistas (and a Python!) at PyCon Jamaica

≪ Previous: Anaconda 4.3.0，Python 科学计算软件包发行版

使用python的 turtle（海龟）模块画图第一步：让Python引入turtle模块，引入模块就是告诉Python你想要用它。 import turtle 第二步：创建画布。调用turtle中的Pen函数。 t = turtle.Pen()
使用Python的turtle（海龟）模块画图

第三步：移动海龟。 t.forward(50)

forward的中文意思是“向前地；促进”。所以这行代码的意思是海龟向前移动50个像素：

t.left(90)

让海龟左转90度

现在我们可以尝试画一个方块，思路就是前进-转向90度-前进，循环四次。

>>> t.forward(50) >>> t.left(90) >>> t.forward(50) >>> t.left(90) >>> t.forward(50) >>> t.left(90) >>> t,forward(50) >>> t.left(90) >>> t,forward(50)

效果如下：

第四步：擦除画布。 >>> t.reset()

重置命令（reset）这会清除画布并把海龟放回开始的位置。

>>> t.clear()

清除命令（clear）只清除屏幕，海龟仍停留在原位。

我们还可以让海龟向右（right）转，或者让它后退（backward）。我们可以用向上（up）来把笔从纸上抬起来（换句话说就是让海龟停止作画），用向下（down）来开始作画。

下面我们综合运用一下，画两条线。 >>> t.reset() //擦除画布并把海龟移回到开始位置 >>> t.backward(100) //后退100个像素 >>> t.up() // 抬笔不再作画 >>> t.right(90) //向右转90度 >>> t.forward(20) //前进20个像素 >>> t.left(90) //向左转90度 >>> t.down() //下笔准备作画 >>> t.forward(100) //前进100个像素

效果如下：

总结

刚开始学用turtle模块作画，感觉就像小学刚开始作画一样。从前是拿起笔直接在纸上画，到现在使用Python作画时，感觉是把从前作画的动作分解开：准备画布――拿起笔准备作画――开始作画――放下笔不再作画。感觉很有趣，慢慢来，这才只是刚开始。^_^

↧

Pythonistas (and a Python!) at PyCon Jamaica

February 4, 2017, 4:33 am

≫ Next: Python中的高级turtle（海龟）作图

≪ Previous: 使用Python的turtle（海龟）模块画图

This past November marked the first PyCon Jamaica . Held in the capital, Kingston, the conference began on November 17th with a day of tutorials followed by a single track of talks on November 18th. I attended both as a representative of the python Software Foundation, which sponsored the conference, and as a speaker.

Python in Kingston’s Higher Education
Pythonistas (and a Python!) at PyCon Jamaica

Kingston, home to approximately 33% of Jamaicans, boasts several institutions of higher learning including the Caribbean Maritime Institute and the Mona campus of the University of the West Indies. PyCon Jamaica kicked off with tutorials at the University of the West Indies. Most of the tutorials focused on introductory topics (e.g. Introduction to Plone). Participants came from a wide range of backgrounds including mechanical engineers or undergraduates with a marketing concentration. Interestingly I was informed Python isn’t a part of the standard computer science offering at the university yet it has become a language of considerable interest in many of Kingston’s professional sectors.

David Bain , organizer of PyCon Jamaica and the local Python Jamaica user group, explained that he thinks the interest in Python has risen as students have become increasingly exposed to web technologies. Bain added that PyCon Jamaica is a way to help demonstrate to students and professionals the various applications Python has. "Jamaica wants to be seen as a viable source for local and North American nearshore developer talent, our event signals that software development talent is here," Bain explained.

Modernizing the Public Sector with Python

Conference talks were held at the Hope Zoo, a facility housing vast botanical gardens, a zoo, and a community center. There were three international speakers, Joir-dan Gumbs of IBM, Star Ying of the US Dept of Commerce, and myself, alongside several local speakers. The tutorials had been more student-centric, but the conference catered to those using Python in the Jamaican public sector.

A common theme from local speakers highlighted how Python has helped local professionals modernize outdated practices. Marc Murray of the Jamaican Ministry of Health described how he has used Python throughout his career of fifteen-plus years to automate processes and enable better data collection and data sharing. More than one speaker acknowledged that struggle of institutional knowledge silos in the local government. With Python, though, these knowledge silos have started to be disrupted. Agencies are able to share the same data sets with greater ease and promote transparency.

#PyConJamaica2016 : Challenges #Python answers in gov't - information silos, use to inc info access, speed, remove redundancy @PythonJamaica

― Lorena Mesa (@loooorenanicole) November 18, 2016

Python's data-processing power was the star in a talk by student Dominic Mills . Mills recently completed an internship at CERN, where he built a Django prototype for debugging hardware in future experiments. Crucial to this project was not only the collection of data via Celery but the capacity to analyze it. Mills used bokeh for real time analysis of the sensor data, permitting monitoring and alarms to be raised if unfavorable conditions were found.

Collectively the speakers at PyCon Jamaica reflect how Jamaican programmers are embracing Python for data collection and analysis in a variety of specialties. Python’s open source packages and rich community support seemed to be its biggest selling points. Speaker Joir-dan Gumbs commented that, “ the best part for me was the presentations of how Python is enhancing the lives of Jamaicans, as well as the networking.”

I’m excited to see what PyCon Jamaica 2017 will hold. Already the conference is rich in data science and data visualization content. After all, if PyCon Jamaica 2016 included an appearance from the Hope Zoo’s own python what will we see next? Perhaps two pythons, and of course many more Jamaican Pythonistas.

↧

Python中的高级turtle（海龟）作图

February 4, 2017, 4:32 am

≫ Next: Writing Python Extensions in Rust

≪ Previous: Pythonistas (and a Python!) at PyCon Jamaica

在python里，海龟不仅可以画简单的黑线，还可以用它画更复杂的几何图形，用不同的颜色，甚至还可以给形状填色。

一、从基本的正方形开始

引入 turtle模块并创建 Pen对象：

>>> import turtle >>> t = turtle.Pen()

前面我们用来创建正方形的代码如下：

>>> t.forward(50) >>> t.left(90) >>> t.forward(50) >>> t.left(90) >>> t.forward(50) >>> t.left(90) >>> t,forward(50)

此段代码太长，我们可以用for循环进行优化：

>>> t.reset() >>> for x in range(1,5): t.forward(50) t.left(90)

效果如下：

二、画星星

我们只需把for循环做一些改动即可，代码如下：

>>> t.reset() >>> for x in range(1,9): ##循环八次 t.forward(100) ##前进100像素 t.left(225) ##向左旋转225度

效果如下：

然而我们还可以进一步改进，比如每次旋转175度，循环37次，代码如下：

>>> t.reset() >>> for x in range(1,38): t.forward(100) t.left(175)

效果如下：

我们还可以画螺旋星，代码如下：

>>> t.reset() >>> for x in range(1,20): t.forward(100) t.left(95)

效果如下：

现在让我们用if语句控制海龟的转向来绘制不同的星星。让海龟先转一个角度，然后下一次转一个不同的角度。

在这里，我们先创建一个运行18次的循环（range(1,19)），然后让海龟向前移动100个像素（t.forward(100)）。接下来是if语句（ifx%2 == 0），它的意思是：x除以2的余数是否等于0.如果x中的数字是偶数，我们让海龟左转175度（t.left(175)），否则（else）我们让它左转225度。代码如下：

>>> t.reset() >>> for x in range(1,19): t.forward(100) if x % 2 == 0: t.left(175) else: t.left(225)

效果如下：

三，画汽车

试着画一辆小汽车，给自己树立一个小目标，说不定哪一天就实现了。

（这段代码新增了color,begin_fill,end_fill,circle,setheading函数）

>>> import turtle >>> t = turtle.Pen() >>> t.color(1,0,0) >>> t.begin_fill() >>> t.forward(100) >>> t.left(90) >>> t.forward(20) >>> t.left(90) >>> t.forward(20) >>> t.right(90) >>> t.forward(20) >>> t.left(90) >>> t.forward(60) >>> t.left(90) >>> t.forward(20) >>> t.right(90) >>> t.forward(20) >>> t.left(90) >>> t.forward(20) >>> t.end_fill() 车身 >>> t.color(0,0,0) >>> t.up() >>> t.forward(10) >>> t.down() >>> t.begin_fill() >>> t.circle(10) >>> t.end_fill() 左车轮 >>> t.setheading(0) >>> t.up() >>> t.forward(90) >>> t.right(90) >>> t.forward(10) >>> t.setheading(0) >>> t.begin_fill() >>> t.down() >>> t.circle(10) >>> t.end_fill() 右车轮

整合后效果如下：

下面来集中介绍下新增的几个函数：

1， color 是用来改变画笔颜色的。

2， begin_fill 和 end_fill 是用来给画布上的一个区域填色的。

3， circle 是用来画一个指定大小的圆。

4， setheading 让海龟面向指定的方向。

总结：

这次比上次更深入的运用了Python的turtle模块来画几个基本的几何图形，还有for循环和if语句来控制海龟在屏幕上的动作。同时改变了海龟的笔的颜色并给它所画的形状填色。接下来将开始学习填色。

感觉越来越有趣了，越努力，越幸运。^_^

↧

Writing Python Extensions in Rust

February 4, 2017, 6:47 am

≫ Next: Visual Studio Code搭建python开发环境

≪ Previous: Python中的高级turtle（海龟）作图

In December Ispent few days with Rust. I wrote few lines of code and was trying to get in touch with the syntax and feeling of the language. One of the major things in my TODO list was figuring out how to write python extensions in Rust. Armin Ronacher wrote this excellent post in the Sentry blog back in October, 2016. I decided to learn from the same code base . It is always much easier to make small changes and then see what actually change due the same. This is also my first usage of CFFI module. Before this, I always wrote Python C extensions from scratch. In this post I will assume that we already have a working Rust installation on your system, and then we will go ahead from that.

Creating the initial Rust project

I am already in my new project directory, which is empty.

$ cargo init Created library project $ ls Cargo.toml src

Now, I am going to update the Cargo.toml file with the following content. Feel free to adjust based on your requirements.

[package] name = "liblearn" version = "0.1.0" authors = ["Kushal Das <mail@kushaldas.in>"] [lib] name = "liblearn" crate-type = ["cdylib"]

Using the crate-type attribute we tell the Rust compiler what kind of artifact to generate. We will create a dynamic system library for our example. On my linux computer it will create a *.so file. You can read more about the crate-types here .

Next we update our src/lib.rs file. Here we are telling that we also have a src/ksum.rs file.

#[cfg(test)] mod tests { #[test] fn it_works() { } } pub mod ksum; use std::ffi::CStr; use std::os::raw::{c_uint, c_char}; #[no_mangle] pub unsafe extern "C" fn sum(a: c_uint, b: c_uint) -> c_uint { println!("{}, {}", a, b); a + b } #[no_mangle] pub unsafe extern "C" fn onbytes(bytes: *const c_char) { let b = CStr::from_ptr(bytes); println!("{}", b.to_str().unwrap()) }

We have various types which can help us to handle the data coming from the C code. We also have two unsafe functions, the first is sum , where we are accepting two integers, and returning the addition of those values. We are also printing the integers just for our learning purpose.

We also have a onbytes function, in which we will take a Python bytes input, and just print it on the STDOUT. Remember this is just an example, so feel free to make changes and learn more :). The CStr::from_ptr function helps us with converting raw C string to a safe C string wrapper in Rust. Read the documentation for the same to know more.

All of the functions also have no_mangle attribute, so that Rust compiler does not mangle the names. This helps in using the functions in C code. Marking the functions extern will help in line of Rust FFI work . At this moment you should be able to build the Rust project with cargo build command.

Writing the Python code

Next we create a build.py file on the top directory, this will help us with CFFI. We will also need our C header file with proper definitions in it, include/liblearn.h

#ifndef LIBLEARN_H_INCLUDED #define LIBLEARN_H_INCLUDED unsigned int sum(unsigned int a, unsigned int b); void onbytes(const char *bytes); #endif

The build.py

import sys import subprocess from cffi import FFI def _to_source(x): if sys.version_info >= (3, 0) and isinstance(x, bytes): x = x.decode('utf-8') return x ffi = FFI() ffi.cdef(_to_source(subprocess.Popen([ 'cc', '-E', 'include/liblearn.h'], stdout=subprocess.PIPE).communicate()[0])) ffi.set_source('liblearn._sumnative', None)

Feel free to consult the CFFI documentation to learn things in depth. If you want to convert Rust Strings to Python and return them, I would suggest you to have a look at the unpack function .

The actual Python module source

We have liblearn/ init .py file, which holds the actual code for the Python extension module we are writing.

import os from ._sumnative import ffi as _ffi _lib = _ffi.dlopen(os.path.join(os.path.dirname(__file__), '_liblearn.so')) def sum(a, b): return _lib.sum(a,b) def onbytes(word): return _lib.onbytes(word) setup.py file

I am copy pasting the whole setup.py below. Most of it is self explanatory. I also kept the original comments which explain various points.

import os import sys import shutil import subprocess try: from wheel.bdist_wheel import bdist_wheel except ImportError: bdist_wheel = None from setuptools import setup, find_packages from distutils.command.build_py import build_py from distutils.command.build_ext import build_ext from setuptools.dist import Distribution # Build with clang if not otherwise specified. if os.environ.get('LIBLEARN_MANYLINUX') == '1': os.environ.setdefault('CC', 'gcc') os.environ.setdefault('CXX', 'g++') else: os.environ.setdefault('CC', 'clang') os.environ.setdefault('CXX', 'clang++') PACKAGE = 'liblearn' EXT_EXT = sys.platform == 'darwin' and '.dylib' or '.so' def build_liblearn(base_path): lib_path = os.path.join(base_path, '_liblearn.so') here = os.path.abspath(os.path.dirname(__file__)) cmdline = ['cargo', 'build', '--release'] if not sys.stdout.isatty(): cmdline.append('--color=always') rv = subprocess.Popen(cmdline, cwd=here).wait() if rv != 0: sys.exit(rv) src_path = os.path.join(here, 'target', 'release', 'libliblearn' + EXT_EXT) if os.path.isfile(src_path): shutil.copy2(src_path, lib_path) class CustomBuildPy(build_py): def run(self): build_py.run(self) build_liblearn(os.path.join(self.build_lib, *PACKAGE.split('.'))) class CustomBuildExt(build_ext): def run(self): build_ext.run(self) if self.inplace: build_py = self.get_finalized_command('build_py') build_liblearn(build_py.get_package_dir(PACKAGE)) class BinaryDistribution(Distribution): """This is necessary because otherwise the wheel does not know that we have non pure information. """ def has_ext_modules(foo): return True cmdclass = { 'build_ext': CustomBuildExt, 'build_py': CustomBuildPy, } # The wheel generated carries a python unicode ABI tag. We want to remove # this since our wheel is actually universal as far as this goes since we # never actually link against libpython. Since there does not appear to # be an API to do that, we just patch the internal function that wheel uses. if bdist_wheel is not None: class CustomBdistWheel(bdist_wheel): def get_tag(self): rv = bdist_wheel.get_tag(self) return ('py2.py3', 'none') + rv[2:] cmdclass['bdist_wheel'] = CustomBdistWheel setup( name='liblearn', version='0.1.0', url='http://github.com/kushaldas/liblearn', description='Module to learn writing Python extensions in rust', license='BSD', author='Kushal Das', author_email='kushaldas@gmail.com', packages=find_packages(), cffi_modules=['build.py:ffi'], cmdclass=cmdclass, include_package_data=True, zip_safe=False, platforms='any', install_requires=[ 'cffi>=1.6.0', ], setup_requires=[ 'cffi>=1.6.0' ], classifiers=[ 'Intended Audience :: Developers', 'License :: OSI Approved :: BSD License', 'Operating System :: OS Independent', 'Programming Language :: Python', 'Topic :: Software Development :: Libraries :: Python Modules' ], ext_modules=[], distclass=BinaryDistribution ) Building the Python extension $ python3 setup.py build running build running build_py creating build/lib creating build/lib/liblearn copying liblearn/__init__.py -> build/lib/liblearn Finished release [optimized] target(s) in 0.0 secs generating cffi module 'build/lib/liblearn/_sumnative.py' running build_ext

Now we have a build directory. We go inside of the build/lib directory, and try out the following.

$ python3 Python 3.5.2 (default, Sep 14 2016, 11:28:32) [GCC 6.2.1 20160901 (Red Hat 6.2.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import liblearn >>> liblearn.sum(12,30) 12, 30 42 >>> b = "Kushal in bengali ".encode("utf-8") >>> liblearn.onbytes(b) Kushal in bengali

This post is only about how to start writing a new extension. My knowledge with Rust is very minimal. In future I will write more as I learn. You can find all the source files in github repo .

Thank you Siddhesh , and Armin for reviewing this post.

↧

Visual Studio Code搭建python开发环境

February 4, 2017, 6:46 am

≫ Next: Python中的高级turtle（海龟）作图（续）

≪ Previous: Writing Python Extensions in Rust

开发python的环境有很多，原来已经在vs2013上面搭建好python的开发环境了，但是vs2013每次启动都占太多内存（太强大了吧），这下出了vs code，既轻量又酷炫，正好拿来试一试开发python，点击visual studio code1.9安装教程

下面直接上搭建Python环境步骤：

1.打开vs code，按按 F1 或者 Ctrl+Shift+P 打开命令行，然后输入 ext install

输入Python，选第一个，这个用的最多，支持自动补全代码等功能，点击安装按钮，即可安装

下面试着编写一个python程序

找到下面Python的安装路径（附Python安装教程）

修改Python的路径

↧

Python中的高级turtle（海龟）作图（续）

February 4, 2017, 9:45 pm

≫ Next: A Unique Slug Generator for Django

≪ Previous: Visual Studio Code搭建python开发环境

四、填色

color函数有三个参数。第一个参数指定有多少红色，第二个指定有多少绿色，第三个指定有多少蓝色。比如，要得到车子的亮红色，我们用 color(1,0,0)，也就是让海龟用百分之百的红色画笔。

这种红色、绿色、蓝色的混搭叫做 RGB(Red,Green,Blue) 。因为红绿蓝是色光上的三原色，任何颜色都可以通过改变三原色的比重来调配出来。

虽然我们不是在计算机屏幕上混合颜料（我们用的是光！），但我们可以把RGB方案想象成三个颜料桶，一个红的，一个绿的和一个蓝的。每个桶里都是满的，我们可以看成满桶的值是1（100%）。然后把所有的红颜料和绿颜料混在一起放在一个大缸里，这样就产生了黄色。

现在我们尝试用海龟画一个黄色的圆，我们要用100%的红色和绿色颜料，不能使用蓝色：

效果如下：

>>> t.color(1,1,0) ##100%的红色，100%的绿色，0%的蓝色 >>> t.begin_fill() ##给后面的形状填色 >>> t.circle(50) >>> t.end_fill() ##用RGB颜色来给圆填色黄色的圆
Python中的高级turtle（海龟）作图（续）

1，用来画填色圆形的函数

为了更容易地用不同的颜色来试验，我们来把画圆填色的代码写成一个函数：

>>> def mycircle(red,green,blue): t.color(red,green,blue) t.begin_fill() t.circle(50) t.end_fill()

我们可以只用绿色来画一个很亮的绿色的圆，如图A：

>>> mycircle(0,1,0)

也可以用一半的绿色（0.5）来画一个深绿色的圆，如图B：

>>> mycircle(0,0.5,0) A：
Python中的高级turtle（海龟）作图（续）

B：

2，使用纯白和纯黑

当天空没有了太阳，世界就变得一切黑暗（假设我们那时候还没有发明灯）。类比，如果让三种颜色都设置为0，就相当于没有光，这时候画出来的图形都是黑色的；反之为1全为白。

>>> mycircle(0,0,0)
Python中的高级turtle（海龟）作图（续）

五、画正方形的函数 >>> def mysquare(size): for x in range(1,5): t.forward(size) t.left(90) >>> mysquare(25) >>> mysquare(50) >>> mysquare(75) >>> mysquare(100) >>> mysquare(125)

效果如下：

六、画填色正方形

要对正方形填色，首先要重置画布，开始填色，然后在调用正方形函数，过程如下：

>>> t.reset() >>> t.begin_fill() >>> mysquare(50) >>> t.end_fill() ##加入这行代码前，你应当看到一个空的正方形，直到你结束填充

效果如下：

我们现在可以改变下这个函数，让它既可以画填好的正方形也可以画不填色的正方形。

>>> def mysquare(size,filled): if filled == True: t.begin_fill() for x in range(1,5): t.forward(size) t.left(90) if filled == True: t.end_fill()

下面我们可以画一个填了色的正方形：

>>> mysquare(50,True)

接着画一个没有填色的正方形：

>>> mysquare(150,False) 七、画填好色的星星

现在我们要写一个mystar函数

>>> def mystar(size,filled): if filled == True: ##检查filled是否为真 t.begin_fill() ##如果是的话开始填充 for x in range(1,19): t.forward(size) if x % 2 == 0: t.left(175) else: t.left(225) if filled == True: t.end_fill() mystar函数

现在我们可以画一个金色（90%红色，75%绿色，0%的蓝色）的星星

>>> t.color(0.9,0.75,0) >>> mystar(120,True)

效果如下：

要给星星画上轮廓，把颜色改成黑色并且不用填色再画一遍星星：

>>> t.color(0,0,0) >>> mystar(120,False)

效果如下：

总结

今天下午收获颇丰，忘记了吃饭，忘记了打游戏，一直在学习python作图。学会了如何用turtle模块画几个基本的几何图形，还有用for循环和if语句来控制海龟在屏幕上的动作。同时可以改变海龟的笔的颜色并给它所画的形状填色。还用了一些函数（比如 def 函数）来重用绘图的代码，提高了效率。

路还很长，坚持走下去！共勉

↧

A Unique Slug Generator for Django

February 4, 2017, 10:50 pm

≫ Next: BrandPost: Python: High Performance or Not? You Might Be Surprised

≪ Previous: Python中的高级turtle（海龟）作图（续）

Using the [Random String Generator](https://www.codingforentrepreneurs.com/blog/random-string-generator-in-python/), we create unique slugs for any given model. ``` from django.utils.text import slugify ''' random_string_generator is located here: http://joincfe.com/blog/random-string-generator-in-python/ ''' from yourapp.utils import random_string_generator def unique_slug_generator(instance, new_slug=None): """ This is for a Django project and it assumes your instance has a model with a slug field and a title character (char) field. """ if new_slug is not None: slug = new_slug else: slug = slugify(instance.title) Klass = instance.__class__ qs_exists = Klass.objects.filter(slug=slug).exists() if qs_exists: new_slug = "{slug}-{randstr}".format( slug=slug, randstr=random_string_generator(size=4) ) return unique_slug_generator(instance, new_slug=new_slug) return slug ``` The above assumes that your model at least has the following: ``` class YourModel(models.Model): title = models.CharField(max_length=120) slug = models.SlugField(blank=True) ```

↧

BrandPost: Python: High Performance or Not? You Might Be Surprised

February 5, 2017, 1:17 am

≫ Next: Open Source Load Testing Tool Review

≪ Previous: A Unique Slug Generator for Django

BrandPost: Python: High Performance or Not? You Might Be Surprised

The concept of an “accelerated python” is relatively new, and it’s made Python worth another look for Big Data and High Performance Computing (HPC) applications.

Thanks to some Python aficionados at Intel, who have utilized the well-known Intel Math Kernel Library (MKL) under the covers, we can all use an accelerated Python that yields big returns for Python performance without requiring that we change our Python code!

Sure, Python is amazing. But Python is relatively slow because it’s an interpreted (not a compiled) language. We can learn and explore interactively―including doing a “Hello, World!” program interactively:

% python

>>> print("Hello, World.")

Hello, World.

>>> import matplotlib.pyplot as myplt

>>> myplt.plot([3,14,15,92])

>>> myplt.ylabel('hello numbers')

>>> myplt.show()

reinders Why It Works

The reason an “accelerated Python” can be so effective comes from a combination of three factors:

Python has mature and widely used packages and libraries Computation in Big Data and HPC applications is focused in small parts of the code MKL is fast

Python has mature and widely used packages and libraries:These libraries can be accelerated, without needing to change our Python code at all. All we have to do is install an accelerated Python. Under the covers, Intel has accelerated NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py. NumPy is a library of routines for operating on N-dimensional arrays. SciPy is a library of fundamental routines for scientific computing, including numerical integration and optimization. Libraries pandas, scikit-learn, and Jupyter provide key routines for Big Data and Machine Learning. Library matplotlib provides data plotting, and mpi4py provides MPI usage.

Computation in Big Data and HPC applications is focused in small parts of the code:Big Data and High Performance Computing (HPC) generally focus most “work” in a few key algorithms, which have been widely studied and supported in libraries notably NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py.

MKL is fast:Intel’s Math Kernel Library is highly tuned for math, and perfect to accelerate NumPy, SciPy, and other libraries that are already used by many Python applications. Additional capabilities for acceleration come from the Intel Data Analytics Acceleration Library (DAAL) and Intel Threading Building Blocks (TBB).

Eat Our Cake and Have It Too

Python is a simple language with a straightforward syntax. It’s known for its expressiveness, easy-to-read syntax, large community of users, and an impressive range of libraries. It encourages innovative and incremental programming, which makes it a natural for the sort of trailblazing that new work entails. Data scientists seeking to squeeze information from Big Data have found Python a perfect fit as a result. Now we can have high performance too.

When thinking of the user-friendly nature of Python, we have to ask, “Does user friendly always mean slow?” It turns out―because we most often focus our heavy computations into forms (like matrix algebra) that use high-speed libraries (compiled code that our Python code utilizes)―we can have the best of both worlds: easy to use and fast. Acceleration is a big step forward, and it’s automatic when we install the accelerated distributions from Intel or the Anaconda Cloud. Look for big speedups in scikit-learn and basic operations in NumPy already, and if you stay current with the updates, expect the opportunity for very large speedups (10X or more) for NumPy universal functions (elementwise operations), more in scikit-learn, big speedups (10X or more) for FFT, optimized memory operations for NumPy, caffe, and theano deep learning packages by March 2017. There will be even more as time goes on.

Free and Easy Downloads

You can learn more about, and download, the Intel distribution for Python at https://software.intel.com/intel-distribution-for-python . It’s free (but not completely open source), and it has gained considerable popularity with Python users because of its speed. The Intel packages for accelerating Python performance are also available on the Anaconda Cloud , where the unique packages in the Intel channel on Anaconda Cloud are: distarray , tbb , pydaal (see https://www.continuum.io/sites/default/files/AnacondaIntelFAQFINAL.pdf ).

Click here to download free trial software

↧

Open Source Load Testing Tool Review

February 5, 2017, 4:31 am

≫ Next: Choropleth Maps in Python

≪ Previous: BrandPost: Python: High Performance or Not? You Might Be Surprised

(Version 1.0)

Benchmarking & Comparing Open-Source Load Testing Tools

Ragnar Lnn

There are tons of load testing tools, both open- and closed-source. Open-source tools are growing in popularity, and we use mainly open-source software (OSS) at Load Impact, so we thought it might be useful to take a deep look at the available OSS load testing options.

OSS load testing options represent kind of a jungle, and the difference between one tool and the next in terms of usability, performance, feature set or reliability can be enormous. Thus, we’d like to help people find the right tool for their use case.

We have chosen to look at what we consider to be the most popular, open-source load testing tools out there today. This list includes:

Jmeter Gatling Locust The Grinder Apachebench Artillery Tsung Vegeta Siege Boom Wrk
Open Source Load Testing Tool Review

This review contains both my own personal views on the good and bad sides of these load testing tools, and my thoughts based on a round of benchmark testing that gives a sense about relative tool performance.

If you are mainly interested in the results from the benchmarking, they will be posted in a followup article very soon :)

So ― what's the setup?

Setting up the Open Source Load Testing Review

We installed, configured and ran these tools from the command line, plus spent a lot of time trying to extract results from them using shell script and standard Unix tools. Then we have tried to figure out the relative performance of the tools. First, by manually trying to squeeze the best performance out of each one, optimizing configuration parameters for the tool in question, and then by running a benchmark test on all the tools, trying to run them with as similar configuration as possible.

The benchmark numbers are one thing, but the opinions on usability that I give will be colored by my use case. For example, a tool that is hard to run from the command line, or that is hard to get useful results from when you run it that way, will make me frustrated. And when I’m frustrated, I’ll have to whine and complain about it. So that is what this article is all about.

One big caveat: I have not been looking much at the data visualization options included with each tool, so you may want to see this comparison as most useful in an automation setting, or if you are planning to push results into some external data storage and/or visualization system anyway.

Try it yourself: run the Docker image

To make things easy for people, I’ve created a public Docker image that can be used to easily repeat all the tests we have made (or to just run one of the load testing tools without having to install it yourself). Try this:

<span style="font-family: 'courier new', courier;"><i><span style="font-weight: 400;">   docker run loadimpact/loadgentest</span></i></span>

Or, if you want to be able to simulate extra network delay you should do:

docker run --cap-add=NET_ADMIN loadimpact/loadgentest

You can also build the Docker image yourself, if you clone our Github repo:

git clone https:// github.com/loadimpact/loadgentest

At https://github.com/loadimpact/loadgentest you will also find some documentation on how to use the setup, by the way!

So ... Which tool is best?

To go against all common wisdom regarding how to keep visitors reading your article, I’ll go ahead and just say: Gatling! Which is funny, because a month ago I would never have said that, given that Gatling is kind of a modern Jmeter, and I usually hate Java applications.

So Gatling is number one, the best, overall winner, etc. Of course... if you’re into python you may be better off with Locust. Oh yeah, and if you wanthigh performance, Wrk would be a better option. However, Wrk is scriptable in Lua, not Python, and has a callback-based API that is not as nice as Locust’s.

If you just need to hit a single URL and nothing else, Apachebench may be the way to go. With Apachebench, you won’t get distracted by all its features because it can literally not do anything but hit one specific URL.

Actually, I think I’ll withdraw that single recommendation. Instead I will say which tools I think have something going for them, and why, and then I’ll show you some benchmarking figures and let you decide for yourself, because one size really doesn’t fit all here.

The top 3open source load testing tools

Gatling ( http://gatling.io )

Apart from maybe its text output to stdout (which is as messy as Locust’s), Gatling is a very nice load testing tool. Its performance is not fantastic, but good enough. It has a consistent design where things actually make a little bit of sense, and the documentation is very good.

It has a DSL (domain specific language) more or less equal to what Jmeter and Tsung offer. However, while Jmeter and Tsung use XML with their specific tags to implement stuff like loops, Gatling lets you define Scala classes that offer similar functionality but that are a lot more readable.

Yes, Scala . Apparently everyone reacts the same way upon hearing that. Gatling’s documentation specifically tells you not to panic, which of course made me panic, but once I realized I couldn’t get out of this and bothered to read the docs and look at the examples I saw that it’s not tricky at all.

My initial assumption was that Gatling would be as clunky to use as Jmeter since it’s a Java app and with DSL functionality similar to Jmeter. But I have now seen the error of my ways and while I still very much prefer to use a “real,” dynamic scripting language to define what happens in a load test, Gatling’s DSL is perhaps the second best (to a real, dynamic language) thing around. Here is how it can look:

import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._

class GatlingSimulation extends Simulation {

val httpConf = http

.baseURL("http://myhost.mydomain.com")

.disableCaching

.acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")

.userAgentHeader("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8

↧

Choropleth Maps in Python

February 5, 2017, 4:30 am

≫ Next: These 2 Cheap Bundles of Courses Will Help You Learn Python

≪ Previous: Open Source Load Testing Tool Review

Choropleth maps are a great way to represent geographical data. I have done a basic implementation of two different data sets. I have used jupyter notebook to show the plots.

World Power Consumption 2014

First do Plotly imports

import plotly.graph_objs as go from plotly.offline import init_notebook_mode,iplot init_notebook_mode(connected=True)

Next step is to fetch the dataset, we’ll use python pandas library to read the read the csv file

import pandas as pd df = pd.read_csv('2014_World_Power_Consumption')

Next, we need to create data and layout variable which contains a dict

data = dict(type='choropleth', locations = df['Country'], locationmode = 'country names',z = df['Power Consumption KWH'], text = df['Country'],colorbar = {'title':'Power Consumption KWH'}, colorscale = 'Viridis',reversescale = True) Let’s make a layout layout = dict(title='2014 World Power Consumption', geo = dict(showframe=False,projection={'type':'Mercator'}))

Pass the data and layout and plot using iplot

choromap = go.Figure(data = [data],layout = layout) iplot(choromap,validate=False)

The output will be be like below:

Check github for full code.

In next post I will try to make a choropleth for a different data set.

References:

https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp

https://plot.ly/python/choropleth-maps/

↧

These 2 Cheap Bundles of Courses Will Help You Learn Python

February 5, 2017, 4:29 am

≫ Next: Kracekumar Ramaraju: Return Postgres data as JSON in Python

≪ Previous: Choropleth Maps in Python

Are you thinking about learning to program? python is a fantastic starter language, and we have two huge bundles of courses available on MakeUseOf Deals that’ll get you up and running! It’s the perfect time to learn.

Python Programming Bootcamp ($39)

This bundle is giagantic! It comes with course after course of useful content. Don’t miss out if you want to get a firm grasp on all things Python!

We need to kick it off with a course that will teach you all of the basics, and one that will do it without being too intimidating. That’s why there’s the class called A Gentle Introduction to Python Programming . It will see you learning by starting with the advanced concepts, and then moving onto the harder stuff. You’ll find that it’s a perfect bridge to lead you into the rest of the content.

Next, we go to Python Programming: The Step-by-Step Python Coding Guide . Over 30,000 students have taken this course as their first, so it’s another great jumping off point as you move on in your Python programming journey. You’ll learn about editors to write code in, performing math, capturing user input, and much more.

These 2 Cheap Bundles of Courses Will Help You Learn Python

Python Made Easy The Complete Python Developer Coursefeatures 12 hours of training that will expand your Python skills to an even higher level. You’ll learn about Python keywords, operators, statements, and expressions. You’ll studyobject-oriented programming, which is useful even in other languages!

You need to actually write some code and learn by doing, which is where Python Tutorial: Python Network Programming Build 7 Apps comes in. As you can guess from its name, you’ll actually be writing seven apps as you make your way through the content, which is really the best way to gain a true understanding of the material.

The ethical hacking field is growing all the time, and it can be quite a lucrative career path if you have the skills. Python For Offensive PenTest: A Complete Practical Course is all about using Python scripts that you create to build white hat hacking tools. You’ll set up your own virtual hacking workplace, learn to counter attacks, and much more.

The final course in the bundle is called Analytics, Machine Learning & NLP in Python . It will teach you Machine Learning, which is the study of pattern recognition and prediction within the field of computer science. Specifically, it’ll teach you how to put Python to use in this field!

Buy: Python Programming Bootcamp

The Perfect Python Programming Bundle ($29)

If you want to go a little cheaper, there’s this bundle, which comes in at $29. It still features four deep courses that’ll teach you lots of valuable stuff, so don’t let the low price fool you.

The bundle starts off with Introduction to javascript Programming for Non-Programmers. It’s a great course to start with if you’re more or less unfamiliar with programming, as it’s designed to get you started.

Next, you’ll get into Fundamentals of Operating Systems. It sounds boring, but it’s actually quite a useful class. Understanding how an OS works is pivotal to making programs that make the most of them.

Finally, we get into the Python-specificcourses with Python 3 Programming Essentials. It’s the starting point for learning to write in the incredibly powerful Python language. It will show you how to write, debug, and execute Python code. You’ll learn about the basics of modules and object-oriented programming in Python. Basically, there’s a lot of good stuff to learn here!

Building on the last class, you’ll get Advanced Python 3 Programming. In case the name wasn’t a giveaway, it’s all about moving past the easy stuff, and really seeing what Python can do. You’ll definitely need to have taken the first class (or have some previous experience), but it’ll be worth itbecause you’ll learn a ton.

Buy: The Perfect Python Programming Bundle

Go On, Python Master!

Are you ready to write masterful programs in Python? You will be if you snag one of these course bundles! Don’t wait, though, because they are on sale for a limited time, and you don’t want to miss the boat!

↧

Kracekumar Ramaraju: Return Postgres data as JSON in Python

February 5, 2017, 4:28 am

≫ Next: 从头开始：用Python实现带随机梯度下降的线性回归

≪ Previous: These 2 Cheap Bundles of Courses Will Help You Learn Python

Postgres supports JSON and JSONB for a couple of years now. The support for JSON-functions landed in version 9.2. These functions let Postgres server to return JSON serialized data. This is a handy feature. Consider a case; python client fetches 20 records from Postgres. The client converts the data returned by the server to tuple/dict/proxy. The application or web server converts tuple again back to JSON and sends to the client. The mentioned case is common in a web application. Not all API’s fit in the mentioned. But there is a use case.

Postgres Example

Consider two tables, author and book with the following schema.

https://gist.github.com/kracekumar/322a2fd5ea09ee952e8a7720fd386184

Postgres function row_to_json convert a particular row to JSON data. Here is a list of authors in the table.

https://gist.github.com/kracekumar/3f4bcdd16d080b5a36436370823e0495

This is simple, let me show a query with an inner join. The book table contains a foreign key to author table . While returning list of books, including author name in the result is useful.

https://gist.github.com/kracekumar/eb9f1009743ccb47df2b3a5f078a4444

As you can see the query construction is verbose. The query has an extra select statement compared to a normal query. The idea is simple. First, do an inner join, then select the desired columns, and finally convert to JSON using row_to_json. row_to_json is available since version 9.2 . The same functionality can be achieved using other function like json_build_object in 9.4. You can read more about it in thedocs.

Python Example

Postgres drivers pyscopg2 and pg8000 handles JSON response, but the result is parsed and returned as a tuple/dictionary. What that means, if you execute raw SQL the returned JSON data is converted to Python dictionary using json.loads . Here is the function that facilitates the conversion in pyscopg2 and pg8000 .

https://gist.github.com/kracekumar/2d1d0b468cafa5197f5e21734047c46d

The psycopg2 converts returned JSON data to list of tuples with a dictionary.

One way to circumvent the problem is to cast the result as text. The Python drivers don’t parse the text. So the JSON format is preserved.

https://gist.github.com/kracekumar/b8a832cd036b54075a2715acf2086d62

Carefully view the printed results. The printed result is a list of tuple with a string.

For SQLAlchemy folks here is how you do it

https://gist.github.com/kracekumar/287178bcb26462a1b34ead4de10f0529

Another way to run SQL statement is to use text function .

The other workaround is to unregister the JSON converter. These two lines should do

import psycopg2.extensions as ext ext.string_types.pop(ext.JSON.values[0], None)

Here is a relevant issue in Pyscopg2 .

↧

从头开始：用Python实现带随机梯度下降的线性回归

February 5, 2017, 6:42 am

≫ Next: 从头开始：用Python实现随机森林算法

≪ Previous: Kracekumar Ramaraju: Return Postgres data as JSON in Python

许多机器学习算法的核心是优化。优化算法用于在机器学习中为给定训练集找出合理的模型参数设置。机器学习最常见的优化算法是随机梯度下降（SGD：stochastic gradient descent）。

本教程将指导大家用 python 实现随机梯度下降对线性回归算法的优化。通过本教程的学习，你将了解到：

如何用随机梯度下降估计线性回归系数

如何对多元线性回归做预测

如何用带随机梯度下降的线性回归算法对新数据做预测

说明

本文将对线性回归、随即梯度下降方法以及本教程所使用的葡萄酒品质数据集做一个集中阐释。

多元线性回归

线性回归是一种用于预测真实值的方法。让人困惑的是，这些需要预测真实值的问题被称为回归问题（regression problems）。

线性回归是一种用直线对输入输出值进行建模的方法。在超过二维的空间里，这条直线被想象成一个平面或者超平面（hyperplane）。预测即是通过对输入值的组合对输出值进行预判。

y = b0 + b1 * x1 + b2 * x2 + ...

系数 (b) 用于对每个输入属性 (x) 进行加权，而学习算法的目的正是寻找一组能导出好的预测值 (y) 的系数。这些系数可以使用随机梯度下降的方法找到。

随机梯度下降

梯度下降（Gradient Descent）是遵循成本函数的梯度来最小化一个函数的过程。这个过程涉及到对成本形式以及其衍生形式的认知，使得我们可以从已知的给定点朝既定方向移动。比如向下朝最小值移动。

在机器学习中，我们可以利用随机梯度下降的方法来最小化训练模型中的误差，即每次迭代时完成一次评估和更新。

这种优化算法的工作原理是模型每看到一个训练实例，就对其作出预测，并重复迭代该过程到一定的次数。这个流程可以用于找出能导致训练数据最小误差的模型的系数。用机器学习的术语来讲，就是每次迭代过程都用如下等式更新系数（b）。

b = b - learning_rate * error * x

其中 b 是系数或者被优化的权重，learing_rate 需手动设定（如 0.01），error 是取决于权重的训练数据模型的预测误差，x 是输入值。

葡萄酒品质数据集

开发了具有梯度下降的线性回归算法之后，我们可以将其运用到一个关于葡萄酒品质的数据集当中。这个数据集囊括了 4898 种白葡萄酒的测量标准，包括酸度和 ph 值。目的是用这些客观标准来预测葡萄酒的品质，分为 0 到 10 级。

下表给出了 5 个数据样本。

7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,6

6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6

8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6

7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6

所有数据需归一化为 0-1 之间的值。每种属性标准单位不同，因而有不同的缩放尺度。通过预测该归一化数据集的平均值（零规则算法），达到了 0.148 的基准方均根差（RMSE）。

该数据集详情请参阅 UCI Machine Learning Repository：http://archive.ics.uci.edu/ml/datasets/Wine+Quality

下载该数据集并将其保存到当前工作目录，文件名为 winequality-white.csv。（注意：文件开头的头信息需去除，用作分隔符的『；』需改为符合 CSV 格式的『，』。）

教程

本教程分为三个部分：

1. 预测

2. 估计系数

3. 葡萄酒品质预测

这将能让你了解在你自己的预测建模问题上实现和应用带有随机梯度下降的线性回归的基础。

1. 预测

首先建立一个用于预测的函数。这将用于对随机梯度下降的候选系数的评估，且模型确定之后也需要这个函数。我们会在测试集或者新的数据上用该函数来进行预测。

函数 predict() 如下所示，用于预测给定了一组系数的行的输出值。

第一个系数始终为截距，也称为偏差或 b0，因其相对独立且不与特定的输入值相关。

# Make a prediction with coefficients

def predict(row, coefficients):

yhat = coefficients[0]

for i in range(len(row)-1):

yhat += coefficients[i + 1] * row[i]

return yhat

我们可以用一个小的数据集对这个函数进行测试。

x, y

1, 1

2, 3

4, 3

3, 2

5, 5

下图是一小部分数据：

线性回归的部分转换数据

我们也可用之前准备好的系数为这个数据集做预测。predict() 函数测试如下。

# Make a prediction with coefficients

def predict(row, coefficients):

yhat = coefficients[0]

for i in range(len(row)-1):

yhat += coefficients[i + 1] * row[i]

return yhat

dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]] coef = [0.4, 0.8]

for row in dataset:

yhat = predict(row, coef)

print("Expected=%.3f, Predicted=%.3f" % (row[-1], yhat))

单个输入值 (x) 和两个系数（b0 和 b1）。用于建模该问题的预测方程为：

y = b0 + b1 * x

或者，手动选择特定系数：

y = 0.4 + 0.8 * x

运行此函数，我们将得到一个相当接近预测值的输出值（y）。

Expected=1.000, Predicted=1.200

Expected=3.000, Predicted=2.000

Expected=3.000, Predicted=3.600

Expected=2.000, Predicted=2.800

Expected=5.000, Predicted=4.400

现在我们可以用随机梯度下降来优化我们的系数值了。

2. 估计系数

我们可以使用随机梯度下降来为我们的训练数据估计系数值。随机阶梯下降需要两个设定参数：

学习率（Learning Rate）：用于限制每次更新时被修正的系数的数量。

Epochs：更新系数的同时运行训练集的次数。

这两个值和数据集都是函数的参数。我们的这个函数将执行三个遍历循环：

1. 单次 epoch 循环

2. 单次 epoch 中训练集中的每行循环

3. 单次 epoch 中每个系数循环并为每一行更新它

可以看到，每次 epoch，我们都会更新数据集里每行的系数。系数的更新是基于模型生成的误差。该误差被算作候选系数的预测值和预期输出值之间的差。

error = prediction - expected

有一个系数用于加权每一个输入属性，这些属性将以连续的方式进行更新，比如

b1(t+1) = b1(t) - learning_rate * error(t) * x1(t)

列表开始的特殊系数，也被称为截距（intercept）或偏差（bias），也以类似的方式更新，但因其不与特定输入值相关，所以无输入值。

b0(t+1) = b0(t) - learning_rate * error(t)

现在我们把所有东西组合在一起。coefficients_sgd() 函数正是用随机梯度下降来计算一个训练集的系数值，下面即是该函数：

# Estimate linear regression coefficients using stochastic gradient descent

def coefficients_sgd(train, l_rate, n_epoch):

coef = [0.0 for i in range(len(train[0]))]

for epoch in range(n_epoch):

sum_error = 0

for row in train:

yhat = predict(row, coef)

error = yhat - row[-1]

sum_error += error**2

coef[0] = coef[0] - l_rate * error

for i in range(len(row)-1):

coef[i + 1] = coef[i + 1] - l_rate * error * row[i]

print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))

return coef

此外，我们追踪每个 epoch 的方差（正值）总和从而在循环之后得到一个好的结果。

# Make a prediction with coefficients def predict(row, coefficients): yhat = coefficients[0] for i in range(len(row)-1): yhat += coefficients[i + 1] * row[i] return yhat # Estimate linear regression coefficients using stochastic gradient descent def coefficients_sgd(train, l_rate, n_epoch): coef = [0.0 for i in range(len(train[0]))] for epoch in range(n_epoch): sum_error = 0 for row in train: yhat = predict(row, coef) error = yhat - row[-1] sum_error += error**2 coef[0] = coef[0] - l_rate * error for i in range(len(row)-1): coef[i + 1] = coef[i + 1] - l_rate * error * row[i] print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error)) return coef # Calculate coefficients dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]] l_rate = 0.001 n_epoch = 50 coef = coefficients_sgd(dataset, l_rate, n_epoch) print(coef)

我们用 0.001 的学习速率训练该模型 50 次，即把整个训练数据集的系数曝光 50 次。运行一个 epoch 系统就将该次循环中的和方差（sum squared error）和以及最终系数集合 print 一次：

>epoch=45, lrate=0.001, error=2.650

>epoch=46, lrate=0.001, error=2.627

>epoch=47, lrate=0.001, error=2.607

>epoch=48, lrate=0.001, error=2.589

>epoch=49, lrate=0.001, error=2.573

[0.22998234937311363, 0.8017220304137576]

可以看到误差是如何在历次 epoch 中持续降低的。或许我们可以增加训练次数（epoch）或者每个 epoch 中的系数总量（调高学习速率）。

尝试一下看你能得到什么结果。

现在，我们将这个算法用到实际的数据当中。

3. 葡萄酒品质预测

我们将使用随机阶梯下降的方法为葡萄酒品质数据集训练一个线性回归模型。本示例假定一个名为 winequality―white.csv 的 csv 文件副本已经存在于当前工作目录。

首先加载该数据集，将字符串转换成数字，并将输出列从字符串转换成数值 0 和 1. 这个过程是通过辅助函数 load_csv()、str_column_to_float() 以及 dataset_minmax() 和 normalize_dataset() 来分别实现的。

我们将通过 K 次交叉验证来预估得到的学习模型在未知数据上的表现。这就意味着我们将创建并评估 K 个模型并预估这 K 个模型的平均误差。辅助函数 cross_validation_split()、rmse_metric() 和 evaluate_algorithm() 用于求导根均方差以及评估每一个生成的模型。

我们用之前创建的函数 predict()、coefficients_sgd() 以及 linear_regression_sgd() 来训练模型。完整代码如下：

# Linear Regression With Stochastic Gradient Descent for Wine Quality

from random import seed

from random import randrange

from csv import reader

from math import sqrt

# Load a CSV file

def load_csv(filename):

dataset = list()

with open(filename, 'r') as file:

csv_reader = reader(file)

for row in csv_reader:

if not row:

continue

dataset.append(row)

return dataset

# Convert string column to float

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

# Find the min and max values for each column

def dataset_minmax(dataset):

minmax = list()

for i in range(len(dataset[0])): col_values = [row[i] for row in dataset]

value_min = min(col_values)

value_max = max(col_values)

minmax.append([value_min, value_max])

return minmax

# Rescale dataset columns to the range 0-1

def normalize_dataset(dataset, minmax):

for row in dataset:

for i in range(len(row)):

row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

# Split a dataset into k folds

def cross_validation_split(dataset, n_folds):

dataset_split = list()

dataset_copy = list(dataset)

fold_size = len(dataset) / n_folds

for i in range(n_folds):

fold = list()

while len(fold) < fold_size:

index = randrange(len(dataset_copy))

fold.append(dataset_copy.pop(index))

dataset_split.append(fold)

return dataset_split

# Calculate root mean squared error

def rmse_metric(actual, predicted):

sum_error = 0.0

for i in range(len(actual)):

prediction_error = predicted[i] - actual[i]

sum_error += (prediction_error ** 2)

mean_error = sum_error / float(len(actual))

return sqrt(mean_error)

# Evaluate an algorithm using a cross validation split

def evaluate_algorithm(dataset, algorithm, n_folds, *args):

folds = cross_validation_split(dataset, n_folds)

scores = list()

for fold in folds:

train_set = list(folds)

train_set.remove(fold)

train_set = sum(train_set, [])

test_set = list()

for row in fold:

row_copy = list(row)

test_set.append(row_copy)

row_copy[-1] = None

predicted = algorithm(train_set, test_set, *args)

actual = [row[-1] for row in fold]

rmse = rmse_metric(actual, predicted)

scores.append(rmse)

return scores

# Make a prediction with coefficients

def predict(row, coefficients):

yhat = coefficients[0]

for i in range(len(row)-1):

yhat += coefficients[i + 1] * row[i]

return yhat

# Estimate linear regression coefficients using stochastic gradient descent

def coefficients_sgd(train, l_rate, n_epoch):

coef = [0.0 for i in range(len(train[0]))]

for epoch in range(n_epoch):

for row in train:

yhat = predict(row, coef)

error = yhat - row[-1] coef[0] = coef[0] - l_rate * error

for i in range(len(row)-1):

coef[i + 1] = coef[i + 1] - l_rate * error * row[i]

# print(l_rate, n_epoch, error)

return coef

# Linear Regression Algorithm With Stochastic Gradient Descent

def linear_regression_sgd(train, test, l_rate, n_epoch):

predictions = list()

coef = coefficients_sgd(train, l_rate, n_epoch)

for row in test:

yhat = predict(row, coef)

predictions.append(yhat)

return(predictions)

# Linear Regression on wine quality dataset

seed(1)

# load and prepare data

filename = 'winequality-white.csv'

dataset = load_csv(filename)

for i in range(len(dataset[0])):

str_column_to_float(dataset, i)

# normalize

minmax = dataset_minmax(dataset)

normalize_dataset(dataset, minmax)

# evaluate algorithm

n_folds = 5

l_rate = 0.01

n_epoch = 50

scores = evaluate_algorithm(dataset, linear_regression_sgd, n_folds, l_rate, n_epoch)

print('Scores: %s' % scores)

print('Mean RMSE: %.3f' % (sum(scores)/float(len(scores))))

一个等于 5 的 k 值被用于交叉验证，给每次迭代 4898/5 = 979.6（低于 1000 都行）条记录来进行评估。对一个小实验选择了 0.01 的学习率和 50 训练 epoch.

你可以尝试你自己的配置，看你能否超过我的分数。

运行这个样本，为 5 次交叉验证的每一次 print 一个分数，然后 print 平均均方根误差（RMSE）。我们可以看到（在归一化的数据集上）该 RMSE 为 0.126。如果我们只是预测平均值的话（使用 Zero Rule Algorithm），那么这个结果就低于基准值 0.148。

Scores: [0.12259834231519767, 0.12733924130891316, 0.12610773846663892, 0.1289950071681572, 0.1272180783291014]

Mean RMSE: 0.126

扩展

这里给出了一些扩展练习，你可以思考并尝试解决它们：

调整该实例。调整其学习率、epoch 的数量甚至原始数据处理和准备的方法，以期能提高最终结果。

批量进行随机梯度下降。改变随机梯度下降算法使其在每个 epoch 上累积更新，且仅在 epoch 结束时批量更新系数。

额外的回归问题。应用该技术来解决 UCI 机器学习库中的其它回归问题。

你会探索这些扩展任务吗？

回顾总结

本教程介绍了如何用 Python 实现带有随机梯度下降的多元线性回归算法。其中包括：

如何对多元线性回归问题做预测

如何优化用于随机梯度下降的系数设置

如何将该方法用于实际的回归预测模型问题

↧

从头开始：用Python实现随机森林算法

February 5, 2017, 6:41 am

≫ Next: Python Flask静态目录

≪ Previous: 从头开始：用Python实现带随机梯度下降的线性回归

拥有高方差使得决策树（secision tress）在处理特定训练数据集时其结果显得相对脆弱。bagging（bootstrap aggregating 的缩写）算法从训练数据的样本中建立复合模型，可以有效降低决策树的方差，但树与树之间有高度关联（并不是理想的树的状态）。

随机森林算法（Random forest algorithm）是对 bagging 算法的扩展。除了仍然根据从训练数据样本建立复合模型之外，随机森林对用做构建树（tree）的数据特征做了一定限制，使得生成的决策树之间没有关联，从而提升算法效果。

本教程旨在探讨如何用 python 实现随机森林算法。通过本文，我们可以了解到：

bagged decision trees 与随机森林算法的差异；

如何构建含更多方差的装袋决策树；

如何将随机森林算法运用于预测模型相关的问题。

算法描述

这个章节将对随机森林算法本身以及本教程的算法试验所用的声纳数据集（Sonar dataset）做一个简要介绍。

随机森林算法

决策树运行的每一步都涉及到对数据集中的最优分裂点（best split point）进行贪婪选择（greedy selection）。

这个机制使得决策树在没有被剪枝的情况下易产生较高的方差。整合通过提取训练数据库中不同样本（某一问题的不同表现形式）构建的复合树及其生成的预测值能够稳定并降低这样的高方差。这种方法被称作引导聚集算法（bootstrap aggregating），其简称 bagging 正好是装进口袋，袋子的意思，所以被称为「装袋算法」。该算法的局限在于，由于生成每一棵树的贪婪算法是相同的，那么有可能造成每棵树选取的分裂点（split point）相同或者极其相似，最终导致不同树之间的趋同（树与树相关联）。相应地，反过来说，这也使得其会产生相似的预测值，降低原本要求的方差。

我们可以采用限制特征的方法来创建不一样的决策树，使贪婪算法能够在建树的同时评估每一个分裂点。这就是随机森林算法（Random Forest algorithm）。

与装袋算法一样，随机森林算法从训练集里撷取复合样本并训练。其不同之处在于，数据在每个分裂点处完全分裂并添加到相应的那棵决策树当中，且可以只考虑用于存储属性的某一固定子集。

对于分类问题，也就是本教程中我们将要探讨的问题，其被考虑用于分裂的属性数量被限定为小于输入特征的数量之平方根。代码如下：

num_features_for_split = sqrt(total_input_features)

这个小更改会让生成的决策树各不相同（没有关联），从而使得到的预测值更加多样化。而多样的预测值组合往往会比一棵单一的决策树或者单一的装袋算法有更优的表现。

声纳数据集（Sonar dataset）

我们将在本教程里使用声纳数据集作为输入数据。这是一个描述声纳反射到不同物体表面后返回的不同数值的数据集。60 个输入变量表示声纳从不同角度返回的强度。这是一个二元分类问题（binary classification problem），要求模型能够区分出岩石和金属柱体的不同材质和形状，总共有 208 个观测样本。

该数据集非常易于理解――每个变量都互有连续性且都在 0 到 1 的标准范围之间，便于数据处理。作为输出变量，字符串'M'表示金属矿物质，'R'表示岩石。二者需分别转换成整数 1 和 0。

通过预测数据集（M 或者金属矿物质）中拥有最多观测值的类，零规则算法（Zero Rule Algorithm）可实现 53% 的精确度。

更多有关该数据集的内容可参见 UCI Machine Learning repository：https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)

免费下载该数据集，将其命名为 sonar.all-data.csv，并存储到需要被操作的工作目录当中。

教程

此次教程分为两个步骤。

1. 分裂次数的计算。

2. 声纳数据集案例研究

这些步骤能让你了解为你自己的预测建模问题实现和应用随机森林算法的基础

1. 分裂次数的计算

在决策树中，我们通过找到一些特定属性和属性的值来确定分裂点，这类特定属性需表现为其所需的成本是最低的。

分类问题的成本函数（cost function）通常是基尼指数（Gini index），即计算由分裂点产生的数据组的纯度（purity）。对于这样二元分类的分类问题来说，指数为 0 表示绝对纯度，说明类值被完美地分为两组。

从一棵决策树中找到最佳分裂点需要在训练数据集中对每个输入变量的值做成本评估。

在装袋算法和随机森林中，这个过程是在训练集的样本上执行并替换（放回）的。因为随机森林对输入的数据要进行行和列的采样。对于行采样，采用有放回的方式，也就是说同一行也许会在样本中被选取和放入不止一次。

我们可以考虑创建一个可以自行输入属性的样本，而不是枚举所有输入属性的值以期找到获取成本最低的分裂点，从而对这个过程进行优化。

该输入属性样本可随机选取且没有替换过程，这就意味着在寻找最低成本分裂点的时候每个输入属性只需被选取一次。

如下的代码所示，函数 get_split() 实现了上述过程。它将一定数量的来自待评估数据的输入特征和一个数据集作为参数，该数据集可以是实际训练集里的样本。辅助函数 test_split() 用于通过候选的分裂点来分割数据集，函数 gini_index() 用于评估通过创建的行组（groups of rows）来确定的某一分裂点的成本。

以上我们可以看出，特征列表是通过随机选择特征索引生成的。通过枚举该特征列表，我们可将训练集中的特定值评估为符合条件的分裂点。

# Select the best split point for a dataset
def get_split(dataset, n_features):
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
features = list()
while len(features) < n_features:
index = randrange(len(dataset[0])-1)
if index not in features:
features.append(index)
for index in features:
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini, groups
return {'index':b_index, 'value':b_value, 'groups':b_groups}

至此，我们知道该如何改造一棵用于随机森林算法的决策树。我们可将之与装袋算法结合运用到真实的数据集当中。

2. 关于声纳数据集的案例研究

在这个部分，我们将把随机森林算法用于声纳数据集。本示例假定声纳数据集的 csv 格式副本已存在于当前工作目录中，文件名为 sonar.all-data.csv。

首先加载该数据集，将字符串转换成数字，并将输出列从字符串转换成数值 0 和 1. 这个过程是通过辅助函数 load_csv()、str_column_to_float() 和 str_column_to_int() 来分别实现的。

我们将通过 K 折交叉验证（k-fold cross validatio）来预估得到的学习模型在未知数据上的表现。这就意味着我们将创建并评估 K 个模型并预估这 K 个模型的平均误差。评估每一个模型是由分类准确度来体现的。辅助函数 cross_validation_split()、accuracy_metric() 和 evaluate_algorithm() 分别实现了上述功能。

装袋算法将通过分类和回归树算法来满足。辅助函数 test_split() 将数据集分割成不同的组；gini_index() 评估每个分裂点；前文提及的改进过的 get_split() 函数用来获取分裂点；函数 to_terminal()、split() 和 build_tree() 用以创建单个决策树；predict() 用于预测；subsample() 为训练集建立子样本集； bagging_predict() 对决策树列表进行预测。

新命名的函数 random_forest() 首先从训练集的子样本中创建决策树列表，然后对其进行预测。

正如我们开篇所说，随机森林与决策树关键的区别在于前者在建树的方法上的小小的改变，这一点在运行函数 get_split() 得到了体现。

完整的代码如下：

# Random Forest Algorithm on Sonar Dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
dataset_split = list()
dataset_copy = list(dataset)
fold_size = len(dataset) / n_folds
for i in range(n_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split
# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
folds = cross_validation_split(dataset, n_folds)
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set, [])
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None
predicted = algorithm(train_set, test_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual, predicted)
scores.append(accuracy)
return scores
# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right
# Calculate the Gini index for a split dataset
def gini_index(groups, class_values):
gini = 0.0
for class_value in class_values:
for group in groups:
size = len(group)
if size == 0:
continue
proportion = [row[-1] for row in group].count(class_value) / float(size)
gini += (proportion * (1.0 - proportion))
return gini
# Select the best split point for a dataset
def get_split(dataset, n_features):
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
features = list()
while len(features) < n_features:
index = randrange(len(dataset[0])-1)
if index not in features:
features.append(index)
for index in features:
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini, groups
return {'index':b_index, 'value':b_value, 'groups':b_groups}
# Create a terminal node value
def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)
# Create child splits for a node or make terminal
def split(node, max_depth, min_size, n_features, depth):
left, right = node['groups']
del(node['groups'])
# check for a no split
if not left or not right:
node['left'] = node['right'] = to_terminal(left + right)
return
# check for max depth
if depth >= max_depth:
node['left'], node['right'] = to_terminal(left), to_terminal(right)
return
# process left child
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_split(left, n_features)
split(node['left'], max_depth, min_size, n_features, depth+1)
# process right child
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_split(right, n_features)
split(node['right'], max_depth, min_size, n_features, depth+1)
# Build a decision tree
def build_tree(train, max_depth, min_size, n_features):
root = get_split(dataset, n_features)
split(root, max_depth, min_size, n_features, 1)
return root
# Make a prediction with a decision tree
def predict(node, row):
if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']
# Create a random subsample from the dataset with replacement
def subsample(dataset, ratio):
sample = list()
n_sample = round(len(dataset) * ratio)
while len(sample) < n_sample:
index = randrange(len(dataset))
sample.append(dataset[index])
return sample
# Make a prediction with a list of bagged trees
def bagging_predict(trees, row):
predictions = [predict(tree, row) for tree in trees]
return max(set(predictions), key=predictions.count)
# Random Forest Algorithm
def random_forest(train, test, max_depth, min_size, sample_size, n_trees, n_features):
trees = list()
for i in range(n_trees):
sample = subsample(train, sample_size)
tree = build_tree(sample, max_depth, min_size, n_features)
trees.append(tree)
predictions = [bagging_predict(trees, row) for row in test]
return(predictions)
# Test the random forest algorithm
seed(1)
# load and prepare data
filename = 'sonar.all-data.csv'
dataset = load_csv(filename)
# convert string attributes to integers
for i in range(0, len(dataset[0])-1):
str_column_to_float(dataset, i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
# evaluate algorithm
n_folds = 5
max_depth = 10
min_size = 1
sample_size = 1.0
n_features = int(sqrt(len(dataset[0])-1))
for n_trees in [1, 5, 10]:
scores = evaluate_algorithm(dataset, random_forest, n_folds, max_depth, min_size, sample_size, n_trees, n_features)
print('Trees: %d' % n_trees)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

这里对第 197 行之后对各项参数的赋值做一个说明。

将 K 赋值为 5 用于交叉验证，得到每个子样本为 208/5 = 41.6，即超过 40 条声纳返回记录会用于每次迭代时的评估。

每棵树的最大深度设置为 10，每个节点的最小训练行数为 1. 创建训练集样本的大小与原始数据集相同，这也是随机森林算法的默认预期值。

我们把在每个分裂点需要考虑的特征数设置为总的特征数目的平方根，即 sqrt(60)=7.74，取整为 7。

将含有三组不同数量的树同时进行评估，以表明添加更多的树可以使该算法实现的功能更多。

最后，运行这个示例代码将会 print 出每组树的相应分值以及每种结构的平均分值。如下所示：

Trees: 1

Scores: [68.29268292682927, 75.60975609756098, 70.73170731707317, 63.41463414634146, 65.85365853658537]

Mean Accuracy: 68.780%

Trees: 5

Scores: [68.29268292682927, 68.29268292682927, 78.04878048780488, 65.85365853658537, 68.29268292682927]

Mean Accuracy: 69.756%

Trees: 10

Scores: [68.29268292682927, 78.04878048780488, 75.60975609756098, 70.73170731707317, 70.73170731707317]

Mean Accuracy: 72.683%

扩展

本节会列出一些与本次教程相关的扩展内容。大家或许有兴趣一探究竟。

算法调校（Algorithm Tuning）。本文所用的配置参数或有未被修正的错误以及有待商榷之处。用更大规模的树，不同的特征数量甚至不同的树的结构都可以改进试验结果。

更多问题。该方法同样适用于其他的分类问题，甚至是用新的成本计算函数以及新的组合树的预期值的方法使其适用于回归算法。

回顾总结

通过本次教程的探讨，你知道了随机森林算法是如何实现的，特别是：

随机森林与装袋决策树的区别。

如何用决策树生成随机森林算法。

如何将随机森林算法应用于解决实际操作中的预测模型问题。

Python Flask静态目录

February 5, 2017, 8:10 pm

≫ Next: Python之路【第二十二章】：Django 缓存

≪ Previous: 从头开始：用Python实现随机森林算法

在创建了Flask项目之后，如果不想用到模板引擎，想做前后端分离的项目时，就需要用到静态目录了。Flask的静态目录规定是static，也就是说所有的静态文件需要放到static文件夹下才能访问到。如下目录:

- app.py - static - index.html

要访问index.html，需要通过Url： http://localhost:5000/static/index.html

也就是说要加上static路径进行访问。但这样又很不方便，因为要加static的话，访问html的url就显得有些“丑”了。这个时候，可以使用参数 static_url_path ，将其设置为 "" ，就可以跳过 static ，直接以 http://localhost:5000/index.html 访问。

app.py代码如下：

# 创建应用 app = Flask(__name__, static_url_path="") 转载请注明出处

：

http://www.zgljl2012.com/2017/01/04/python-flaskjing-tai-mu-lu/

↧

Python之路【第二十二章】：Django 缓存

February 5, 2017, 8:09 pm

≫ Next: 数据挖掘实验2python编写贝叶斯分类器

≪ Previous: Python Flask静态目录

缓存

由于Django是动态网站，所有每次请求均会去数据进行相应的操作，当程序访问量大时，耗时必然会更加明显，最简单解决方式是使用：缓存，缓存将一个某个views的返回值保存至内存或者memcache中，5分钟内再有人来访问时，则不再去执行view中的操作，而是直接从内存或者Redis中之前缓存的内容拿到，并返回

Django中提供了6种缓存方式：

开发调试内存文件数据库 Memcache缓存（python-memcached模块、pylibmc模块） 1、配置

① 开发配置

# 此为开始调试用，实际内部不做任何操作 # 配置： CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.dummy.DummyCache', # 引擎 'TIMEOUT': 300, # 缓存超时时间（默认300，None表示永不过期，0表示立即过期） 'OPTIONS':{ 'MAX_ENTRIES': 300, # 最大缓存个数（默认300） 'CULL_FREQUENCY': 3, # 缓存到达最大个数之后，剔除缓存个数的比例，即：1/CULL_FREQUENCY（默认3） }, 'KEY_PREFIX': '', # 缓存key的前缀（默认空） 'VERSION': 1, # 缓存key的版本（默认1） 'KEY_FUNCTION' 函数名 # 生成key的函数（默认函数会生成为：【前缀:版本:key】） } } # 自定义key def default_key_func(key, key_prefix, version): """ Default function to generate keys. Constructs the key used by all other methods. By default it prepends the `key_prefix'. KEY_FUNCTION can be used to specify an alternate function with custom key making behavior. """ return '%s:%s:%s' % (key_prefix, version, key) def get_key_func(key_func): """ Function to decide which key function to use. Defaults to ``default_key_func``. """ if key_func is not None: if callable(key_func): return key_func else: return import_string(key_func) return default_key_func 开发

② 内存配置

# 此缓存将内容保存至内存的变量中 # 配置： CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.locmem.LocMemCache', 'LOCATION': 'unique-snowflake', } } # 注：其他配置同开发调试版本内存

③ 文件配置

# 此缓存将内容保存至文件 # 配置： CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache', 'LOCATION': '/var/tmp/django_cache', } } # 注：其他配置同开发调试版本文件

④ 数据库配置

# 此缓存将内容保存至数据库 # 配置： CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.db.DatabaseCache', 'LOCATION': 'my_cache_table', # 数据库表 } } # 注：执行创建表命令 python manage.py createcachetable 数据库

⑤ Memcache缓存（python-memcached模块）

CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache', 'LOCATION': '127.0.0.1:11211', } } CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache', 'LOCATION': 'unix:/tmp/memcached.sock', } } CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache', 'LOCATION': [ '172.19.26.240:11211', '172.19.26.242:11211', ] } } CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache', 'LOCATION': [ # 权重 ('172.19.26.240:11211',1), ('172.19.26.242:11211',15) ] } } python-memcached模块

⑥ Memcache缓存（pylibmc模块）

# 此缓存使用pylibmc模块连接memcache CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.PyLibMCCache', 'LOCATION': '127.0.0.1:11211', } } CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.PyLibMCCache', 'LOCATION': '/tmp/memcached.sock', } } CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.PyLibMCCache', 'LOCATION': [ '172.19.26.240:11211', '172.19.26.242:11211', ] } } pylibmc模块

2、应用

① 全站使用

使用中间件，经过一系列的认证等操作，如果内容在缓存中存在，则使用FetchFromCacheMiddleware获取内容并返回给用户，当返回给用户之前，判断缓存中是否已经存在，如果不存在则UpdateCacheMiddleware会将缓存保存至缓存，从而实现全站缓存 MIDDLEWARE = [ # 写到最上面 'django.middleware.cache.UpdateCacheMiddleware', # 其他中间件... # 写到最下面 'django.middleware.cache.FetchFromCacheMiddleware', ] 全局生效

② 单独视图缓存

方式一： from django.views.decorators.cache import cache_page @cache_page(60 * 15) def my_view(request): ... 方式二： from django.views.decorators.cache import cache_page urlpatterns = [ url(r'^foo/([0-9]{1,2})/$', cache_page(60 * 15)(my_view)), ] 单独方法生效

③ 局部视图使用

a. 引入TemplateTag {% load cache %} b. 使用缓存 {% cache 5000 缓存key %} 缓存内容 {% endcache %} html单独部分生效 3、单独视图缓存示例

cache方法处理的请求，都进行缓存10秒

HTML文件：

<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <h1>{{ ctime }}</h1> <h1>{{ ctime }}</h1> <h1>{{ ctime }}</h1> </body> </html> cache.html

配置文件：

CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache', 'LOCATION': os.path.join(BASE_DIR,'cache'), } } settings.py

处理文件：

from django.views.decorators.cache import cache_page @cache_page(10) # 装饰cache方法 def cache(request): import time ctime = time.time() return render(request,'cache.html',{'ctime':ctime}) 4、局部视图示例

缓存html文件某一部分

HTML文件：

{% load cache %} <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <h1>{{ ctime }}</h1> <h1>{{ ctime }}</h1> {# 10秒 #} {% cache 10 c1 %} <h1>{{ ctime }}</h1> {% endcache %} </body> </html> cache.html

配置文件：

CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache', 'LOCATION': os.path.join(BASE_DIR,'cache'), } } settings.py

处理文件：

def cache(request): import time ctime = time.time() return render(request,'cache.html',{'ctime':ctime}) 5、全局生效

配置文件：

MIDDLEWARE = [ 'django.middleware.cache.UpdateCacheMiddleware', 'django.middleware.security.SecurityMiddleware', 'django.contrib.sessions.middleware.SessionMiddleware', 'django.middleware.common.CommonMiddleware', 'django.middleware.csrf.CsrfViewMiddleware', 'django.contrib.auth.middleware.AuthenticationMiddleware', 'django.contrib.messages.middleware.MessageMiddleware', 'django.middleware.clickjacking.XFrameOptionsMiddleware', 'django.middleware.cache.FetchFromCacheMiddleware', ] settings.py

其余文件都一致，全局优先级大；请求流程-->使用中间件，经过一系列的认证等操作，如果内容在缓存中存在，则使用FetchFromCacheMiddleware获取内容并返回给用户，如果不存在则接着往下走，执行views函数，最后经过UpdateCacheMiddleware会将缓存保存至缓存，从而实现全站缓存

↧