Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all 9596 articles
Browse latest View live

Python 常用 PEP8 编码规范详解

$
0
0

python 常用 PEP8 编码规范

代码布局

缩进

每级缩进用4个空格。
括号中使用垂直隐式缩进或使用悬挂缩进。

EXAMPLE:

# (垂直隐式缩进)对准左括号
foo = long_function_name(var_one, var_two,
var_three, var_four)
# (悬挂缩进) 一般情况只需多一层缩进
foo = long_function_name(
var_one, var_two,
var_three, var_four)
# (悬挂缩进) 但下面情况, 需再加多一层缩进, 和后续的语句块区分开来
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)
# 右括号回退
my_list = [
1, 2, 3,
4, 5, 6,
]
result = some_function_that_takes_arguments(
'a', 'b', 'c',
'd', 'e', 'f',
)

错误示范:

# 不使用垂直对齐时,第一行不能有参数。
foo = long_function_name(var_one, var_two,
var_three, var_four)
# 参数的悬挂缩进和后续代码块缩进不能区别。
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)
# 右括号不回退,不推荐
my_list = [
1, 2, 3,
4, 5, 6,
]
result = some_function_that_takes_arguments(
'a', 'b', 'c',
'd', 'e', 'f',
)

最大行宽

每行最大行宽不超过 79 个字符
一般续行可使用反斜杠
括号内续行不需要使用反斜杠

EXAMPLE:

# 无括号续行, 利用反斜杠
with open('/path/to/some/file/you/want/to/read') as file_1, \
open('/path/to/some/file/being/written', 'w') as file_2:
file_2.write(file_1.read())
# 括号内续行, 尽量在运算符后再续行
class Rectangle(Blob):
def __init__(self, width, height,
color='black', emphasis=None, highlight=0):
if (width == 0 and height == 0 and
color == 'red' and emphasis == 'strong' or
highlight > 100):
raise ValueError("sorry, you lose")
if width == 0 and height == 0 and (color == 'red' or
emphasis is None):
raise ValueError("I don't think so -- values are %s, %s" %
(width, height))

空行

两行空行用于分割顶层函数和类的定义
单个空行用于分割类定义中的方法

EXAMPLE:

# 类的方法定义用单个空行分割,两行空行分割顶层函数和类的定义。
class A(object):
def method1():
pass
def method2():
pass
def method3():
pass

模块导入

导入的每个模块应该单独成行
导入顺序如下: (各模块类型导入之间要有空行分割,各组里面的模块的顺序按模块首字母自上而下升序排列)

标准库
相关的第三方库
本地库

EXAMPLE:

# 按模块首字母排序导入, 依此递推
import active
import adidas
import create

错误示例:

# 一行导入多模块
import sys, os, knife
# 不按首字母导入
import create
import active
import beyond

字符串

单引号和双引号作用是一样的,但必须保证成对存在,不能夹杂使用. (建议句子使用双引号, 单词使用单引号, 但不强制.)

EXAMPLE:

# 单引号和双引号效果一样
name = 'JmilkFan'
name = "Hey Guys!"

表达式和语句中的空格

括号里边避免空格

EXAMPLE:

spam(ham[1], {eggs: 2})

错误示例:

spam( ham[ 1 ], { eggs: 2 } )
逗号,冒号,分号之前避免空格
EXAMPLE:

if x == 4: print x, y; x, y = y, x

错误示例:

if x == 4 : print x , y ; x , y = y , x
函数调用的左括号之前不能有空格

EXAMPLE:

spam(1)
dct['key'] = lst[index]

错误示例:

spam (1)
dct ['key'] = lst [index]
赋值等操作符前后不能因为对齐而添加多个空格

EXAMPLE:

x = 1
y = 2
long_variable = 3

错误示例:

x = 1
y = 2
long_variable = 3
二元运算符两边放置一个空格
涉及 = 的复合操作符 ( += , -=等)
比较操作符 ( == , < , > , != , <> , <= , >= , in , not in , is , is not )
逻辑操作符( and , or , not )

EXAMPLE:

a = b
a or b
# 括号内的操作符不需要空格
name = get_name(age, sex=None, city=Beijing)

注释

注释块

注释块通常应用在代码前,并和代码有同样的缩进。每行以 ‘# ' 开头, 而且#后面有单个空格。

EXAMPLE:

# Have to define the param `args(List)`,
# otherwise will be capture the CLI option when execute `python manage.py server`.
# oslo_config: (args if args is not None else sys.argv[1:])
CONF(args=[], default_config_files=[CONFIG_FILE])
单行注释(应避免无谓的注释)

EXAMPLE:

x = x + 1 # Compensate for border
文档字符串

EXAMPLE:

# 多行文档, 首行首字母大写,结尾的 """ 应该单独成行
"""Return a foobang
Optional plotz says to frobnicate the bizbaz first.
"""
# 单行的文档, 结尾的 """ 在同一行。
"""Return a foobang"""

命名规则

包和模块名:

包和模块名应该简短,全部用小写字母, 多字母之间可以使用单下划线连接。

类名:

遵循驼峰命名

class MyClass(object):
pass
全局变量名:
全局变量名应尽量只在模块内部使用, 对可能使用语句 from moduleName import variableName 而被导入的模块,应采用 __all__ 机制来防止全局变量被别的模块导入, 或者在全局变量名开头加一个前置下划线.

EXAMPLE:

_name = 'name'
函数名
函数名应该为全部小写的凹驼峰规则。

EXAMPLE:

vcenter_connection = ''
常量名
常量全部使用大写字母的凹驼峰规则来表示, 通常在模块顶格定义

EXAMPLE:

MAX_OVERFLOW = ''
TOTAL = 1
方法名和实例变量
非公开方法和实例变量开头使用前置下划线
有时候可能会为了避免与子类命名冲突,采用两个前置下划线
需要注意的是: 若 class Foo 的属性名为 __a, 该属性是不能以 Foo.__a 的方式访问的(执著的用户还是可以通过Foo._Foo__a 来访问), 所以通常双前置下划线仅被用来避免与基类的属性发生命名冲突。

编程建议

None 的比较用 is 或 is not,而不要用 ==

用 is not 代替 not … is, 前者的可读性更好

EXAMPLE:

# Yes
if foo is not None
# No
if not foo is None
使用函数定义关键字 def 代替 lambda 赋值给标识符, 这样更适合于回调和字符串表示

# Yes
def f(x):
return 2*x
# No
f = lambda x: 2*x

异常类应该继承自Exception,而不是 BaseException

Python 2 中用raise ValueError('message') 代替 raise ValueError, 'message'

(考虑兼容python3和续行的方便性)

捕获异常时尽量指明具体异常, 尽量不用 except Exception, 应该捕获 出了什么问题,而不是 问题发生

EXAMPLE:

# Yes (捕获具体异常)
try:
import platform_specific_module
except ImportError:
platform_specific_module = None
# No (不要全局捕获)
try:
import platform_specific_module
except:
platform_specific_module = None
try/except 子句中的代码要尽可能的少, 以免屏蔽掉其他的错误

EXAMPLE:

# Yes
try:
value = collection[key]
except KeyError:
return key_not_found(key)
else:
return handle_value(value)
# No
try:
return handle_value(collection[key])
except KeyError:
# 可能会捕捉到 handle_value()中的 KeyError, 而不是 collection 的
return key_not_found(key)
函数或者方法在没有返回值时要明确返回 None

# Yes
def foo():
return None
# No
def foo():
return
使用字符串方法而不是 string 模块

python 2.0 以后字符串方法总是更快,而且与 Unicode 字符串使用了相同的 API

使用使用 .startswith() 和 .endswith() 代替字符串切片来检查前缀和后缀

startswith() 和 endswith 更简洁,利于减少错误

EXAMPLE:

# Yes
if foo.startswith('bar'):
# No
if foo[:3] == 'bar':
使用 isinstance() 代替对象类型的比较

EXAMPLE:

# Yes
if isinstance(obj, int):
# No
if type(obj) is type(1):
空序列类型对象的 bool 为 False:

# Yes
if not seq:
pass
if seq:
pass
# No
if len(seq):
pass
if not len(seq):
pass
不要用 == 进行 bool 比较

# Yes
if greeting:
pass
# No
if greeting == True
pass
if greeting is True: # Worse
pass

感谢阅读,希望能帮助到大家,谢谢大家对本站的支持!


python学习之面向对象【入门初级篇】

$
0
0

前言

最近在学习python的面向对象编程,以前是没有接触过其它的面向对象编程的语言,因此学习这一部分是相当带劲的,这里也总结一下。

概述

python支持多种编程范式:面向过程、面向对象、面向切面(装饰器部分)等。
面向过程:根据业务逻辑从上到下写垒代码
函数式:将某功能代码封装到函数中,日后便无需重复编写,仅调用函数即可
面向对象:对函数进行分类和封装,让开发“更快更好更强...”

OOP思想

面向对象的基本哲学:世界由具有各自运动规律和内部状态的对象组成,对象之间相互作用和通讯构成了世界。
唯一性,世界没有两片相同的树叶,同样的没有相同的对象。
分类性,分类是对现实世界的抽象。
三大特性:封装、继承和多态

面向对象的三大特性:

一、封装

封装,对具体对象的一种抽象,即将某些部分隐藏起来,在程序外部看不到,即无法调用。

私有化:将类或函数中的某些属性限制在某个区域之内,外部无法调用。

Python中私有化的方法也比较简单,就是在准备私有化的属性(包括方法、数据)名字前面加双下划线。

例如:

class ProtectMe(object):
def __init__(self):
self.me = "qiwsir"
self.__name = "kivi"

def __python(self):
print("I love Python.")

def code(self):
print9"Which language do you like?")
self.__python()

if __name__ == "__main__":
p = ProtectMe()
print(p.me)
print(p.__name)

#运行结果
qiwsir
Traceback (most recent call last):
File "21102.py", line 21, in <module>
print p.__name
AttributeError: 'ProtectMe' object has no attribute '__name' 

说明:__name属性隐藏了,无法调用。

调用私有属性,可以使用property函数

class ProtectMe(object):
def __init__(self):
self.me = "qiwsir"
self.__name = "kivi"

@property
def name(self):
return self.__name

if __name__ == "__main__":
p = ProtectMe()
print(p.name)
#运行结果
kivi

所以,在使用面向对象的封装特性时,需要:

将内容封装到某处
从某处调用被封装的内容

第一步:将内容封装到某处

python学习之面向对象【入门初级篇】

 self 是一个形式参数,当执行 obj1 = Foo('wupeiqi', 18 ) 时,self 等于 obj1

                              当执行 obj2 = Foo('alex', 78 ) 时,self 等于 obj2

所以,内容其实被封装到了对象 obj1 和 obj2 中,每个对象中都有 name 和 age 属性,在内存里类似于下图来保存。

python学习之面向对象【入门初级篇】

第二步:从某处调用被封装的内容

调用被封装的内容时,有两种情况:

通过对象直接调用
通过self间接调用
class Role(object):
ac = None #类变量
def __init__(self,name,role,weapon,life_value):
#初始化方法
self.name = name #实例变量(成员变量)
self.role = role
self.weapon = weapon
self.life_val = life_value

def buy_weapon(self,weapon): #定义方法
#self:表示实例本身
self.weapon = weapon
# print("%s is buying [%s]" %(self.name,weapon))

#把一个抽象的类变成一个具体的对象的过程,叫实例化

p1 = Role("sanjiang",'Police',"B10",90) #实例
t1 = Role("Chunyun",'Terrorist',"B11",100)

二、继承

继承,面向对象中的继承和现实生活中的继承相同,即:子可以继承父的内容。

class SchoolMember(object):
# member_nums = 0
def __init__(self,name,age,sex):
self.name = name
self.age = age
self.sex = sex
# self.enroll()

def enroll(self):
SchoolMember.member_nums += 1
print("SchoolMember [%s] is enrolled!" %self.name)

def tell(self):
print("Hello my name is [%s]" %self.name)

class Teacher(SchoolMember):
def __init__(self,name,age,sex,course,salary): #重写父类的__init__方法
super(Teacher,self).__init__(name,age,sex) #继承(新式类)
# SchoolMember.__init__(self,name,age,sex) #继承(旧式类)
self.course = course
self.salary = salary

def teaching(self):
print("Teacher [%s] is teaching [%s]" %(self.name,self.course))



class Student(SchoolMember):
def __init__(self,name,age,sex,course,tuition):
super(Student,self).__init__(name,age,sex)
self.course = course
self.tuition = tuition

def pay_tuition(self):
print("ca,student [%s] paying tuition [%s] again" %(self.name,self.tuition))

总结

以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。

Python中struct模块对字节流/二进制流的操作教程

$
0
0

前言

最近使用python解析IDX文件格式的MNIST数据集,需要对二进制文件进行读取操作,其中我使用的是struct模块。查了网上挺多教程都写的挺好的,不过对新手不是很友好,所以我重新整理了一些笔记以供快速上手。

注:教程中以下四个名词同义:二进制流、二进制数组、字节流、字节数组

快速上手

在struct模块中,将一个整型数字、浮点型数字或字符流(字符数组)转换为字节流(字节数组)时,需要使用格式化字符串fmt告诉struct模块被转换的对象是什么类型,比如整型数字是'i',浮点型数字是'f',一个ascii码字符是's'。

def demo1():
# 使用bin_buf = struct.pack(fmt, buf)将buf为二进制数组bin_buf
# 使用buf = struct.unpack(fmt, bin_buf)将bin_buf二进制数组反转换回buf
# 整型数 -> 二进制流
buf1 = 256
bin_buf1 = struct.pack('i', buf1) # 'i'代表'integer'
ret1 = struct.unpack('i', bin_buf1)
print bin_buf1, ' <====> ', ret1
# 浮点数 -> 二进制流
buf2 = 3.1415
bin_buf2 = struct.pack('d', buf2) # 'd'代表'double'
ret2 = struct.unpack('d', bin_buf2)
print bin_buf2, ' <====> ', ret2
# 字符串 -> 二进制流
buf3 = 'Hello World'
bin_buf3 = struct.pack('11s', buf3) # '11s'代表长度为11的'string'字符数组
ret3 = struct.unpack('11s', bin_buf3)
print bin_buf3, ' <====> ', ret3
# 结构体 -> 二进制流
# 假设有一个结构体
# struct header {
# int buf1;
# double buf2;
# char buf3[11];
# }
bin_buf_all = struct.pack('id11s', buf1, buf2, buf3)
ret_all = struct.unpack('id11s', bin_buf_all)
print bin_buf_all, ' <====> ', ret_all

输出结果如下:

Python中struct模块对字节流/二进制流的操作教程
demo1输出结果

详解struct模块

主要函数

struct模块中最重要的三个函数是pack() , unpack() , calcsize()

# 按照给定的格式化字符串,把数据封装成字符串(实际上是类似于c结构体的字节流)
string = struct.pack(fmt, v1, v2, ...)
# 按照给定的格式(fmt)解析字节流string,返回解析出来的tuple
tuple = unpack(fmt, string)
# 计算给定的格式(fmt)占用多少字节的内存
offset = calcsize(fmt)

struct中的格式化字符串

struct中支持的格式如下表:



Format
C Type
Python
字节数




x
pad byte
no value
1


c
char
string of length 1
1


b
signed char
integer
1


B
unsigned char
integer
1


?
_Bool
bool
1


h
short
integer
2


H
unsigned short
integer
2


i
int
integer
4


I
unsigned int
integer or lon
4


l
long
integer
4


L
unsigned long
long
4


q
long long
long
8


Q
unsigned long long
long
8


f
float
float
4


d
double
float
8


s
char[]
string
1


p
char[]
string
1


P
void *
long
 


      注1:q和Q只在机器支持64位操作时有意思
      注2:每个格式前可以有一个数字,表示个数
      注3:s格式表示一定长度的字符串,4s表示长度为4的字符串,但是p表示的是pascal字符串
      注4:P用来转换一个指针,其长度和机器字长相关
      注5:最后一个可以用来表示指针类型的,占4个字节

为了同c中的结构体交换数据,还要考虑有的c或c++编译器使用了字节对齐,通常是以4个字节为单位的32位系统,故而struct根据本地机器字节顺序转换.可以用格式中的第一个字符来改变对齐方式.定义如下:



Character
Byte order
Size and alignment




@
native
native 凑够4个字节


=
native
standard 按原字节数


<
little-endian
standard 按原字节数


>
big-endian
standard 按原字节数


!
network (= big-endian)
standard 按原字节数


使用方法是放在fmt的第一个位置,就像'@5s6sif'

总结

以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助如果有疑问大家可以留言交流。

python使用xlrd与xlwt对excel的读写和格式设定

$
0
0

前言

python操作excel主要用到xlrd和xlwt这两个库,即xlrd是读excel,xlwt是写excel的库。本文主要介绍了python使用xlrd与xlwt对excel的读写和格式设定,下面话不多说,来看看详细的实现过程。

脚本里先注明# -*- coding:utf-8 -*-   

1.  确认源excel存在并用xlrd读取第一个表单中每行的第一列的数值。

import xlrd, xlwt
import os

assert os.path.isfile('source_excel.xls'),"There is no timesheet exist. Exit..."

book = xlrd.open_workbook('source_excel.xls')
sheet=book.sheet_by_index(0)

for rows in range(sheet.nrows):
value = sheet.cell(rows,0).value

2. 用xlwt准备将从源表中读出的数据写入新表,并设定行宽和表格的格式。合并单元格2行8列后写入标题,并设定格式为之前定义的tittle_style。

使用的是write_merge。

wbk = xlwt.Workbook(encoding='utf-8')
sheet_w = wbk.add_sheet('write_after', cell_overwrite_ok=True)
sheet_w.col(3).width = 5000
tittle_style = xlwt.easyxf('font: height 300, name SimSun, colour_index red, bold on; align: wrap on, vert centre, horiz center;')
sheet_w.write_merge(0,2,0,8,u'这是标题',tittle_style)

3. 当函数中要用到全局变量时,注意加global。否则会出现UnboundLocalError:local variable'xxx' referenced before assignment.

check_num = 0

def check_data(sheet):
global check_num
check_num=check_num+1

4. 写入日期和带格式的数值。原来从sheet中读取的日期格式为2014/4/10,处理后只保留日期并做成数组用逗号分隔后写入新的excel。

date_arr = []
date=sheet.cell(row,2).value.rsplit('/')[-1]
if date not in date_arr:
date_arr.append(date)
sheet_w.write_merge(row2,row2,6,6,date_num, normal_style)
sheet_w.write_merge(row2,row2,7,7,','.join(date_arr), normal_style)

5. 当从excel中读取的日期格式为xldate时,就需要使用xlrd的xldate_as_tuple来处理为date格式。先判断表格的ctype确实是xldate才能开始操作,否则会报错。之后date格式可以使用strftime来转化为string。如:date.strftime("%Y-%m-%d-%H")

from datetime import date,datetime
from xlrd import xldate_as_tuple

if (sheet.cell(rows,3).ctype == 3):
num=num+1
date_value = xldate_as_tuple(sheet.cell_value(rows,3),book.datemode)
date_tmp = date(*date_value[:3]).strftime("%d")

6. 最后保存新写的表

wbk.save('new_excel.xls')

总结

以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。

Python第三方库xlrd/xlwt的安装与读写Excel表格

$
0
0

前言

相信大家都应该有所体会,在平时经常会遇到处理 Excel 表格数据的情况,人工处理起来实在是太麻烦了,我们可以使用 python 来解决这个问题,我们需要两个 Python 扩展, xlrd 和 xlwt 。

xlrd和xlwt是Python的第三方库,所以是需要自己安装的,可以在python的官网https://pypi.python.org/pypi下载该模块来安装,也可以通过其他手段,比如easy_install或者pip,下面来看看详细的安装介绍与读写Excel表格的方法吧。

使用 xlwt 写入 Excel 数据

xlwt 的安装方式

$ sudo pip install xlrd
示例代码

import xlwt
xls = xlwt.Workbook()
sheet = xls.add_sheet('sample')
sheet.write(0, 0, 'netcon')
sheet.write(0, 1, 'conw.net')
xls.save('sample.xls')

这个是一个最简单的例子,创建一个 Excel 表格,新建一个名为 sample 的 sheet ,并在 A1 、 B1 的位置写上 jb51 、 jb51.net 。

使用 xlrd 读取 Excel 数据

xlrd 的安装方式

$ sudo pip install xlrd

示例代码

import xlrd
xls = xlrd.open_workbook('sample.xls')
sheet = xls.sheets()[0]
values = sheet.row_values(0)
print(values)

这份代码使用 xlrd 读取上面创建的 Excel 表格,输出是:

['jb51', 'jb51.net']

总结

以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。

python实现的多线程端口扫描功能示例

$
0
0

本文实例讲述了python实现的多线程端口扫描功能。分享给大家供大家参考,具体如下:

下面的程序给出了对给定的ip主机进行多线程扫描的Python代码

#!/usr/bin/env python
#encoding: utf-8
import socket, sys, thread, time
openPortNum = 0
socket.setdefaulttimeout(3)
def usage():
print '''''Usage:
Scan the port of one IP: python port_scan_multithread.py -o <ip>
Scan the port of one IP: python port_scan_multithread.py -m <ip1, ip2, ip3, ip4 ...>
'''
print 'Exit'
sys.exit(1)
def socket_port(ip, PORT):
global openPortNum
if PORT > 65535:
print 'Port scanning beyond the port range, interrupt to scan'
sys.exit(1)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = s.connect_ex((ip, PORT))
if(result == 0):
print ip, PORT,'is open'
openPortNum += 1
s.close()
def start_scan(IP):
for port in range(0, 65535+1):
thread.start_new_thread(socket_port, (IP, int(port)))
time.sleep(0.006)
if __name__ == '__main__':
t = 0
if len(sys.argv)<2 or sys.argv[1] == '-h':
usage()
elif sys.argv[1] == '-o':
ONE_IP = raw_input('Please input ip of scanning: ')
t = time.time()
start_scan(ONE_IP)
elif sys.argv[1] == '-m':
MANY_IP = raw_input('Please input many ip of scanning: ')
IP_SEG = MANY_IP.split(',')
t = time.time()
for i in IP_SEG:
start_scan(i)
print
print 'total open port is %s, scan used time is: %f ' % (openPortNum, time.time()-t)

运行效果图

python实现的多线程端口扫描功能示例

更多关于Python相关内容感兴趣的读者可查看本站专题:《Python URL操作技巧总结》、《Python数据结构与算法教程》、《Python Socket编程技巧总结》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python 字符串大小写转换的简单实例

$
0
0

①所有字母都转换为大写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'hello, world!'
    print(a.upper())输出:
HELLO, WORLD!

②所有字母都转换为小写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'HELLO, WORLD!'
    print(a.lower())输出:
hello, world!

③首字母转换成大写, 其余转换成小写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'HELLO, WORLD!'
    print(a.capitalize())输出:
Hello, world!
④所有单词的首字母转换成大写, 其余转换成小写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'HELLO, WORLD!'
    print(a.title())输出:
Hello, World!

⑤判断所有字母都为大写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'HELLO, WORLD!'
    print(a.isupper())
   
    b = 'hello, world!'
    print(b.isupper())输出:
True
False
⑥判断所有字母都为小写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'HELLO, WORLD!'
    print(a.islower())
   
    b = 'hello, world!'
    print(b.islower())输出:
False
True
⑦判断所有单词的首字母为大写

# -*- coding:utf-8 -*-

if __name__ == "__main__":
    a = 'HELLO, WORLD!'
    print(a.istitle())
   
    b = 'hello, world!'
    print(b.istitle())
   
    c = 'Hello, World!'
    print(c.istitle())输出:
False
False
True

以上这篇python 字符串大小写转换的简单实例就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持脚本之家。

Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

$
0
0
Kaggle | 使用python和R绘制数据地图的十七个经典案例(附资源)

一点号大数据文摘3小时前


php?url=0FW4ptXBbF" alt="Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)" />

大数据文摘作品,转载具体要求见文末

编译团队 | 寒小阳 黄念 黄卓君

作者|Megan Risdal

目前,Kaggle用户在我们的开放数据科学平台上创建了近3万颗内核。这代表了惊人且不断增长的可再现知识。我发现我们的代码和数据库是目前了解Python和R最新技术和库的好地方。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

在这篇博客中,我将一些优秀的用户内核变成迷你教程,作为在Kaggle上发布的数据集进行绘制地图的开始。这篇文章中,你将学习如何用Python和R,使用包括实际代码示例的几种方法来布局和可视化地理空间数据。我还列出了资源,以便你可以了解每个教程中突出显示的每个包以及进一步的用户分析,从而获得更多的灵感。

前言

为了探索目的而创建一个简单的地图不再需要你学习如何操作shapefile或想象投影。并且,无论你喜欢在R或Python,都有快速和简单的方法把你的数据展现在地图上。

注:Shapefile文件是描述空间数据的几何和属性特征的非拓扑实体矢量数据结构的一种格式

R 地图

对于R用户,Kaggler Umesh显示,你需要的是ggplot2和Hadley Wickham的地图包,借助CDC在Kaggle上发布的数据,以显示美国哪些州每日吸烟者的百分比最高。

数据包下载链接:http://docs.ggplot2.org/current/map_data.html


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

然后,创建地图本身就像创建任何其他ggplot可视化一样熟悉。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

最终的成果清楚地描述了美国哪些州每天吸烟人数最多。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
美国的吸烟者和非吸烟者在地图上的分布。

这里,还有一些更好的资源用于使用地图、mapsdata和ggplot2:

在R中绘制地图

使用ggplot2软件包在R中绘制地图

请注意,你目前无法在内核中使用ggmaps。在大多数情况下,你不能做诸如从我们的环境中调用API的事情。

Python 地图

对于Python用户来说, matplotlib底图工具包是绘制2D地图一个好的起始。你可以在底图文档中阅读更多内容,这里有各种示例。

数据包下载链接:http://matplotlib.org/basemap/

有很多用户编写的大内核,但Kaggler Dotman则显示了使用底图来很轻松地将纽约市近100万Uber行程的数据可视化:


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
在纽约可视化Uber出行数据。

有关演示如何使用Python中的底图来生成有效的地图可视化的更多示例,请查看以下这些用户内核:

美国宽带手机接入的地图可视化(Jesse Lieman-Sifry制作)。

使用2014年美国社区调查数据处理shapefile(Phil Butcher分派的代码)。

南非犯罪的Choropleth地图(Kostya Bahshetsyan制作)。

互动地图

使用交互式地图(和一般的交互式数据可视化),你可以将颜色限制为只有你认为与你的受众更广泛相关的颜色,而且还可以让用户在需要更多信息的地方深入查看。在这里,我强调了使用Plotly,Leaflet和Highcharter创建的用户创建的地图。

Plotly

在FiveThirtyEight提供的数据集中,用户可以检查追溯到1971年的美国警察死亡原因。鉴于位置信息,Kaggler Abigail Larion比较了使用Python和Plotly状态的警察死亡地图。她的代码演示了如何简单地用计数(国家人口归一化)创建一个干净和互动的地图:


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

美国警察的死亡情况

有关使用Plotly的交互式拼版地图的更多示例,请查看其页面上的详细代码示例。有R和Python的样本适合你的映射需求。按照以下教程,使用Plotly也可以尝试其他任何地图类型:

县级choropleth

散点图

气泡图

地图上的线条

小倍数映射

因为包括数据在内的代码是最好的学习方式,因为Plotly在Kaggle的Python用户中很流行,这里有一些更好的内核:

气温和全球变暖分析地图

出生时的全球预期寿命

UFO在美国的报道

Leaflet

在Kaggle Kernels中创建交互式地图的另一个方法是Leaflet。Leaflet是一个用于移动友好交互式地图的开源javascript库。有一个伟大的R Leaflet,使其易于集成和控制在R中的单张地图。你可以阅读Leaflet的小部件以及如何在他们的教程操作其属性。

EwenHenderson的一个梦幻般的内核使用超级简洁的Leaflet检查来自波士顿的Airbnb数据中的邻居列表和“超级主机”。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

分析位于波士顿的Airbnb主机。

不是所有的Leaflet的教程都必须适用于在内核中专门制作地图,但这里有一些可能在开始使用:

单页快速入门指南

互动Choropleth地图(案例研究)

使用GeoJSON与Leaflet

在我的印象里,高分辨率R包是一个新的包。但是,它是我见过的最光滑的内核。如他们的主页上所述,“Highcharter是Highcharts Javascript库及其模块的R包装。你可以在这里找到他们的文档。在另一个Ewen Henderson的内核中,他分析了由FiveThirtyEight作为Kaggle数据集发布的2016年调查数据,使高速成像看起来超级容易使用。注意他使用恰当的Highcharter主题是FiveThirtyEight。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

在2016年总统选举投票数据中,共和党与民主党的(平均)偏好。

要获得更多高级灵感,你可以在这里找到更多资源:

来自Highcharter的“展示”的鼓舞人心的可视化

更多“高地图”的例子

动画地图

互动地图可以是很好的,当你想让读者在他们的休闲细节控制探索数据。如果你的目标是说明一个特定的故事,传达随着时间的变化作为数据中的一个新的维度,或只是添加一些引人注目的戏剧,你可以选择动画。是的,你可以在内核可视化动画gif图。

一个用户pavelevap使用记录历史全球气温的数据创建一个惊人的动画,显示世界各地城市的平均温度。当你观看动画展开,你急切希望更多的蓝色球体出现。这使得pavelevap的可视化和底图的使用相当有效。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
1950年至2013年500个随机城市的年平均温度的可视化。

动画地图的其他示例:

在中国的一天(Charles Darwin, 一个Kaggle用户, Python)。

将历史温度异常动画化。(Donyoe, R)

非常规地图

只是因为你有坐标数据并不意味着它属于传统的世界地图。你可以将你在这里学到的很多内容,包括地图制作、互动和动画,转移到足球场甚至是星际上。我会给你留下这些几个奖金的绘制坐标数据地图的例子:

利用martijn探索事件数据(R)。这个内核不仅可以显示你如何整理凌乱的XML文件,而且还可以显示如何绘制和映射在欧洲足球比赛期间发生的事件。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)
欧洲足球数据库中进球的位置。

研究科比布莱恩特的射门选位置择(Arjoonn Sharma, Python)。该作者显示,剩下的时间越少,科比在越远的投篮位置上越冒风险。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

在科比布莱恩特的投篮选择的时间背后的探索。

利用DBenn绘制外太阳行星的3D空间位置的地图(R)。这个内核展示了Plotly中酷炫的3D绘图功能,将太阳系行星的位置可视化了。


Kaggle | 使用Python和R绘制数据地图的十七个经典案例(附资源)

使用Plotly在3D空间中绘制外行星。

检查此内核中的交互式代码。

所以,你看到了显示了数据绘图技术的十七个例子。任何这些内核的交叉和延伸,再佐以自己的天赋或通过选择“新脚本”或“新笔记本”,在Kaggle上发布的200多个功能数据集中get你的新的地图制作技能。

来源链接:


My experience with type hints and mypy

$
0
0

The CLA bot for the PSF is designed defensively because if the bot accidentally lets a pull request through from someone that has not signed the CLA that could lead to legal troubles. To alleviate any worries I may have about bugs lurking in the code I have made sure that the CLA bot's code is thoroughly tested. I use Travis to make sure that continuous integration is passing , I use Codecov and coverage.py to make sure that there's 100% branch coverage , and the bot does not deploy to Heroku unless CI is passing (aside: thanks to Heroku for donating free hosting to the PSF which I'm taking advantage of for the bot).

But one thing I had not taken advantage of until today to help code defensively is type hints and mypy . I didn't do this from the outset because mypy didn't support async functions when the CLA bot was initially written. But with the advent of variable annotations in python 3.6 and mypy's support for async functions I thought I would see what it was like to add type hints to pre-existing code.

What worked

Following the general approach outlined by the Dropbox team during their Dec 2016 BayPiggies talk , which mirrors what Zulip outlined in October 2016, worked out well. Basically you run mypy with no types to make sure it won't trip over anything, and then you slowly add types, one object at a time. Since mypy only types things that have been given types you don't have to worry about mypy over-reaching and making false-positives. This gives you a nice iterative process where you can don't have to convert all of your code at once.

What didn't work

Unfortunately mypy isn't ready to take full advantage of Python 3.6. Now for most people this won't be a problem, but if you're not aware of this it can trip you up. For instance, even though typing.Collection exists, typeshed doesn't support the class . And because of how mypy is structured, if typeshed doesn't have something from the typing module it will claim it doesn't exist. In the end I was able to work around this by using typing.AbstractSet , but it was a bit frustrating to not get to fully use all the types available in Python 3.6.

You also can't use f-strings in mypy yet (I've been told they're coming). Since mypy has to mirror so much of the Python internals spanning Python 2 & 3 it hasn't had its parser updated yet to handle f-strings. Luckily it's coming, but it would have been nice if support was available when Python 3.6 was released (which is not a criticism since the mypy team has only so much time and their own priorities).

I did end up skipping the type hinting of the test suite to avoid the work. When you're faking things out and using types that you know you should not normally be passed in it leads to a lot of type errors (all of which were legitimate, but I simply did not care). I could have updated the test code to pass the appropriate type, but I was lazy. I also could have loosened the type hints to be more permissive, but I did not think that was the best solution due to my laziness. (It's now an open issue to resolve this.)

Did I get anything out of this?

You can look at the pull request which added type hints . While mypy did find a couple bugs, all of them would have been found by most linters anyway.

What mypy really got me was better documentation. While I was adding the type hints there were a couple of times where I had to examine the code to realize what the appropriate type was. Now that I have the functions and methods all hinted I don't have to guess anymore. That should make long-term maintenance a bit easier. And I don't think the code reads poorly because of the type hints so I don't think there's a penalty there. This is also useful for the CLA bot as it's entirely abstracted out into a few abstract base classes to make swapping out any server that it communicates with easy; having type hints means mypy verifies the type hinting contract between ABCs and their subclasses.

After having gone through the experience, would I bother typing new Python 3 code? My answer is yes once mypy supports f-strings. When I design an API I already have to think about what type of objects would be acceptable, so quickly writing down my assumptions doesn't hurt anything, it's relatively quick, and it benefits anyone having to work with my code. But I also wouldn't contort my code to fit within the confines of type hints (i.e. if type hints forces me to write cleaner code then that's great, but if something is so dynamic that it can't have type hints then that's fine and I'll happily use typing.Any as an escape hatch).

In the end I view type hints as enhanced documentation that has tooling to help verify that the documentation about types is accurate. And for that use-case I see type hints worth doing and not at all a burden.

知乎Live全文搜索之使用Elasticsearch全文搜索

$
0
0

一般的网站都会包含搜索功能,它能帮助用户发现没有找到想要的东西,甚至能帮助用户挖掘到兴趣,这对提升用户对网站的黏性和用户体验有非常大的帮助。举个豆瓣的例子,用户可以在主站的搜索里面找到电影、书籍、音乐、用户、小站、小组、游戏等相关内容。

传统的数据库系统设计成可进行「增删改查」等操作,我们都知道,需要存储到数据库的内容都是经过深思熟虑的,如果用户规模和数据很大,多增加一个字段就意味着要增加很多额外的空间、多个索引,并影响到执行效率,这意味着我们无法把全部数据都存进数据库。举个例子,写一篇日记,可以把作者、发表时间、标题等字段存入数据库,但是日记正文无法也放进去,太占空间了。

现在的一个最佳实践是把访问比较集中的、频繁的数据直接放到内存和键值数据库中。但是再进行搜索,就要对不同的存储内容整理,并根据一定的算法把符合的内容排序后返回。而且这些数据非常有可能是由不同的开发团队来维护和开发的,首先这些产品间的网络通信就是一笔不小的开销,还要从不同的产品获取到结果之后再排序。这显然不合理。

全文搜索软件如Elasticsearch则有如下特点:

查询速度。全文搜索的数据存取方式只考虑快速读取,相比数据库的查询,要快的多得多。 支持复杂的查询表达式。数据库系统的查询通常只支持AND/OR等有限的模式,全文检索支持多得多的查询方式。 灵活排序。数据库系统一般按照内置的排序规则来排序,有什么字段并且有对应的索引才可能按什么字段排序。而全文搜索除了能够支持数据库的排序规则外,还支持按照结果的相关度排序,比如Elasticsearch内置了文本相关性、衰减、分词等高级特性,对搜索的效果有非常大的帮助。虽然Elasticsearch自带的中文分词不好,但是提供良好的插件机制,我们可以安装第三方中文分词插件,可以达到非常好的中文搜索效果。

ElasticSearch(简称ES)是一个提供Restful API的、实时的分布式搜索和分析引擎,稳定,可靠,快速,功能强大。

我最早关注到ElasticSearch是因为Github抛弃了Solr,采取ElasticSearch来做PB级的搜索。而现在ES已经是变得非常流行了。除此之外,基于ELK(ElasticSearch, Logstash, Kibana)的实时日志分析平台应用也非常广泛。

在上篇文章中我们已经抓取了需要的Live数据。我们先感受一条:

In : from models import Live In : live = Live.get(789840559912009728) In : live Out: Live(index='live', doc_type='live', id='789840559912009728') In : for k, v in vars(live)['_d_'].items(): ...: print('{}: {}'.format(k, '{}{}'.format(v[:30], '...') if isinstance(v, str) and len(v) > 30 ...: else v)) ...: subject: python 工程师的入门和进阶 feedback_score: 4.5 status: False description: 我是董伟明,豆瓣高级产品开发工程师。从 2011 年开始接触... speaker_message_count: 208 liked_num: 1134 outline: starts_at: 2016-12-27 21:00:00 speaker_id: 314 topic_names: Python seats_taken: 3562 amount: 9.99 tag_names: 互联网

其实对于我这个需求,算是把ES当做NoSQL数据库来用了。

假如用户要搜索Python相关的Live,也就是需要从ES中各个需要覆盖到的字段中找Python这个词,最后把符合的文档按照相关度等因素排好序返回给我们。

排序规则

默认的ES会根据文档的相关度进行排序的,多字段查询的写法如下:

In : SEARCH_FIELDS = ['subject', 'description', 'outline', 'tag_names', 'topic_names'] In : lives = s.query('multi_match', query='python', fields=SEARCH_FIELDS).execute() In : lives[0]._score Out: 9.952918 In : lives[1]._score Out: 9.250518

排在前面的结果获得的分数要更高,也就是被ES认为更相关。

但对于很多场景这是有问题的,比如上面这个文档,不同字段应该要不同的权重。举个例子,2个文档包含Python关键词的数量一致,但是第一个topic_names是Python,另外一个topic_names是Ruby,显然第一个文档要排在之前。因为知乎话题(Topic)是比较可信的一种划分,而相对的description里面\的文本就算有Python也不一定说明这个一个Python相关的Live, 有可能只是作者提自己曾经学习过Python而已。所以不同字段我们需要设计不同的权重,我不是算法工程师,只是按自己的理解来调整下:

In : from elasticsearch_dsl.query import Q In : a = Q('multi_match', query='python', fields=['subject^5', 'outline^2', 'description', 'topic_nam ...: es^10', 'tag_names^5']) In : lives = s.query(a).execute() In : lives[0]._score Out: 73.56763

可以看到,这次评分获得了很大的提升,是因为我给subject的权重 5,outline 2…… 注意,topic_name的权重成了之前的10倍,这样很容易岔开不同Live的得分。

还有一点,我使用了Q,这和上面的区别不大,只是随着本文越来越深入,用Q来包装接口会让代码更好理解。

得分衰减

有了权重还是不够,因为我们还要考虑给新的Live机会。如果一个大V在很久之前举办了一场Live,文本相关度很高、收入很多、评价也很好。那么是不是这个Live就应该一直排在前面甚至第一呢?这对用户的积极性有很多的伤害,因为后来的人很难有机会有足够的曝光。所以当一段「保护期」之后,这个Live的得分会随着时间不断的减低。

ES内置了衰减函数(Decay Function)的支持。对于数值、日期和地理位置类型,可以设置一个理想的值,如果实际的值越偏离这个理想值(无论是增大还是减小),就越不符合期望,分数就越低。

它支持如下参数:

origin:原点,该字段最理想的值,这个值可以得到满分(1.0) offset:偏移量,与原点相差在偏移量之内的值也可以得到满分 scale:衰减规模,当值超出了原点到偏移量这段范围,它所得的分数就开始进行衰减了,衰减规模决定了这个分数衰减速度的快慢 decay:衰减值,该字段可以被接受的值(默认为0.5),相当于一个分界点,具体的效果与衰减的模式有关

衰减函数可以指定三种不同的模式:线性函数(linear)、以e为底的指数函数(Exp)和高斯函数(gauss),它们拥有不同的衰减曲线,我盗用官方的图:


知乎Live全文搜索之使用Elasticsearch全文搜索

可以感受到:

linear直线衰减,在0分外的值都是0分 exp衰减速度先快后慢 gauss衰减速度先慢后快再慢

对于Live搜索和排序的需求,我选择了gauss,希望对live开始的前后7天以外的时间点都让这个live的得分变低。

我们通过一个例子,来感受下时间衰减对得分的影响。

In : from elasticsearch_dsl.query import Q, SF In : lives = s.query('multi_match', query='python', fields=SEARCH_FIELDS).execute() In : live = lives[0] In : live._score, live.subject, live.starts_at Out: (9.952918, 'Python 工程师的入门和进阶', datetime.datetime(2016, 12, 27, 21, 0)) In : sf = SF('gauss', starts_at={'origin': 'now', 'offset': '60d', 'scale': '10d'}) In : sf2 = SF('gauss', starts_at={'origin': 'now', 'offset': '7d', 'scale': '10d'})

我的Live在sf的60天偏移量范围内,但是不在sf2的7天范围内。现在感受下随着时间衰减对得分的影响:

In : lives = s.query(Q('function_score', boost_mode='multiply', functions=[sf])).execute() In : lives[0]._score, lives[0].subject Out: (10.952918, 'Python 工程师的入门和进阶') In : lives = s.query(Q('function_score', boost_mode='multiply', functions=[sf2])).execute() In : lives[0]._score, lives[0].subject Out: (10.065867, 'Python 工程师的入门和进阶')

用boost_mode可以指定计算后的分数与原始的_score如何合并,有以下选项:

multiply:将结果乘以_score sum:将结果加上_score min:取结果与_score的较小值 max:取结果与_score的较大值 replace:使结果替换掉_score

看看使用sum的效果:

In : lives = s.query(Q('function_score', boost_mode='sum', functions=[sf])).execute() In : lives[0]._score, lives[0].subject Out: (11.952918, 'Python 工程师的入门和进阶') In : lives = s.query(Q('function_score', boost_mode='sum', functions=[sf2])).execute() In : lives[0]._score, lives[0].subject Out: (11.065802, 'Python 工程师的入门和进阶')

由于我们对不同的字段做了比较大的权重的加成,使用相乘更有意义。

数据归一化

相信知乎Live在排序的时候也考虑了Live的收入(也就是单价*购买人数)这个因素,所谓「市场经济下价格基本准确反映供需关系」,一个Live是不是受到大家的欢迎从收入以及单价上市可以体现出来的。但是就算我们对文本相关性做了权重的提高,得分也不外乎几十到几百。如果直接和收入相乘,显然是有问题的:小众优秀的主题由于受众小,票价上不去,得分会很低。同理一个低质量的Live由于大家一股脑被吸引进来由于收入较高而排在前面。顺便说,知乎Live的年度精选就有多个3星、三星半的Live的存在,有种想不开的感觉。对我这种长期受豆瓣评分体系价值观的影响,无法理解这种评分怎么入选的…

我看Live少的一场才收入2-3k, 而有的一场10几w。怎么对这个指标做标准化处理,让指标之间更具有可比性呢?

这就是数据归一化,也就是把原始数据经过数据标准化处理后,各指标处于同一数量级,适合进行综合对比评价。

归一化的方法很多,我这里为了简单实用,选择了「对数函数转换」。通过Jupyter Notebook+Matplotlib+Numpy看一下转换的效果:


知乎Live全文搜索之使用Elasticsearch全文搜索

这样2k被转化为3.3,10w会被转化成5。收入会有影响,但是被极大的缩小了。

再验证下,被归一化之前:

In : s = Live.search() In : s = s.query('multi_match', query='python', fields=SEARCH_FIELDS) In : lives = s.query(Q('function_score', boost_mode='multiply', functions=[SF('script_score', script= ...: "doc['seats_taken'].value * doc['amount'].value")])).execute() In : lives[0]._score Out: 75663.34

这个得分,我想静静..

ES内置了脚本语言painless和groovy,我们能直接在ES内部进行一些简单的编程。当然也支持Python但是需要额外安装了。我觉得完全用不到Python,尤其是5.0开始painless成为了默认编程语言。原因是painless有如下特点:

支持List、map和array等高级数据结构 性能接近Java 内置正则表达式 支持匿名函数lambda 语法简单,学过Java、Python、javascript的开发者很容易就可以学会

随便提一下,编程脚本的时候尽量不要出现一些受外界影响大的变量,比如用户id这种,因为ES可以缓存脚本,如果使用一些每次查询都不一样的条件去生成脚本会影响性能。

现在我们把log10引入再看看:

In : sf = SF('script_score', script={'lang': 'painless', 'inline': "Math.log10(doc['seats_taken'].value* doc['amount'].value)"}) In : lives = s.query(Q('function_score', functions=[sf])).execute() In : lives[0]._score Out: 14.504177

这样会合理很多。

今天先讲到这里。下一篇将介绍聚合分析

Mark Needham: Go vs Python: Parsing a JSON response from a HTTP API

$
0
0

As part of a recommendations with Neo4j talk that I’ve presented a few times over the last year I have a set of scripts that download some data from the meetup.com API .

They’re all written in python but I thought it’d be a fun exercise to see what they’d look like in Go. My eventual goal is to try and parallelise the API calls.

This is the Python version of the script:

import requests import os import json key = os.environ['MEETUP_API_KEY'] lat = "51.5072" lon = "0.1275" seed_topic = "nosql" uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(seed_topic, lat, lon, key) r = requests.get(uri) all_topics = [topic["urlkey"] for result in r.json()["results"] for topic in result["topics"]] for topic in all_topics: print topic

We’re using the requests library to send a request to the meetup API to get the groups which have the topic ‘nosql’ in the London area. We then parse the response and print out the topics.

Now to do the same thing in Go! The first bit of the script is almost identical:

import ( "fmt" "os" "net/http" "log" "time" ) func handleError(err error) { if err != nil { fmt.Println(err) log.Fatal(err) } } func main() { var httpClient = &http.Client{Timeout: 10 * time.Second} seedTopic := "nosql" lat := "51.5072" lon := "0.1275" key := os.Getenv("MEETUP_API_KEY") uri := fmt.Sprintf("https://api.meetup.com/2/groups?&topic=%s&lat=%s&lon=%s&key=%s", seedTopic, lat, lon, key) response, err := httpClient.Get(uri) handleError(err) defer response.Body.Close() fmt.Println(response) }

If we run that this is the output we see:

$ go cmd/blog/main.go &{200 OK 200 HTTP/2.0 2 0 map[X-Meetup-Request-Id:[2d3be3c7-a393-4127-b7aa-076f150499e6] X-Ratelimit-Reset:[10] Cf-Ray:[324093a73f1135d2-LHR] X-Oauth-Scopes:[basic] Etag:["35a941c5ea3df9df4204d8a4a2d60150"] Server:[cloudflare-nginx] Set-Cookie:[__cfduid=d54db475299a62af4bb963039787e2e3d1484894864; expires=Sat, 20-Jan-18 06:47:44 GMT; path=/; domain=.meetup.com; HttpOnly] X-Meetup-Server:[api7] X-Ratelimit-Limit:[30] X-Ratelimit-Remaining:[29] X-Accepted-Oauth-Scopes:[basic] Vary:[Accept-Encoding,User-Agent,Accept-Language] Date:[Fri, 20 Jan 2017 06:47:45 GMT] Content-Type:[application/json;charset=utf-8]] 0xc420442260 -1 [] false true map[] 0xc4200d01e0 0xc4202b2420}

So far so good. Now we need to parse the response that comes back.

Most of the examples that I came across suggest creating a struct with all the fields that you want to extract from the JSON document but that feels a bit over kill for such a simple script.

Instead we can just create maps of (string -> interface{}) and then apply type conversions where appropriate. I ended up with the following code to extract the topics:

import "encoding/json" var target map[string]interface{} decoder := json.NewDecoder(response.Body) decoder.Decode(&target) for _, rawGroup := range target["results"].([]interface{}) { group := rawGroup.(map[string]interface{}) for _, rawTopic := range group["topics"].([]interface{}) { topic := rawTopic.(map[string]interface{}) fmt.Println(topic["urlkey"]) } }

It’s more verbose that the Python version because we have to explicitly type each thing we take out of the map at every stage, but it’s not too bad. This is the full script:

package main import ( "fmt" "os" "net/http" "log" "time" "encoding/json" ) func handleError(err error) { if err != nil { fmt.Println(err) log.Fatal(err) } } func main() { var httpClient = &http.Client{Timeout: 10 * time.Second} seedTopic := "nosql" lat := "51.5072" lon := "0.1275" key := os.Getenv("MEETUP_API_KEY") uri := fmt.Sprintf("https://api.meetup.com/2/groups?&topic=%s&lat=%s&lon=%s&key=%s", seedTopic, lat, lon, key) response, err := httpClient.Get(uri) handleError(err) defer response.Body.Close() var target map[string]interface{} decoder := json.NewDecoder(response.Body) decoder.Decode(&target) for _, rawGroup := range target["results"].([]interface{}) { group := rawGroup.(map[string]interface{}) for _, rawTopic := range group["topics"].([]interface{}) { topic := rawTopic.(map[string]interface{}) fmt.Println(topic["urlkey"]) } } }

Once I’ve got these topics the next step is to make more API calls to get the groups for those topics.

I want to make those API calls in parallel while making sure I don’t exceed the rate limit restrictions on the API and I think I can make use of go routines, channels, and timers to do that. But that’s for another post!

Python 奇技淫巧 (三) 函数

$
0
0
0X00 任意个参数

python中一般定义函数是这样的 def add(a, b) ,参数的个数是固定的,那么怎么才可以接收任意多个参数就像 rm 1.txt 2.jpg 3.mp3 4.cpp 这样?很简单,使用 * 和 ** 就可以。下面代码里第一个参数a接收到了 hello,world 而 *b 则接收到了其余所有的参数,将其作为一个元组。

#!/usr/bin/python # coding=utf-8 def add(a, *b): print(a) return b if __name__ == '__main__': x = add('hello,world', 2, 3, 4, 5) print(x) 0X01 添加注解

在Python中定义函数的同时可以也给函数添加注解,注解可以帮助我们在调用函数的时候起到一个提醒的作用。虽然几十行的代码不会遇到看不懂的情况,但是在修改别人代码或者编写一个大项目的时候必然会有这种问题。我们可以直接在代码中加注释来解释说明,但是使用注解还是要比注释来得简单方便。不过通过注解注解指定的类型不像是C语言那样有实际意义,就算是你传入的参数和返回的值不是按照注解来的也不会报错。

#!/usr/bin/python # coding=utf-8 def add(a: int, b: int) -> int: # 这里声明了a和b都是int型,返回值也是int型 return a + b if __name__ == '__main__': print(add(3, 5)) 0X02 默认参数

我们常用的一些内置函数是有好多个可选参数的,不过我们不需要每个参数都要传入,因为Python可以给参数设置默认值,如果没有传入那个参数就会选择使用默认值,比如下面这个 add 函数。

#!/usr/bin/python # coding=utf-8 def add(a = 3, b = 5): return a + b if __name__ == '__main__': print(add()) # 没有任何参数,默认使用3和5,最后结果则为8 print(add(1)) # 传入了参数a为1,最后结果则为6 print(add(4, 6)) # 传入了参数a和b分别为4和6,最后结果为10 print(add(b = 3)) # 值传入了b参数为3,所以最后结果为6 print(add(b = 3, a = 10)) # 指定参数的话也可以不按顺序 0X03 函数mini 匿名函数

这里称之为匿名函数感觉还是有点别扭,因为这儿定义的函数并不是真的匿名,也是有名字的,因为函数自身非常短小倒不如称之为函数mini。在Python中有一个关键字 lambda ,可以定义一个匿名函数,使用这个关键字定义函数的时候函数声明、返回值、函数体只能写成一行。这样的函数功能肯定不能很强大,不过确实能减少代码量,少写好多重复的代码。正式代码的第一行就定义了一个函数,名为add,参数是x和y,返回值是x+y。所以说标准是这样的 函数名 = lambda 参数 : 返回值 。这里还有个例子: my_sqrt = lambda x : math.sqrt(x) 。注意,在匿名函数里什么 if-else 、 while 、 try-except 都是不能用的,总之你的函数就只能写一行。

#!/usr/bin/python # coding=utf-8 if __name__ == '__main__': add = lambda x, y : x + y print(add(3, 5)) print(add(2, 7)) print(add(1, 9))

Python的多线程与多进程简介-上

$
0
0

作者:杨冬 欢迎转载,也请保留这段声明。谢谢!

出处: https://andyyoung01.github.io/ 或 http://andyyoung01.16mb.com/

python允许我们使用它的APIs编写多线程或多进程应用。本篇就来了解一下Python多线程与多进程的基本概念,下篇了解一下实际的代码实例。

Python的多线程或多进程概述

Python的多线程或多进程的调度是通过操作系统的调度程序实现的。当一个线程或进程阻塞,例如等待IO时,该线程或进程被操作系统降低执行的优先级,所以CPU内核可以被分配用来执行其它有实际计算任务的线程或进程。

下图描述了Python的线程或进程架构:


Python的多线程与多进程简介-上

线程存在于进程之内。一个进程可以包含多个线程,但通常包含至少一个线程,这个线程被称为主线程。在一个进程内的线程共享进程的内存,所以进程内的不同线程的通信可以通过引用共享的对象来实现。不同的进程并不共享同一块内存,所以进程间的通信是通过其它接口如文件、sockets或特别分配的共享内存区域来实现的。

当线程需要执行操作时,它请求操作系统的线程调度程序给其分配一些CPU时间。调度程序根据各种参数来将CPU的核心分配给等待的线程,调度程序的实现根据操作系统的不同而不同。同一个进程中运行的不同线程可能同时运行在不同的CPU核心上(但CPython例外)。

Python的线程和GIL

Python的CPython解释器(从www.python.org下载的标准版解释器)包含一个Global Interpreter Lock(GIL),它的存在确保了Python进程中同时只有一个线程可以执行,即使有多个CPU核心可用。所以CPython程序中的多线程 并不能 通过多个cpu核心并行执行。不过,即使是这样,在等待I/O时被阻塞的线程仍然被操作系统降低执行优先级并放入背景等待,以便让真正有计算任务的线程可以执行,下图简单地描述了这个过程:


Python的多线程与多进程简介-上

上图中的 Waiting for GIL 状态是某个线程已经完成了I/O,在退出阻塞状态想要开始执行时,另外一个线程持有GIL,所以已经就绪的线程被强制等待。在很多的网络应用程序中,花在等待I/O上的时间比实际处理数据的时间要多得多。只要不是有非常大的并发连接数,由GIL导致的连接的线程的阻塞是相对较低的,所以对于这些网络服务程序,使用线程的方式实现并发连接,仍然是一个合适的架构。

Python的进程与线程的比较

对于操作系统来说,一个应用就是一个进程。比如打开一个浏览器,它是一个进程;打开一个记事本,它是一个进程。每个进程有它特定的进程号。他们共享操作系统的内存资源。进程是操作系统分配资源的最小单位。

而对于每一个进程而言,比如一个视频播放器,它必须同时播放视频和音频,就至少需要同时运行两个“子任务”,进程内的这些子任务就是通过线程来完成。线程是最小的执行单元。一个进程它可以包含多个线程,这些线程相互独立,同时又共享进程所拥有的资源。

以使用OS资源的角度来说,进程比线程更加“重量级”,创建一个新的进程需要的时间比创建一个新线程来说要多,而且进程使用更多的内存资源。

有一点需要 注意 的是,如果你需要执行一个计算任务密集的Python程序,最好是通过多进程来实现。因为如果程序中的每个线程都有繁重的计算任务,它们都要使用CPU,而由于GIL的存在,这些线程并不能真正的在不同的CPU核心上并行执行,所以这会严重降低程序整体的性能。

本篇了解了Python中多线程和多进程的基本概念,下篇看看实际的代码实例。

New Release Of Nim Borrows From Python, Rust, Go, and Lisp

$
0
0
An anonymous reader writes:

"Nim compiles and runs fast, delivers tiny executables on several platforms, and borrows great ideas from numerous other languages ," according to InfoWorld. After six years, they write, Nim is finally "making a case as a mix of the best of many worlds: The compilation speed and cross-platform targeting of Go, the safe-by-default behaviors of Rust, the readability and ease of development of python, and even the metaprogramming facilities of the Lisp family..."

might remind you of Python as it uses indented code blocks

There's an improved output system in the newest release , and both its compiler and library are MIT licensed. Share your thoughts and opinions in the comments. Is anybody excited about writing code in Nim?

终于学会承认自己做不到

$
0
0

要说 2016 年我在技术上的最大的收获,当属心态发生了转变。在此之前,我在技术上没有一个明确钻研方向,为了解决问题嘛,碰到什么问题就搞定什么问题,一时半会搞不定就绕过去,也能获得很好的结果。到头来却是,我成了一个「全栈」工程师,不知道我真实水平的人以为我能力很好,什么烫手山芋交给我都能得到超出预期的好结果。我自己却感觉到彷徨和困惑:几年之后,人们提起我的时候,想到的是一个干活很靠谱的全栈,还是一个别的什么?

我想要成为一个真正的技术专家,而不是一个顶着技术专家头衔的「全干工程师」。带着这样的理想,我加入了阿里云的云数据库团队,然而并没有什么平台是能承诺一定让你成为专家的,平台能够提供的是高难度的问题,高水平的同僚,剩下的一切都需要自己去证明,机会也要自己去争取。

仅仅懂得比别人多,是成不了专家的,最多就是成为咨询师。大老板跟我说,专家就是要能搞定别人搞不定的问题。为了能够搞定别人搞不定的问题,你要成为这个领域里的第一,开始是团队里的第一,后来是公司里的第一,最后是行业里的第一。

属于我的领域在哪里,我不知道。没关系,先从把工作做好开始。

后来因为对工作中一些效率上的工具不满意,自己在周末造了两个轮子,不曾想无意中解决了团队里说了很久却一直没能解决的问题,然后就这样获得了团队里的第一届「创新之星」,奖品是一台大疆无人机。


终于学会承认自己做不到

有些事情别人半年都搞不定,你两天就做出来了;有些事情你半年都搞不定,别人两天就做来了,我希望每个人都能做自己最擅长最喜欢的事情,把事情做精做深,真正发挥每个人的价值

什么是我最擅长的事情?我不知道。我不是 mysql 专家,不是 Redis 专家,不熟悉 NGINX,不熟悉 Kafka,不熟悉任何一个基础设施或者中间件,不是任何一个数据库或者操作系统的老司机,甚至我做为 Java 工程师也不熟悉 JVM,更不知道 Servlet 是何方神圣,做为 python 工程师也不熟悉 Python 解释器。但是我在需要的时候能魔改 Druid 连接池底层的 SQL 解析,能魔改 pip 打造出一个能从 Nexus 仓库上传、下载、安装 Python 模块的 mpip。

于是到头来我还是一个什么都拿不出手却什么都干的渣渣?回想着大老板跟我说的话,我似乎发现了问题的结症:我对自己的要求太多,以至于什么都想做最终什么都做不好。

于是我开始学着承认自己能力不足,学着承认自己搞不定。在过去,承认自己解决不了问题总是让我感到羞愧,如今却可以心安理得地承认自己搞不定,让同事找老司机看问题,然后自己跟着在一旁学习,而不是自己抱着代码默默排查个老半天。

心态转变之后,突然整个世界都变好了。

我虽然还不知道自己擅长什么,但是我至少知道了自己不擅长什么。走出舒适区就是要去做自己不擅长的事情,可是我连舒适区都没有,连擅长的领域都没有,还是先找到舒适区吧。

今年有大半年时间我没怎么写生产代码,而是在做一场旷日持久的线上变更,每周有两三天通宵达旦地工作,以至于我在做变更的那段时间里,都不怎么会写代码了。但是那段时间里,经常有同事找我讨论 Java 系统架构、HTTP API 的设计,这让我意识到,原来我在 Java 系统设计、微服务架构方面的经验,甚至于之前在菜鸟网络时见识的各种淘系中间件、各种内部轮子的设计,可以给同事们设计系统的时候做为参考,可以分享我之前掉过的坑、做过哪些事情避免掉进坑。

我突然发现,只需要动动嘴皮子给个方案,然后由别人坑哧吭哧去把系统实现出来,这种感觉真的很奇妙,但是有毒!这不就是我们说的 PPT 架构师吗!据说菊花厂的架构师就是这样,只出设计,不写代码。那段时间也曾经和同事开玩笑,说我已经写不动代码,干脆转型做 PPT 架构师,专门参与技术方案讨论就好了:joy:

后来,旷日持久的变更终于做完了,零故障,很棒,超出预期。因为做这个变更,在一些项目上欠下了债,之前是厚着脸皮一拖再拖,现在从变更里出来了,债得还。还债之路也并不平坦,老板让我开始介入一个重要的底层系统的工作。原本以为能够等到我把债还完才开始介入,没想到债才刚开始还,我就在一次讨论中贡献了一套技术方案,然后在这个系统里坑哧吭哧写了一个月才把这个方案给实现了,前几天刚刚交付测试。我从零开始给这个系统设计了一个全新的分布式架构,号称能解决之前老架构上解决不了的各种问题,还和同事一起在这套新架构上支持了一个全新的业务,老业务的迁移也在开始规划。

终于等到了一个机会,做出一个能够代表自己真实的工程能力的系统。原来,在这之前所做的一切工作,获得的一切认同,成为一个靠谱的全栈工程师也好,和同事通力协作零故障地完成超级大变更也好,都只是在为这个机会做准备,之前做的一切日常而琐碎的系统维护,也显得不是那么琐碎无聊。

当你不知道该做什么的时候,可以先把手上的事情做好。简单的事情做不好,就不会有机会去做不那么简单的事情。

事了拂衣去,深藏身与名。交付了一个牛逼闪闪的新架构,交付了一个新业务,本以为可以悄悄的退出来,继续还我的项目的债。但是天不从人愿,CAP 定理让人们中毒太深,提到分布式系统言必谈 Paxos 和 Raft,以至于我搞了一个和 Raft 毫无关系的系统,也被同事询问系统设计的理论基础。

对我来说设计一个分布式架构的应用,和设计一套微服务架构的应用,都是纯粹的软件工程,除了一个是系统内部通信完成状态转移,一个是系统之间通信完成业务请求之外,没什么本质区别。Storm 和 Kafka 之类的分布式系统,不也没有理论基础么。但是换个方向一想,我做的系统居然已经重要到需要理论来保证其正确性的地步了,如果我真能给出理论或者什么证明,是不是也就可以自称「略懂分布式」了?

一直没认真看分布式方面的书,终于吃了文化的亏,就趁着这个机会好好补课吧,或许分布式系统设计,可以成为我所擅长的领域。


Understanding Higher Order Local Gradient Computation for Backpropagation in Dee ...

$
0
0
Introduction

One of the major difficulties in understanding how neural networks work is due to the backpropagation algorithm. There are endless texts and online guides on backpropagation, but most are useless. I read several explanations of backpropagation when I learned about it from 2013 to 2014, but I never felt like I really understood it until I took/TA-ed the Deep Neural Networks class at Berkeley, based on the excellent Stanford CS 231n course.

The course notes from CS 231n include a tutorial on how to compute gradients for local nodes in computational graphs , which I think is key to understanding backpropagation. However, the notes are mostly for the one-dimensional case, and their main advice for extending gradient computation to the vector or matrix case is to keep track of dimensions. That’s perfectly fine, and in fact that was how I managed to get through the second CS 231n assignment.

But this felt unsatisfying.

For some of the harder gradient computations, I had to test several different ideas before passing the gradient checker, and sometimes I wasn’t even sure why my code worked! Thus, the purpose of this post is to make sure I deeply understand how gradient computation works.

Note: I’ve had this post in draft stage for a long time. However, I just found out that the notes from CS 231n have been updated with a guide from Erik Learned-Miller on taking matrix/vector derivatives. That’s worth checking out, but fortunately, the content I provide here is mostly distinct from his material.

The Basics: Computational Graphs in One Dimension

I won’t belabor the details on one-dimensional graphs since I assume the reader has read the corresponding Stanford CS 231n guide. Another nice post is from Chris Olah’s excellent blog . For my own benefit, I reviewed derivatives on computational graphs by going through the CS 231n example with sigmoids (but with the sigmoid computation spread out among finer-grained operations). You can see my hand-written computations in the following image. Sorry, I have absolutely no skill in getting this up quickly using tikz, Inkscape, or other visualization tactics/software. Feel free to right-click and open the image in a new tab. Warning: it’s big. (But I have to say, the iPhone7 plus makes really nice images. I remember the good old days when we had to take our cameras to CVS to get them developed…)


Understanding Higher Order Local Gradient Computation for Backpropagation in Dee ...

Another note: from the image, you can see that this is from the fourth lecture of CS 231n class. I watched that video on YouTube, which is excellent and of high-quality. Fortunately, there are also automatic captions which are highly accurate. (There’s an archived reddit thread discussing how Andrej Karpathy had to take down the videos due to a related lawsuit I blogged about earlier , but I can see them just fine. Did they get back up somehow? I’ll write more about this at a later date.)

When I was going through the math here, I came up with several rules to myself:

There’s a lot of notation that can get confusing, so for simplicity, I always denoted inputs as

and outputs as , though in this example, we only have one output at each step. By doing this, I can view the s as a function of the terms, so the local gradient turns into and then I can substitute in terms of the inputs.

When doing backpropgation, I analyzed it node-by-node , and the boxes I drew in my image contain a number which indicates the order I evaluated them. (I skipped a few repeat blocks just as the lecture did.) Note that when filling in my boxes, I only used the node and any incoming/outgoing arrows. Also, the

and keep getting repeated, i.e. the next step will have equal to whatever the was in the previous block.

Always remember that when we have arrows here, the part above the arrow contains the value of

(respectively, ) and below the arrow we have (respectively ).

Hopefully this will be helpful to beginners using computational graphs.

Vector/Matrix/Tensor Derivatives, With Examples

Now let’s get to the big guns ― vectors/matrices/tensors. Vectors are a special case of matrices, which are a special case of tensors, the most generalized

-dimensional array. For this section, I will continue using the “partial derivative” notation

to represent any derivative form (scalar, vector, or matrix).

ReLU

Our first example will be with ReLU s, because that was covered a bit in the CS 231n lecture. Let’s suppose

, a 3-D column vector representing some data from a hidden layer deep into the network. The ReLU operation’s forward pass is extremely simple: , which can be vectorized using np.max

.

The backward pass is where things get tricky. The input is a 3-D vector, and so is the output! Hence, taking the derivative of the function

means we have to consider the effect of every on every . The only way that’s possible is to use Jacobians. Using the example here, denoting the derivative as where is a function of

, we have:

The most interesting part of this happens when we expand the Jacobian and see that we have a bunch of derivatives, but they all evaluate to zero on the off-diagonal . After all, the effect (i.e. derivative) of

will be zero for the function . The diagonal term is only slightly more complicated: an indicator function (which evaluates to either 0 or 1) depending on the outcome of the ReLU. This means we have to cache

the result of the forward pass, which easy to do in the CS 231n assignments.

How does this get combined into the incoming (i.e. “upstream”) gradient, which is a vector

. We perform a matrix times vector operation with that and our Jacobian from above. Thus, the overall gradient we have for

with respect to the loss function, which is what we wanted all along, is:

This is as simple as doing mask * y_grad where mask is a numpy array with 0s and 1s depending on the value of the indicator functions, and y_grad is the upstream derivative/gradient. In other words, we can completely bypass the Jacobian computation in our python code! Another option is to use y_grad[x <= 0] = 0 , where x is the data that was passed in the forward pass (just before ReLU was applied). In numpy, this will set all indices to which the condition x <= 0 is true to have zero value, precisely clearing out the gradients where we need it cleared.

In practice, we tend to use mini-batches of data, so instead of a single

, we have a matrix with columns.Denote the th column as . Writing out the full Jacobian is too cumbersome in this case, but to visualize it, think of having and then stacking the two samples into a six-dimensional vector. Do the same for the output . The Jacobian turns out to again be a diagonal matrix, particularly because the derivative of on the output is zero for . Thus, we can again use a simple masking, element-wise multiply on the upstream gradient to compute the local gradient of w.r.t. . In our code we don’t have to do any “stacking/destacking”; we can actually use the exact same code mask * y_grad with both of these being 2-D numpy arrays (i.e. matrices) rather than 1-D numpy arrays. The case is similar for larger minibatch sizes using

samples.

Remark : this process of computing derivatives will be similar to other activation functions because they are elementwise operations.

Affine Layer (Fully Connected), Biases

Now let’s discuss a layer which isn’t elementwise: the fully connected layer operation

. How do we compute gradients? To start, let’s consider one 3-D element

so that our operation is

According to the chain rule, the local gradient with respect to

is

Since we’re doing backpropagation, we can assume the upstream derivative is given, so we only need to compute the

Jacobian. To do so, observe that

and a similar case happens for the second component. The off-diagonal terms are zero in the Jacobian since

has no effect on for

. Hence, the local derivative is

That’s pretty nice ― all we need to do is copy the upstream derivative. No additional work necessary!

Now let’s get more realistic. How do we extend this when

is a matrix? Let’s continue the same notation as we did in the ReLU case, so that our columns are for

. Thus, we have:

Remark : crucially, notice that the elements of

are repeated

across columns.

How do we compute the local derivative? We can try writing out the derivative rule as we did before:

but the problem is that this isn’t matrix multiplication. Here,

is a function from to

, and to evaluate the derivative, it seems like we would need a 3-D matrix for full generality.

Fortunately, there’s an easier way with computational graphs . If you draw out the computational graph and create nodes for

, you see that you have to write plus nodes to get the output, each of which takes in one of these terms along with adding . Then this produces

. See my hand-drawn diagram:


Understanding Higher Order Local Gradient Computation for Backpropagation in Dee ...

This captures the key property of independence among the samples in

. To compute the local gradients for , it therefore suffices to compute the local gradients for each of the and then add them together. (The rule in computational graphs is to add

incoming derivatives, which can be verified by looking at trivial 1-D examples.) The gradient is

See what happened? This immediately reduced to the same case we had earlier, with a

Jacobian being multiplied by a upstream derivative. All of the Jacobians turn out to be the identity, meaning that the final derivative is the sum of the columns of the original upstream derivative matrix . As a sanity check, this is a -dimensional vector, as desired. In numpy, one can do this with something similar to np.sum(Y_grad) , though you’ll probably need the axis

argument to make sure the sum is across the appropriate dimension.

Affine Layer (Fully Connected), Weight Matrix

Going from biases, which are represented by vectors, to weights , which are represented by matrices, brings some extra difficulty due to that extra dimension.

Let’s focus on the case with one sample

. For the derivative with respect to , we can ignore since the multivariate chain rule states that the expression differentiated with respect to causes

to disappear, just like in the scalar case.

The harder part is dealing with the chain rule for the

expression, because we can’t write the expression “ ”. The function is a vector , and the variable we’re differentiating here is a matrix

. Thus, we’d again need a 3-D like matrix to contain the derivatives.

Fortunately, there’s an easier way with the chain rule. We can still use the rule, except we have to sum over the intermediate components , as specified by the chain rule for higher dimensions; see the Wikipedia article for more details and justification . Our “intermediate component” here is the

vector, which has two components. We therefore have:

We fortunately see that it simplifies to a simple matrix product! This seems to suggest the following rule: try to simplify any expressions to straightforward Jacobians, gradients, or scalar derivatives, and sum over as needed. Above, splitting the components of

allowed us to utilize the derivative since is now a real-valued function

, thus enabling straightforward gradient derivations. It also meant the upstream derivative could be analyzed component-by-component, making our lives easier.

A similar case holds for when we have multiple columns

in . We would have another

sum above, over the columns, but fortunately this can be re-written as matrix multiplication.

Convolutional Layers

How do we compute the convolutional layer gradients? That’s pretty complicated so I’ll leave that as an exercise for the reader. For now.

A Few Django ORM Mistakes

$
0
0

See if you can figure out what's wrong with the code snippets below! Ask yourself what the problem is, what effect will it have, and how can you fix it?

These examples are for Django, but probably apply to many other ORMs.

Bug 1

def create(): with transaction.atomic(): thing = Thing.objects.create(foo=1, bar=1) set_foo(thing.id) thing.bar = 2 thing.save() def set_foo(id): thing = Thing.objects.get(id=id) thing.foo = 2 thing.save()

Hint

The save method saves all attributes.

Solution

The problem with this code is that two python instances of the same database row exist. Here's the annotated source:

def create(): with transaction.atomic(): # Create database row foo=1 bar=1 thing = Thing.objects.create(foo=1, bar=1) set_foo(thing.id) # The database row has been updated with foo=2 bar=1, but this # instance still has foo=1 bar=1 as it hasn't been reloaded thing.bar = 2 thing.save() # Writes foo=1 bar=2 # The foo=2 write has been lost def set_foo(id): # Look up the same Thing, but create a new instance thing = Thing.objects.get(id=id) thing.foo = 2 thing.save() # Writes foo=2, bar=1

The result is a single Thing with a foo of 1 and a bar of 2 . A write has been lost!

Here's one possible fix:

def create(): with transaction.atomic(): thing = Thing.objects.create(foo=1, bar=2) set_foo(thing) thing.bar = 3 thing.save() def set_foo(thing): thing.bar = 4 thing.save()

Bug 2

class Thing(Model): foo = ... bar = ... def thing_set_foo(id, value): thing = Thing.objects.get(id=id) thing.foo = value thing.save() def thing_set_bar(id, value): thing = Thing.objects.get(id=id) thing.bar = value thing.save()

Hint

Assume thing_set_foo and thing_set_bar can happen simultaneously.

Solution

It's possible for a thread to read from the database just before a write happens in another thread, resulting in the following situation:


A Few Django ORM Mistakes

Here's one possible solution:

def thing_set_foo(id, value): thing = Thing.objects.get(id=id) thing.foo = value thing.save(update_fields=["foo"]) def thing_set_bar(id, value): thing = Thing.objects.get(id=id) thing.bar = value thing.save(update_fields=["bar"]) Bug 3

def increment(id) counter = Counter.objects.get(id=id) counter.count = counter.count + 1 counter.save()

Solution

This is very much like bug 2, but the twist is that the increment function can conflict with itself. If called in two different threads, even though increment is called twice the total may still only be 1.


A Few Django ORM Mistakes

One way to fix this is to make the increment operation atomic .


A Few Django ORM Mistakes

The way to do this in the Django ORM is to use F objects:

def increment(id) counter = Counter.objects.get(id=id) counter.count = F('count') + 1 counter.save()

Isolation Levels READ COMMITTED Isolation Level

This is the default for PostgreSQL. Transactions can read updates from other transactions after they have been committed.


A Few Django ORM Mistakes
REPEATABLE READ Isolation Level

This is the default for mysql. A snapshot is established on the first read in the transaction, and all subsequent reads are from the snapshot.


A Few Django ORM Mistakes

Going forward, assume we are using MySQL in its default configuration.

Bug 4

def state_transition(id): with transaction.atomic(): stateful = Stateful.objects.get(id=id) if stateful.state == DONE: raise AlreadyDone do_state_transition() stateful.state = DONE stateful.save()

Solution

It is possible for do_state_transition to be executed twice if the state transition is executed concurrently. This could be a problem if your state transition includes side effects!


A Few Django ORM Mistakes

One simple solution to this problem is to lock the object:

def state_transition(id): with transaction.atomic(): stateful = Stateful.objects.select_for_update().get(id=id) if stateful.state == DONE: raise AlreadyDone do_state_transition() stateful.state = DONE stateful.save()

But, generally, you should try to avoid doing side effects in transactions!

Bug 5 def create_payment(collection_id, amount): with transaction.atomic(): payment_collection = PaymentCollection.objects.get(id=collection_id) Payment.objects.create( amount=amount, payment_collection=payment_collection) payment_collection.total = ( payment_collection.payment_set.all() .aggregate(total=Sum('amount'))['total']) payment_collection.save() Solution

If executed concurrently, one transaction will not see the newly created Payment in the other transaction. The last write will win and the total will be inconsistent.


A Few Django ORM Mistakes

In addition to that, in MySQL this can potentially deadlock causing your transaction to roll back entirely! Creating the Payment causes a lock that blocks the aggregation read.

Both issues can be fixed by locking the model being updated ( not the payment!) at the start of the transaction:

def create_payment(collection_id, amount): with transaction.atomic(): payment_collection = (PaymentCollection.objects .select_for_update().get(id=collection_id)) Payment.objects.create( amount=amount, payment_collection=payment_collection) payment_collection.total = ( payment_collection.payment_set.all() .aggregate(total=Sum('amount'))['total']) payment_collection.save()

Or, alternatively, making the update atomic:

def create_payment(collection_id, amount): payment_collection = PaymentCollection.objects.get(id=collection_id) Payment.objects.create(amount=amount, payment_collection=payment_collection) with connection.cursor() as cursor: cursor.execute(""" UPDATE payment_collection, (SELECT payment_collection_id, sum(amount) AS total FROM payment GROUP BY payment_collection__id) totals SET payment_collection.total = totals.total WHERE totals.pc_id = pc.id AND pc.id = %s """, [payment_collection.id])

Note that this cannot be in a transaction, or the deadlock issues will remain!

In my opinion, the safest way to do this is by using a SQL view instead of storing the total. Views can be awkward to use with Django unfortunately.

CREATE VIEW payment_collection_totals SELECT payment_collection_id, SUM(amount) AS total FROM payment GROUP BY payment_collection_id CREATE VIEW payment_collection_with_total SELECT payment_collection.*, COALESCE(totals.total, 0) AS total FROM payment_collection LEFT JOIN totals ON (totals.payment_collection_id = payment_collection.id)

Bug 6

def foo(id): with transaction.atomic(): foo = Foo.objects.get(id=id) bar = Bar.objects.create(...) with lock(): foo.refresh_from_db() # If foo.bar has already been set in another thread, # raise an exception and rollback the transaction if foo.bar: raise Exception foo.bar = bar foo.save()

Hint

This is a bug in MySQL, but not PostgreSQL.

Solution

This bug is a result of the REPEATABLE READ isolation level. The read after the transaction starts establishes a snapshot, so when refresh_from_db is performed after waiting for a lock, the snapshot is read, not the most recent value.

This means when the foo.bar check is performed, we are checking potentially stale data. This can cause multiple Bar to be created, but only one of them linked to the correct Foo .

Confusingly, replacing with lock() with a select_for_update() will work for MySQL, because MySQL has a weird quirk where locked reads do not read from the snapshot. When using REPEATABLE READ with PostgreSQL, this will throw an error instead.

The preferred way is to either move the lock to the top, outside of the transaction, or using select_for_update() as follows:

def foo(id): with transaction.atomic(): foo = Foo.objects.select_for_update().get(id=id) bar = Bar.objects.create(...) # If foo.bar has already been set in another thread, # raise an exception and rollback the transaction if foo.bar: raise Exception foo.bar = bar foo.save()

Tips Remember the ORM is an in-memory cache. ORMs can obscure bugs. Look at the SQL! Avoid read-modify-write If you don't you'll probably need a lock. If it's not locked, it's not up to date. Lock before reads to avoid weird MySQL behaviour. Locking reads don't use the snapshot in MySQL. Prefer immutable database design if practical. See:Immutable Data. Consider using a serializable isolation level. You don't have to worry about locking if you use serializable transactions. Has other drawbacks. PostgreSQL implementation is nicer than the MySQL one IMO.

Fensterbrief - a python script to organize and work with letters based on LaTeX ...

$
0
0

I use LaTeX for more than a decade. Naturally it is my preferred tool suite when it comes to writing letters. Since working with LaTeX and doing office work requires a bit book keeping, I started developing a helper tool to work with LaTeX letters and corresponding templates.

I think everything started when I had to write a letter a long time ago and instead of just writing the letter directly, I implemented a CGI script that asked for metadata and then rendered everything into a PDF file using the LaTeX code as intermediate format. There are several similar scripted approaches, you can find on the Internet. Meanwhile, for me it is more natural to have LaTeX letter files on my filesystem and when I write a new letter, I may reuse an old one, especially to reuse the recipient address I manually inserted into a letter to the same recipient before. So, I keep the LaTeX letter files for a later reuse and the PDF output to have a printable copy and I have several unnecessary artifact files, which usually remain in the filesystem, too. My letters are organized in a directory structure: each directory corresponds to a business process, which stores one or more documents related to it. The files and folder have a fixed naming scheme, which includes a date in ISO date notation (YYYY-MM-DD) for letters and a YYYY-MM format for folder names to support a file name based lexicographical order.

For a long time, I manually organized these files and directories. Recently, I started automating the process and wrote a python script that implements it. The idea is to have a tool that enables me writing letters similar fast as writing e-mails, in other words in a fire and forget style. Hence, I recently added the support of Markdown and support for the lookup of postal addresses via the Google API.

The tool is called Fensterbrief and it supports these features:

intended to be used via command line maintaining a folder and document structure support for LaTeX and Markdown based letters support for fax transmissions via simple-fax.de support for buying postage for the Deutsche Post lookup postal addresses via the Google API

The source code, further documentation, and usage examples are on Github .

What is a variable?

$
0
0

You can visualize a variable 1 as a box that contains a thing 2 . For example, this variable called x contains an integer 0:

This other variable called most_recent_email contains a pointer (more below) which we draw with an arrow.


What is a variable?

Beyond these basics, programming languages disagree what a variable is and can do. I will use variable to mean the green box, identifier to mean the name (like x or most_recent_email ), and value to refer to a thing that goes inside the variable.

Declaration

Declaration creates a new variable out of thin air 3 . We'll draw a bomb inside the box since declaration by itself does not put a value in the variable, so errors will occur if you try to use what is inside.


What is a variable?
Language Code C int x; Rust let x: i32; Java int x; Java final int i;
Allocation

Allocation is like declaration, but creates a variable without an identifier. It also creates a pointer pointing at the variable.


What is a variable?
Language Code 4 C malloc(1000000000); C++ new int[1000000000]; Initialization

Initialization , which is often done at the same time as or immediately after declaration or allocation, is putting a value into a variable for the first time. You can initialize variables you previously declared:


What is a variable?
Language Code Note C x = 7; Rust x = 7;

You can declare and initialize at the same time (we introduce the lock drawing for the optional const or final which we will come back to in a second):


What is a variable?
Language Code Java final int i = 7; C/C++ int const i = 7; php 6 $x = 7; R 6 x <- 7

In a very common pattern, you can allocate, initialize, declare, and initialize all in quick succession. The x variable contains a pointer. Many languages call this kind of variable a reference: 5


What is a variable?
Language Code python x = 7 ES6 let x = 7; Java Integer x2 = new Integer(7); Haskell let i = 7 in .... C++ int* x2 = new int; *x2 = 7; C++ int * const i2 = new int; *i2 = 7; Assignment

Assignment puts a new value into a variable, wiping out the previous value. It is like erasing a chalkboard and writing something new on it.


What is a variable?

Examples of assignment:

Language code C x = 9; Java x = 9; PHP $x = 9;

Also common is assignment of a variable that contains a pointer (many languages call this kind of variable a reference):


What is a variable?
Language Code Python x = 9 C++ x2 = new int; *x2 = 9;

Not all programming languages and not all variables support assignment. Variables that don't support assignment are referred to in many programming languages as const . Using the i examples from above, all of these will cause an error:

Language Code Java i = 9; C/C++ i = 9; C++ i2 = new int; *i2 = 9; Identifier Aliasing

Some languages let you attach more than one identifier to the same variable. This is known as aliasing although PHP calls it "references".


What is a variable?
Language Code PHP $y =& $x; Pointer Aliasing

Many languages let you create multiple pointers to a single allocated variable. This is also sometimes known as aliasing .


What is a variable?
Language Code Python y = x C++ int* y2 = x2; Footnotes

1 some variables in computers science are actually "free variables" which are different from what is described here. parameter assignment, closures, currying, unification, and variables in mathematics are all topics for another post. The word "variable" means too many things.

2 ultimately all variables store sequences of bits, but this is a topic for another post

3 actually, out of a big pool of memory

4 the compiler I tried optimized out these non-sensical examples, which you should never actually write into code

5 these examples don't actually store 7 -- for example in Python the value actually is made up of two separate values, an instance of PyObject_HEAD and a 7. In the Haskell example, the value is probably actually a Thunk that will evaluate to 7

6 PHP secretly does an allocation of a zval here, but tries to make it look like its variables directly store values. R does something similar.

Python 终端下中文字符对齐处理和编码续

$
0
0

本来是修改自己一个终端小程序的宽字符处理,然后就和编码纠结上了。

以前总结过一篇python Encoding。

这两天花了不少时间继续研究了下这块,越研究越让人迷糊,还存在不少疑问。只能说在研究和总结这块时,我的内心是崩溃的……希望以后不再在这块纠结。

以下总结在环境 linux,Python2.7 下研究。

先谈谈终端下中文字符(宽字符)的对齐输出问题:

终端下中文字符(宽字符)的对齐输出问题

比如我在终端下输出表格,里面包含了中英文,因为中英文的长度不一致, len() 获取的宽度是编码字节的长度,不是实际长度:

>>> len(u'我'.encode('utf-8')) 3 >>> len(u'我'.encode('gbk')) 2

这里「我」如果是 utf-8 编码,则占 3 个字节长度,而在 gbk 下则是 2 个字节长度。所以通过 len() 来固定长度显然不合适,造成无法对齐的情况。

后来看到 这个帖子 ,了解到 unicodedata 这个库,因为最终的 unicode 码点是唯一的,所以通过这个库,可以获取某个 unicode 字符的宽度。

我的处理:

#coding: utf-8 import unicodedata def wide_chars(s): """return the extra width for wide characters ref: http://stackoverflow.com/a/23320535/1276501""" if isinstance(s, str): s = s.decode('utf-8') return sum(unicodedata.east_asian_width(x) in ('F', 'W') for x in s) # example max_width = 20 s1 = u'中文' # Format Specification Mini-Language # ref: https://docs.python.org/2/library/string.html#format-specification-mini-language fmt_str1 = u'{0:<%s}|' % (max_width - wide_chars(s1)) s2 = u'ab' fmt_str2 = u'{0:<%s}|' % (max_width - wide_chars(s2)) s3 = u'新年快乐' fmt_str3 = u'{0:<%s}|' % (max_width - wide_chars(s3)) print(fmt_str1.format(s1)) print(fmt_str2.format(s2)) print(fmt_str3.format(s3))

F 表示宽字符,如 ¥ ; W 表示其它的字符,如中文汉字和符号。具体看 EAST ASIAN WIDTH

(注:先前以为宽字符都是全角字符,不过发现半角中文逗号也是属于 F,懒得研究这块了)

上面函数返回的是一个 unicode 字符串中,所有中文字符的个数,因为大部分中文字符在 utf-8 下是占三个字节,而宽度是两个字符,所以减去中文个数就是字节和宽度一样。注意这里是用的 unicode 字符串,如果在 Python2.x 下使用 str,则需要改为 + wide_chars() 。

当然上面还不是很严谨,因为没有具体研究哪些字符可能不是 3 个字节。 Python 中计算字符宽度 这篇文章里提到了另外一种方案。不过目前使用来说,上面的例子已经足够了。

然后就是开始字符编码的各种风暴了……

Terminal Emulator Character Encoding and $LANG

比如 Mac 下我使用的 Terminal 是 iTerm2, 终端的字符编码设置在 Profiles -> Terminal -> Character Encoding ,选择的是 Unicode (UTF-8) 。

$LANG 设置是 zh_CN.UTF-8 , zh_CN 表示语言,即这里的中文, UTF-8 也就是编码集。终端显示的一些结果,如 ls , date 的输出、vim 的提示都是中文。

locale 相关

关于 locale 输出各字段的含义,在 man 7 locale 中有解释。

对于区域设置,有两个特殊的: C 和 POSIX 。

比如查看 mac 手册时,因为字符集原因,呈现的是乱码,这时经常会使用 LANG=C man xxx ,意思就是去掉本地化,使用程序自身的语言去呈现,当然一般就是英文了。

What does “LC_ALL=C” do? LC_ALL=C 的含义

关于 locale 的字段,比较常用的是 LC_ALL , LANG , LC_CTYPE , LC_TIME 等:

LC_ALL 用于最高优先级的全局设置,一般为空;为空时会寻找相应字段 LC_* 的值,如果这个也没设置,就找默认的 LANG 值。 LC_CTYPE 用于字节序列编码的解释,即字符显示等作用 LC_TIME 时间的显示格式和本地化

参考:

Ubuntu wiki - locale Ubuntu 国际化与本土化 再说 Python Source Code Encodings

在Python Encoding 提到过这个,这里再补充下。

这个的作用就是让 Python 解释器在读文件时,对里面的非 ascii 字符做编解码处理。

比如最常用的:

coding=utf-8

如果设置其它的字符集,如 gb2312:

coding=gb2312

则要求文件的编码是 gb2312,否则比如保存为 utf-8 编码,则以 gb2312 编码去读取文件会导致错误。

另外就是执行的输出,比如:

#coding=gb2312 a = "中文" print(a) print(a.decode('gb2312').encode('gb2312'))

需要设置终端编码是 gb2312,否则输出是乱码。

Python sys.getdefaultencoding()

Python 2.x 下,默认是 ascii,这个与系统环境无关。

在Python Encoding 也提到了。

再举个例子,参考 Python - Default Encoding :

#coding=utf-8 import sys reload(sys) sys.setdefaultencoding('utf-8') print sys.getdefaultencoding() a = '中文' + u'abc'

如果不改变默认编码为 utf-8,则报错:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

因为这个字符串拼接的实际效果是:

a = '中文'.decode(sys.getdefaultencoding()) + u'abc' In vim, encoding vs fileencoding vs fileencodings

注意 fileencoding 和 fileencodings 区别,一个是单数,一个是复数。

encoding :字符串。影响 buffer, register 等,就是在输入字符时最后呈现的样子。默认是"latin1" or value from $LANG fileencoding :字符串。新建文件写文件时的字符编码。如果没设置则和 encoding 一样,否则在写文件时会做转码;读取文件时,会被 fileencodings 设置。 fileencodings :列表。当准备编辑一个已存在的文件时,vim 尝试使用第一个字符编码去读取文件,如果出错则使用第二个。

即:

读文件:从 fileencoding 转换到 encoding 写文件:从 encoding 转换到 fileencoding

所以在打开或保存时可以看到 converted 的标记。

举个例子:

set encoding=utf-8 set fileencoding=gb2312 set fileencodings=utf-8,gb2312

新建一个文件,vim 显示的编码使用 utf-8, 在输入中文后 ,保存会看到左下角提示 converted ,命令行通过 file -i filename 可以看到文件字符集为 charset=iso-8859-1。再次打开文件时,通过 fileencodings 来设置 fileencoding,并进行转码到 encoding 设置。

延伸阅读:

Character Encoding Tricks for Vim 用 vim 打开后中文乱码怎么办?

另外有个疑问 TODO:

如果我的 encoding 和 fileencoding 都设置的 gb2312,而终端设置的 UTF-8 时,实际保存的是 utf-8;如果终端是 gb2312,则保存的是 gb2312。 而如果 encoding 是 utf-8, fileencoding 是 gb2312, 终端是 UTF-8 时,实际保存的是 gb2312

这块目前还不清楚原因,只能猜测是因为终端字符集和 encoding buffer 这块有关,虽然设置的 gb2312,但是实际是终端的 utf-8,但是因为和 fileencoding 一样设置的 gb2312,所以不做转码。

总结一下 万能的 UTF-8 大法好 不要设置不一致,从终端到程序配置都统一为 UTF-8 其它参考 UTF-8 and Unicode FAQ for Unix/Linux Effect of $LANG on terminal 浅析 Linux 的国际化与本地化机制
Viewing all 9596 articles
Browse latest View live