Review: 6 Python IDEs go to the mat

October 19, 2016, 9:56 pm

≫ Next: Controlling 4DSystems Diablo16 and Picaso displays from Python

Of all the metrics you could use to gauge the popularity and success of a language, one surefire factor is the number of development environments available for it. python’s rise in popularity over the last several years has brought with it a strong wave of IDE support, with tools aimed both at the general programmer and those who use Python for tasks like scientific work and analytical programming.

These six IDEs with Python support cover the gamut of use cases. Some are multilanguage IDEs that have Python support through an add-on or a repackaging of another product with Python-specific extensions. Each benefits a slightly different audience of Python developer, although many strive to be useful as universal solutions.

↧

Controlling 4DSystems Diablo16 and Picaso displays from Python

October 19, 2016, 9:55 pm

≫ Next: 给Python函数执行前后添加额外行为

≪ Previous: Review: 6 Python IDEs go to the mat

Afew week ago I had a really good time testing 4DSystems 4Duino-24 board . One of the things I noticed is that the Serial Command Set interface is really flexible. You can easily drive the display from an 8-bit microcontroller. But you can also use more powerful controllers like an ESP8266 or an ARM machine like a Raspberry Pi or even my laptop.

4DSystems provide libraries forall those platforms and others. Most of those libraries share a common language: C (they havealso developed libraries in Basic for PicAxe and Pascal). But even thou I spend a lot of time write C code, when I’m on my laptop a prefer higher level languages like Node.js or python. So why not using Python to control these displays?

Actually, Python being written in C itself has a great support to wrapC libraries so you can use them from the language. Using Python to develop has several advantages:

Powerful language with complex but easy-to-use data structures Rapid development since it’s an interpreted language Mostly platform independent (you still need to compile the C libraries for your platform, but the wrapper and example should work without modifications) It’s cool

Since the 4Duino-24 uses a Picaso chip I asked 4DSystems for a sample of one of their Diablo16 products. A gen4-uLCD-32DCT-CLB display arrived shortly after but I had no time to play with it until this week.

Controlling 4DSystems Diablo16 and Picaso displays from Python

The gen4-uLCD-32DCT-CLB display with a nice bezel and the flat cable connected to the interface board

The back of the gen4-uLCD-32DCT-CLB display

The Diablo16 controller

Compiling the C library

So what I wanted to do was a wrapper library in Python for the Diablo16 Serial Library . So first thing was to have a library to wrap.

The easiest way is to checkout the Diablo16 Serial linux Library repository and build it. It is meant for Raspberry Pi but doeswork on my x86_64Linux laptop. Once you have checked it out and “cd” to the folder it should be as simple as typing “make”. But it isn’t.

$ make
[Compile] diabloSerial.c
diabloSerial.c: In function ‘OpenComm’:
diabloSerial.c:2240:14: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
write(cPort, (unsigned char *)&ch, 1);
^
[Link (Dynamic)]
/usr/bin/ld: diabloSerial.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
diabloSerial.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status
make: *** [diabloSerial.so] Error 1

It complains about you trying to link a dynamic library to a static one. But, good enough, it gives you the solution. Just edit the “Makefile” to add the -fPIC flag to the compile options:

CFLAGS = $(DEBUG) -Wall $(INCLUDE) -Winline -pipe -fPIC

And…

$ make clean; make
rm -f diabloSerial.o *~ core tags *.bak Makefile.bak libdiabloSerial.*
[Compile] diabloSerial.c
diabloSerial.c: In function ‘OpenComm’:
diabloSerial.c:2240:14: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
write(cPort, (unsigned char *)&ch, 1);
^
[Link (Dynamic)]

The result is a “libdiabloSerial.so” you will have to copy somewhere the python wrapper could find it. More about this soon.

Wrapping it up

The Diablo16 and Picaso wrapper libraries arereleased as free open software and can be checked out at my 4DSystems Python repository onBitbucket.

The trickis to use the ctypes library for Python.ctypes isa “foreign function library(that)provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python”.

It’s actually really magic when you call your first C function from python code. So I spent a few hours wrapping the Diablo16 Serial library. I found a way to preserve most of the API in a simple and fast to code way. I only changed the way the wrapper returns values in some cases, like when returning strings or arrays of values.

Since it was the first time I was writing a wrapper with ctypes I spent some time trying to find a way to avoid writing thousands of lines of function definitions. I’m quite happy with the solution I found. Most of the code has been moved to configuration, and only those functions that deal with pointers (strings, arrays,…) or byref variables have their own wrapper function in the library.

I also realised that the differences between Diablo16 and Picaso APIs are really minimal so I created two libraries that both inherit from the same classand only define those calls that are different.

So this is the DiabloSerial.py library, that inherits from the BaseSerial.py and defines only those calls that are specific for the Diablo16 controller and also the relative path to the dynamic library to wrap.

from ctypes import *
from BaseSerial import BaseSerial
class DiabloSerial(BaseSerial):
def library(self):
return cdll.LoadLibrary("libs/libdiabloSerial.so")
def definitions(self):
definitions = super(DiabloSerial, self).definitions()
definitions['bus_Read8'] = [c_uint16, []]
definitions['bus_Write8'] = [None, [c_uint16]]
definitions['putstr'] = [c_uint16, [c_char_p]]
return definitions

There is little error check and I’m not sure all the calls would work, specially the error callbacks. I’m testing the wrapper as I’m writing sample code and so far so good. But if you happen to find a bug or have a suggestion please tell me.

Wiring the display

In the box, aside from the display, there was a small documentation booklet, a 30 pin flat cable and a gen4-IB interface board . This board brings out the minimum pins required to communicate with the display controller: 5V, TX, RX, GND and RES for reset.

So I grabbed my FTDI based USB to UART board and wired them together using 5V logic and crossing RX and TX lines. I didn’t have to wire the RES pin.

A simple example

So let’s test thewrapper.

import sys
from DiabloSerial import DiabloSerial
from DiabloConstants import *
def callback(errcode, errbyte):
print "ERROR: ", errcode, errbyte
sys.exit(1)
if __name__ == "__main__":
diablo = DiabloSerial(500, True, callback)
diablo.OpenComm('/dev/ttyUSB0', 9600)
diablo.gfx_ScreenMode(PORTRAIT);
dia

↧

给Python函数执行前后添加额外行为

October 19, 2016, 9:54 pm

≫ Next: Building RESTful APIs With Flask: ORM Independent

≪ Previous: Controlling 4DSystems Diablo16 and Picaso displays from Python

给python函数执行前后添加额外行为

2016.10.19 17:46:04

以前想在函数前后添点啥额外功能（比如过滤、计时等）时，总是首先想到装饰器。比如这个计量所花费时间的程序：

from functools import wraps, partial from time import time def timing(func=None, frequencies=1): if func is None: # print("+None") return partial(timing, frequencies=frequencies) # else: # print("-None") @wraps(func) def _wrapper(*args, **kwargs): start_time = time() for t in range(frequencies): result = func(*args, **kwargs) end_time = time() print('运行花费时间：{:.6f}s。'.format(end_time-start_time)) return result return _wrapper @timing def run(): l = [] for i in range(5000000): l.extend([i]) return len(l)

运行如下：

In [4]: run() 运行花费时间：2.383398s。 Out[4]: 5000000

（喜欢刨根问底的可以去掉注释，并思考预计会有什么样的输出）。

今天无意间看到了Python的上下文管理器（ContextManager），发现也非常不错，其实这跟 with 语句是息息相关的，竟然以前一直未在意。

from time import time def run2(): l = [] for i in range(5000000): l.extend([i]) return len(l) class ElapsedTime(): def __enter__(self): self.start_time = time() return self def __exit__(self, exception_type, exception_value, traceback): self.end_time = time() print('运行花费时间：{:.6f}s。'.format(self.end_time - self.start_time)) with ElapsedTime(): run2()

初略看了一点官方文档，上下文管理还是有点都内容的。Python发展到现在，其实不简单了。说简单，只是你自己不够与时俱进，掌握的都是老式三板斧而已。所以，知识需要不断更新，才能弥补自己的盲点，这是我最想说的。

后续再找时间好好探究一番。

↧

Building RESTful APIs With Flask: ORM Independent

October 20, 2016, 12:16 am

≫ Next: Tommorow FOSSASIA meets PyLadies Pune

≪ Previous: 给Python函数执行前后添加额外行为

In thefirst partof this three-part tutorial series, we saw how to write RESTful APIs all by ourselves using Flask as the web framework. In thesecond part, we created a RESTful API using Flask-Restless which depends on SQLAlchemy as the ORM. In this part, we will use another Flask extension, Flask-Restful , which abstracts your ORM and does not make any assumptions about it.

I will take the same sample application as in the last part of this series to maintain context and continuity. Although this example application is based on SQLAlchemy itself, this extension can be used along with any ORM in a similar fashion, as shown in this tutorial.

Installing Dependencies

While continuing with the application from the first part, we need to install only one dependency:

$ pip install Flask-Restful The Application

Before we start, you might want to remove the code that we wrote for the second part of this tutorial series for more clarity.

As always, we will start with changes to our application's configuration, which will look something like the following lines of code:

flask_app/my_app/__init__.py

from flask.ext.restful import Api
api = Api(app)

Just adding the above couple of lines to the existing code should suffice.

flask_app/my_app/catalog/views.py

import json
from flask import Blueprint, abort
from flask.ext.restful import Resource
from flask.ext.restful import reqparse
from my_app.catalog.models import Product
from my_app import api, db
catalog = Blueprint('catalog', __name__)
parser = reqparse.RequestParser()
parser.add_argument('name', type=str)
parser.add_argument('price', type=float)
@catalog.route('/')
@catalog.route('/home')
def home():
return "Welcome to the Catalog Home."
class ProductApi(Resource):
def get(self, id=None, page=1):
if not id:
products = Product.query.paginate(page, 10).items
else:
products = [Product.query.get(id)]
if not products:
abort(404)
res = {}
for product in products:
res[product.id] = {
'name': product.name,
'price': product.price,
}
return json.dumps(res)
def post(self):
args = parser.parse_args()
name = args['name']
price = args['price']
product = Product(name, price)
db.session.add(product)
db.session.commit()
res = {}
res[product.id] = {
'name': product.name,
'price': product.price,
}
return json.dumps(res)
api.add_resource(
ProductApi,
'/api/product',
'/api/product/<int:id>',
'/api/product/<int:id>/<int:page>'
)

Most of the code above is self-explanatory. I will highlighta few points, though. The code above seems very similar to the one that we wrote in the first part of this series, but here the extension used does a bunch of behind-the-scenes optimizations and provides a lot more features that can be leveraged.

Here the methods declared under any class that subclasses Resource are automatically considered for routing. Also, any parameters that we expect to receive along with incoming HTTP calls need to be parsed using reqparse .

Testing the Application

This application can be tested in exactly the same way as we did in the second part of this tutorial series. I have kept the routing URL the same for the same purpose.

Conclusion

In this last part of this three-part tutorial series on developing RESTful APIs with Flask, we saw how to write ORM-independent RESTful APIs. This wraps up the basics of writing RESTful APIs with Flask in various ways.

There is more that can be learned about each of the methods covered, and you can explore this on your own, using the basics you've learned in this series.

↧

Tommorow FOSSASIA meets PyLadies Pune

October 20, 2016, 12:15 am

≫ Next: python模块之logging

≪ Previous: Building RESTful APIs With Flask: ORM Independent

Tommorow FOSSASIA meets PyLadies Pune

Posted: 2016-10-19T19:15:15+05:30 |Source | More posts about fossasia PyLadies python

Tomorrow we have a special PyLadies meetup at the local Red Hat office. Hong Phuc Dang from FOSSASIA is coming down for a discussion with the PyLadies team here. She will be taking about various projects FOSSASIA is working on, including codeheat . In the second half I will be taking a workshop on creating command line shell using Python.

On Friday we will be moving to Belgaum, Karnataka, India. We will be participating in Science Hack Day India , the idea is to have fun along with school kids, and build something. Praveen Patil is leading the effort for this event.

↧

python模块之logging

October 20, 2016, 12:14 am

≫ Next: [Python]-16-正则表达式基础

≪ Previous: Tommorow FOSSASIA meets PyLadies Pune

在现实生活中，记录日志非常重要。银行转账时会有转账记录；飞机飞行过程中，会有黑盒子（飞行数据记录器）记录飞行过程中的一切。如果有出现什么问题，人们可以通过日志数据来搞清楚到底发生了什么。对于系统开发、调试以及运行，记录日志都是同样的重要。如果没有日志记录，程序崩溃时你几乎就没办法弄明白到底发生了什么事情。举个例子，当你在写一个服务器程序时，记录日志是非常有必要的。下面展示的就是 EZComet.com 服务器的日志文件截图。

服务崩溃后，如果没有日志，我几乎没办法知道到底发生了错误。日志不仅对于服务器很重要，对于桌面图形应用同样十分重要。比如，当你的客户的 PC 机程序崩溃时，你可以让他们把日志文件发给你，这样你就可以找到问题到底出在哪儿。相信我，在不同的 PC 环境下，你永远不会知道会有怎样奇怪的问题。我曾经就接收到过这样的错误日志。

1 2011-08-22 17:52:54,828 - root - ERROR - [Errno 10104] getaddrinfo failed 2 Traceback (most recent call last): 3 File "<string>", line 124, in main 4 File "<string>", line 20, in __init__ 5 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/wx._core", line 7978, in __init__ 6 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/wx._core", line 7552, in _BootstrapApp 7 File "<string>", line 84, in OnInit 8 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/twisted.internet.wxreactor", line 175, in install 9 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/twisted.internet._threadedselect", line 106, in __init__ 10 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/twisted.internet.base", line 488, in __init__ 11 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/twisted.internet.posixbase", line 266, in installWaker 12 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/twisted.internet.posixbase", line 74, in __init__ 13 File "h:workspaceprojectbuildpyi.win32mrdjoutPYZ1.pyz/socket", line 224, in meth 14 gaierror: [Errno 10104] getaddrinfo failed

我最终发现，这个客户的 PC 机被一种病毒感染，导致了调用 gethostname 函数失败。看吧，如果没有日志可以查你怎么可能知道这些。

打印输出不是个好办法

尽管记录日志非常重要，但是并不是所有的开发者都能正确地使用它。我曾看到一些开发者是这样记录日志的，在开发的过程中插入 print 语句，开发结束后再将这些语句移除。就像这样:

1 print 'Start reading database' 2 records = model.read_recrods() 3 print '# records', records 4 print 'Updating record ...' 5 model.update_records(records) 6 print 'done'

这种方式对于简单脚本型程序有用，但是如果是复杂的系统，你最好不要使用这样的方式。首先，你没办法做到在日志文件中只留下极其重要的消息。你会看到大量的消息日志。但是你却找不到任何有用的信息。你除了移除这输出语句这外，没别的办法控制代码，但是极有可能的是你忘记了移出那些没用的输出。再者，print 输出的所有信息都到了标准输出中，这将严重影响到你从标准输出中查看其它输出数据。当然，你也可以把消息输出到 stderr ，但是用 print 做日志记录的方式还是不好。

使用 python 的标准日志模块

那么，怎么样记录日志才是正确的呢？其实非常简单，使用 python 的标准日志模块。多亏 python 社区将日志做成了一个标准模块。它非常简单易用且十分灵活。你可以像这样使用日志系统：

1 import logging 2 logging.basicConfig(level=logging.INFO) 3 logger = logging.getLogger(__name__) 4 5 logger.info('Start reading database') 6 # read database here 7 8 records = {'john': 55, 'tom': 66} 9 logger.debug('Records: %s', records) 10 logger.info('Updating records ...') 11 # update records here 12 13 logger.info('Finish updating records')

运行的时候就可看到：

1 INFO:__main__:Start reading database 2 INFO:__main__:Updating records ... 3 INFO:__main__:Finish updating records

你可能会问这与使用 print 有什么不同呢。它有以下的优势：

你可以控制消息的级别，过滤掉那些并不重要的消息。你可决定输出到什么地方，以及怎么输出。

有许多的重要性别级可供选择，debug、info、warning、error 以及 critical。通过赋予 logger 或者 handler 不同的级别，你就可以只输出错误消息到特定的记录文件中，或者在调试时只记录调试信息。让我们把 logger 的级别改成 DEBUG 再看一下输出结果：

1 logging.basicConfig(level=logging.DEBUG)

输出变成了：

1 INFO:__main__:Start reading database 2 DEBUG:__main__:Records: {'john': 55, 'tom': 66} 3 INFO:__main__:Updating records ... 4 INFO:__main__:Finish updating records

正如看到的那样，我们把 logger 的等级改为 DEBUG 后，调试记录就出现在了输出当中。你也可以选择怎么处理这些消息。例如，你可以使用 FileHandler 把记录写进文件中：

1 import logging 2 3 logger = logging.getLogger(__name__) 4 logger.setLevel(logging.INFO) 5 6 # create a file handler 7 8 handler = logging.FileHandler('hello.log') 9 handler.setLevel(logging.INFO) 10 11 # create a logging format 12 13 formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 14 handler.setFormatter(formatter) 15 16 # add the handlers to the logger 17 18 logger.addHandler(handler) 19 20 logger.info('Hello baby') 以合适的等级输出日志记录

有了灵活的日志记录模块后，你可以按适当的等级将日志记录输出到任何地方然后配置它们。那么你可能会问，什么是合适的等级呢？在这儿我将分享一些我的经验。

大多数的情况下，你都不想阅读日志中的太多细节。因此，只有你在调试过程中才会使用 DEBUG 等级。我只使用 DEBUG 获取详细的调试信息，特别是当数据量很大或者频率很高的时候，比如算法内部每个循环的中间状态。

1 def complex_algorithm(items): 2 for i, item in enumerate(items): 3 # do some complex algorithm computation 4 5 logger.debug('%s iteration, item=%s', i, item)

在处理请求或者服务器状态变化等日常事务中，我会使用 INFO 等级。

1 def handle_request(request): 2 logger.info('Handling request %s', request) 3 # handle request here 4 5 result = 'result' 6 logger.info('Return result: %s', result) 7 8 def start_service(): 9 logger.info('Starting service at port %s ...', port) 10 service.start() 11 logger.info('Service is started')

当发生很重要的事件，但是并不是错误时，我会使用 WARNING 。比如，当用户登录密码错误时，或者连接变慢时。

1 def authenticate(user_name, password, ip_address): 2 if user_name != USER_NAME and password != PASSWORD: 3 logger.warn('Login attempt to %s from IP %s', user_name, ip_address) 4 return False 5 # do authentication here

有错误发生时肯定会使用 ERROR 等级了。比如抛出异常，IO 操作失败或者连接问题等。

1 def get_user_by_id(user_id): 2 user = db.read_user(user_id) 3 if user is None: 4 logger.error('Cannot find user with user_id=%s', user_id) 5 return user 6 return user

我很少使用 CRITICAL 。当一些特别糟糕的事情发生时，你可以使用这个级别来记录。比方说，内存耗尽，磁盘满了或者核危机（希望永远别发生 :S）。

虽然不是非得将 logger 的名称设置为 __name__ ，但是这样做会给我们带来诸多益处。在 python 中，变量 __name__ 的名称就是当前模块的名称。比如，在模块 “foo.bar.my_module” 中调用 logger.getLogger(__name__) 等价于调用logger.getLogger(“foo.bar.my_module”) 。当你需要配置 logger 时，你可以配置到 “foo” 中，这样包 foo 中的所有模块都会使用相同的配置。当你在读日志文件的时候，你就能够明白消息到底来自于哪一个模块。

捕捉异常并使用 traceback 记录它

出问题的时候记录下来是个好习惯，但是如果没有 traceback ，那么它一点儿用也没有。你应该捕获异常并用 traceback 把它们记录下来。比如下面这个例子：

1 try: 2 open('/path/to/does/not/exist', 'rb') 3 except (SystemExit, KeyboardInterrupt): 4 raise 5 except Exception, e: 6 logger.error('Failed to open file', exc_info=True)

使用参数 exc_info=true 调用 logger 方法, traceback 会输出到 logger 中。你可以看到下面的结果

1 ERROR:__main__:Failed to open file 2 Traceback (most recent call last): 3 File "example.py", line 6, in <module> 4 open('/path/to/does/not/exist', 'rb') 5 IOError: [Errno 2] No such file or directory: '/path/to/does/not/exist'

Python 使用logging模块记录日志涉及四个主要类，使用官方文档中的概括最为合适：

logger提供了应用程序可以直接使用的接口；

handler将(logger创建的)日志记录发送到合适的目的输出；

filter提供了细度设备来决定输出哪条日志记录；

formatter决定日志记录的最终输出格式。

logging模块是在2.3新引进的功能，下面是一些常用的类和模块级函数

模块级函数

logging.getLogger([name]):返回一个logger对象，如果没有指定名字将返回root logger

logging.debug()、logging.info()、logging.warning()、logging.error()、logging.critical()：设定root logger的日志级别

logging.basicConfig():用默认Formatter为日志系统建立一个StreamHandler，设置基础配置并加到root logger中

每个程序在输出信息之前都要获得一个Logger。Logger通常对应了程序的模块名，比如聊天工具的图形界面模块可以这样获得它的Logger：

LOG=logging.getLogger(”chat.gui”)

而核心模块可以这样：

LOG=logging.getLogger(”chat.kernel”)

Logger.setLevel(lel):指定最低的日志级别，低于lel的级别将被忽略。debug是最低的内置级别，critical为最高

Logger.addFilter(filt)、Logger.removeFilter(filt):添加或删除指定的filter

Logger.addHandler(hdlr)、Logger.removeHandler(hdlr)：增加或删除指定的handler

Logger.debug()、Logger.info()、Logger.warning()、Logger.error()、Logger.critical()：可以设置的日志级别

设置logger的level， level有以下几个级别：

NOTSET < DEBUG < INFO < WARNING < ERROR < CRITICAL

如果把looger的级别设置为INFO，那么小于INFO级别的日志都不输出，大于等于INFO级别的日志都输出

Handlers

handler对象负责发送相关的信息到指定目的地。Python的日志系统有多种Handler可以使用。有些Handler可以把信息输出到控制台，有些Logger可以把信息输出到文件，还有些 Handler可以把信息发送到网络上。如果觉得不够用，还可以编写自己的Handler。可以通过addHandler()方法添加多个多handler

Handler.setLevel(lel):指定被处理的信息级别，低于lel级别的信息将被忽略

Handler.setFormatter()：给这个handler选择一个格式

Handler.addFilter(filt)、Handler.removeFilter(filt)：新增或删除一个filter对象

Formatters

Formatter对象设置日志信息最后的规则、结构和内容，默认的时间格式为%Y-%m-%d %H:%M:%S，下面是Formatter常用的一些信息

%(name)s

Logger的名字

%(levelno)s

数字形式的日志级别

%(levelname)s

文本形式的日志级别

%(pathname)s

调用日志输出函数的模块的完整路径名，可能没有

%(filename)s

调用日志输出函数的模块的文件名

%(module)s

调用日志输出函数的模块名

%(funcName)s

调用日志输出函数的函数名

%(lineno)d

调用日志输出函数的语句所在的代码行

%(created)f

当前时间，用UNIX标准的表示时间的浮点数表示

%(relativeCreated)d

输出日志信息时的，自Logger创建以来的毫秒数

%(asctime)s

字符串形式的当前时间。默认格式是 “2003-07-08 16:49:45,896”。逗号后面的是毫秒

%(thread)d

线程ID。可能没有

%(threadName)s

线程名。可能没有

%(process)d

进程ID。可能没有

%(message)s

用户输出的消息

设置过滤器

细心的朋友一定会发现前文调用logging.getLogger()时参数的格式类似于“A.B.C”。采取这样的格式其实就是为了可以配置过滤器。看一下这段代码：

LOG=logging.getLogger(”chat.gui.statistic”)

console = logging.StreamHandler()

console.setLevel(logging.INFO)

formatter = logging.Formatter(’%(asctime)s %(levelname)s %(message)s’)

console.setFormatter(formatter)

filter=logging.Filter(”chat.gui”)

console.addFilter(filter)

LOG.addHandler(console)

和前面不同的是我们在Handler上添加了一个过滤器。现在我们输出日志信息的时候就会经过过滤器的处理。名为“A.B”的过滤器只让名字带有 “A.B”前缀的Logger输出信息。可以添加多个过滤器，只要有一个过滤器拒绝，日志信息就不会被输出。当然名为“A”前缀的Logger会输出信息。另外，在Logger中也可以添加过滤器。

每个Logger可以附加多个Handler。接下来我们就来介绍一些常用的Handler：

1) logging.StreamHandler

使用这个Handler可以向类似与sys.stdout或者sys.stderr的任何文件对象(file object)输出信息。它的构造函数是：

StreamHandler([strm])

其中strm参数是一个文件对象。默认是sys.stderr

2) logging.FileHandler

和StreamHandler类似，用于向一个文件输出日志信息。不过FileHandler会帮你打开这个文件。它的构造函数是：

FileHandler(filename[,mode])

filename是文件名，必须指定一个文件名。

mode是文件的打开方式。参见Python内置函数open()的用法。默认是’a'，即添加到文件末尾。

3) logging.handlers.RotatingFileHandler

这个Handler类似于上面的FileHandler，但是它可以管理文件大小。当文件达到一定大小之后，它会自动将当前日志文件改名，然后创建一个新的同名日志文件继续输出。比如日志文件是chat.log。当chat.log达到指定的大小之后，RotatingFileHandler自动把文件改名为chat.log.1。不过，如果chat.log.1已经存在，会先把chat.log.1重命名为chat.log.2。。。最后重新创建 chat.log，继续输出日志信息。它的构造函数是：

RotatingFileHandler( filename[, mode[, maxBytes[, backupCount]]])

其中filename和mode两个参数和FileHandler一样。

maxBytes用于指定日志文件的最大文件大小。如果maxBytes为0，意味着日志文件可以无限大，这时上面描述的重命名过程就不会发生。

backupCount用于指定保留的备份文件的个数。比如，如果指定为2，当上面描述的重命名过程发生时，原有的chat.log.2并不会被更名，而是被删除。

4) logging.handlers.TimedRotatingFileHandler

这个Handler和RotatingFileHandler类似，不过，它没有通过判断文件大小来决定何时重新创建日志文件，而是间隔一定时间就自动创建新的日志文件。重命名的过程与RotatingFileHandler类似，不过新的文件不是附加数字，而是当前时间。它的构造函数是：

TimedRotatingFileHandler( filename [,when [,interval [,backupCount]]])

其中filename参数和backupCount参数和RotatingFileHandler具有相同的意义。

interval是时间间隔。

when参数是一个字符串。表示时间间隔的单位，不区分大小写。它有以下取值：

S 秒

M 分

H 小时

D 天

W 每星期（interval==0时代表星期一）

midnight 每天凌晨

5) logging.handlers.SocketHandler

6) logging.handlers.DatagramHandler

以上两个Handler类似，都是将日志信息发送到网络。不同的是前者使用TCP协议，后者使用UDP协议。它们的构造函数是：

Handler(host, port)

其中host是主机名，port是端口名

7) logging.handlers.SysLogHandler

8) logging.handlers.NTEventLogHandler

9) logging.handlers.SMTPHandler

10) logging.handlers.MemoryHandler

11)logging.handlers.HTTPHandler

1 # encoding:utf-8 2 #import logging 3 4 #FORMAT = '%(asctime)-15s %(clientip)s %(user)-8s %(message)s' 5 #logging.basicConfig(format=FORMAT) 6 #d = {'clientip': '192.168.0.1', 'user': 'fbloggs'} 7 #logger = logging.getLogger('tcpserver') 8 #logger.warning('Protocol problem: %s', 'connection reset', extra=d) 9 10 #FORMAT = '%(asctime)-15s %(message)s' 11 #logging.basicConfig(filename = "C:\\Users\\june\\Desktop\\1.txt", level = logging.DEBUG, filemode = "a", format=FORMAT) 12 #logging.debug('this is a message') 13 #logging.debug('test') 14 15 #import logging 16 #import datetime 17 # 18 #curDate = datetime.date.today() - datetime.timedelta(days=0) 19 #logName = 'C:\\Users\\june\\Desktop\\error_%s.log' %curDate 20 # 21 #logging.basicConfig(level=logging.INFO, 22 # format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s', 23 # #datefmt='%a, %d %b %Y %H:%M:%S', 24 # filename=logName, 25 # filemode='a') 26 # 27 ##2013-10-21 03:25:51,509 writeLog.py[line:14] INFO This is info message 28 ##2013-10-21 03:25:51,510 writeLog.py[line:15] WARNING This is warning message 29 #logging.debug('This is debug message') 30 #logging.info('This is info message') 31 #logging.warning('This is warning message')import logging 32 import logging.config 33 34 logging.config.fileConfig("logging.conf") 35 36 #create logger 37 loggerInfo = logging.getLogger("infoLogger") 38 39 #"application" code 40 loggerInfo.debug("debug message") 41 loggerInfo.info("info message") 42 loggerInfo.warn("warn message") 43 loggerInfo.error("error message") 44 loggerInfo.critical("critical message") 45 46 47 loggerError = logging.getLogger("errorLogger") 48 loggerError.error("Error: Hello world!") 1 #coding=utf-8 2 import logging 3 import datetime 4 5 format='%(asctime)s - %(filename)s - [line:%(lineno)d] - %(levelname)s - %(message)s' 6 curDate = datetime.date.today() - datetime.timedelta(days=0) 7 infoLogName = r'C:/Users/june/Desktop/info_%s.log' %curDate 8 errorLogName = r'C:/Users/june/Desktop/error_%s.log' %curDate 9 10 formatter = logging.Formatter(format) 11 12 infoLogger = logging.getLogger("infoLog") 13 errorLogger = logging.getLogger("errorLog") 14 15 infoLogger.setLevel(logging.INFO) 16 errorLogger.setLevel(logging.ERROR) 17 18 infoHandler = logging.FileHandler(infoLogName, 'a') 19 infoHandler.setLevel(logging.INFO) 20 infoHandler.setFormatter(formatter) 21 22 errorHandler = logging.FileHandler(errorLogName, 'a') 23 errorHandler.setLevel(logging.ERROR) 24 errorHandler.setFormatter(formatter) 25 26 testHandler = logging.StreamHandler() 27 testHandler.setFormatter(formatter) 28 testHandler.setLevel(logging.ERROR) 29 30 infoLogger.addHandler(infoHandler) 31 infoLogger.addHandler(testHandler) 32 errorLogger.addHandler(errorHandler) 33 34 #infoLogger.debug("debug message") 35 #infoLogger.info("info message") 36 #infoLogger.warn("warn message") 37 # # 下面这行会同时打印在文件和终端上 38 #infoLogger.error("error message") 39 # 40 #errorLogger.error("error message") 41 #errorLogger.critical("critical message") 1 ''' 2 Created on 2016年8月18日 3 4 @author: apple 5 ''' 6 #-*- coding:utf-8 -*- 7 8 #开发出一个日志系统，既要把日志输出到控制台，还要写入日志文件 9 10 import logging 11 import time 12 import os 13 import os.path 14 15 class Logger(): 16 def __init__(self, log_name, logger_name): 17 18 ''' 19 指定保存日志的

↧

[Python]-16-正则表达式基础

October 20, 2016, 12:13 am

≫ Next: Sending HTML emails with embedded images from Django

≪ Previous: python模块之logging

引言

正则表达式不是python语法的一部分，它是独立的用于处理字符串的强大工具，在支持正则表达式的语言里，正则表达式是通用的，这篇文章介绍Python中的re模块，它提供了常用的正则表达式语法支持。

文章目录 0×1.正则表达式基本语法

正则表达式的基本语法对于支持它的编程语言都是通用的，下面是一些常用语法与说明：

字符匹配：

● . :（英文句点）匹配任意单个字符（换行符\n除外）；

● \ :（斜杠）用于特殊符号的转义；

● \d :匹配单个数字；

● \D :匹配单个非数字；

● \w :匹配单个单词字符（大小写字母以及数字）；

● \W :匹配单个非单词字符；

● \s :匹配单个空白字符（包括：空格，\n，\r，\t，\f，\v）；

● \S :匹配单个非空白字符；

范围与数量匹配：

● [...] :（中括号）匹配中括号中的单个字符，或字符范围中的单个字符，例如[abc]为匹配abc三个字符中的任意一个，[a-z]为匹配全体小写字母中的任意一个； ● [^...] :在中括号匹配的基础上取反，即不匹配中括号中的单个字符或范围中的单个字符；

● * :（星号）匹配星号前一个字符0次或无限次，例如"ab*c"可以匹配"ac"，"abc"，"abbbbc"；

● + :（加号）匹配加号前一个字符1次或无限次，与星号唯一的区别就是，至少要匹配一次，例如"ab+c"不能匹配"ac"，但可以匹配"abc"，"abbbbc"；

● ? :（问号）匹配问号前一个字符0次或1次，例如"ab?c"可以匹配"ac"，"abc"，但不能匹配"abbbbc"；

● {m} :匹配大括号前一个字符m次，例如"ab{3}c"可以匹配"abbbc"但不能匹配"abc"，"abbc"；

● {m,n} :匹配大括号前一个字符m至n次，如果省略m则默认m为0，如果省略n则默认n为无穷大，如两者都省略，等同于"星号"，例如"ab{2,4}c"可以匹配"abbc"，"abbbc"，"abbbbc"，但不能匹配"abc"；

● *? :使星号匹配变成非贪婪模式，大部分程序支持中正则表达式默认都是使用贪婪模式，非贪婪模式多用于字符串分组，将在本文第二部分演示；

● +? :使加号匹配变成非贪婪模式；

● ?? :使问号匹配变成非贪婪模式；

● {m,n} ? :使大括号匹配变成非贪婪模式；

边界、逻辑与分组：

● ^ :（上标）匹配字符串开头，例如"^ab"能够匹配"ab123"，但不能匹配"bb123"，凡是以"ab"开头的字符串都能匹配；

● $ :（dollar符）匹配字符串结尾，例如".*\.py$"能够匹配所有".py"结尾的字符串；

● ^...$ :全匹配，例如"^abcd$"只能匹配字符串"abcd"；

了解了这些基础规则后，下面来看python如何使用这些正则表达式。

0×2.Python常用正则表达式实例 a.re模块匹配实例

Python内置的re模块提供了对正则表达式的支持，先来看几个简单的实例：

#!/usr/bin/env python3 #coding=utf-8 #导入re模块 import re #re.match(r"正则表达式","要匹配的字符串")，在正则表达式前使用r前缀，可以避免\的转义，本例匹配所有a与b之间包含三个数字的字符串 rex=re.match(r"a...b", "a123b") print(rex) #打印出一个SRE_Match对象 #re.math进行判断时，如果匹配返回一个SRE_Match对象，如果不匹配，返回None if rex: print("True") else: print("False") #程序输出 <_sre.SRE_Match object; span=(0, 5), match='a123b'> True #修改上程序中的re.match(r"a...b", "a123b")部分，去掉print(rex)语句，使用第一部分介绍的正则表达式，看看程序如何输出 #1.匹配数字和非数字 re.match(r"a\d\d\db", "a111b") True re.match(r"a\d\d\db", "a1b") False re.match(r"a\Db", "a2b") False #2.匹配单词和非单词 re.match(r"a\w\w\we", "abcde") True re.match(r"a\w\w\we", "a12De") True re.match(r"a\w\W\we", "a1_De") True #3.中括号匹配 re.match(r"a[bcd]e", "abce") False re.match(r"a[bcd]e", "abe") True re.match(r"a[^bcd]e", "abe") False re.match(r"a[^bcd]e", "axe") True #4.星号匹配 re.match(r"ab*c", "abc") True re.match(r"ab*c", "ac") True #5.加号匹配 re.match(r"ab+c", "ac") False re.match(r"ab+c", "abbbc") True #6.问号匹配 re.match(r"ab?c", "ac") True re.match(r"ab?c", "abbc") False #剩下的大家自己按照上面的格式去测试即可 b.字符串切片

使用正则表达式对字符串切片，比使用字符串自带的split方法更加的灵活和强大，请看下面的实例：

#!/usr/bin/env python3 #coding=utf-8 import re #普通的字符串split print("a b c d e".split(" ")) #使用正则表达式实现split，使用一个或多个空格完成切片处理 print(re.split(r"\s+", "a b c d e")) #包含一个或多个空格、原点、分号、逗号的切片处理 print(re.split(r"[\s\.\;\,]+", "a ,b ;;c ... d e")) #程序输出，普通的字符串切片对多个连续空格的处理会产生空的字符串，而正则表达式则不会 ['a', 'b', 'c', '', 'd', '', '', 'e'] ['a', 'b', 'c', 'd', 'e'] ['a', 'b', 'c', 'd', 'e'] c.字符串分组

re模块的groups()方法能够将匹配的字符串进行分组处理，每个需要分组匹配的正则表达式被包含在一对圆括号中，请看下面的实例：

#!/usr/bin/env python3 #coding=utf-8 import re #首先匹配后面的字符串是否以三位数字开头，6到8位数字结尾，如果匹配可以使用groups和group取出正则表达式对应括号中的内容 a=re.match(r"^(\d{3})-(\d{6,8})$","010-88888888") if a: print(a.groups()) #取出分组后的元祖列表 print(a.group(0)) #取出匹配到的全部字符 print(a.group(1)) #第一个括号中匹配的内容 print(a.group(2)) #第二个括号中匹配的内容，如果有多个括号以此类推 #程序输出 ('010', '88888888') 010-88888888 010 88888888

再来看一个计算机时间匹配的例子：

#!/usr/bin/env python3 #coding=utf-8 import re t="23:53:09" #下面是一个全匹配，从头到尾，必须满足第一个分号前，第一位可以是0或1，第二位可以是0~9，或者第一位是2，第二位是0~3，或者0~9，这是我们一天时间的整点数，后面分别是从0~59的数字匹配 a=re.match(r'^([01][0-9]|2[0-3]|[0-9]):([0-5][0-9]|[0-9]):([0-5][0-9]|[0-9])$', t) if a: print(a.groups()) print(a.group(0)) print(a.group(1)) print(a.group(2)) print(a.group(3)) #程序输出 ('23', '53', '09') 23:53:09 23 53 09 d.编译正则表达式

在上面的每一次正则表达式的匹配使用中，Python解释器都会首先编译正则表达式，然后再用编译后的正则表达式去匹配字符串，在大量重复的匹配环境下，这样的效率十分低；

如果一个正则表达式并不需要修改，那么就应该提前预编译（compile）它，之后每次的匹配操作都不需要再次编译，直接使用预编译好的规则去匹配即可，这样可以提高程序执行的效率，下面是一个预编译的实例：

#!/usr/bin/env python3 #coding=utf-8 import re t1="23:53:09" t2="19:02:12" #将匹配时钟的正则表达式预编译 re_c=re.compile(r'^([01][0-9]|2[0-3]|[0-9]):([0-5][0-9]|[0-9]):([0-5][0-9]|[0-9])$') #之后的每一次匹配，都不需要再编译 a=re_c.match(t1) b=re_c.match(t2) print(a.groups()) print(b.groups()) #程序输出 ('23', '53', '09') ('19', '02', '12') e.非贪婪模式匹配

在大部分语言的支持中，正则表达式是使用贪婪模式的，也就意味着尽可能多的匹配字符串（一少部分语言内置的是非贪婪模式），用一个实例来解释什么是贪婪模式：

#!/usr/bin/env python3 #coding=utf-8 import re #匹配a开头，紧接着0个或多个b，结尾为一个或者多个b再加一个单字符c，贪婪模式下，第一组会尽可能多的匹配b，也就意味着，第二组只能匹配到一个bc print(re.match(r"^(ab*)(b+c)$","abbbbbc").groups()) #非贪婪模式下，星号最少是0个，那么就使用0个作为匹配规则 print(re.match(r"^(ab*?)(b+c)$","abbbbbc").groups()) #加号匹配一个或多个，非贪婪模式就是一个 print(re.match(r"^(ab+)(b+c)$","abbbbbc").groups()) print(re.match(r"^(ab+?)(b+c)$","abbbbbc").groups()) #问号匹配0个或者一个，非贪婪模式就是0个 print(re.match(r"^(ab?)(b+c)$","abbbbbc").groups()) print(re.match(r"^(ab??)(b+c)$","abbbbbc").groups()) #程序输出，从输出中可以看出，贪婪模式是尽可能的多匹配，非贪婪模式是尽可能的少匹配 ('abbbb', 'bc') ('a', 'bbbbbc') ('abbbb', 'bc') ('ab', 'bbbbc') ('ab', 'bbbbc') ('a', 'bbbbbc')

↧

Sending HTML emails with embedded images from Django

October 20, 2016, 12:12 am

≫ Next: python之八大排序方法

≪ Previous: [Python]-16-正则表达式基础

Currently I'm working on an application which sends HTML emails with embedded or inline images and multiple CSV and PDF attachments.

Let's assume that we will be using an object containing data for our email. I'm providing my models here just as a reference to get the idea:

class RenderedReport(models.Model): report = models.ForeignKey('Report', related_name='rendered_reports') approved_by = models.ForeignKey('auth.User', blank=True, null=True) date_rendered = models.DateTimeField(auto_now_add=True) date_queried = models.DateTimeField(blank=True, null=True) date_approved = models.DateTimeField(blank=True, null=True) date_sent = models.DateTimeField(blank=True, null=True) class Meta: get_latest_by = 'date_rendered' ordering = ('-date_rendered',) def __unicode__(self): return str(self.report) class RenderedView(models.Model): rendered_report = models.ForeignKey('RenderedReport', related_name='rendered_views') view = models.ForeignKey('View', related_name='rendered_views') png = ImageField(upload_to='reports') pdf = models.FileField(upload_to='reports', blank=True) csv = models.FileField(upload_to='reports', blank=True) class Meta: ordering = ['view'] def __unicode__(self): return str(self.view) @property def image_filename(self): return os.path.basename(self.png.name)

I don't like the idea of re-inventing the wheel, so I will be using a responsive email template from Zurb Studios .

I'm skipping the entire HTML template code for brevity because it wasn't modified. We only need this part:

<table class="body-wrap"> <tr> <td></td> <td class="container" bgcolor="#FFFFFF"> <div class="content"> <table> <tr> <td> {% for view in views %} <h3>{{ view }}</h3> <p><img src="cid:{{ view.image_filename }}" /></p> {% if not forloop.last %}<p> </p>{% endif %} {% endfor %} </td> </tr> </table> </div> </td> <td></td> </tr> </table>

Now it's time to get our HTML email rendered. We won't be sending a plain-text version of our email,

I'm providing a simplified, but a working snippet of python code:

import os from email.mime.image import MIMEImage from django.core.mail import EmailMultiAlternatives from django.template.loader import render_to_string rendered_report = RenderedReport.objects.get(pk=1) views = rendered_report.rendered_views.all() context = {'views': views} html_content = render_to_string('reports/email.html', context=context).strip() subject = 'HTML Email' recipients = ['john.doe@test.com'] reply_to = ['noreply@test.com'] msg = EmailMultiAlternatives(subject, html_content, config.formatted_email_from, to, reply_to=reply_to) msg.content_subtype = 'html' # Main content is text/html msg.mixed_subtype = 'related' # This is critical, otherwise images will be displayed as attachments! for view in views: # Create an inline attachment image = MIMEImage(view.png.read()) image.add_header('Content-ID', '<{}>'.format(view.image_filename)) msg.attach(image) # Create a regular attachment with a CSV file if view.csv: filename = os.path.basename(view.csv.name) msg.attach(filename, view.csv.read(), 'text/csv') # Create a regular attachment with a PDF file if view.pdf: filename = os.path.basename(view.pdf.name) msg.attach(filename, view.pdf.read(), 'application/pdf') msg.send()

This will send a responsive HTML email containing inline images and attachments.

Please pay additional attention to the line with msg.mixed_subtype = 'related' . It sets the email header Content-Type: multipart/related; guaranteeing that your images will be displayed inline and not as attachments.

I'm providing an example how the <img> tag will be rendered: <img src="cid:20161010_dailykpisnapshot_OCuZ4O4.png">

And here's the email headers:

Content-Type: image/png Content-Disposition: inline Content-Transfer-Encoding: base64 Content-ID: <20161010_dailykpisnapshot_OCuZ4O4.png>

↧

python之八大排序方法

October 20, 2016, 12:11 am

≫ Next: OpenCV with Python Blueprints' first anniversary: Giveaway

≪ Previous: Sending HTML emails with embedded images from Django

一、插入排序

1 #-*- coding:utf-8 -*- 2 ''' 3 描述 4 插入排序的基本操作就是将一个数据插入到已经排好序的有序数据中，从而得到一个新的、个数加一的有序数据，算法适用于少量数据的排序，时间复杂度为O(n^2)。 5 是稳定的排序方法。插入算法把要排序的数组分成两部分：第一部分包含了这个数组的所有元素，但将最后一个元素除外（让数组多一个空间才有插入的位置）， 6 而第二部分就只包含这一个元素（即待插入元素）。在第一部分排序完成后，再将这个最后元素插入到已排好序的第一部分中 7 ''' 8 def insert_sort(lists): 9 count = len(lists) 10 for i in range(1, count): 11 key = lists[i] 12 j = i - 1 13 while j >= 0: 14 if lists[j] > key: 15 lists[j + 1] = lists[j] 16 lists[j] = key 17 j -= 1 18 return lists 19 20 lst1 = raw_input().split() 21 lst = [int(i) for i in lst1] 22 #lst = input() 23 insert_sort(lst) 24 for i in range(len(lst)): 25 print lst[i], View Code

二、希尔排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述 4 希尔排序(Shell Sort)是插入排序的一种。也称缩小增量排序，是直接插入排序算法的一种更高效的改进版本。希尔排序是非稳定排序算法。 5 该方法因DL．Shell于1959年提出而得名。希尔排序是把记录按下标的一定增量分组，对每组使用直接插入排序算法排序；随着增量逐渐减少， 6 每组包含的关键词越来越多，当增量减至1时，整个文件恰被分成一组，算法便终止。 7 ''' 8 def shell_sort(lists): 9 count = len(lists) 10 step = 2 11 group = count / step 12 while group > 0: 13 for i in range(group): 14 j = i + group 15 while j < count: 16 k = j - group 17 key = lists[j] 18 while k >= 0: 19 if lists[k] > key: 20 lists[k + group] = lists[k] 21 lists[k] = key 22 k -= group 23 j += group 24 group /= step 25 return lists 26 27 lst1 = raw_input().split() 28 lst = [int(i) for i in lst1] 29 #lst = input() 30 shell_sort(lst) 31 for i in range(len(lst)): 32 print lst[i], View Code

三、冒泡排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述 4 它重复地走访过要排序的数列，一次比较两个元素，如果他们的顺序错误就把他们交换过来。 5 走访数列的工作是重复地进行直到没有再需要交换，也就是说该数列已经排序完成。 6 ''' 7 def bubble_sort(lists): 8 count = len(lists) 9 for i in range(count): 10 for j in range(i + 1, count): 11 if lists[i] > lists[j]: 12 lists[i], lists[j] = lists[j], lists[i] 13 return lists 14 15 lst1 = raw_input().split() 16 lst = [int(i) for i in lst1] 17 #lst = input() 18 bubble_sort(lst) 19 for i in range(len(lst)): 20 print lst[i], View Code

四、直接选择排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述 4 基本思想：第1趟，在待排序记录r1 ~ r[n]中选出最小的记录，将它与r1交换；第2趟，在待排序记录r2 ~ r[n]中选出最小的记录，将它与r2交换； 5 以此类推，第i趟在待排序记录r[i] ~ r[n]中选出最小的记录，将它与r[i]交换，使有序序列不断增长直到全部排序完毕。 6 ''' 7 def select_sort(lists): 8 count = len(lists) 9 for i in range(count): 10 min = i 11 for j in range(i + 1, count): 12 if lists[min] > lists[j]: 13 min = j 14 lists[min], lists[i] = lists[i], lists[min] 15 return lists 16 17 lst1 = raw_input().split() 18 lst = [int(i) for i in lst1] 19 #lst = input() 20 select_sort(lst) 21 for i in range(len(lst)): 22 print lst[i], View Code

五、快速排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述(利用递归，效率较低，较难理解) 4 通过一趟排序将要排序的数据分割成独立的两部分，其中一部分的所有数据都比另外一部分的所有数据都要小， 5 然后再按此方法对这两部分数据分别进行快速排序，整个排序过程可以递归进行，以此达到整个数据变成有序序列。 6 ''' 7 def quick_sort(lists, left, right): 8 if left >= right: 9 return lists 10 key = lists[left] 11 low = left 12 high = right 13 while left < right: 14 while left < right and lists[right] >= key: 15 right -= 1 16 lists[left] = lists[right] 17 while left < right and lists[left] <= key: 18 left += 1 19 lists[right] = lists[left] 20 lists[right] = key 21 quick_sort(lists, low, left - 1) 22 quick_sort(lists, left + 1, high) 23 return lists 24 25 lst1 = raw_input().split() 26 lst = [int(i) for i in lst1] 27 #lst = input() 28 quick_sort(lst,0,len(lst)-1) 29 for i in range(len(lst)): 30 print lst[i], View Code

六、堆排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述(较难理解) 4 堆排序(Heapsort)是指利用堆积树（堆）这种数据结构所设计的一种排序算法，它是选择排序的一种。可以利用数组的特点快速定位指定索引的元素。 5 堆分为大根堆和小根堆，是完全二叉树。大根堆的要求是每个节点的值都不大于其父节点的值，即A[PARENT[i]] >= A[i]。 6 在数组的非降序排序中，需要使用的就是大根堆，因为根据大根堆的要求可知，最大的值一定在堆顶。 7 ''' 8 # 调整堆 9 def adjust_heap(lists, i, size): 10 lchild = 2 * i + 1 11 rchild = 2 * i + 2 12 max = i 13 if i < size / 2: 14 if lchild < size and lists[lchild] > lists[max]: 15 max = lchild 16 if rchild < size and lists[rchild] > lists[max]: 17 max = rchild 18 if max != i: 19 lists[max], lists[i] = lists[i], lists[max] 20 adjust_heap(lists, max, size) 21 22 # 创建堆 23 def build_heap(lists, size): 24 for i in range(0, (size/2))[::-1]: 25 adjust_heap(lists, i, size) 26 27 # 堆排序 28 def heap_sort(lists): 29 size = len(lists) 30 build_heap(lists, size) 31 for i in range(0, size)[::-1]: 32 lists[0], lists[i] = lists[i], lists[0] 33 adjust_heap(lists, 0, i) 34 35 lst1 = raw_input().split() 36 lst = [int(i) for i in lst1] 37 #lst = input() 38 heap_sort(lst) 39 for i in range(len(lst)): 40 print lst[i], View Code

七、归并排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述（利用递归） 4 归并排序是建立在归并操作上的一种有效的排序算法,该算法是采用分治法（Divide and Conquer）的一个非常典型的应用。将已有序的子序列合并，得到完全有序的序列； 5 即先使每个子序列有序，再使子序列段间有序。若将两个有序表合并成一个有序表，称为二路归并。 6 归并过程为：比较a[i]和a[j]的大小，若a[i]≤a[j]，则将第一个有序表中的元素a[i]复制到r[k]中，并令i和k分别加上1；否则将第二个有序表中的元素a[j]复制到r[k]中， 7 并令j和k分别加上1，如此循环下去，直到其中一个有序表取完，然后再将另一个有序表中剩余的元素复制到r中从下标k到下标t的单元。归并排序的算法我们通常用递归实现， 8 先把待排序区间[s,t]以中点二分，接着把左边子区间排序，再把右边子区间排序，最后把左区间和右区间用一次归并操作合并成有序的区间[s,t]。 9 ''' 10 def merge(left, right): 11 #合并过程 12 i, j = 0, 0 13 result = [] 14 while i < len(left) and j < len(right): 15 if left[i] <= right[j]: 16 result.append(left[i]) 17 i += 1 18 else: 19 result.append(right[j]) 20 j += 1 21 result.extend(left[i:]) 22 result.extend(right[j:]) 23 return result 24 25 def merge_sort(lists): 26 if len(lists) <= 1: 27 return lists 28 mid = len(lists) / 2 29 left = merge_sort(lists[:mid]) 30 right = merge_sort(lists[mid:]) 31 return merge(left, right) 32 33 lst1 = raw_input().split() 34 lst = [int(i) for i in lst1] 35 #lst = input() 36 tt = merge_sort(lst) 37 for i in range(len(tt)): 38 print tt[i], View Code

八、基数排序

1 #-*- coding:utf8 -*- 2 ''' 3 描述（表示没接触过，第一次听说） 4 基数排序（radix sort）属于“分配式排序”（distribution sort），又称“桶子法”（bucket sort）或bin sort，顾名思义，它是透过键值的部份资讯， 5 将要排序的元素分配至某些“桶”中，藉以达到排序的作用，基数排序法是属于稳定性的排序，其时间复杂度为O (nlog(r)m)， 6 其中r为所采取的基数，而m为堆数，在某些时候，基数排序法的效率高于其它的稳定性排序法。 7 ''' 8 import math 9 def radix_sort(lists, radix=10): 10 k = int(math.ceil(math.log(max(lists), radix))) 11 bucket = [[] for i in range(radix)] 12 for i in range(1, k+1): 13 for j in lists: 14 bucket[j/(radix**(i-1)) % (radix**i)].append(j) 15 del lists[:] 16 for z in bucket: 17 lists += z 18 del z[:] 19 return lists 20 21 lst1 = raw_input().split() 22 lst = [int(i) for i in lst1] 23 #lst = input() 24 radix_sort(lst) 25 for i in range(len(lst)): 26 print lst[i], View Code

下面附一下各个排序算法的时间复杂度以及稳定性比较：

平均速度最快的排序算法是？

排序方法平均情况最好情况最坏情况辅助空间稳定性

冒泡排序 O(n^2) O(n) O(n^2) O(1) 稳定

选择排序 O(n^2) O(n^2) O(n^2) O(1) 不稳定

插入排序 O(n^2) O(n) O(n^2) O(1) 稳定

希尔排序O(n*log(n))~O(n^2) O(n^1.3) O(n^2) O(1) 不稳定

堆排序 O(n*log(n)) O(n*log(n)) O(n*log(n)) O(1) 不稳定

归并排序 O(n*log(n)) O(n*log(n)) O(n*log(n)) O(n) 稳定

快速排序 O(

↧

OpenCV with Python Blueprints' first anniversary: Giveaway

October 20, 2016, 2:34 am

≫ Next: Python内置函数(6)――bool

≪ Previous: python之八大排序方法

A year ago today, Packt Publishing Ltd. released OpenCV with python Blueprints , my first technical book on computer vision and machine learning using the OpenCV library. To celebrate this 1-year anniversary, I'm giving away 3 print copies of the book via Amazon Giveaways! Read on to find out how you can participate.

Michael Beyeler

OpenCV with Python Blueprints

Design and develop advanced computer vision projects using OpenCV with Python

Packt Publishing Ltd., London, England

Paperback: 230 pages

ISBN 978-178528269-0

[ GitHub ] [ Discussion Group ] [ Free Sample ] Critical reception

When Packt Publishing first approached me with the idea of writing yet another book on OpenCV, I was skeptical. The proposal came during a hectic time of my PhD studies, so there was no way I was going to tell my adviser about it. :-) He would have probably told me to let it go, focus on more imminent goals instead such as earning an academic degree―and where was that paper draft I promised him a week ago anyway... That's what a good adviser is supposed to tell his over-eager student, and the over-eager student is supposed to listen. But, of course, I signed up for it anyway.

Knowing that this was just too good an opportunity to pass up, I did my research on the topic, and quickly realized that an advanced guide on how to use OpenCV for real-world applications was actually hard to come by. Of course, there was the O'Reilly classic , but that was about it. And I knew I didn't want to write yet another introductory tutorial masquerading as a book, or simply teach theory that readers could have looked up in a two-minute internet search. Instead, each chapter in the book was supposed to cover a self-contained, practical, real-world project, spanning all possible domains of the OpenCV library (such as image manipulation, augmented reality, object tracking, 3D scene reconstruction, statistical learning, and object categorization). A tall order.

One year later, I am really proud of the result, and thankful for all the support I have received. The source code has been bookmarked by over 50 people on GitHub , people have used it in their scholarly publications , and reviews have been more than heartwarming. Some sample testimonials from readers across various online retailers can be found below (platforms sorted by number of ratings):

Amazon (9 ratings)
"This book is great for someone who is at a beginner level in computer vision and wants to get into an intermediate level by learning with real world examples."
- Sebastian Montabone $37.49 Barnes & Noble (5 ratings)
"The book is written with a touch of personality that makes this an engaging read instead of a dry math text."
- Penny V. Webster $39.99 Google Books (4 ratings) GoodReads (3 ratings)
"Fantastic book with a great balance of theory/concepts vs. advanced hands-on examples."
- James Schultz Powell's (2 ratings)
"Excellent compendium to learn OpenCV and apply it in advanced practical projects."
- Nancy Hurd $53.32 PacktPub (1 rating) $39.99 O'Reilly (1 rating) $39.99 Amazon Giveaway

With this I would like to thank everyone who has supported me throughout this year and helped make the book a success!

Because cold-hard cash is better than words, I am giving away 3 print copies (each a $39 value) via Amazon Giveaways . All you need to do is click on the link and follow me on Twitter. That's it! The Amazon site will instantly let you know whether you have won.

Thanks for stopping by―and good luck!

↧

Python内置函数(6)――bool

October 20, 2016, 3:53 am

≫ Next: Why K&R C is a Must-Read

≪ Previous: OpenCV with Python Blueprints' first anniversary: Giveaway

英文文档：

class bool ([ x ])

Return a Boolean value, i.e. one of True or False . x is converted using the standard truth testing procedure . If x is false or omitted, this returns False ; otherwise it returns True . Theclass is a subclass of(see Numeric Types ― int, float, complex ). It cannot be subclassed further. Its only instances are False and True (see Boolean Values ).

说明：

1. 返回值为True或者False的布尔值

2. 参数如果缺省，则返回False

>>> bool() #未传入参数 False

3. 参数转换使用标准的逻辑测试表达式

3.1 传入布尔类型时，按原值返回

>>> bool(True) True >>> bool(False) False

3.2 传入字符串时，空字符串返回False，否则返回True

>>> bool('') False >>> bool('0') True

3.3 传入数值时，0值返回False，否则返回True

>>> bool(0) False >>> bool(1) True >>> bool(-1.0) True

3.4 传入元组、列表、字典等对象时，元素个数为空返回False，否则返回True

>>> bool(()) #空元组 False >>> bool((0,)) #非空元组 True >>> bool([]) #空列表 False >>> bool([0]) #非空列表 True >>> bool({}) #空字典 False >>> bool({'k':'v'}) #非空字典 True

↧

Why K&R C is a Must-Read

October 20, 2016, 5:37 am

≫ Next: Fixing Python Performance with Rust

≪ Previous: Python内置函数(6)――bool

In 1988 Brian Kernighan and Dennis Ritchie released the 2nd edition of "The C Programming Language" (better known as "K&R C"), a must-read for any serious programmer. It is praised in reviews , random HN posts, and in books like P. Seibel's excellent Coders at Work amongst others. Is reading it solid advice or is it a product of a cargo cult? I decided to find out.

The book's small size, slightly less than 300 pages, of which around 190 are the core material, masks a mountain of knowledge. I promised myself that, to grok everything, I would do every exercise and that it "shouldn't take me more than two or three weeks." That was on a fine April day. So I read the text, I did every exercise, I compared my code to other's solutions, when, on the last day of September, I bumped up against the appendix. This could have been compressed into a month's worth of focused study were it not for client work and other happenings in the real world. My motivation would ebb and flow, but the clear writing and interesting problems helped me get to the end.

The book contains many exercises, most of them small, 30-minute things. Taken allogether, they amount to roughly 6500 lines of C and progress from "substitute all substrings in a string" to "recursively read directory inodes to get each directory's files' names and sizes". I underestimated the time to complete an exercise the first few times; this humbled me. My first big wakeup call was that, in C's procedural style, my ways of managing state, learned through OOP and a bit of FP, were useless. There were other challenges as well: managing memory manually, handling low-level input, referencing and dereferencing memory using pointers, and figuring out the meaning of a few anachronisms like "tabstops." All sunk-cost fallacies aside, what did I get out of this?

I can finally explore everything ever written in C. This isn't normally a strong selling point of learning a programming language. Just today I chanced upon this great post about what really happens when you instantiate a python class . Python's default implementation, CPython, is written in C, so I was able to follow and understand the code that's run when you Foo(a, b) . A few days earlier, I used this newly found superpower to have a look at some venerable linux utilities like cat , cp , and netcat . It blew my mind to read programs written before I was born. I can't decide whether I should buckle down and read CPython's source, or the legendarily good redis code, or maybe, out of nostalgia's sake, grok Quake 3 Arena . The world just became more open and interesting.

Knowing even a little bit of C strips away many layers of abstraction from the world; it turns the abstract into the mechanical: "get me 12 bytes of memory, write 12 bytes of data (no more!) in there, now hand its address over to a function." It pushes me to ask questions: how does Python allocate memory for lists and tuples ? How does Linux " watch " your files for modifications? How is a JS associative array structured in terms of raw bytes? What kind of system calls happen when you run a script ? I put on my lab coat and safety goggles and go at it with strace and gdb like I would with a microscope or petri dish.

" So you're telling me, that after all that time and effort, you're essentially a software archeologist with better debugging skills? ", you may ask.

Yeah, how awesome is that :)? But if that's not enough - after K&R you're ready to do one more thing. See, since C is so widespread, you're likely to be working with software that can be extended or modified using it. In my little part of the woods, the popular thing to do is to write Python or Postgresql modules in C when the task at hand demands speed. Why not take it further and write your own nginx modules ? Or, using Lua's source as a guide, write a simple interpreter?

You'll get by without K&R. Even live a happy and productive life, too. But reading it will allow you scratch that itch when you're looking at the screen, palms on your temples, and muttering " why does this thing... "

↧

Fixing Python Performance with Rust

October 20, 2016, 5:36 am

≫ Next: NumPy Tutorial: Data analysis with Python

≪ Previous: Why K&R C is a Must-Read

Sentry processes over a billion errors every month. We’ve been able to scale most of our systems, but in the last few months, one component has stood out as a computational chokepoint: python’s source map processing.

Starting last week, the infrastructure team decided to investigate the scaling shortcomings of our source map processing. Our javascript client has jumped to become our most popular integration, and one of the reasons is our ability to un-minify JavaScript via Source Maps . Processing does not come operationally free, though. We have to fetch, de-compress, un-minify, and reverse transpile our way into making a JavaScript stack trace legible.

When we had written the original processing pipeline almost 4 years ago , the source map ecosystem was just starting to come to fruition. As it grew into what is now a complex and mature mapping process, so did our time to process them in Python.

As of yesterday, we have dramatically cut down that processing time (and CPU utilization on our machines) by replacing our source map handling with a Rust module that we interface with from Python.

To explain how we got here, we first need to better explain source maps and their shortcomings in Python.

Source Maps in Python

As our user’s applications are becoming more and more complex, so do their source maps. Parsing the JSON itself is fast enough in Python, as they mostly contain just for a few strings. The problem lies in objectification. Each source map token yields a single Python object, and we had some source maps that expanded to a few million tokens.

The problem with objectifying source map tokens is that we pay an enormous price for a base Python object, just to get a few bytes from a token. Additionally, all these objects engage in reference counting and garbage collection, which contributes even further to the overhead. Handling a 30MB source map makes a single Python process expand to ~ 800MB in memory, executing millions of memory allocations and keeping the garbage collector very busy with tokens’ short-lived nature.

Since this objectification requires object headers and garbage collection mechanisms, we had very little room for actual processing improvement inside of Python.

Source Maps in Rust

After the investigation had pointed us towards Python’s shortcomings, we decided to vet the performance of our Rust source map parser, perviously written for our CLI tool . After applying the parser to a particularly problematic source map, it showed that parsing with this library alone could cut down the processing time from >20 seconds to <0.5 sec . This meant that even ignoring any optimizations, just replacing Python with Rust could relieve our chokepoint.

Once we proved that Rust was definitively faster, we cleaned up some Sentry internal APIs so that we could replace our original implementation with a new library. That Python library is named libsourcemap and is a thin wrapper around our own rust-sourcemap .

The Results

After deploying the library, the machines that were dedicated to source map processing instantly sighed in relief.

With all of the CPUs efficiently processing, our worst source map times diminished to a tenth of their original time.

More importantly, the slowest times were not the only maps to receive improvements. The average processing time reduced to ~400ms.

Since JavaScript is our most popular project language, this change reached as far as reducing the end-to-end processing time for all events to ~300ms.

Embedding Rust in Python

There are various methods to expose a Rust library to Python and the other way round. We chose to compile our crate into a dylib and to provide some good ol’ C functions, exposed to Python through CFFI and C headers. With the headers, CFFI generates a tiny shim that can call out into Rust. From there, libsourcemap can open a dynamically shared library that is generated from Rust at runtime.

There are two steps to this process. The first is a build module that configures CFFI when setup.py runs:

import subprocess from cffi import FFI ffi = FFI() ffi.cdef(subprocess.Popen([ 'cc', '-E', 'include/libsourcemap.h'], stdout=subprocess.PIPE).communicate()[0]) ffi.set_source('libsourcemap._sourcemapnative', None)

After building the module, the header is ran through the C preprocessor so that it expands macros, a process that CFFI cannot do by itself. Additionally, this tells CFFI where to put the generated shim module. All that needs to happen after that is loading the module:

import os from libsourcemap._sourcemapnative import ffi as _ffi _lib = _ffi.dlopen(os.path.join(os.path.dirname(__file__), '_libsourcemap.so'))

The next step is to write some wrapper code to provide a Python API to the Rust objects, and since we’re Sentry, we started with the ability to forward exceptions. This happens in a two-part process: First, we made sure that in Rust, we used result objects wherever possible. In addition, we set up landing pads for panics to make sure they never cross a DLL boundary. Second, we defined a helper struct that can store error information; and passed it as an out parameter to functions that can fail.

In Python, a helper context manager was provided:

@contextmanager def capture_err(): err = _ffi.new('lsm_error_t *') def check(rv): if rv: return rv try: cls = special_errors.get(err[0].code, SourceMapError) exc = cls(_ffi.string(err[0].message).decode('utf-8', 'replace')) finally: _lib.lsm_buffer_free(err[0].message) raise exc yield err, check

We have a dictionary of specific error classes ( special_errors ) but if no specific error can be found, a generic SourceMapError will be raised.

From there, we can actually define the base class for a source map:

class View(object): @staticmethod def from_json(buffer): buffer = to_bytes(buffer) with capture_err() as (err_out, check): return View._from_ptr(check(_lib.lsm_view_from_json( buffer, len(buffer), err_out))) @staticmethod def _from_ptr(ptr): rv = object.__new__(View) rv._ptr = ptr return rv def __del__(self): if self._ptr: _lib.lsm_view_free(self._ptr) self._ptr = None

Exposing a C ABI in Rust We start with a C header containing some exported functions, how can we export them from Rust? There are two tools: the special #[no_mangle] attribute, and the std::panic module; providing a landing pad for Rust panics. We built ourselves some helpers to deal with this : a function to notify Python about an exception and two landing pad helpers; a generic one and one that boxes up the return value. With this, it becomes quite nice to write wrapper methods: #[no_mangle] pub unsafe fn lsm_view_from_json(bytes: *const u8, len: c_uint, err_out: *mut CError) -> *mut View { boxed_landingpad(|| { View::json_from_slice(slice::from_raw_parts(bytes, len as usize)) }, err_out) } #[no_mangle] pub unsafe fn lsm_view_free(view: *mut View) { if !view.is_null() { Box::from_raw(view); } }

The way boxed_landingpad works is quite simple. It invokes the closure, catches the panic with panic::catch_unwind , unwraps the result and boxes up the success value in a raw pointer. In case an error happens it fills out err_out and returns a NULL pointer. In lsm_view_free , one just has to reconstruct the box from the raw pointer.

Building the Extension

To actually build the extension, we have to run some less-than-beautiful steps inside of setuptool s . Thankfully, it did not take us much time to write it since we already had a similar set of steps for our DSYM handling library .

The handy part of this setup is that a source distribution invokes cargo for building, and binary wheels for installing the final dylib, removing the need for any end-user to navigate the Rust toolchain.

What went well? What didn’t?

I was asked on Twitter: “what alternatives to Rust there would have been?” Truth be told, Rust is pretty hard to replace for this. The reason is that, unless you want to fully rewrite an entire Python component in a different codebase, you can only write a native extension. In that case, your requirements to the language are pretty harsh: it must not have an invasive runtime, must not have a GC, and must support the C ABI. Right now, the only languages I think that fit this are C, C++, and Rust.

What worked well:

Marrying Rust and Python with CFFI. There are some alternatives to this which link against libpython, but it makes for a significantly more complex wheels build. Using ancient CentOS versions to build somewhat portable linux wheels with Docker. While this process is tedious, the difference in stability between different Linux flavors and kernels make Docker and CentOS an acceptable build solution. The Rust ecosystem. We’re using serde for deserialization and a base64 module from crates.io , both working really well together. In addition, the mmap support uses another crate that was provided by the community (memmap)[LINKME].

What didn’t work well:

Iteration and compilation times really could be better. We are compiling modules and headers every time we change a character. The setuptools steps are very brittle. We probably spent more time making setuptools work than any other developmental roadblock that came up. Luckily, we did this once before so it was easier this time around.

While Rust is pretty great for what we do, without a doubt there is a lots that needs to improve. In particular, the infrastructure for exporting C ABIs (and to make them useful for Python) could use lots of improvements. Compile times are also not great at all. Hopefully incremental compilation will help there.

Next Steps

There is even more room for us to improve on this if we want. Instead of parsing the JSON, we can start caching in a more efficient format, which is a bunch of structs stored in memory. In particular, if paired with a file system cache, we could almost entirely eliminate the cost of loading since we bisect the index, and that can be done quite efficiently with mmap.

Given the good results of this we will most likely evaluate Rust more in the future to handle some expensive code paths that are common. However there are no CPU-bound fruits that current hang lower than source maps. For most of our other operations, we’re spending more time waiting for IO.

In Summary

This project has been a tremendous success. It took us very little time to implement, it lowered processing times for our users, and it also will help us scale horizontally. Rust has been the perfect tool for this job because it allowed us to offload an expensive operation into a native library without having to use C or C++, which would not be well suited for a task of this complexity. While it was very easy to write a source map parser in Rust, it would have been considerably less fun and more work in C or C++.

We love Python at Sentry, and are proud contributors to numerous Python open-source initiatives. While Python remains our favorite go-to, we believe in using the right tool for the job, no matter what language it may be. Rust proved to be the best tool for this job, and we are excited to see where Rust and Python will take us in the future. If you feel the same way, we’re hiring in multiple positions and would love to hear from you.

↧

NumPy Tutorial: Data analysis with Python

October 20, 2016, 5:34 am

≫ Next: Python 模块:更有逻辑地组织你的Python代码段

≪ Previous: Fixing Python Performance with Rust

NumPy is a commonly used python data analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn , that use NumPy under the hood. NumPy was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages NumPy in some way.

In this tutorial, we’ll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity , along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.

NumPy Tutorial: Data analysis with Python

The wines we'll be analyzing are from the Minho region of Portugal.

The data was downloaded from the UCI Machine Learning Repository , and is available here . Here are the first few rows of the winequality-red.csv file, which we’ll be using throughout this tutorial:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality" 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

The data is in what I’m going to call ssv (semicolon separated values) format each record is separated by a semicolon ( ; ), and rows are separated by a new line. There are 1600 rows in the file, including a header row, and 12 columns.

Before we get started, a quick version note we’ll be using Python 3.5 . Our code examples will be done using Jupyter notebook .

If you want to jump right into a specific area, here are the topics:

Lists Of Lists for CSV Data

Before using NumPy, we’ll first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the ssv file.

In the below code, we:

Import the csv library. Open the winequality-red.csv file. With the file open, create a new csv.reader object. Pass in the keyword argument delimiter=";" to make sure that the records are split up on the semicolon character instead of the default comma character. Call the list type to get all the rows from the file. Assign the result to wines . In[1]:

import csv with open("winequality-red.csv", 'r') as f: wines = list(csv.reader(f, delimiter=";"))

Once we’ve read in the data, we can print out the first 3 rows:

In[3]: print(wines[:3]) [['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], ['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5'], ['7.8', '0.88', '0', '2.6', '0.098', '25', '67', '0.9968', '3.2', '0.68', '9.8', '5']]

The data has been read into a list of lists. Each inner list is a row from the ssv file. As you may have noticed, each item in the entire list of lists is represented as a string, which will make it harder to do computations.

We’ll format the data into a table to make it easier to view:

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality 7.4 0.70 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 7.8 0.88 0 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5

As you can see from the table above, we’ve read in three rows, the first of which contains column headers. Each row after the header row represents a wine. The first element of each row is the fixed acidity , the second is the volatile acidity , and so on. We can find the average quality of the wines. The below code will:

Extract the last element from each row after the header row. Convert each extracted element to a float. Assign all the extracted elements to the list qualities . Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean. In[4]: qualities = [float(item[-1]) for item in wines[1:]] sum(qualities) / len(qualities) Out[4]:

5.6360225140712945

Although we were able to do the calculation we wanted, the code is fairly complex, and it won’t be fun to have to do something similar every time we want to compute a quantity. Luckily, we can use NumPy to make it easier to work with our data.

Numpy 2-Dimensional Arrays With NumPy, we work with multidimensional arrays. We’ll dive into all of the possible types of multidimensional arrays later on, but for now, we’ll focus on 2-dimensional arrays. A 2-dimensional array is also known as a matr

↧

Python 模块:更有逻辑地组织你的Python代码段

October 20, 2016, 2:17 pm

≫ Next: Epilepsy Detection Using EEG Data

≪ Previous: NumPy Tutorial: Data analysis with Python

模块让你能够有逻辑地组织你的python代码段。

把相关的代码分配到一个模块里能让你的代码更好用，更易懂。

模块也是Python对象，具有随机的名字属性用来绑定或引用。

简单地说，模块就是一个保存了Python代码的文件。模块能定义函数，类和变量。模块里也能包含可执行的代码。

例子

一个叫做aname的模块里的Python代码一般都能在一个叫aname.py的文件中找到。下例是个简单的模块support.py。

def print_func( par ):
print "Hello : ", par
return import 语句

想使用Python源文件，只需在另一个源文件里执行import语句，语法如下：

import module1[, module2[,... moduleN]

当解释器遇到import语句，如果模块在当前的搜索路径就会被导入。

搜索路径是一个解释器会先进行搜索的所有目录的列表。如想要导入模块hello.py，需要把命令放在脚本的顶端：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 导入模块
import support
# 现在可以调用模块里包含的函数了
support.print_func("Zara")

以上实例输出结果：

Hello : Zara

一个模块只会被导入一次，不管你执行了多少次import。这样可以防止导入模块被一遍又一遍地执行。

From…import 语句

Python的from语句让你从模块中导入一个指定的部分到当前命名空间中。语法如下：

from modname import name1[, name2[, ... nameN]]

例如，要导入模块fib的fibonacci函数，使用如下语句：

from fib import fibonacci

这个声明不会把整个fib模块导入到当前的命名空间中，它只会将fib里的fibonacci单个引入到执行这个声明的模块的全局符号表。

From…import* 语句

把一个模块的所有内容全都导入到当前的命名空间也是可行的，只需使用如下声明：

from modname import *

这提供了一个简单的方法来导入一个模块中的所有项目。然而这种声明不该被过多地使用。

定位模块

当你导入一个模块，Python解析器对模块位置的搜索顺序是：

当前目录如果不在当前目录，Python 则搜索在 shell 变量 PYTHONPATH 下的每个目录。如果都找不到，Python会察看默认路径。UNIX下，默认路径一般为/usr/local/lib/python/。

模块搜索路径存储在system模块的sys.path变量中。变量里包含当前目录，PYTHONPATH和由安装过程决定的默认目录。

PYTHONPATH变量

作为环境变量，PYTHONPATH由装在一个列表里的许多目录组成。PYTHONPATH的语法和shell变量PATH的一样。

在windows系统，典型的PYTHONPATH如下：

set PYTHONPATH=c:\python20\lib;

在UNIX系统，典型的PYTHONPATH如下：

set PYTHONPATH=/usr/local/lib/python 命名空间和作用域

变量是拥有匹配对象的名字（标识符）。命名空间是一个包含了变量名称们（键）和它们各自相应的对象们（值）的字典。

一个Python表达式可以访问局部命名空间和全局命名空间里的变量。如果一个局部变量和一个全局变量重名，则局部变量会覆盖全局变量。

每个函数都有自己的命名空间。类的方法的作用域规则和通常函数的一样。

Python会智能地猜测一个变量是局部的还是全局的，它假设任何在函数内赋值的变量都是局部的。

因此，如果要给全局变量在一个函数里赋值，必须使用global语句。

global VarName的表达式会告诉Python， VarName是一个全局变量，这样Python就不会在局部命名空间里寻找这个变量了。

例如，我们在全局命名空间里定义一个变量money。我们再在函数内给变量money赋值，然后Python会假定money是一个局部变量。然而，我们并没有在访问前声明一个局部变量money，结果就是会出现一个UnboundLocalError的错误。取消global语句的注释就能解决这个问题。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
Money = 2000
def AddMoney():
# 想改正代码就取消以下注释:
# global Money
Money = Money + 1
print Money
AddMoney()
print Money dir()函数

dir()函数一个排好序的字符串列表，内容是一个模块里定义过的名字。

返回的列表容纳了在一个模块里定义的所有模块，变量和函数。如下一个简单的实例：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 导入内置math模块
import math
content = dir(math)
print content;

以上实例输出结果：

['__doc__', '__file__', '__name__', 'acos', 'asin', 'atan',
'atan2', 'ceil', 'cos', 'cosh', 'degrees', 'e', 'exp',
'fabs', 'floor', 'fmod', 'frexp', 'hypot', 'ldexp', 'log',
'log10', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh',
'sqrt', 'tan', 'tanh']

在这里，特殊字符串变量__name__指向模块的名字，__file__指向该模块的导入文件名。

globals()和locals()函数

根据调用地方的不同，globals()和locals()函数可被用来返回全局和局部命名空间里的名字。

如果在函数内部调用locals()，返回的是所有能在该函数里访问的命名。

如果在函数内部调用globals()，返回的是所有在该函数里能访问的全局名字。

两个函数的返回类型都是字典。所以名字们能用keys()函数摘取。

reload()函数

当一个模块被导入到一个脚本，模块顶层部分的代码只会被执行一次。

因此，如果你想重新执行模块里顶层部分的代码，可以用reload()函数。该函数会重新导入之前导入过的模块。语法如下：

reload(module_name)

在这里，module_name要直接放模块的名字，而不是一个字符串形式。比如想重载hello模块，如下：

reload(hello) Python中的包

包是一个分层次的文件目录结构，它定义了一个由模块及子包，和子包下的子包等组成的Python的应用环境。

考虑一个在Phone目录下的pots.py文件。这个文件有如下源代码：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
def Pots():
print "I'm Pots Phone"

同样地，我们有另外两个保存了不同函数的文件：

Phone/Isdn.py 含有函数Isdn() Phone/G3.py 含有函数G3()

现在，在Phone目录下创建file __init__.py：

Phone/__init__.py

当你导入Phone时，为了能够使用所有函数，你需要在__init__.py里使用显式的导入语句，如下：

from Pots import Pots
from Isdn import Isdn
from G3 import G3

当你把这些代码添加到__init__.py之后，导入Phone包的时候这些类就全都是可用的了。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 导入 Phone 包
import Phone
Phone.Pots()
Phone.Isdn()
Phone.G3()

以上实例输出结果：

I'm Pots Phone
I'm 3G Phone
I'm ISDN Phone

如上，为了举例，我们只在每个文件里放置了一个函数，但其实你可以放置许多函数。你也可以在这些文件里定义Python的类，然后为这些类建一个包。

↧

Epilepsy Detection Using EEG Data

October 20, 2016, 8:08 pm

≫ Next: 总结：常用的 Python 爬虫技巧

≪ Previous: Python 模块:更有逻辑地组织你的Python代码段

In this example we’ll use the cesium library to compare various techniques for epilepsy detection using a classic EEG time series dataset from Andrzejak et al. . The raw data are separated into five classes: Z, O, N, F, and S; we will consider a three-class classification problem of distinguishing normal (Z, O), interictal (N, F), and ictal (S) signals.

The overall workflow consists of three steps: first, we “featurize” the time series by selecting some set of mathematical functions to apply to each; next, we build some classification models which use these features to distinguish between classes; finally, we validate our models by generating predictions for some unseen holdout set and comparing them to the true class labels.

First, we’ll load the data and inspect a representative time series from each class:

%matplotlib inline import numpy as np import matplotlib.pyplot as plt import seaborn; seaborn.set() from cesium import datasets eeg = datasets.fetch_andrzejak() # Group together classes (Z, O), (N, F), (S) as normal, interictal, ictal eeg["classes"] = eeg["classes"].astype('U16') # allocate memory for longer class names eeg["classes"][np.logical_or(eeg["classes"]=="Z", eeg["classes"]=="O")] = "Normal" eeg["classes"][np.logical_or(eeg["classes"]=="N", eeg["classes"]=="F")] = "Interictal" eeg["classes"][eeg["classes"]=="S"] = "Ictal" fig, ax = plt.subplots(1, len(np.unique(eeg["classes"])), sharey=True) for label, subplot in zip(np.unique(eeg["classes"]), ax): i = np.where(eeg["classes"] == label)[0][0] subplot.plot(eeg["times"][i], eeg["measurements"][i]) subplot.set(xlabel="time (s)", ylabel="signal", title=label)
Epilepsy Detection Using EEG Data

Featurization

Once the data is loaded, we can generate features for each time series using the cesium.featurize module. The featurize module includes many built-in choices of features which can be applied for any type of time series data; here we’ve chosen a few generic features that do not have any special biological significance.

If Celery is running, the time series will automatically be split among the available workers and featurized in parallel; setting use_celery=False will cause the time series to be featurized serially.

from cesium import featurize features_to_use = ['amplitude', 'percent_beyond_1_std', 'maximum', 'max_slope', 'median', 'median_absolute_deviation', 'percent_close_to_median', 'minimum', 'skew', 'std', 'weighted_average'] fset_cesium = featurize.featurize_time_series(times=eeg["times"], values=eeg["measurements"], errors=None, features_to_use=features_to_use, targets=eeg["classes"], use_celery=True) print(fset_cesium) <xarray.Dataset> Dimensions: (channel: 1, name: 500) Coordinates: * channel (channel) int64 0 * name (name) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... target (name) object 'Normal' 'Normal' 'Normal' ... Data variables: minimum (name, channel) float64 -146.0 -254.0 -146.0 ... amplitude (name, channel) float64 143.5 211.5 165.0 ... median_absolute_deviation (name, channel) float64 28.0 32.0 31.0 31.0 ... percent_beyond_1_std (name, channel) float64 0.1626 0.1455 0.1523 ... maximum (name, channel) float64 141.0 169.0 184.0 ... median (name, channel) float64 -4.0 -51.0 13.0 -4.0 ... percent_close_to_median (name, channel) float64 0.505 0.6405 0.516 ... max_slope (name, channel) float64 1.111e+04 2.065e+04 ... skew (name, channel) float64 0.0328 -0.09271 ... weighted_average (name, channel) float64 -4.132 -52.44 12.71 ... std (name, channel) float64 40.41 48.81 47.14 ...

The output of featurize_time_series is an xarray.Dataset which contains all the feature information needed to train a machine learning model: feature values are stored as data variables, and the time series index/class label are stored as coordinates (a channel coordinate will also be used later for multi-channel data).

Custom feature functions

Custom feature functions not built into cesium may be passed in using the custom_functions keyword, either as a dictionary {feature_name: function} , or as a dask graph . Functions should take three arrays times, measurements, errors as inputs; details can be found in the cesium.featurize documentation . Here we’ll compute five standard features for EEG analysis provided by Guo et al. (2012) :

import numpy as np import scipy.stats def mean_signal(t, m, e): return np.mean(m) def std_signal(t, m, e): return np.std(m) def mean_square_signal(t, m, e): return np.mean(m ** 2) def abs_diffs_signal(t, m, e): return np.sum(np.abs(np.diff(m))) def skew_signal(t, m, e): return scipy.stats.skew(m)

Now we’ll pass the desired feature functions as a dictionary via the custom_functions keyword argument.

guo_features = { 'mean': mean_signal, 'std': std_signal, 'mean2': mean_square_signal, 'abs_diffs': abs_diffs_signal, 'skew': skew_signal } fset_guo = featurize.featurize_time_series(times=eeg["times"], values=eeg["measurements"], errors=None, targets=eeg["classes"], features_to_use=list(guo_features.keys()), custom_functions=guo_features, use_celery=True) print(fset_guo) <xarray.Dataset> Dimensions: (channel: 1, name: 500) Coordinates: * channel (channel) int64 0 * name (name) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... target (name) object 'Normal' 'Normal' 'Normal' 'Normal' 'Normal' ... Data variables: abs_diffs (name, channel) float64 4.695e+04 6.112e+04 5.127e+04 ... mean (name, channel) float64 -4.132 -52.44 12.71 -3.992 -18.0 ... mean2 (name, channel) float64 1.65e+03 5.133e+03 2.384e+03 ... skew (name, channel) float64 0.0328 -0.09271 -0.0041 0.06368 ... std (name, channel) float64 40.41 48.81 47.14 47.07 44.91 45.02 ... Multi-channel time series

The EEG time series considered here consist of univariate signal measurements along a uniform time grid. But featurize_time_series also accepts multi-channel data; to demonstrate this, we will decompose each signal into five frequency bands using a discrete wavelet transform as suggested by Subasi (2005) , and then featurize each band separately using the five functions from above.

import pywt n_channels = 5 eeg["dwts"] = [pywt.wavedec(m, pywt.Wavelet('db1'), level=n_channels-1) for m in eeg["measurements"]] fset_dwt = featurize.featurize_time_series(times=None, values=eeg["dwts"], errors=None, features_to_use=list(guo_features.keys()), targets=eeg["classes"], custom_functions=guo_features) print(fset_dwt) <xarray.Dataset> Dimensions: (channel: 5, name: 500) Coordinates: * channel (channel) int64 0 1 2 3 4 * name (name) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... target (name) object 'Normal' 'Normal' 'Normal' 'Normal' 'Normal' ... Data variables: abs_diffs (name, channel) float64 2.513e+04 1.806e+04 3.241e+04 ... skew (name, channel) float64 -0.0433 0.06578 0.2999 0.1239 0.1179 ... mean2 (name, channel) float64 1.294e+04 5.362e+03 2.321e+03 664.4 ... mean (name, channel) float64 -17.08 -6.067 -0.9793 0.1546 0.03555 ... std (name, channel) float64 112.5 72.97 48.17 25.77 10.15 119.8 ...

The output featureset has the same form as before, except now the channel coordinate is used to index the features by the corresponding frequency band. The functions in cesium.build_model and cesium.predict all accept featuresets from single- or multi-channel data, so no additional steps are required to train models or make predictions for multichannel featuresets using the cesium library.

Model Building

Model building in cesium is handled by the build_model_from_featureset function in the cesium.build_model submodule. The featureset output by featurize_time_series contains both the feature and target information needed to train a model; build_model_from_featureset is simply a wrapper that calls the fit method of a given scikit-learn model with the appropriate inputs. In the case of multichannel features, it also handles reshaping the featureset into a (rectangular) form that is compatible with scikit-learn .

For this example, we’ll test a random forest classifier for the built-in cesium features, and a 3-nearest neighbors classifier for the others, as suggested by Guo et al. (2012) .

from cesium.build_model import build_model_from_featureset from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.cross_validation import train_test_split train, test = train_test_split(np.arange(len(eeg["classes"])), random_state=0) rfc_param_grid = {'n_estimators': [8, 16, 32, 64, 128, 256, 512, 1024]} model_cesium = build_model_from_featureset(fset_cesium.isel(name=train), RandomForestClassifier(max_features='auto', random_state=0), params_to_optimize=rfc_param_grid) knn_param_grid = {'n_neighbors': [1, 2, 3, 4]} model_guo = build_model_from_featureset(fset_guo.isel(name=train), KNeighborsClassifier(), params_to_optimize=knn_param_grid) model_dwt = build_model_from_featureset(fset_dwt.isel(name=train), KNeighborsClassifier(), params_to_optimize=knn_param_grid) Prediction

Making predictions for new time series based on these models follows the same pattern: first the time series are featurized using featurize_timeseries , and then predictions are made based on these features using predict.model_predictions ,

from sklearn.metrics import accuracy_score from cesium.predict import model_predictions preds_cesium = model_predictions(fset_cesium, model_cesium, return_probs=False) preds_guo = model_predictions(fset_guo, model_guo, return_probs=False) preds_dwt = model_predictions(fset_dwt, model_dwt, return_probs=False) print("Built-in cesium features: training accuracy={:.2%}, test accuracy={:.2%}".format( accuracy_score(preds_cesium[train], eeg["classes"][train]), accuracy_score(preds_cesium[test], eeg["classes"][test]))) print("Guo et al. features: training accuracy={:.2%}, test accuracy={:.2%}".format( accuracy_score(preds_guo[train], eeg["classes"][train]), accuracy_score(preds_guo[test], eeg["classes"][test]))) print("Wavelet transform features: training accuracy={:.2%}, test accuracy={:.2%}".format( accuracy_score(preds_dwt[train], eeg["classes"][train]), accuracy_score(preds_dwt[test], eeg["classes"][test]))) Built-in cesium features: training accuracy=100.00%, test accuracy=83.20% Guo et al. features: training accuracy=90.93%, test accuracy=84.80% Wavelet transform features: training accuracy=100.00%, test accuracy=95.20%

The workflow presented here is intentionally simplistic and omits many important steps such as feature selection, model parameter selection, etc., which may all be incorporated just as they would for any other scikit-learn analysis. But with essentially three function calls ( featurize_time_series , build_model_from_featureset , and model_predictions ), we are able to build a model from a set of time series and make predictions on new, unlabeled data. In upcoming posts we’ll introduce the web frontend for cesium and describe how the same analysis can be performed in a browser with no setup or coding required.

Written byCesium Developers inmisc on Fri 08 July 2016. Tags:example,

↧

总结：常用的 Python 爬虫技巧

October 20, 2016, 8:07 pm

≪ Previous: Epilepsy Detection Using EEG Data

作者： j_hao104

用python也差不多一年多了，python应用最多的场景还是web快速开发、爬虫、自动化运维：写过简单网站、写过自动发帖脚本、写过收发邮件脚本、写过简单验证码识别脚本。

爬虫在开发过程中也有很多复用的过程，这里总结一下，以后也能省些事情。

1、基本抓取网页

get方法

post方法

2、使用代理IP

在开发爬虫过程中经常会遇到IP被封掉的情况，这时就需要用到代理IP;

在urllib2包中有ProxyHandler类，通过此类可以设置代理访问网页，如下代码片段：

3、Cookies处理

cookies是某些网站为了辨别用户身份、进行session跟踪而储存在用户本地终端上的数据(通常经过加密)，python提供了cookielib模块用于处理cookies，cookielib模块的主要作用是提供可存储cookie的对象，以便于与urllib2模块配合使用来访问Internet资源.

代码片段：

关键在于CookieJar()，它用于管理HTTP cookie值、存储HTTP请求生成的cookie、向传出的HTTP请求添加cookie的对象。整个cookie都存储在内存中，对CookieJar实例进行垃圾回收后cookie也将丢失，所有过程都不需要单独去操作。

4、伪装成浏览器

某些网站反感爬虫的到访，于是对爬虫一律拒绝请求。所以用urllib2直接访问网站经常会出现 HTTP Error 403： Forbidden的情况

对有些 header 要特别留意，Server 端会针对这些 header 做检查：

1.User-Agent 有些 Server 或 Proxy 会检查该值，用来判断是否是浏览器发起的 Request

2.Content-Type 在使用 REST 接口时，Server 会检查该值，用来确定 HTTP Body 中的内容该怎样解析。

这时可以通过修改http包中的header来实现，代码片段如下：

5、页面解析

对于页面解析最强大的当然是正则表达式，这个对于不同网站不同的使用者都不一样，就不用过多的说明，附两个比较好的网址：

正则表达式入门>>>

正则表达式在线测试>>>

其次就是解析库了，常用的有两个lxml和BeautifulSoup，对于这两个的使用介绍两个比较好的网站：

lxml>>>

BeautifulSoup>>>

对于这两个库，我的评价是，都是HTML/XML的处理库，Beautifulsoup纯python实现，效率低，但是功能实用，比如能用通过结果搜索获得某个HTML节点的源码;lxmlC语言编码，高效，支持Xpath

6、验证码的处理

对于一些简单的验证码，可以进行简单的识别。本人也只进行过一些简单的验证码识别。但是有些反人类的验证码，比如12306，可以通过打码平台进行人工打码，当然这是要付费的。

7、gzip压缩

有没有遇到过某些网页，不论怎么转码都是一团乱码。哈哈，那说明你还不知道许多web服务具有发送压缩数据的能力，这可以将网络线路上传输的大量数据消减 60% 以上。这尤其适用于 XML web 服务，因为 XML 数据的压缩率可以很高。

但是一般服务器不会为你发送压缩数据，除非你告诉服务器你可以处理压缩数据。

于是需要这样修改代码：

这是关键：创建Request对象，添加一个 Accept-encoding 头信息告诉服务器你能接受 gzip 压缩数据

然后就是解压缩数据：

8、多线程并发抓取

单线程太慢的话，就需要多线程了，这里给个简单的线程池模板这个程序只是简单地打印了1-10，但是可以看出是并发的。

虽然说python的多线程很鸡肋，但是对于爬虫这种网络频繁型，还是能一定程度提高效率的。

End.

↧

2016最新百度云网盘搜索引擎源码,附带Python爬虫+PHP网站+Xunsearch搜索引擎

October 20, 2016, 8:06 pm

≫ Next: Python for loops

≪ Previous: 总结：常用的 Python 爬虫技巧

源码简介：

适用范围：百度云网盘搜索引擎源码，百度搜索引擎源码，网盘搜索爬虫源码

演示地址：(以截图为准)

运行环境：php、mysql

其他说明：分享的是一款搜索引擎源码，百度云盘爬虫源码，python百度云网盘搜索引擎，爬虫 +网站，搜索引擎采用Xunsearch高效搜索，源码很简单，附带安装教程，可二开，这里吾爱免费分享给大家，全自动更新百度云盘内容，全自动采集哦！

2016最新百度云网盘搜索引擎源码,附带Python爬虫+PHP网站+Xunsearch搜索引擎

# 爱百应 - 百度云搜索引擎，安装部署教程

## 运行环境

开始之前你需要安装

* PHP 5.3.7 +

* MySQL

* Python 2.7 ~

* [xunsearch](http://xunsearch.com/) 搜索引擎

## 获取源码

```

git clone git@github.com:k1995/BaiduyunSpider.git

```

或手动下载

```

https://github.com/k1995/BaiduyunSpider/archive/master.zip

```

下载完毕后，___项目的目录结构___大致是这样的

```

--- indexer/ #索引

--- spider/ #爬虫

--- sql/

--- web/ #网站

--- application/

--- config/ # 配置相关

--- config.php

--- database.php # 数据库配置

...

--- static/ # 存放静态资源，css|js|font

--- system/

--- index.php

...

```

## 开始部署

### 创建数据库

创建名为`pan`的数据库，编码设为`utf-8`。然后导入`sql`，完成表的创建。

### 网站部署

支持`nginx`，`apache` 服务器。

__apache__ 需要开启 *mod_rewrite*。

__nginx__ 配置如下

```

location /

{

index index.php;

try_files $uri $uri/ /index.php/$uri;

}

location ~ [^/]\.php(/|$)

{

fastcgi_pass 127.0.0.1:9000;

fastcgi_index index.php;

include fastcgi.conf;

include pathinfo.conf;

}

```

#### 配置文件修改

`config.php` 文件修改网站标题，描述等信息

`database.php` 修改数据库账号，密码等信息

> 网站是基于CodeIgniter 框架开发的，如安装，部署，或二次开发有问题，请参考[官网文档]( http://codeigniter.org.cn/user_guide/general/welcome.html)

### 启动爬虫

进入 `spider/`目录，修改`spider.py` 中数据库信息。

__如果你是第一次部署，需运行下面命令，完成做种__

```

python spider.py --seed-user

```

上面其实就是抓取百度云热门分享用户的相关信息，然后从他们开始入手爬取数据

然后运行

```

python spider.py

```

此时爬虫已经开始工作了

### 安装xunsearch

目前使用__xunsearch__作为搜索引擎，后面会更换为`elasticsearch`。

安装过程请参考（不需要安装，PHP SDK，我已经整合到web里了）

http://xunsearch.com/doc/php/guide/start.installation

### 索引数据

上面我们完成了爬虫的数据抓取，网站的搭建，但还不能搜索，下面开始最后一步，索引的建立。

进入 `indexer/`目录，在`indexer.php`中将$prefix，替换为你web的根路径

```

require '$prefix/application/helpers/xs/lib/XS.php';

```

并修改数据库账号密码

然后运行

```

python ./index.php

```

到此为止程序已全部安装完毕，若有疑问请在 [github 中文社区](http://www.githubs.cn/topic/118) 发帖

↧

Python for loops

October 20, 2016, 8:05 pm

≫ Next: Python内置函数(8)――bytes

It’s always interesting to explain a new programming language to students. python does presents some challenges to that learning process. I think for loop can be a bit of a challenge until you understand them. Many students are most familiar with the traditional for loop like Java:

for (i = 0; i < 5; i++) { ... }

Python supports three types of for loops a range for loop, a for-each expression, and a for loop with enumeration. Below are examples of each of these loops.

A range for loop goes from a low numerical value to a high numerical value, like:

for i in range(0,3): print i

It prints the following range values:

A for-each loop goes from the first to the last item while ignoring indexes, like: list = ['a','b','c'] for i in list: print i

It prints the following elements of the list:

a b c

A for-each loop goes from the first to the last item while ignoring indexes, like: list = ['a','b','c'] for i, e in enumerate(list): print "[" + str(i) + "][" + list[i] + "]"

The i represents the index values and the e represents the elements of a list. The str() function casts the numeric value to a string.

It prints the following:

[0][a] [1] [2][c]

This should help my students and I hope it helps you if you’re trying to sort out how to use for loops in Python.

↧

Python内置函数(8)――bytes

October 20, 2016, 8:04 pm

≫ Next: Python内置函数(7)――bytearray

≪ Previous: Python for loops

英文文档：

class bytes ( [ source [, encoding [, errors ] ] ] )

Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256 .is an immutable version of it has the same non-mutating methods and the same indexing and slicing behavior.

Accordingly, constructor arguments are interpreted as for.

说明：

1. 返回值为一个新的不可修改字节数组，每个数字元素都必须在0 - 255范围内，是bytearray函数的具有相同的行为，差别仅仅是返回的字节数组不可修改。

2. 当3个参数都不传的时候，返回长度为0的字节数组

>>> b = bytes() >>> b b'' >>> len(b) 0

3. 当source参数为字符串时，encoding参数也必须提供，函数将字符串使用str.encode方法转换成字节数组

>>> bytes('中文') #需传入编码格式 Traceback (most recent call last): File "<pyshell#14>", line 1, in <module> bytes('中文') TypeError: string argument without an encoding >>> bytes('中文','utf-8') b'\xe4\xb8\xad\xe6\x96\x87' >>> '中文'.encode('utf-8') b'\xe4\xb8\xad\xe6\x96\x87'

4. 当source参数为整数时，返回这个整数所指定长度的空字节数组

>>> bytes(2) b'\x00\x00' >>> bytes(-2) #整数需大于0，用于做数组长度 Traceback (most recent call last): File "<pyshell#19>", line 1, in <module> bytes(-2) ValueError: negative count

5. 当source参数为实现了buffer接口的object对象时，那么将使用只读方式将字节读取到字节数组后返回

6. 当source参数是一个可迭代对象，那么这个迭代对象的元素都必须符合0<=x<256，以便可以初始化到数组里

>>> bytes([1,2,3]) b'\x01\x02\x03' >>> bytes([256,2,3]) Traceback (most recent call last): File "<pyshell#21>", line 1, in <module> bytes([256,2,3]) ValueError: bytes must be in range(0, 256)

7. 返回数组不可修改

>>> b = bytes(10) >>> b b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>> b[0] 0 >>> b[1] = 1 #不可修改 Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> b[1] = 1 TypeError: 'bytes' object does not support item assignment >>> b = bytearray(10) >>> b bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00') >>> b[1] = 1 #可修改 >>> b bytearray(b'\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00')

↧