Python 的技巧和方法你了解多少？

December 5, 2018, 8:22 pm

≫ Next: Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

≪ Previous: 我的2018年总结

学了这些你的python代码将会改善，你的技巧将会提高。

1. 路径操作

比起os模块的path方法，python3标准库的pathlib模块的Path处理起路径更加的容易。

获取当前文件路径

前提导入os和pathlib包。。

os版：

print(os.path.dirname(__file__)) print(os.getcwd()) 复制代码

pathlib版:

print(pathlib.Path.cwd()) 复制代码

看着好像没啥区别，然后看下面这个。

获取上两级文件目录

os版

print(os.path.dirname(os.path.dirname(os.getcwd()))) 复制代码

pathlib版

print(pathlib.Path.cwd().parent.parent) 复制代码拼接路径

os版

print(os.path.join(os.path.dirname(os.path.dirname(os.getcwd())),"yamls","a.yaml")) 复制代码

pathlib版

parts=["yamls","a.yaml"] print(pathlib.Path.cwd().parent.parent.joinpath(*parts)) 复制代码运行时拼接路径

os版

os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'yamls',f'{site_name}.yaml') 复制代码

pathlib版

parts=["yamls","a.yaml"] print(pathlib.Path(__file__).resolve().parent.parent.joinpath(*parts)) 复制代码

另外pathlib生成的是个对象，在open文件操作中可以直接运行的但是如果当作字符串操作会出现错误，此时需要对其进行转换，使用os.fspath()即可，不过一般很少有操作路径字符串的习惯。

综合起来，还是pathlib拼接路径方便。

2. 保存标准格式的yaml文件

编程免不了要写配置文件，怎么写配置也是一门学问。

YAML 是专门用来写配置文件的语言，非常简洁和强大，远比 JSON 格式方便。

YAML在python语言中有PyYAML安装包。

前提安装第三方库

pip install pyaml pip install ruamel.yaml 复制代码

关于yaml的读取知识网上一堆了我就不说了，这里主要说写入。

from ruamel import yaml data={"age":23,"sex":"男","name":"牛皮"} with open(conf_file, "w", encoding='utf-8') as fs: yaml.dump(data, fs, Dumper=yaml.RoundTripDumper, allow_unicode=True) 复制代码

yaml写文件和json一样也是使用dump。

3. 同时迭代两个列表

以前的时候我是这么解决的

a = ["a", "b", "c", "d"] b = [1, 2, 3] # 空的补充None for index, a_item in enumerate(a): b_item = None if len(b) - 1 <= index: pass else: b_item = b[index] print({a_item:b_item}) 复制代码

现在我通过itertools标准库的zip升级版zip_longest解决，可以通过fillvalue参数补充缺失值。当然如果比较的元素个数相同可以直接用zip。

from itertools import zip_longest a = ["a", "b", "c", "d","e"] b = [1, 2, 3] # 空的补充None for a_item, b_item in zip_longest(a,b,fillvalue=0): print({a_item:b_item}) 复制代码 4. 三元表达式还能这么用？

一般的我们这样写

a="hello" if 2>1 else "bye" print(a) 复制代码

我们知道python中false实际式0，true是1，所以对于上面的式子我们就可以这么写了。

a=["hello","bye"][2<1] print(a) 复制代码

因为2<1是false也就是0，所以输出了第一个元素hello。

5.简单的类使用namedtuple代替

先来一个简单的例子

import collections # Person=collections.namedtuple('Person','name age') # 如果使用python中的关键字会出现错误,此时使用rename字段。 # 按照元素在元组中的下标赋值。class就是_2,def是_3 Person = collections.namedtuple('Person', ['name', 'age', 'class', 'def', 'name', 'name'], rename=True) p = Person(name='lisa', age='12', _2="class2", _3="def", _4="name2", _5="name3") print(p) # 如果出现相同的字段第二次出现的时候也是用其下标，参考上面的例子。 # _fields查看字段名,可以发现内置模块和重复的字段标记为_加下标的形式 print(p._fields) # 使用_asdict将namedtuple转为OrderedDict。 od = p._asdict() print(od) # 然后可以转为字典 print(dict(od)) # _replace()方法构建一个新实例，因为namedtuple是不可变类型所以这个方法可以返回一个新的对象。 new_p = p._replace(name="samJ") print(new_p) print(new_p is p) # 可以看到不是同一个对象。复制代码

一个实用的例子pyppeteer的例子感受下

import asyncio import pyppeteer from collections import namedtuple Response = namedtuple("rs", "title url html cookies headers history status") async def get_html(url, timeout=30): # 默认30s browser = await pyppeteer.launch(headless=True, args=['--no-sandbox']) page = await browser.newPage() res = await page.goto(url, options={'timeout': int(timeout * 1000)}) data = await page.content() title = await page.title() resp_cookies = await page.cookies() resp_headers = res.headers resp_history = None resp_status = res.status response = Response(title=title, url=url, html=data, cookies=resp_cookies, headers=resp_headers, history=resp_history, status=resp_status) return response if __name__ == '__main__': url_list = ["http://www.10086.cn/index/tj/index_220_220.html", "http://www.10010.com/net5/011/", "http://python.jobbole.com/87541/"] task = (get_html(url) for url in url_list) loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*task)) for res in results: print(res.title) 复制代码 6 使用枚举让数字变得更易懂。 import enum # 枚举 @enum.unique class Sex(enum.Enum): man = 12 woman = 13 # 因为加了唯一值的装饰器所以下面添加属性会报错 # boy=12 print(Sex.man.name) print(Sex.woman.value) # 遍历 for item in Sex: print(item.name) print(item.value) print("-" * 40) # 其他使用方式 words = enum.Enum( value='item', names=('a b c d e f'), ) # 输出元素c，必须是上面names里含有的值 print(words.c) print(words.f) # 因为names不含有w所以报错 try: print(words.w) except AttributeError as e: print(e.args) print("-" * 40) for word in words: print(word.name, word.value) # 默认赋值为、从1开始自增。 print("-" * 40) # 如果自定义元素的值啧改为一下元组的形式 words2 = enum.Enum( value='item2', names=[('a', 23), ('b', 56), ("c", 12), ("d", 333)] ) for word2 in words2: print(word2.name, word2.value) 复制代码 7 链式合并字典chainmap的使用 from collections import ChainMap # ChainMap d1 = {'a': 1, 'b': 2} d2 = {'a2': 3, 'b2': 4} d3 = {'a3': 5, 'b3': 6} d4 = {'a4': 7, 'b4': 8} c = ChainMap(d1, d2, d3, d4) # 多个字典合并为一个 for k, v in c.items(): print(k, v) print(c.maps) # 要搜索的索引列表 c.maps = list(reversed(c.maps)) # 逆转映射列表 print(c) # 因为c和d1-d4对应的索引位置实际是一个所以，修改c的时候会影响到d1到d4其中饿的一个值，同理修改 # d1-d4的时候也会影响到c。 # 所以使用new_child创建一个新的映射。再修改就影响不到底层的数据了。 c2 = c.new_child() c2["a4"] = 100 print(c) print(c2) # 输出发现c的值没有发生变化，只要c2变化。 d5 = {"a5": 34, "b5": 78} c2 = c2.new_child(d5) # 可以在原来的映射基础上添加新的映射 print(c2) 复制代码 8 在不打乱列表顺序的基础上插入元素 import bisect """ bisect 模块，用于维护有序列表。 bisect 模块实现了一个算法用于插入元素到有序列表。在一些情况下，这比反复排序列表或构造一个大的列表再排序的效率更高。 Bisect 是二分法的意思，这里使用二分法来排序，它会将一个元素插入到一个有序列表的合适位置，这使得不需要每次调用 sort 的方式维护有序列表。 """ values = [14, 85, 77, 26, 50, 45, 66, 79, 10, 3, 84, 77, 1] print("New Pos Content") print("--- --- -------") l = [] for i in values: postion = bisect.bisect(l, i) # 返回插入的位置 bisect.insort(l, i) # 等于insort_right print('{:3}{:3}'.format(i, postion), l) """ Bisect模块提供的函数有： bisect.bisect_left(a,x, lo=0, hi=len(a)) : 查找在有序列表 a 中插入 x 的index。lo 和 hi 用于指定列表的区间，默认是使用整个列表。如果 x 已经存在，在其左边插入。返回值为 index。 bisect.bisect_right(a,x, lo=0, hi=len(a)) bisect.bisect(a, x,lo=0, hi=len(a)) ：这2个函数和 bisect_left 类似，但如果 x 已经存在，在其右边插入。 bisect.insort_left(a,x, lo=0, hi=len(a)) ：在有序列表 a 中插入 x。和 a.insert(bisect.bisect_left(a,x, lo, hi), x) 的效果相同。 bisect.insort_right(a,x, lo=0, hi=len(a)) bisect.insort(a, x,lo=0, hi=len(a)) : 和 insort_left 类似，但如果 x 已经存在，在其右边插入。 Bisect 模块提供的函数可以分两类： bisect* 只用于查找 index，不进行实际的插入；而 insort* 则用于实际插入。该模块比较典型的应用是计算分数等级： """ 复制代码 8 关于字典的逻辑运算你了解多少 # 使用&操作符查看字典的相同之处 #字典键支持常见的集合操作，并集交集差集。 a = {'x': 1, 'y': 2, 'z': 3} b = {'w': 2, 'z': 4, 'x': 3, 'z': 3} # 获取相同的键 c = a.keys() & b.keys() print(c) # 获取相同的键值对 d = a.items() & b.items() print(d) # 创建一个新的字典并删除某些键 e = {k: a[k] for k in a.keys() - {'z', 'x'}} print(e) 复制代码 9 给切片起个名字 a="safr3.14" print(a[-4:]) #上面可以改为 pie=slice(len(a)-4,len(a)) print(a) 复制代码 10 获取出现频率高的元素 from collections import Counter text = "abcdfegtehto;grgtgjri" # 可迭代对象 lis = ["a", "c", "d", "t", "b"] dic = {"a": 1, "b": 4, "c": 2, "d": 9} # 字典也可以 c = Counter() # 可以定义空容器然后update c.update(text) c2 = Counter() c2.update(dic) c3 = Counter(lis) # 也可以直接传入对象 print(c) print(c2) print(c3) # 使用c.most_comman(n)获取前n出现频率最高的元素,列表元组类型 print(c.most_common(4)) 复制代码

更多工具使用以及python技巧，请关注公众号：python学习开发。

↧

Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

December 5, 2018, 7:32 pm

≫ Next: SHAP and LIME Python Libraries: Part 1 Great Explainers, with Pros and Cons t ...

≪ Previous: Python 的技巧和方法你了解多少？

Keras, TensorFlow and PyTorch are among the top three frameworks that are preferred by Data Scientists as well as beginners in the field of Deep Learning.This comparison on Keras vs TensorFlow vs PyTorch will provide you with a crisp knowledge about the topDeep Learning Frameworks and help you find out which one is suitable for you. In this blog you will get a complete insight into the above three frameworks in the following sequence:

Introduction Keras
Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

Keras is an open sourceneural network library written inpython. It is capable of running on top of TensorFlow. It is designed to enable fast experimentation with deep neural networks .

TensorFlow
Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library that is used for machine learning applications like neural networks.

PyTorch

PyTorch is an open source machine learning library for Python, based on Torch. It is used for applications such as natural language processing and was developed by Facebook’s AI research group.

Keras vs Tensorflow vs PyTorch | Edureka Comparison Factors

All the three frameworks are related to each other and also have certain basic differences that distinguishes them from one another.

So lets have a look at the parameters that distinguish them:

Level of API
Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

Keras is a high-level API capable of running on top of TensorFlow, CNTK and Theano. It has gained favor for its ease of use and syntactic simplicity, facilitating fast development.

TensorFlow is a framework that provides both high and low level APIs. Pytorch, on the other hand, is a lower-level API focused on direct work with array expressions. It has gained immense interest in the last year, becoming a preferred solution for academic research, and applications of deep learning requiring optimizing custom expressions.

Speed

The performance is comparatively slower in Keras whereas Tensorflow and PyTorch provide a similar pace which is fast and suitable for high performance .

Architecture
Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

Kerashas a simple architecture. It is more readable and concise . Tensorflow on the other hand is not very easy to use even though it provides Keras as a framework that makes work easier. PyTorch has a complex architecture and the readability is less when compared to Keras.

Debugging

In keras, there is usually very less frequent need to debug simple networks. But in case of Tensorflow, it is quite difficult to perform debugging. Pytorch on the other hand has better debugging capabilities as compared to the other two.

Dataset

Keras is usually used for small datasets as it is comparitively slower. On the other hand, TensorFlow and PyTorch are used for high performance models and large datasets that require fast execution.

Popularity
Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

With the increasing demand in the field of Data Science, there has been an enormousgrowth of Deep learning technology in the industry. With this, all the three frameworks have gained quite a lot of popularity. Keras tops the list followed by TensorFlow and PyTorch. It has gained immense popularity due to its simplicity when compared to the other two.

Get in-depth Knowledge of Deep Learning

These were the parameters that distinguish all the three frameworks but there is no absolute answer to which one is better. The choice ultimately comes down to

Technical background Requirements and Ease of Use Final Verdict

Now coming to the final verdict of Keras vs TensorFlow vs PyTorch let’s have a look at the situations that are most preferable for each one of these three deep learning frameworks

Keras is most suitable for:

Rapid Prototyping Small Dataset Multiple back-end support
Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

TensorFlow is most suitable for:

Large Dataset High Performance Functionality Object Detection

PyTorch is most suitable for:

Flexibility Short Training Duration Debugging capabilities

Now with this, we come to an end of this comparison on Keras vs TensorFlow vs PyTorch . I Hope you guys enjoyed this article and understood which Deep Learning Framework is most suitable for you.

Now that you have understood the comparison between Keras, TensorFlow and PyTorch, check out the AI and Deep Learning With Tensorflow byEdureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. This Certification Training is curated by industry professionals as per the industry requirements & demands. You will master concepts such as SoftMax function, Autoencoder Neural Networks, Restricted Boltzmann Machine (RBM) and work with libraries like Keras & TFLearn.

Got a question for us? Please mention it in the comments section of “Keras vs TensorFlow vs PyTorch” and we will get back to you.

↧

SHAP and LIME Python Libraries: Part 1 Great Explainers, with Pros and Cons t ...

December 5, 2018, 10:54 pm

≫ Next: [后端技术栈-Django]-1-使用Docker容器来部署Django之一Docker简介

≪ Previous: Keras vs TensorFlow vs PyTorch : Comparison of the deep learning Frameworks

This blog post provides a brief technical introduction to the SHAP and LIME python libraries, followed by code and output to highlight a few pros and cons of each.

Introduction

Model explainability is a priority in today’s data science community. As data scientists, we want to prevent model bias and help decision makers understand how to use our models in the right way. Data science leaders and executives are mindful of existing and upcoming legislation that requires models to provide evidence of how they work and how they avoid mistakes (e.g., SR 11-7 and The FUTURE of AI Act ).

Part 1 in this blog post provides a brief technical introduction to the SHAP and LIME Python libraries, followed by code and output to highlight a few pros and cons of each. Part 2 will explore these libraries in more detail by applying them to a variety of Python models. The goal of these posts is to familiarize readers with how to use these libraries in practice and how to interpret their output, helping you leverage model explanations in your own work.

SHAP and LIME

SHAP and LIME are both popular Python libraries for model explainability. SHAP (SHapleyAdditive exPlanation) leverages the idea of Shapley values for model feature influence scoring. The technical definition of a Shapley value is the “average marginal contribution of a feature value over all possible coalitions.” In other words, Shapley values consider all possible predictions for an instance using all possible combinations of inputs. Because of this exhaustive approach, SHAP can guarantee properties like consistency and local accuracy. LIME (Local Interpretable Model-agnostic Explanations) builds sparse linear models around each prediction to explain how the black box model works in that local vicinity. In their NIPS paper , the authors of SHAP show that Shapley values provide the only guarantee of accuracy and consistency and that LIME is actually a subset of SHAP but lacks the same properties. For further study, I found the GitHub sites SHAP GitHub and LIME GitHub helpful resources:

So why would anyone ever use LIME? Simply put, LIME is fast, while Shapley values take a long time to compute. For you statisticians out there, this situation reminds me somewhat of Fisher’s Exact Test versus a Chi-Squared Test on contingency tables. Fisher’s Exact Test provides the highest accuracy possible because it considers all possible outcomes, but it takes forever to run on large tables. This makes the Chi-Squared Test, a distribution-based approximation, a nice alternative.

The SHAP Python library helps with this compute problem by using approximations and optimizations to greatly speed things up while seeking to keep the nice Shapley properties. When you use a model with a SHAP optimization, things run very fast and the output is accurate and reliable. Unfortunately, SHAP is not optimized for all model types yet.

For example, SHAP has a tree explainer that runs fast on trees, such as gradient boosted trees from XGBoost and scikit-learn and random forests from sci-kit learn, but for a model like k-nearest neighbor, even on a very small dataset, it is prohibitively slow. Part 2 of this post will review a complete list of SHAP explainers. The code and comments below document this deficiency of the SHAP library on the Boston Housing dataset . This code is a subset of a Jupyter notebook I created to walk through examples of SHAP and LIME. The notebook is hosted on Domino’s trial site. Click here to view, download, or run the notebook .

# Load Libraries
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import sklearn.ensemble
import numpy as np
import lime
import lime.lime_tabular
import shap
import xgboost as xgb
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d, Axes3D
import seaborn as sns
import time
%matplotlib inline
# Load Boston Housing Data
X,y = shap.datasets.boston()
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X,y = shap.datasets.boston()
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# K Nearest Neighbor
knn = sklearn.neighbors.KNeighborsRegressor()
knn.fit(X_train, y_train)
# Create the SHAP Explainers
# SHAP has the following explainers: deep, gradient, kernel, linear, tree, sampling
# Must use Kernel method on knn
# Summarizing the data with k-Means is a trick to speed up the processing
"""
Rather than use the whole training set to estimate expected values, we summarize with
a set of weighted kmeans, each weighted by the number of points they represent.
Running without kmeans took 1 hr 6 mins 7 sec. Running with kmeans took 2 min 47 sec.
Boston Housing is a small dataset.
Running SHAP on models that require the Kernel method becomes prohibitive.
"""
# build the kmeans summary
X_train_summary = shap.kmeans(X_train, 10)
# using the kmeans summary
t0 = time.time()
explainerKNN = shap.KernelExplainer(knn.predict,X_train_summary)
shap_values_KNN_test = explainerKNN.shap_values(X_test)
t1 = time.time()
timeit=t1-t0
timeit
# without kmeans# a test run took 3967.6232330799103 seconds
"""
t0 = time.time()
explainerKNN = shap.KernelExplainer(knn.predict, X_train)shap_values_KNN_test = explainerKNN.shap_values(X_test)
t1 = time.time()
timeit=t1-t0timeit
"""
# now we can plot the SHAP explainer
shap.force_plot(explainerKNN.expected_value, shap_values_KNN_test[j], X_test.iloc[[j]])
SHAP and LIME Python Libraries: Part 1 Great Explainers, with Pros and Cons t ...

SHAP and LIME Python Libraries: Part 1 Great Explainers, with Pros and Cons t ...

Running SHAP on a knn model built on the Boston Housing dataset took over an hour, which is a tough pill to swallow. We can get that down to three minutes if we sacrifice some accuracy and reliability by summarizing the data first with a k-means algorithm. As an alternative approach, we could use LIME. LIME runs instantaneously with the same knn model and does not require summarizing with k-means. See the code and output below. Note that LIME’s output is different than the SHAP output, especially for features AGE and B. With LIME not having the same accuracy and consistency properties as Shapley Values, and with SHAP using a k-means summary before calculating influence scores, it’s tough to tell which comes closer to the correct answer.

exp = explainer.explain_instance(X_test.values[j], knn.predict, num_features=5)
exp.show_in_notebook(show_table=True)
SHAP and LIME Python Libraries: Part 1 Great Explainers, with Pros and Cons t ...

While LIME provided a nice alternative in the knn model example, LIME is unfortunately not always able to save the day. It doesn’t work out-of-the-box on all models. For example, LIME cannot handle the requirement of XGBoost to use xgb.DMatrix() on the input data. See below for one

↧

[后端技术栈-Django]-1-使用Docker容器来部署Django之一Docker简介

December 5, 2018, 10:52 pm

≫ Next: 随想录（dlib学习）

≪ Previous: SHAP and LIME Python Libraries: Part 1 Great Explainers, with Pros and Cons t ...

[后端技术栈-Django]-1-使用Docker容器来部署Django之一Docker简介

可能大家会比较奇怪明明讲的是Django为啥要介绍Docker，实不相瞒如果大家不知道Docker对Django的学习并无影响，但是个人之前很早就听说Docker了，借着这个机会也学习下，趁着有这个应用场景也事件下，其实个人对Docker也不是很熟悉，只是现学现卖。将整个过程贯通起来，个人接触新东西一般喜欢围绕着问题展开，遇到不明白的在网上找资料或者找书去了解。这篇博客也采用这种方式，下面就围绕着几个问题展开，这篇博客不会对Docker进行太深入了解，目标是够用就好。深入的知识在后续大家使用的时候遇到问题再在实践中解决，毕竟精力有限。

1. 容器化 vs 虚拟化

虚拟化是通过中间件将一台或者多台独立机器虚拟运行与物理硬件之上，用户并不能感知为他们服务的到底是哪台机器，事实上呈现在用户面前的就和使用一部机器是一样的感觉，只不过这部机器在物理范畴上可能不是单纯一台主机，可能有多台机器组成的一个集群。虚拟机是抽象硬件资源，每一个虚拟机实例占用指定数量的CPU、内存、硬盘等资源，这些资源每个虚拟机实例之间不会共享。

而什么是容器化呢？容器化从应用出发，将应用分割成多个容器，而这些容器直接运行在操作系统内核上的用户空间，容器技术可以让多个独立用户空间运行在同一台宿主机上。也就是说容器技术是抽象软件资源，它和linux上运行的一个应用程序没有太大区别。

早期，大家都认为虚拟化方式可以最大程度上提供虚拟化管理的灵活性。但是随着时间推移，大家发现，虚拟化技术有个问题就是：每个虚拟机都需要运行一个完整的操作系统以及其中安装好的大量应用程序。但实际生产开发环境里，我们更关注的是自己部署的应用程序，如果每次部署发布我都得搞一个完整操作系统和附带的依赖环境，那么这让任务和性能变得很重和很低下。这时候，人们就在想，有没有其他什么方式能让人更加的关注应用程序本身，底层多余的操作系统和环境可以共享和复用？换句话来说，那就是我部署一个服务运行好后，我再想移植到另外一个地方，我可以不用再安装一套操作系统和依赖环境，这就是容器化提出的场景。

2. 容器化的特点

容器是自包含的，它打包了应用程序及其所有依赖，可以直接运行。

容器是可移植的，这就可以确保应用在开发环境、测试环境、生产环境等都有完全一样的运行环境。

容器是互相隔离的，同一主机上运行的多个容器，不会互相影响。

容器是轻量级的，体现在容器的秒级启动，并且占用资源很少。

3. 什么是Docker，Docker的组成及镜像结构

Docker是一个能够把开发应用程序自动部署到容器的开源引擎，使用Docker开发人员只需要关心容器中运行的应用程序，运维人员只需要关心如何管理容器。

它能保证写代码的开发环境与应用程序要部署的生产环境一致性。这对经常出现开发环境是好的，等到部署上去各种问题，又得各种联调的程序员来说无疑是巨大的福音。

Docker 目前大量用于：

持续集成和持续部署 (CI/CD)，加速应用管道自动化和应用部署，

以及结合微服务技术构建可伸缩扩展的服务框架，

服务器资源共享

创建隔离的运行环境

这些场景。

那Docker又是由那些组件组成的呢？

Docker 主要由

Docker引擎：Docker引擎是由客户端服务器架构的程序，客户端通过docker命令行工具以及一套restful API向Docker服务器发出请求，Docker服务器或者称为守护进程完成所有工作并返回。Docker服务器和客户端可以在同一台宿主机器上运行，也可以从本地的Docker客户端连接到另一台宿主机上远程Docker服务器。 Docker镜像：用户基于镜像运行自己的容器，可以把镜像当作容器的源代码，或者相当于我们安装系统的光盘，写Dockerfile就相当于刻录系统光盘。 Docker容器：如果说Docker镜像相当于系统光盘，那么Docker容器就是由这个系统光盘制作出来的可以跑的系统。 Registry：和我们的github类型，github存储的是代码，而Registry存储的是Docker的镜像，换句话说它就是Docker镜像仓库。

下面是整个Docker组件的组成图：

从 Docker 的使用角度来说最为关键的是镜像的制作，Docker 镜像的制作是通过Dockerfile来完成的，Dockerfile的编写我们会在下面进行介绍，这里我们先来看下Docker 镜像的组成：

容器基于镜像启动和运行。可以说Docker镜像是容器的基石，Docker的镜像是一个层叠的只读文件系统，它的最底端是一个引导文件系统及bootfs。 Docker用户几乎永远都不会和引导文件系统有交互，实际上当一个容器启动后，bootfs会被移到内存中，引导文件将被卸载。Docker镜像的第二层是rootfs（root文件系统），位于引导文件系统之上，可以有多种操作系统。在传统的linux系统中root文件系统最先会以只读的方式加载，当引导和启动完成后他才会被切换成读写模式。但是在Docker里，root文件系统永远只能是只读，并且Docker会用联合加载系统在rootfs之上加载更多的只读文件系统。联合加载只得是一次加载多个文件系统。但是在外面看来只有一个文件系统。联合加载会将各层文件系统加载到一起，这样最终的文件系统会包含所有的文件及目录。Docker将这样的文件系统称为镜像。一个镜像可以放到另一个镜像顶部，位于下面的镜像称为父镜像。一个容器中可以运行用户的一个或多个进程。当一个容器启动时，Docker会在镜像的最顶层增加一个读写文件系统，我们在Docker中运行的程序就是在这个层运行并执行的。第一次启动Docker时，读写层是空的，当文件发生变化后都会应用到这一层。比如修改一个文件，先将该文件从只读层复制到读写层，然后隐藏只读层，这就是Docker的写时复制。

4. Docker的常用操作命令

镜像操作：

将镜像拉到本地 [docker pull ubuntu] 查看当前已经有的镜像 [docker images] 查找镜像 [docker search xxx] 删除镜像 [docker rmi d5a6e75613ea] 登录注销docker hub [docker login/logout] 上传镜像：在docker hub 上创建一个docker地址。标准格式为用户名/docker镜像名比如我这边创建的docker镜像名为testdocker 构建命令如下： docker build -t "tbfungeek/testdocker:0.0.1" . 使用下面命令就可以进行push到dockerhub了 docker push tbfungeek/testdocker:0.0.1

容器操作：

查看docker info [docker info] 查看当前正在运行的容器 [docker ps -a] 创建容器 [docker run -dit -p 8888:80 --name test ubuntu /bin/bash] 删除容器 [docker rm 容器id] 启动容器 [docker start xxxx] 重启容器 [docker restart xxxx] 附加到容器中 [docker attach xxxx] 退出容器 [exit] 停止容器 [docker stop] 查看日志 [docker logs -f xxxx] 查看端口 [docker port 4d17d19e34e2] 5. DockerFile的常用指令

构建会在Docker后台守护进程（daemon）中执行，而不是CLI中。构建前，构建进程会将全部内容（递归）发送到守护进程。

在创建一个Docker 镜像的时候推荐重新新建一个空的目录作为构建Docker的上下文，并且将Dockerfile放在上下文目录下的顶层目录（虽然可以通过-f参数来指定构建Docker的目录但是推荐还是放在上下文目录的顶层），在这个上下文文件夹中只存放用于构建当前 Docker镜像所必须的文件，对于不需要的文件通过dockerignore文件进行忽略。

Docker 守护进程会一条一条的执行Dockerfile中的指令，而且会在每一步提交并生成一个新镜像，最后会输出最终镜像的ID。生成完成后，Docker 守护进程会自动清理你发送的上下文。

Dockerfile文件中的每条指令会被独立执行，并会创建一个新镜像，RUN cd /tmp等命令不会对下条指令产生影响。

Docker 会重用已生成的中间镜像，以加速docker build的构建速度。

1. 创建目录 2. 创建Dockerfile 3. 编写Dockerfile # Version: 0.0.1 FROM ubuntu:latest MAINTAINER linxiaohai "tbfungeek@163.com" RUN apt-get update && apt-get install vim EXPOSE 80 4. 编译 Dockerfile生成镜像 docker build -f web_container/Dockerfile . docker build --no-cache -t "标签linxiaohai/web:v1" . docker build --no-cache -t "标签linxiaohai/web:v1" git@github.com:xxx/web_container

FROM

FROM <image> FROM <image>:<tag> FROM <image>@<digest>

在Dockerfile中第一条非注释指令一定是FROM，它指定了以哪一个镜像作为基准镜像，

LABEL

给构建的镜像打标签。

如果base image中也有标签，则继承，如果是同名标签，则覆盖。为了减少layer数量，尽量将标签写在一个LABEL指令中去，如：

LABEL author="lin xiaohai" \ version="0.0.1" 指定后可以通过docker inspect查看： "Labels": { "author": "lin xiaohai", "version": "0.0.1" }

VOLUME

VOLUME用于创建挂载点，即向基于所构建镜像创始的容器添加卷

VOLUME ["/var/log"] VOLUME /var/log /var/db

如，通过VOLUME创建一个挂载点：

ENV volum "/home/mydata" VOLUME ${volum}

构建的镜像，并指定镜像名为docker_file。构建镜像后，使用新构建的运行一个容器。运行容器时，需-v参将能本地目录绑定到容器的卷（挂载点）上，以使容器可以访问宿主机的数据。

docker run -dit -v ~/test:/home/mydata/ --name "volumetests" docker_file

USER

USER用于指定运行镜像所使用的用户：

USER daemon

使用USER指定用户时，可以使用用户名、UID或GID，或是两者的组合。以下都是合法的指定试：

USER user USER user:group USER uid USER uid:gid USER user:gid USER uid:group

使用USER指定用户后，Dockerfile中其后的命令RUN、CMD、ENTRYPOINT都将使用该用户。镜像构建完成后，通过docker run运行容器时，可以通过-u参数来覆盖所指定的用户。

WORKDIR

WORKDIR /path/to/workdir

WORKDIR指令用于设置Dockerfile中的RUN、CMD和ENTRYPOINT指令执行命令的工作目录(默认为/目录)，该指令在Dockerfile文件中可以出现多次，如果使用相对路径则为相对于WORKDIR上一次的值

ARG

ARG用于指定传递给构建运行时的变量：

ARG <name>[=<default value>] 在使用docker build构建镜像时，可以通过 build-arg =

参数来指定或重设置这些变量的值。

docker内置了一批构建参数，可以不用在Dockerfile中声明：HTTP_PROXY、http_proxy、HTTPS_PROXY、https_proxy、FTP_PROXY、ftp_proxy、NO_PROXY、no_proxy

RUN

RUN指令会在当前镜像的顶层执行任何命令，并commit成新的（中间）镜像，提交的镜像会在后面继续用到。

上面看到RUN后的格式有两种写法。

shell格式，相当于执行/bin/sh -c “ “：

RUN apt-get install vim -y

exec格式，不会触发shell，所以$HOME这样的环境变量无法使用，但它可以在没有bash的镜像中执行，而且可以避免错误的解析命令字符串：

RUN ["apt-get", "install", "vim", "-y"] 或 RUN ["/bin/bash", "-c", "apt-get install vim -y"] 与shell风格相同

RUN可以执行任何命令，然后在当前镜像上创建一个新层并提交。提交后的结果镜像将会用在Dockerfile文件的下一步。

通过RUN执行多条命令时，可以通过\换行执行,也可以在同一行中，通过分号分隔命令：

CMD

一个Dockerfile里只能有一个CMD，如果有多个，只有最后一个生效。CMD指令的主要功能是在build完成后，为了给docker run启动到容器时提供默认命令或参数，这些默认值可以包含可执行的命令，也可以只是参数（此时可执行命令就必须提前在ENTRYPOINT中指定）。

它与ENTRYPOINT的功能极为相似，区别在于如果docker run后面出现与CMD指定的相同命令，那么CMD会被覆盖；而ENTRYPOINT会把容器名后面的所有内容都当成参数传递给其指定的命令（不会对命令覆盖）。另外CMD还可以单独作为ENTRYPOINT的所接命令的可选参数。

CMD与RUN的区别在于，RUN是在build成镜像时就运行的，先于CMD和ENTRYPOINT的，CMD会在每次启动容器的时候运行，而RUN只在创建镜像时执行一次，固化在image中。

ENTRYPOINT

ENTRYPOINT命令设置在容器启动时执行命令，如果有多个ENTRYPOINT指令，那只有最后一个生效。

使用exec格式，在docker run

ENV

用于设置环境变量：

ENV <key> <value> 设置了后，后续的RUN命令都可以使用，当运行生成的镜像时这些环境变量依然有效，如果需要在运行时更改这些环境变量可以在运行docker run时添加-env <key>=<value>参数来修改

ADD

在构建镜像时，复制上下文中的文件到镜像内，格式：

ADD <src>... <dest> ADD ["<src>",... "<dest>"]

可以是文件、目录，也可以是文件URL。可以使用模糊匹配(wildcards，类似shell的匹配)，可以指定多个，必须是在上下文目录和子目录中，无法添加../a.txt这样的文件。如果是个目录，则复制的是目录下的所有内容，但不包括该目录。如果是个可被docker识别的压缩包，docker会以tar -x的方式解压后将内容复制到。

可以是绝对路径，也可以是相对WORKDIR目录的相对路径。如果路径不存在则会自动级联创建，根据你的需要是里是否需要反斜杠/，习惯使用/结尾从而避免被当成文件。

COPY

COPY的语法与功能与ADD相同，只是不支持上面讲到的是远程URL、自动解压这两个特性，但是Best Practices for Writing Dockerfiles建议尽量使用COPY，并使用RUN与COPY的组合来代替ADD，这是因为虽然COPY只支持本地文件拷贝到container，但它的处理比ADD更加透明，建议只在复制tar文件时使用ADD，如ADD trusty-core-amd64.tar.gz /。

EXPOSE

EXPOSE指令告诉容器在运行时要监听的端口，但是这个端口是用于多个容器之间通信用的（links），外面的host是访问不到的。要把端口暴露给外面的主机，在启动容器时使用-p选项。

ONBUILD

向镜像中添加一个触发器，当以该镜像为base image再次构建新的镜像时，会触发执行其中的指令。格式：

ONBUILD [INSTRUCTION]

比如我们生成的镜像是用来部署python代码的，但是因为有多个项目可能会复用该镜像。所以一个合适的方式是：

[...] # 在下一次以此镜像为base image的构建中，执行ADD . /app/src，将项目代目添加到新镜像中去 ONBUILD ADD . /app/src # 并且build Python代码 ONBUILD RUN /usr/local/bin/python-build --dir /app/src [...]

注意

ONBUILD只会继承给子节点的镜像，不会再继承给孙子节点。

ONBUILD ONBUILD或者ONBUILD FROM或者ONBUILD MAINTAINER是不允许的。

STOPSIGNAL

STOPSIGNAL用于设置停止容器所要发送的系统调用信号：

STOPSIGNAL signal

所使用的信号必须是内核系统调用表中的合法的值，如：9、SIGKILL

可以通过如下材料进行进一步学习：

↧

随想录（dlib学习）

December 5, 2018, 10:50 pm

≫ Next: Sending Emails With Python

≪ Previous: [后端技术栈-Django]-1-使用Docker容器来部署Django之一Docker简介

opencv大家用的很多，但是opencv的效率实在不敢恭维。所以，大家开始慢慢寻找其他的一些开源库，dlib就是不错的一个选择。当然，opencv也不是一无是处，现在主要用来进行基本图像数据的处理。dlib可以进行人脸检测、人脸旋转、人脸识别、视频检测等等，对于一般的场景来说，基本不会有很大的问题。

1、安装opencv

shell> sudo pip install opencv-python

2、安装dib

shell> sudo apt-get install libpython-dev

shell> sudo pip install dlib

因为这里dlib是需要进行c语言编译的，所以libpython-dev安装也是十分必要的

3、最简单的dlib应用

#!/usr/bin/python # The contents of this file are in the public domain. See LICENSE_FOR_EXAMPLE_PROGRAMS.txt # # This example program shows how to find frontal human faces in an image. In # particular, it shows how you can take a list of images from the command # line and display each on the screen with red boxes overlaid on each human # face. # # The examples/faces folder contains some jpg images of people. You can run # this program on them and see the detections by executing the # following command: # ./face_detector.py ../examples/faces/*.jpg # # This face detector is made using the now classic Histogram of Oriented # Gradients (HOG) feature combined with a linear classifier, an image # pyramid, and sliding window detection scheme. This type of object detector # is fairly general and capable of detecting many types of semi-rigid objects # in addition to human faces. Therefore, if you are interested in making # your own object detectors then read the train_object_detector.py example # program. # # # COMPILING/INSTALLING THE DLIB PYTHON INTERFACE # You can install dlib using the command: # pip install dlib # # Alternatively, if you want to compile dlib yourself then go into the dlib # root folder and run: # python setup.py install # # Compiling dlib should work on any operating system so long as you have # CMake installed. On Ubuntu, this can be done easily by running the # command: # sudo apt-get install cmake # # Also note that this example requires Numpy which can be installed # via the command: # pip install numpy import sys import cv2 import dlib detector = dlib.get_frontal_face_detector() win = dlib.image_window() for f in sys.argv[1:]: print("Processing file: {}".format(f)) img = dlib.load_rgb_image(f) # The 1 in the second argument indicates that we should upsample the image # 1 time. This will make everything bigger and allow us to detect more # faces. dets = detector(img, 1) print("Number of faces detected: {}".format(len(dets))) for i, d in enumerate(dets): print("Detection {}: Left: {} Top: {} Right: {} Bottom: {}".format( i, d.left(), d.top(), d.right(), d.bottom())) win.clear_overlay() win.set_image(img) win.add_overlay(dets) dlib.hit_enter_to_continue() # Finally, if you really want to you can ask the detector to tell you the score # for each detection. The score is bigger for more confident detections. # The third argument to run is an optional adjustment to the detection threshold, # where a negative value will return more detections and a positive value fewer. # Also, the idx tells you which of the face sub-detectors matched. This can be # used to broadly identify faces in different orientations. if (len(sys.argv[1:]) > 0): img = dlib.load_rgb_image(sys.argv[1]) dets, scores, idx = detector.run(img, 1, -1) for i, d in enumerate(dets): print("Detection {}, score: {}, face_type:{}".format( d, scores[i], idx[i]))

4、其他资源

http://dlib.net/

↧

Sending Emails With Python

December 5, 2018, 9:52 pm

≫ Next: Turn your ML model into a web service in under 10 minutes with AWS CodeStar

≪ Previous: 随想录（dlib学习）

You probably found this tutorial because you want to send emails using python. Perhaps you want to receive email reminders from your code, send a confirmation email to users when they create an account, or send emails to members of your organization to remind them to pay their dues. Sending emails manually is a time-consuming and error-prone task, but it’s easy to automate with Python.

In this tutorial you’ll learn how to:

Set up a secure connection using SMTP_SSL() and .starttls()

Use Python’s built-in smtplib library to send basic emails

Send emails with HTML content and attachments using the email package

Send multiple personalized emails using a CSV file with contact data

Use the Yagmail package to send email through your Gmail account using only a few lines of code

You’ll find a few transactional email services at the end of this tutorial, which will come in useful when you want to send a large number of emails.

Free Bonus:Click here to get access to a chapter from Python Tricks: The Bookthat shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

Getting Started

Python comes with the built-in smtplib module for sending emails using the Simple Mail Transfer Protocol (SMTP). smtplib uses the RFC 821 protocol for SMTP. The examples in this tutorial will use the Gmail SMTP server to send emails, but the same principles apply to other email services. Although the majority of email providers use the same connection ports as the ones in this tutorial, you can run a quick Google search to confirm yours.

To get started with this tutorial, set up a Gmail account for development , or set up an SMTP debugging server that discards emails you send and prints them to the command prompt instead. Both options are laid out for you below. A local SMTP debugging server can be useful for fixing any issues with email functionality and ensuring your email functions are bug-free before sending out any emails.

Option 1: Setting up a Gmail Account for Development

If you decide to use a Gmail account to send your emails, I highly recommend setting up a throwaway account for the development of your code. This is because you’ll have to adjust your Gmail account’s security settings to allow access from your Python code, and because there’s a chance you might accidentally expose your login details. Also, I found that the inbox of my testing account rapidly filled up with test emails, which is reason enough to set up a new Gmail account for development.

A nice feature of Gmail is that you can use the + sign to add any modifiers to your email address, right before the @ sign. For example, mail sent to my+person1@gmail.com and my+person2@gmail.com will both arrive at my@gmail.com . When testing email functionality, you can use this to emulate multiple addresses that all point to the same inbox.

To set up a Gmail address for testing your code, do the following:

Create a new Google account . Turn Allow less secure apps to ON . Be aware that this makes it easier for others to gain access to your account.

If you don’t want to lower the security settings of your Gmail account, check out Google’s documentation on how to gain access credentials for your Python script, using the OAuth2 authorization framework.

Option 2: Setting up a Local SMTP Server

You can test email functionality by running a local SMTP debugging server, using the smptd module that comes pre-installed with Python. Rather than sending emails to the specified address, it discards them and prints their content to the console. Running a local debugging server means it’s not necessary to deal with encryption of messages or use credentials to log in to an email server.

You can start a local SMTP debugging server by typing the following in Command Prompt:

$ python -m smtpd -c DebuggingServer -n localhost:1025

On linux, use the same command preceded by sudo .

Any emails sent through this server will be discarded and shown in the terminal window as a bytes object for each line:

---------- MESSAGE FOLLOWS ---------- b'X-Peer: ::1' b'' b'From: my@address.com' b'To: your@address.com' b'Subject: a local test mail' b'' b'Hello there, here is a test email' ------------ END MESSAGE ------------

For the rest of the tutorial, I’ll assume you’re using a Gmail account, but if you’re using a local debugging server, just make sure to use localhost as your SMTP server and use port 1025 rather than port 465 or 587. Besides this, you won’t need to use login() or encrypt the communication using SSL/TLS.

Sending a Plain-Text Email

Before we dive into sending emails with HTML content and attachments, you’ll learn to send plain-text emails using Python. These are emails that you could write up in a simple text editor. There’s no fancy stuff like text formatting or hyperlinks. You’ll learn that a bit later.

Starting a Secure SMTP Connection

When you send emails through Python, you should make sure that your SMTP connection is encrypted, so that your message and login credentials are not easily accessed by others. SSL (Secure Sockets Layer) and TLS (Transport Layer Security) are two protocols that can be used to encrypt an SMTP connection. It’s not necessary to use either of these when using a local debugging server.

There are two ways to start a secure connection with your email server:

SMTP_SSL() .starttls()

In both instances, Gmail will encrypt emails using TLS, as this is the more secure successor of SSL. As per Python’s Security considerations , it is highly recommended that you use create_default_context() from the ssl module. This will load the system’s trusted CA certificates, enable host name checking and certificate validation, and try to choose reasonably secure protocol and cipher settings.

If you want to check the encryption for an email in your Gmail inbox, go to More → Show original to see the encryption type listed under the Received header.

smtplib is Python’s built-in module for sending emails to any Internet machine with an SMTP or ESMTP listener daemon.

I’ll show you how to use SMTP_SSL() first, as it instantiates a connection that is secure from the outset and is slightly more concise than the .starttls() alternative. Keep in mind that Gmail requires that you connect to port 465 if using SMTP_SSL() , and to port 587 when using .starttls() .

Option 1: Using SMTP_SSL()

The code example below creates a secure connection with Gmail’s SMTP server, using the SMTP_SSL() of smtplib to initiate a TLS-encrypted connection. The default context of ssl validates the host name and its certificates and optimizes the security of the connection. Make sure to fill in your own email address instead of my@gmail.com :

import smtplib, ssl port = 465 # For SSL password = input("Type your password and press enter: ") # Create a secure SSL context context = ssl.create_default_context() with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server: server.login("my@gmail.com", password) # TODO: Send email here

Using with smtplib.SMTP_SSL() as server: makes sure that the connection is automatically closed at the end of the indented code block. If port is zero, or not specified, .SMTP_SSL() will use the standard port for SMTP over SSL (port 465).

It’s not safe practice to store your email password in your code, especially if you intend to share it with others. Instead, use input() to let the user type in their password when running the script, as in the example above. If you don’t want your password to show on your screen when you type it, you can import the getpass module and use .getpass() instead for blind input of your password.

Option 2: Using .starttls()

Instead of using .SMTP_SSL() to create a connection that is secure from the outset, we can create an unsecured SMTP connection and encrypt it using .starttls() .

To do this, create an instance of smtplib.SMTP , which encapsulates an SMTP connection and allows you access to its methods. I recommend defining your SMTP server and port at the beginning of your script to configure them easily.

The code snippet below uses the construction server = SMTP() , rather than the format with SMTP() as server: which we used in the previous example. To make sure that your code doesn’t crash when something goes wrong, put your main code in a try block, and let an except block print any error messages to stdout :

import smtplib, ssl smtp_server = "smtp.gmail.com" port = 587 # For starttls sender_email = "my@gmail.com" password = input("Type your password and press enter: ") # Create a secure SSL context context = ssl.create_default_context() # Try to log in to server and send email try: server = smtplib.SMTP(smtp_server,port) server.ehlo() # Can be omitted server.starttls(context=context) # Secure the connection server.ehlo() # Can be omitted server.login(sender_email, password) # TODO: Send email here except Exception as e: # Print any error messages to stdout print(e) finally: server.quit()

To identify yourself to the server, .helo() (SMTP) or .ehlo() (ESMTP) should be called after creating an .SMTP() object, and again after .starttls() . This function is implicitly called by .starttls() and .sendmail() if needed, so unless you want to check the SMTP service extensions of the server, it is not necessary to use .helo() or .ehlo() explicitly.

Sending Your Plain-text Email

After you initiated a secure SMTP connection using either of the above methods, you can send your email using .sendmail() , which pretty much does what it says on the tin:

server.sendmail(sender_email, receiver_email, message)

I recommend defining the email addresses and message content at the top of your script, after the imports, so you can change them easily:

sender_email = "my@gmail.com" receiver_email = "your@gmail.com" message = """\ Subject: Hi there This message is sent from Python.""" # Send email here

The message string starts with "Subject: Hi there" followed by two newlines ( \n ). This ensures Hi there shows up as the subject of the email, and the text following the newlines will be treated as the message body.

The code example below sends a plain-text email using SMTP_SSL() :

import smtplib, ssl port = 465 # For SSL smtp_server = "smtp.gmail.com" sender_email = "my@gmail.com" # Enter your address receiver_email = "your@gmail.com" # Enter receiver address password = input("Type your password and press enter: ") message = """\ Subject: Hi there This message is sent from Python.""" context = ssl.create_default_context() with smtplib.SMTP_SSL(smtp_server, port, context=context) as server: server.login(sender_email, password) server.sendmail(sender_email, receiver_email, message)

For comparison, here is a code example that sends a plain-text email over an SMTP connection secured with .starttls() . The server.ehlo() lines may be omitted, as they are called implicitly by .starttls() and .sendmail() , if required:

import smtplib, ssl port = 587 # For starttls smtp_server = "smtp.gmail.com" sender_email = "my@gmail.com" receiver_email = "your@gmail.com" password = input("Type your password and press enter:") message = """\ Subject: Hi there This message is sent from Python.""" context = ssl.create_default_context() with smtplib.SMTP(smtp_server, port) as server: server.ehlo() # Can be omitted server.starttls(context=context) server.ehlo() # Can be omitted server.login(sender_email, password) server.sendmail(sender_email, receiver_email, message) Sending Fancy Emails

Python’s built-in email package allows you to structure more fancy emails, which can then be transferred with smptlib as you have done already. Below, you’ll learn how use the email package to send emails with HTML content and attachments.

Including HTML Content

If you want to format the text in your email ( bold , italics , and so on), or if you want to add any images, hyperlinks, or responsive content, then HTML comes in very handy. Today’s most common type of email is the MIME (Multipurpose Internet Mail Extensions) Multipart email, combining HTML and plain-text. MIME messages are handled by Python’s email.mime module. For a detailed description, check the documentation .

As not all email clients display HTML content by default, and some people choose only to receive plain-text emails for security reasons, it is important to include a plain-text alternative for HTML messages. As the email client will render the last multipart attachment first, make sure to add the HTML message after the plain-text version.

In the example below, our MIMEText() objects will contain the HTML and plain-text versions of our message, and the MIMEMultipart("alternative") instance combines these into a single message with two alternative rendering options:

import smtplib, ssl from email.mime.text import MIMEText from email.mime.multipart import MIMEMultipart sender_email = "my@gmail.com" receiver_email = "your@gmail.com" password = input("Type your password and press enter:") message = MIMEMultipart("alternative") message["Subject"] = "multipart test" message["From"] = sender_email message["To"] = receiver_email # Create the plain-text and HTML version of your message text = """\ Hi, How are you? Real Python has many great tutorials: www.realpython.com""" html = """\ <html> <body> <p>Hi,<br> How are you?<br> <a href="http://www.realpython.com">Real Python</a> has many great tutorials. </p> </body> </html> """ # Turn these into plain/html MIMEText objects part1 = MIMEText(text, "plain") part2 = MIMEText(html, "html") # Add HTML/plain-text parts to MIMEMultipart message # The email client will try to render the last part first message.attach(part1) message.attach(part2) # Create secure connection with server and send email context = ssl.create_default_context() with smtplib.SMTP_SSL("smtp.gmail.com", 465, context=context) as server: server.login(sender_email, password) server.sendmail( sender_email, receiver_email, message.as_string() )

In this example, you first define the plain-text and HTML message as string literals, and then store them as plain / html MIMEText objects. These can then be added in this order to the MIMEMultipart("alternative") message and sent through your secure connection with the email server. Remember to add the HTML message after the plain-text alternative, as email clients will try to render the last subpart first.

Adding Attachments Using the email Package

In order to send binary files to an email server that is designed to work with textual data, they need to be encoded before transport. This is most commonly done using base64 , which encodes binary data into printable ASCII characters.

The code example below shows how to send an email with a PDF file as an attachment:

import email, smtplib, ssl from email import encoders from email.mime.base import MIMEBase from email.mime.multipart import MIMEMultipart from email.mime.text import MIMEText subject = "An email with attachment from Python" body = "This is an email with attachment sent from Python" sender_email = "my@gmail.com" receiver_email = "your@gmail.com" password = input("Type your password and press enter:") # Create a multipart message and set headers message = MIMEMultipart() message["From"] = sender_email message["To"] = receiver_email message["Subject"] = subject message["Bcc"] = receiver_email # Recommended for mass emails # Add body to email message.attach(MIMEText(body, "plain")) filename = "document.pdf" # In same directory as script # Open PDF file in binary mode with open(filename, "rb") as attachment: # Add file as application/octet-stream # Email client can usually download this automatically as attachment part = MIMEBase("application", "octet-stream") part.set_payload(attachment.read()) # Encode file in ASCII characters to send by email encoders.encode_base64(part) # Add header as key/value pair to attachment part part.add_header( "Content-Disposition", f"attachment; filename= {filename}", ) # Add attachment to message and convert message to string message.attach(part) text = message.as_string() # Log in to server using secure context and send email context = ssl.create_default_context() with smtplib.SMTP_SSL("smtp.gmail.com", 465, context=context) as server: server.login(sender_email, password) server.sendmail(sender_email, receiver_email, text)

The MIMEultipart() message accepts parameters in the form of RFC5233 -style key/value pairs, which are stored in a dictionary and passed to the .add_header method of the Message base class.

Check out the documentation for Python’s email.mime module to learn more about using MIME classes.

Sending Multiple Personalized Emails

Imagine you want to send emails to members of your organization, to remind them to pay their contribution fees. Or maybe you want to send students in your class personalized emails with the grades for their recent assignment. These tasks are a breeze in Python.

Make a CSV File With Relevant Personal Info

An easy starting point for sending multiple personalized emails is to create a CSV (comma-separated values) file that contains all the required personal information. (Make sure not to share other people’s private information without their consent.) A CSV file can be thought of as a simple table, where the first line often contains the column headers.

Below are the contents of the file contacts_file.csv , which I saved in the same folder as my Python code. It contains the names, addresses, and grades for a set of fictional people. I used my+modifier@gmail.com constructions to make sure all emails end up in my own inbox, which in this example is my@gmail.com :

name,email,grade Ron Obvious,my+ovious@gmail.com,B+ Killer Rabbit of Caerbannog,my+rabbit@gmail.com,A Brian Cohen,my+brian@gmail.com,C

When creating a CSV file, make sure to separate your values by a comma, without any surrounding whitespaces.

Loop Over Rows to Send Multiple Emails

The code example below shows you how to open a CSV file and loop over its lines of content (skipping the header row). To make sure that the code works correctly before you send emails to all your contacts, I’ve printed Sending email to ... for each contact, which we can later replace with functionality that actually sends out emails:

import csv with open("contacts_file.csv") as file: reader = csv.reader(file) next(reader) # Skip header row for name, email, grade in reader: print(f"Sending email to {name}") # Send email here

In the example above, using with open(filename) as file: makes sure that your file closes at the end of the code block. csv.reader() makes it easy to read a CSV file line by line and extract its values. The next(reader) line skips the header row, so that the following line for name, email, grade in reader: splits subsequent rows at each comma, and stores the resulting values in the strings name , email and grade for the current contact.

If the values in your CSV file contain whitespaces on either or both sides, you can remove them using the .strip() method.

Personalized Content

You can put personalized content in a message by using str.format() to fill in curly-bracket placeholders. For example, "hi {name}, you {result} your assignment".format(name="John", result="passed") will give you "hi John, you passed your assignment" .

As of Python 3.6, string formatting can be done more elegantly usingf-strings, but these require the placeholders to be defined before the f-string itself. In order to define the email message at the beginning of the script, and fill in placeholders for each contact when looping over the CSV file, the older .format() method is used.

With this in mind, you can set up a general message body, with placeholders that can be tailored to individuals.

Code Example

The following code example lets you send personalized emails to multiple contacts. It loops over a CSV file with name,email,grade for each contact, as in theexample above.

The general message is defined in the beginning of the script, and for each contact in the CSV file its {name} and {grade} placeholders are filled in, and a personalized email is sent out through a secure connection with the Gmail server, as you saw before:

import csv, smtplib, ssl message = """Subject: Your grade Hi {name}, your grade is {grade}""" from_address = "my@gmail.com" password = input("Type your password and press enter: ") context = ssl.create_default_context() with smtplib.SMTP_SSL("smtp.gmail.com", 465, context=context) as server: server.login(from_address, password) with open("contacts_file.csv") as file: reader = csv.reader(file) next(reader) # Skip header row for name, email, grade in reader: server.sendmail( from_address, email, message.format(name=name,grade=grade), ) Yagmail

There are multiple libraries designed to make sending emails easier, such as Envelopes , Flanker and Yagmail . Yagmail is designed to work specifically with Gmail, and it greatly simplifies the process of sending emails through a friendly API, as you can see in the code example below:

import yagmail receiver = "your@gmail.com" body = "Hello there from Yagmail" filename = "document.pdf" yag = yagmail.SMTP("my@gmail.com") yag.send( to=receiver, subject="Yagmail test with attachment", contents=[body, filename], )

This code example sends an email with a PDF attachment in a fraction of the lines needed for our example using email and smtplib .

When setting up Yagmail, you can add your Gmail validations to the keyring of your OS, as described in the documentation . If you don’t do this, Yagmail will prompt you to enter your password when required and store it in the keyring automatically.

Transactional Email Services

If you plan to send a large volume of emails, want to see email statistics, and want to ensure reliable delivery, it may be worth looking into transactional email services. Although all of the following services have paid plans for sending large volumes of emails, they also come with a free plan so you can try them out. Some of these free plans are valid indefinitely and may be sufficient for your email needs.

Below is an overview of the free plans for some of the major transactional email services. Clicking on the provider name will take you to the pricing section of their website.

Provider Free plan Sendgrid 40,000 emails for your first 30 days, then 100/day Sendinblue 300 emails/day Mailgun First 10,000 emails free Mailjet 200 emails/day Amazon SES 62,000 emails/month

You can run a Google search to see which provider best fits your needs, or just try out a few of the free plans to see which API you like working with most.

Sendgrid Code Example

Here’s a code example for sending emails with Sendgrid to give you a flavor of how to use a transactional email service with Python:

import os import sendgrid from sendgrid.helpers.mail import Content, Email, Mail sg = sendgrid.SendGridAPIClient( apikey=os.environ.get("SENDGRID_API_KEY") ) from_email = Email("my@gmail.com") to_email = Email("your@gmail.com") subject = "A test email from Sendgrid" content = Content( "text/plain", "Here's a test email sent through Python" ) mail = Mail(from_email, subject, to_email, content) response = sg.client.mail.send.post(request_body=mail.get()) # The statements below can be included for debugging purposes print(response.status_code) print(response.body) print(response.headers)

To run this code, you must first:

Sign up for a (free) Sendgrid account Request an API key for user validation Add your API key by typing setx SENDGRID_API_KEY "YOUR_API_KEY" in Command Prompt (to store this API key permanently) or set SENDGRID_API_KEY YOUR_API_KEY to store it only for the current client session

More information on how to set up Sendgrid for Mac and windows can be found in the repository’s README on Github .

Conclusion

You can now start a secure SMTP connection and send multiple personalized emails to the people in your contacts list!

You’ve learned how to send an HTML email with a plain-text alternative and attach files to your emails. The Yagmail package simplifies all these tasks when you’re using a Gmail account. If you plan to send large volumes of email, it is worth looking into transactional email services.

Enjoy sending emails with Python, and remember: no spam please !

↧

Turn your ML model into a web service in under 10 minutes with AWS CodeStar

December 6, 2018, 12:14 am

≫ Next: Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy L ...

≪ Previous: Sending Emails With Python

As a data scientist, developing a model that makes the right predictions feels incredibly rewarding on its own. An hdf5 file on your machine, however, is often not really helpful for your company or anyone with the same problem you just solved. The next step is therefore often to create a web service for your model, that you can access via API. One option would be to write a Flask application which you can host on your server. Unfortunately, this approach is often complex and doesn’t scale well. While there are many tools around to help with the set-up and management of virtual servers, everyone who tried setting up an EC2 instance knows the hassles that come with it. Sticking with AWS, the next option would be running your model on Lambda, exposing it through API Gateway, etc. Since at least four different Services + code needs to be managed, this method might be easier but can still be quite complex. Fortunately, Amazon recognized this issue and introduced a solution in 2017: AWS CodeStar .

CodeStar streamlines the app creation and deployment process by connecting multiple of the AWS services in a, mostly, intuitive and easy-to-use way. As an example, we will deploy my implementation of Rob Renalds Gibberish Detector, a Markov-Chain based tool to detect whether a string contains real words or just random “gibberish”. I trained the model on German text and stored it as a python Pickle file.

By the end of this post, we will have a working web service that takes in one variable through a GET request, runs the model via Python code and returns the resulting prediction as JSON.

Step 1: Creating your CodeStarproject
Turn your ML model into a web service in under 10 minutes with AWS CodeStar

Over 30 templates make it easy to set up your web service orapp

Since our goal is to establish a web service, we can choose the first Python template. As a result, our service will run “serverless” on Lambda. The following three steps are relatively self-explanatory:

We decide on a project name, which AWS will also convert into an URL friendly project id. This project id will be, later on, part of the HTTP endpoint. Next, we have to choose our preferred Git repository. If you decided on GitHub, you have the option to change the repository name, description and set it either to public or private. CodeStar will then create the repository for you with all necessary files in it, but more on those later. To work its magic, CodeStar needs permission to manage all the different tools in our pipeline on your behalf.
Turn your ML model into a web service in under 10 minutes with AWS CodeStar

Our AWS CodePipeline Step 2: Connect to your source repository

One of the great things about CodeStar is that the code is all managed via Git and every update you push into it, will update the Lambda function and automatically be deployed. Since AWS automatically creates the repository for you, all that needs to be done to start coding is a git clone your_repository_url .

AWS CodeStar creates all the necessary files foryou Step 3: Create GET parameter in AWS APIGateway

To make changes in the API parameters, we need to open our project in AWS APIGateway. The fastest way leads over the side-bar in our dashboard: Project -> Project Resources -> AWS APIGateway.

AWS APIGateway before ourchanges

The first step is to add a Methode Request . For this exercise, we add one string parameter, which is called string . Since we need this input to run our model, we can make this parameter Required . APIGateway requires us to have a fitting Request Validator in check. Because we are only using URL parameters, the validator Validate query string parameters and headers will do fine. Here is how the resulting page should look like:

AWS APIGateway configuration for one URL parameter Step 4: Write the Lambdafunction

CodeStar builds from your Git repository. As a result, you can write code in your favorite IDE and push once you are done. The created repository contains the following items by default:

index.py: This file contains the code for your Lambda function. README.md: The readme file contains basic information about the next steps to take and links to the official documentation. template.yml: The structure of your ‘serverless’ AWS architecture. buildspec.yml: This file contains additional commands that are executed during the build process. A standard command pre-build is the execution of a unit test. tests/: Contains the file test_handler.py with the unit test mentioned above.

First, we have to make our model file accessible to the function. The easiest way is to add the file into our Git repository. AWS Lambda has relatively generous storage limits , which should be sufficient for most use-cases. Once uploaded, Lambda can access the file the usual way, using open .

Finally, we can write our Python code into index.py, which will become our Lambda function. With our set-up in step 3 and 4, we can access the URL get-parameter easily through the event parameter:

req_name = event['queryStringParameters']['string']

You can find the full code on Github . After implementing the main function, we have to update the unit test. Remember, if the unit test fails, the service will not be deployed. For that reason, we update everything accordingly:

All we have to do now is to push the model file, and code to the project repository, using the usual commands: git add. , git commit , and git push . As soon as the changes are online, CodeStar will automatically update its code basis and build and deploy everything. You can find the status on the right-hand side of the dashboard.

How it should looklike Final words

If you followed along ― congratulations, you just made your machine learning model publicly available in less than 10 minutes! You can find the endpoint for your API on the dashboard, add the parameters to it and voila. Due to the integration of AWS CodePipline, it is easy to keep your model updated, and the connection to Amazon CloudWatch gives you many insights into what happens to your functions once it’s out in the wild.

Making your machine learning models public via Lambda is just one of many great things you can do with CodeStar. The next time you get lost in setting up any AWS workflow that involves the usual 5+ services, take a look, maybe CodeStar can help you to reduce your time to production as well.

↧

Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy L ...

December 6, 2018, 4:42 am

≫ Next: Python堆排序之heapq

≪ Previous: Turn your ML model into a web service in under 10 minutes with AWS CodeStar

Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy L ...

Woohooo! We've got our environment set up and are ready to start building our bot! Seeing as the last post in this series (locatedhere) was mostly informational, let's get right into the code for part two.

So, it is my assumption that if you are reading this series so far, that you know the basics of web scraping in python, and that you are looking to learn how to not get blocked rather than the basics. So lets get right into the first line of defense we have against web filters: IP Address cycling

Many web scrapers will purchase Proxies that they are allowed to change once or twice a month for the purpose of not getting filtered by address. But did you know that an IP is usually filtered in under 100 requests and, seeing as most scrapers make at least 100 requests an hour, purchasing enough IP address proxies to last a month can get expensive.

There are free options out there, but the disadvantage there is that those IP Address lists are usually monitored and only last a couple hours before they're blacklisted. But there is some good news: Free Proxy providers (good ones anyway) originate new IPs faster than their old ones get pinged. So we're going to use my personal favorite: free-proxy-list.net

To use this effectively and account for the IPs we've used and abandoned, we're going to build a web scraper to retrieve the IPs dynamically! First step in that process, and the topic of this article, will be to build our proxy list. First step is to create a new file called proxy_retriever.py that contains the following:

import requests import pandas as pd from bs4 import BeautifulSoup class proxyRetriever(): def __init__(self): self.s = requests.Session() self.ip_list = []

Pretty simple so far. We've imported our http library (requests) and our html parser (beautifulSoup) as well as the library we're going to use to build the CSV file containing our proxies (pandas). Next, we create a new class called proxyRetriever to handle all actions involved in building our list. Now, you may want to lay out your project differently as far as the OOP concepts you use. But, in my personal opinion, I feel that an application that is as compartmentalized as possible makes life easier for yourself and everybody who uses your code to come. Basically, every class handles one general task (building a proxy list for example). Each function within that class will handle one specific task, the more specific the better. This way, your code will be very DRY and debugging will be a cinch because if something doesn't work correctly, you will have a very good idea of where the problem is based solely on what class and function is supposed to handle that one, very specific thing.

You will also notice that we are defining a requests Session here rather than simply a requests.get(). There are a few reasons for this that I find superior than repeatedly using one-off HTTP requests, but a specific post on that is coming soon so I won't go into it here. Just play along for now.

The second class-level property is the ip_list we will be building. Defining our properties in this way just makes it easier for our class functions to share access to that particular variable without having to pass data around to each other which can A)Get messy when multiple threads and/or processes become involved and B)Get messy even in a single-threaded environment. To me, it does not matter one whit how elegant your solution is or how brilliant it makes you look if the person who maintains it when you no longer do has no clue how to work with it. Making yourself look more clever than the next guy is the kind of A-type grandstanding that has no place in an environment as collaborative and diverse as the development and engineering field. Simple, readable, easy to follow solutions are superior to those that might be a pithing faster or more efficient, but are totally untenable.

Now that our class is defined and has some initialization behavior, lets add our first function:

def connect_and_parse(self, website): r = self.s.get(website) soup = BeautifulSoup(r.text, "html.parser") proxy_table = soup.find('tbody') proxy_list = proxy_table.find_all('tr') elites = [tr for tr in proxy_list if 'elite' in tr.text] tds = [] for tr in elites: tds.append([td.text for td in tr]) return tds

With the principle of "one function per function" in mind, connect_and_parse takes a web address as an argument (in this case ' https://free-proxy-list.net/' ), connects via our requests session, and pulls down the HTML located at that address. For the sake of simplicity, I tend to use the "html.parser" parser included with BeautifulSoup, rather than rely on another dependency like lxml.

Pro Tip: Remember, BeautifulSoup is markup parser. It works with HTML strings and nothing else. I see many folks get into a groove and pass the response object (in this case r) to BeautifulSoup and hit an exception. Our response is an object that contains properties that BeautifulSoup works with (like .text and .content), but is useless to us by itself. Just remember to pass r.text or r.content to avoid that annoying holdup.

Using the browser tools of your preference (I use Chrome Dev Tools), examine the site's content to find that the table containing our proxy information is the first instance of the 'tbody' HTML class. Calling the soup.find() function finds the first occurrence of this class and assigns it to proxy_table.

To create a list of the rows of this table, assign proxy_table.find_all('tr') to the proxy_list variable.

Pro Tip: Any item returned by the soup.find() function will be a BeautifulSoup object that you can further parse with find(), find_all(), next_sibling(), etc. However, soup.find_all() returns a list filled with BeautifulSoup objects. Even if this list contains only one item, you will still need to access it specifically to parse it further. In this example, if we tried to call something like proxy_list.find('some-other-element') , we will get an exception because the proxy_list variable is a list containing soup objects and therefore, only has the properties of a list. To further parse an element of find_all(), iterate through the list or access an item by its index.

Our next line creates a list of "elite" status proxies. These types of proxies are totally anonymous and very difficult for filters to pin down and filter on the fly. They are also HTTPS so your traffic will be both anonymous and encrypted with SSL, both things we want our crawler to have. This process is done using a list comprehension, which you can get a crash course inhere

Next line is an example of readability vs cleverness. We are creating a list of cells in each row stored in the elites list. This can be done with a list comprehension, however it would be a nested list comprehension that would be a bit of a brain teaser, difficult to read, and would need to be read multiple times just to understand the logic flow going on. This would make debugging any issues it could later develop take much longer than needed. So instead, we use a traditional for loop to access the rows in elites, then leverage a simpler, more readable list comprehension to process the cells of each row into our desired information on each proxy we are collecting.

Cool! So we've got our proxy information in a neat little list of lists. Now to process it and toss it into a CSV file we can reference later:

def clean_and_sort(self, data_set): columns = ['Address', 'Port', 'Country_Code', 'Country', 'Proxy_Type'] mapped_lists = {} i=0 while i < 5: mapped_lists.update({columns[i]: [tag[i] for tag in tags_stripped_data]}) i+=1 df = pd.DataFrame(mapped_lists, columns=columns) df.to_csv('proxy_ips.csv')

This next function takes the output of our previous function (our collection or lists or proxy details), cleans it, sorts it, and outputs it to a csv. The columns variable is a list that contains the column names we will be using for our csv file. We then create an empty dictionary, and use a while loop to only process the first 5 items in each list we created in our connect_and_parse function into a dictionary mapping data to csv columns. This is necessary because, if we re-examine free-proxy-list.net, the table contains 8 columns. We only care about the first 5. Now, we could have written our scraper to only retrieve the first 5. However, connect_and_parse is not responsible for cleaning and sorting our data, only grabbing it and returning it as unceremonious blob. Is it the most efficient solution? Perhaps, perhaps not. But when it comes to debugging a problem with sorting the data correctly once retrieved, you will be glad you went about it this way. Because no matter what you change (and subsequently break) in the course of debugging the clean_and_sort function, connect_and_parse will still work just fine. Can you say the same if both responsibilities were given to connect_and_parse by itself?

From there, we call on pandas to create a dataframe from the data in our mapped_lists dictionaries, using our columns list as the column headers, and exporting the dataframe to our new file, proxy_ips.csv. Once open, this file will look like this:

Okay! Our first line of defense is in place! But before we can start going after the data we want, we have one more common filtering vector we need to confuse: User-Agents. Tune in for part three to continue your studies in filter-proofing your web crawler!

↧

Python堆排序之heapq

December 6, 2018, 4:40 am

≫ Next: PySide2 and PyQt versions.

≪ Previous: Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy L ...

堆排序 ( . )

这是崔斯特的第七十八篇原创文章 python中的堆排序

heapq模块实现了Python中的堆排序，并提供了有关方法。让用Python实现排序算法有了简单快捷的方式。

heapq的官方文档和源码： Heap queue algorithm

下面通过举例的方式说明heapq的应用方法

实现堆排序 from heapq import * def heap_sort(iterable): h = [] for value in iterable: heappush(h, value) return [heappop(h) for _ in range(len(h))] if __name__ == '__main__': print(heap_sort([1, 3, 5, 9, 2, 123, 4, 88])) Output: [1, 2, 3, 4, 5, 9, 88, 123]

下面说说几个主要方法

heappush()

heapq.heappush(heap, item):将item压入到堆数组heap中。如果不进行此步操作，后面的heappop()失效

heappop()

heapq.heappop(heap):从堆数组heap中取出最小的值，并返回。

>>>h = [] #定义一个list >>>from heapq import * #引入heapq模块 >>>h [] >>>heappush(h, 5) #向堆中依次增加数值 >>>heappush(h, 2) >>>heappush(h, 3) >>>heappush(h, 9) >>>h #h的值 [2, 5, 3, 9] >>>heappop(h) #从h中删除最小的，并返回该值 2 >>>h [3, 5, 9] >>>h.append(1) #注意，如果不是压入堆中，而是通过append追加一个数值 >>>h #堆的函数并不能操作这个增加的数值，或者说它堆对来讲是不存在的 [3, 5, 9, 1] >>>heappop(h) #从h中能够找到的最小值是3,而不是1 3 >>>heappush(h, 2) #这时，不仅将2压入到堆内，而且1也进入了堆。 >>>h [1, 2, 9, 5] >>>heappop(h) #操作对象已经包含了1 1 heapq.heappushpop(heap, item)

是上述heappush和heappop的合体，同时完成两者的功能.注意：相当于先操作了heappush(heap,item),然后操作heappop(heap)

>>>h [1, 2, 9, 5] >>>heappop(h) 1 >>>heappushpop(h, 4) #增加4同时删除最小值2并返回该最小值，与下列操作等同： 2 #heappush(h,4),heappop(h) >>>h [4, 5, 9] heapq.heapify(x)

x必须是list，此函数将list变成堆，实时操作。从而能够在任何情况下使用堆的函数。

>>>a = [3, 6, 1] >>>heapify(a) #将a变成堆之后，可以对其操作 >>>heappop(a) 1 >>>b = [4, 2, 5] #b不是堆，如果对其进行操作，显示结果如下 >>>heappop(b) #按照顺序，删除第一个数值并返回,不会从中挑选出最小的 4 >>>heapify(b) #变成堆之后，再操作 >>>heappop(b) 2 heapq.heapreplace(heap, item)

是heappop(heap)和heappush(heap,item)的联合操作。注意，与heappushpop(heap,item)的区别在于，顺序不同，这里是先进行删除，后压入堆

>>>a = [] >>>heapreplace(a, 3) #如果list空，则报错 Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: index out of range >>>heappush(a, 3) >>>a [3] >>>heapreplace(a, 2) #先执行删除（heappop(a)->3),再执行加入（heappush(a, 2)) 3 >>>a [2] >>>heappush(a, 5) >>>heappush(a, 9) >>>heappush(a, 4) >>>a [2, 4, 9, 5] >>>heapreplace(a, 6) #先从堆a中找出最小值并返回，然后加入6 2 >>>a [4, 5, 9, 6] >>>heapreplace(a, 1) #1是后来加入的，在1加入之前，a中的最小值是4 4 >>>a [1, 5, 9, 6] heapq.merge(*iterables)

举例：

>>> a = [2, 4, 6] >>> b = [1, 3, 5] >>> c = merge(a, b) >>> list(c) [1, 2, 3, 4, 5, 6]

在归并排序中详细演示了本函数的使用方法。

heapq.nlargest(n, iterable[, key]),heapq.nsmallest(n, iterable[, key])

获取列表中最大、最小的几个值。

>>> a [2, 4, 6] >>> nlargest(2,a) [6, 4] 数组中的第K个最大元素

其实以上说了那么多，只是为了说这道题。

在未排序的数组中找到第 k 个最大的元素。请注意，你需要找的是数组排序后的第 k 个最大的元素，而不是第 k 个不同的元素。

示例 1: 输入: [3,2,1,5,6,4] 和 k = 2 输出: 5 示例 2: 输入: [3,2,3,1,2,4,5,5,6] 和 k = 4 输出: 4 说明:

你可以假设 k 总是有效的，且 1 ≤ k ≤ 数组的长度。

这里不说别的解法。当然面试中你肯定不能这么写，但这是一个很好的思路

class Solution: def findKthLargest(self, nums, k): """ :type nums: List[int] :type k: int :rtype: int """ import heapq heapq.heapify(nums) return heapq.nlargest(k, nums)[-1] 看到有人用 return sorted(nums)[-k] ，真的要被气死了。

参考 https://github.com/qiwsir/algorithm/blob/master/heapq.md

↧

PySide2 and PyQt versions.

December 6, 2018, 4:38 am

≫ Next: Stack Abuse: Sets in Python

≪ Previous: Python堆排序之heapq

A short intro about this python module named PySide2 can be found on the Wikipedia webpage .

The purpose of this tutorial is to introduce concepts about licenses and develop python programs using ergonomic design interfaces with PySide2 and PyQt versions.

PySide2 is a Python binding of the cross-platform GUI toolkit Qt, currently developed by The Qt Company under the Qt for Python project. It is one of the alternatives to the standard library package Tkinter.

...

PySide was released under the LGPL in August 2009 by Nokia,[1] the former owners of the Qt toolkit, after Nokia failed to reach an agreement with PyQt developers Riverbank Computing[7] to change its licensing terms to include LGPL as an alternative license.

Now about the LGPL software license:

The GNU Lesser General Public License (LGPL) is a free software license published by the Free Software Foundation (FSF). The license allows developers and companies to use and integrate software released under the LGPL into their own (even proprietary) software without being required by the terms of a strong copyleft license to release the source code of their own components.

I wrote in the past about PySide and you can find on this blog or on my website .

Let's start installing Python using pip3.6 .

C:\Python364>cd Scripts C:\Python364\Scripts>pip3.6.exe install PySide2 Collecting PySide2 Downloading https://files.pythonhosted.org/packages/10/ba/7448ec862655c356ade2 2351ed46c9260773186c37ba0d8ceea1ef8c7515/PySide2-5.11.2-5.11.2-cp35.cp36.cp37-no ne-win_amd64.whl (128.7MB) 100% |%

↧

Stack Abuse: Sets in Python

December 6, 2018, 4:36 am

≫ Next: 爬取中国知网CNKI的遇到的坑与技术总结

≪ Previous: PySide2 and PyQt versions.

Introduction

In python, a set is a data structure that stores unordered items. The set items are also unindexed. Like a list, a set allows the addition and removal of elements. However, there are a few unique characteristics that define a set and separate it from other data structures:

A set does not hold duplicate items. The elements of the set are immutable, that is, they cannot be changed, but the set itself is mutable, that is, it can be changed. Since set items are not indexed, sets don't support any slicing or indexing operations.

In this article, we will be discussing the various operations that can be performed on sets in Python.

How to Create a Set

There are two ways through which we can create sets in Python.

We can create a set by passing all the set elements inside curly braces {} and separate the elements using commas (,). A set can hold any number of items and the items can be of different types, for example, integers, strings, tuples, etc. However, a set does not accept an element that is mutable, for example, a list, dictionary, etc.

Here is an example of how to create a set in Python:

num_set = {1, 2, 3, 4, 5, 6} print(num_set) Output {1, 2, 3, 4, 5, 6}

We just created a set of numbers. We can also create a set of string values. For example:

string_set = {"Nicholas", "Michelle", "John", "Mercy"} print(string_set) Output {'Michelle', 'Nicholas', 'John', 'Mercy'}

You must have noticed that the elements in the above output are not ordered in the same way we added them to the set. The reason for this is that set items are not ordered. If you run the same code again, it is possible that you will get an output with the elements arranged in a different order.

We can also create a set with elements of different types. For example:

mixed_set = {2.0, "Nicholas", (1, 2, 3)} print(mixed_set) Output {2.0, 'Nicholas', (1, 2, 3)}

All the elements of the above set belong to different types.

We can also create a set from a list. This can be done by calling the Python's built-in set() function. For example:

num_set = set([1, 2, 3, 4, 5, 6]) print(num_set) Output {1, 2, 3, 4, 5, 6}

As stated above, sets do not hold duplicate items. Suppose our list had duplicate items, as shown below:

num_set = set([1, 2, 3, 1, 2]) print(num_set) Output {1, 2, 3}

The set has removed the duplicates and returned only one of each duplicate items. This also happens when we are creating a set from scratch. For example:

num_set = {1, 2, 3, 1, 2} print(num_set) Output {1, 2, 3}

Again, the set has removed the duplicates and returned only one of the duplicate items.

The creation of an empty set is some-what tricky. If you use empty curly braces {} in Python, you create an empty dictionary rather than an empty set. For example:

x = {} print(type(x)) Output <class 'dict'>

As shown in the output, the type of variable x is a dictionary.

To create an empty set in Python we must use the set() function without passing any value for the parameters, as shown below:

x = set() print(type(x)) Output <class 'set'>

The output shows that we have created a set.

Accessing Set Items

Python does not provide us with a way of accessing an individual set item. However, we can use a for loop to iterate through all the items of a set. For example:

months = set(["Jan", "Feb", "March", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"]) for m in months: print(m) Output March Feb Dec Jan May Nov Oct Apr June Aug Sep July

We can also check for the presence of an element in a set using the in keyword as shown below:

months = set(["Jan", "Feb", "March", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"]) print("May" in months) Output True

The code returned "True", which means that the item was found in the set. Similarly, searching for an element that doesn't exist in the set returns "False", as shown below:

months = set(["Jan", "Feb", "March", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"]) print("Nicholas" in months) Output False

As expected, the code returned "False".

Adding Items to a Set

Python allows us to add new items to a set via the add() function. For example:

months = set(["Jan", "March", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"]) months.add("Feb") print(months) Output {'Oct', 'Dec', 'Feb', 'July', 'May', 'Jan', 'June', 'March', 'Sep', 'Aug', 'Nov', 'Apr'}

The item "Feb" has been successfully added to the set. If it was a set of numbers, we would not have passed the new element within quotes as we had to do for a string. For example:

num_set = {1, 2, 3} num_set.add(4) print(num_set) Output {1, 2, 3, 4}

In the next section, we will be discussing how to remove elements from sets.

Removing Items from a Set

Python allows us to remove an item from a set, but not using an index as set elements are not indexed. The items can be removed using either the discard() or remove() methods.

Keep in mind that the discard() method will not raise an error if the item is not found in the set. However, if the remove() method is used and the item is not found, an error will be raised.

Let us demonstrate how to remove an element using the discard() method:

num_set = {1, 2, 3, 4, 5, 6} num_set.discard(3) print(num_set) Output {1, 2, 4, 5, 6}

The element 3 has been removed from the set.

Similarly, the remove() method can be used as follows:

num_set = {1, 2, 3, 4, 5, 6} num_set.remove(3) print(num_set) Output {1, 2, 4, 5, 6}

Now, let us try to remove an element that does not exist in the set. Let's first use the discard() method:

num_set = {1, 2, 3, 4, 5, 6} num_set.discard(7) print(num_set) Output {1, 2, 3, 4, 5, 6}

The above output shows that the set was not affected in any way. Now let's see what happens when we use the remove() method in the same scenario:

num_set = {1, 2, 3, 4, 5, 6} num_set.remove(7) print(num_set) Output Traceback (most recent call last): File "C:\Users\admin\sets.py", line 2, in <module> num_set.remove(7) KeyError: 7

The output shows that the method raised an error as we attempted to remove an element that is not in the set.

With the pop() method, we can remove and return an element. Since the elements are unordered, we cannot tell or predict the item that will be removed. For example:

num_set = {1, 2, 3, 4, 5, 6} print(num_set.pop()) Output

You can use the same method to remove an element and return the elements that are remaining in the set. For example:

num_set = {1, 2, 3, 4, 5, 6} num_set.pop() print(num_set) Output {2, 3, 4, 5, 6}

Those are the elements remaining in the set.

The Python's clear() method helps us remove all elements from a set. For example:

num_set = {1, 2, 3, 4, 5, 6} num_set.clear() print(num_set) Output set()

The output is an empty set() with no elements in it.

Set Union

Suppose we have two sets, A and B. The union of the two sets is a set with all the elements from both sets. Such an operation is accomplished via the Python's union() function.

Here is an example:

months_a = set(["Jan", "Feb", "March", "Apr", "May", "June"]) months_b = set(["July", "Aug", "Sep", "Oct", "Nov", "Dec"]) all_months = months_a.union(months_b) print(all_months) Output {'Oct', 'Jan', 'Nov', 'May', 'Aug', 'Feb', 'Sep', 'March', 'Apr', 'Dec', 'June', 'July'}

A union can also be performed on more than two sets, and all their elements will be combined into a single set. For example:

x = {1, 2, 3} y = {4, 5, 6} z = {7, 8, 9} output = x.union(y, z) print(output) Output {1, 2, 3, 4, 5, 6, 7, 8, 9}

During the union operation, duplicates are ignored, and only one of the duplicate items is shown. For example:

x = {1, 2, 3} y = {4, 3, 6} z = {7, 4, 9} output = x.union(y, z) print(output) Output {1, 2, 3, 4, 6, 7, 9}

The | operator can also be used to find the union of two or more sets. For example:

months_a = set(["Jan","Feb", "March", "Apr", "May", "June"]) months_b = set(["July", "Aug", "Sep", "Oct", "Nov", "Dec"]) print(months_a | months_b) Output {'Feb', 'Apr', 'Sep', 'Dec', 'Nov', 'June', 'May', 'Oct', 'Jan', 'July', 'March', 'Aug'}

If you want to perform a union on more than two sets, separate the set names using the | operator. For example:

x = {1, 2, 3} y = {4, 3, 6} z = {7, 4, 9} print(x | y | z) Output {1, 2, 3, 4, 6, 7, 9} Set Intersection

Suppose you have two sets A and B. Their intersection is a set with elements that are common in both A and B.

The intersection operation in sets can be achieved via either the & operator or the intersection() method. For example:

For example:

x = {1, 2, 3} y = {4, 3, 6} print(x & y) Output {3}

The two sets have 3 as the common element. The same can also be achieved with the intersection() method:

x = {1, 2, 3} y = {4, 3, 6} z = x.intersection(y) print(z) Output {3}

In the next section, we will be discussing how to determine the difference between sets.

Set Difference

Suppose you have two sets A and B. The difference of A and B (A - B) is the set with all elements that are in A but not in B. Consequently, (B - A) is the set with all the elements in B but not in A.

To determine set differences in Python, we can use either the difference() function or the - operator. For example:

set_a = {1, 2, 3, 4, 5} set_b = {4, 5, 6, 7, 8} diff_set = set_a.difference(set_b) print(diff_set) Output {1, 2, 3}

in the script above, only the first three elements of set set_a are not available in the set set_b , hence they form our output. The minus - operator can also be used to find the difference between the two sets as shown below:

set_a = {1, 2, 3, 4, 5} set_b = {4, 5, 6, 7, 8} print(set_a - set_b) Output {1, 2, 3}

The symmetric difference of sets A and B is the set with all elements that are in A and B except the elements that are common in both sets. It is determined using the Python's symmetric_difference() method or the ^ operator. For example:

set_a = {1, 2, 3, 4, 5} set_b = {4, 5, 6, 7, 8} symm_diff = set_a.symmetric_difference(set_b) print(symm_diff) Output {1, 2, 3, 6, 7, 8}

The symmetric difference can also be found as follows:

set_a = {1, 2, 3, 4, 5} set_b = {4, 5, 6, 7, 8} print(set_a ^ set_b) Output {1, 2, 3, 6, 7, 8} Set Comparison

We can compare sets depending on the elements they have. This way, we can tell whether a set is a superset or a subset of another set. The result from such a comparison will be either True or False .

To check whether set A is a subset of set B, we can use the following operation:

A <= B

To check whether B is a superset of A, we can use the following operation:

B >= A

For example:

months_a = set(["Jan", "Feb", "March", "Apr", "May", "June"]) months_b = set(["Jan", "Feb", "March", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"]) subset_check = months_a <= months_b superset_check = months_b >= months_a print(subset_check) print(superset_check) Output True True

The subset and superset can also be checked using issubset() and issuperset() methods as shown below:

months_a = set(["Jan","Feb", "March", "Apr", "May", "June"]) months_b = set(["Jan","Feb", "March", "Apr", "May", "June", "July", "Aug", "Sep", "Oct", "Nov", "Dec"]) subset_check = months_a.issubset(months_b) superset_check = months_b.issuperset(months_a) print(subset_check) print(superset_check) Output True True

In the next section, we will discuss some of the most commonly used set methods provided by Python that we have not already discussed.

Set Methods

Python comes with numerous built-in set methods, including the following:

copy()

This method returns a copy of the set in question. For example:

string_set = {"Nicholas", "Michelle", "John", "Mercy"} x = string_set.copy() print(x) Output {'John', 'Michelle', 'Nicholas', 'Mercy'}

The output shows that x is a copy of the set string_set .

isdisjoint()

This method checks whether the sets in question have an intersection or not. If the sets don't have common items, this method returns True , otherwise it returns False . For example:

names_a = {"Nicholas", "Michelle", "John", "Mercy"} names_b = {"Jeff", "Bosco", "Teddy", "Milly"} x = names_a.isdisjoint(names_b) print(x) Output True

The two sets don't have common items, hence the output is True .

len()

This method returns the length of a set, which is the total number of elements in the set. For example:

names_a = {"Nicholas", "Michelle", "John", "Mercy"} print(len(names_a)) Output

The output shows that the set has a length of 4.

Python Frozen Set

Frozenset is a class with the characteristics of a set, but once its elements have been assigned, they cannot be changed. Tuples can be seen as immutable lists, while frozensets can be seen as immutable sets.

Sets are mutable and unhashable, which means we cannot use them as dictionary keys. Frozensets are hashable and we can use them as dictionary keys.

To create frozensets, we use the frozenset() method. Let us create two frozensets, X and Y :

X = frozenset([1, 2, 3, 4, 5, 6]) Y = frozenset([4, 5, 6, 7, 8, 9]) print(X) print(Y) Output frozenset({1, 2, 3, 4, 5, 6}) frozenset({4, 5, 6, 7, 8, 9})

The frozensets support the use of Python set methods like copy() , difference() , symmetric_difference() , isdisjoint() , issubset() , intersection() , issuperset() , and union() .

Conclusion

The article provides a detailed introduction to sets in Python. The mathematical definition of sets is the same as the definition of sets in Python. A set is simply a collection of items that are unordered. The set itself is mutable, but the set elements are immutable. However, we can add and remove elements from a set freely. In most data structures, elements are indexed. However, set elements are not indexed. This makes it impossible for us to perform operations that target specific set elements.

↧

爬取中国知网CNKI的遇到的坑与技术总结

December 6, 2018, 4:34 am

≫ Next: Python Numpy : Select rows / columns by index from a 2D ndarray | Multi Dimensio ...

≪ Previous: Stack Abuse: Sets in Python

【python2.7】爬取知网论文 python实现CNKI知网爬虫《Python3网络爬虫开发实战》崔庆才

最近要写一个数据分析的项目，需要根据关键词爬取近十年期刊的主要信息，记录一下爬取过程中遇到的问题

分析

cnki算是对爬虫作了一定抵御，我们要爬取学术论文详情页的主题，摘要等信息，主要步骤和其他网站的爬取大致相似：一是要根据关键词搜索到列表页；二是要从列表页请求得到详情页，从详情页取得我们所要的信息。

入口页面：[ kns.cnki.net/kns/brief/d… ] 搜索后，js动态渲染的请求列表页面：[ kns.cnki.net/kns/brief/b… ..] 这里我们打开Developer Tools观察请求头和参数
爬取中国知网CNKI的遇到的坑与技术总结

这里的关键信息： ① 请求参数，我们观察到请求的关键词在字段KeyValue中(GET请求)； ② cookie 和 referer ：如果没有在请求头部加入referer，我们将无法打开这个列表页，如果没有在头部中加入cookie，我们请求后得到的页面内容是不完整的！注意：iframe列表详情页只有从入口页面请求渲染后才能得到，这步请求不能省略！从列表页的链接中解析得到详情页[ kns.cnki.net/KCMS/detail… ...]
爬取中国知网CNKI的遇到的坑与技术总结

我们继续打开Developer Tools观察网页的HTML中跳转到详情页的链接
爬取中国知网CNKI的遇到的坑与技术总结

这里我们发现，链接的地址和我们最终得到的地址是不同的！是因为网页重定向了！

详情页，这里就只要解析网页即可，我们通过xpath可以很容易得到题目，作者，关键词，摘要等信息

Scrapy实战

如何设置cookie：

settings中设置 COOKIES_ENABLED=True ； http请求参考 Scrapy - how to manage cookies/sessions 补充：cookiejar模块的主要作用是提供可存储的cookie对象，可以捕获cookie并在后续连接请求时重新发送，实现模拟登录功能。在scrapy中可以在请求是传入meta参数设置，根据不同会话记录对应的cookie:

如何请求入口页：(CJFQ代表期刊，可根据需求更改)

data = { "txt_1_sel": "SU$%=|", "txt_1_value1": self.key_word, "txt_1_special1": "%", "PageName": "ASP.brief_default_result_aspx", "ConfigFile": "SCDBINDEX.xml", "dbPrefix": "CJFQ", "db_opt": "CJFQ", "singleDB": "CJFQ", "db_codes": "CJFQ", "his": 0, "formDefaultResult": "", "ua": "1.11", "__": time.strftime('%a %b %d %Y %H:%M:%S') + ' GMT+0800 (中国标准时间)' } query_string = parse.urlencode(data) yield Request(url=self.home_url+query_string, headers={"Referer": self.cur_referer}, cookies={CookieJar: 1}, callback=self.parse) 复制代码如何请求列表页 def parse(self, response): data = { 'pagename': 'ASP.brief_default_result_aspx', 'dbPrefix': 'CJFQ', 'dbCatalog': '中国学术期刊网络出版总库', 'ConfigFile': 'SCDBINDEX.xml', 'research': 'off', 't': int(time.time()), 'keyValue': self.key_word, 'S': '1', "recordsperpage": 50, # 'sorttype': "" } query_string = parse.urlencode(data) url = self.list_url + '?' + query_string yield Request(url=url, headers={"Referer": self.cur_referer}, callback=self.parse_list_first) 复制代码如何解析列表页获得列表总页数： response.xpath('//span[@class="countPageMark"]/text()').extract_first() max_page = int(page_link.split("/")[1]) 复制代码请求每个列表页 data = { "curpage": page_num,#循环更改 "RecordsPerPage": 50, "QueryID": 0, "ID":"", "turnpage": 1, "tpagemode": "L", "dbPrefix": "CJFQ", "Fields":"", "DisplayMode": "listmode", "PageName": "ASP.brief_default_result_aspx", "isinEn": 1 } 复制代码解析列表页（这里如果结果为空，请检查你是否正确设置了cookie） tr_node = response.xpath("//tr[@bgcolor='#f6f7fb']|//tr[@bgcolor='#ffffff']") for item in tr_node: paper_link = item.xpath("td/a[@class='fz14']/@href").extract_first() 复制代码如何解析详情页（只是一个示例，有很多种解析方法） title = response.xpath('//*[@id="mainArea"]/div[@class="wxmain"]/div[@class="wxTitle"]/h2/text()').extract() author = response.xpath('//*[@id="mainArea"]/div[@class="wxmain"]/div[@class="wxTitle"]/div[@class="author"]/span/a/text()').extract() abstract = response.xpath('//*[@id="ChDivSummary"]/text()').extract() keywords = response.xpath('//*[@id="catalog_KEYWORD"]/following-sibling::*/text()').extract() 复制代码

欢迎fork我的Github项目

2018.12.15 更新

上面项目在爬取数量不多的项目时不会报错，但是我尝试爬取20w数据量的论文时，发现每次只有1000多条数据，观察发现请求转到 vericode.aspx 页，即验证码页面。因为我不怎么会处理验证码啦，所以果断放弃，使用手机知网的接口 wap.cnki.net/touch/web ，发现真的so easy!

请求列表页

第一步还是要观察请求，打开DevTool

可以简单的构造第一个请求(GET)：

def start_requests(self): data = { "kw": self.key_word, "field":5 } url = self.list_url + '?' + parse.urlencode(data) yield Request(url=url, headers=self.header, meta={'cookiejar': 1}, callback=self.parse) 复制代码

得到第一页列表页，筛选条件，得到请求的FormData

我们在网页进行筛选操作，可以看到类似结果：

FormData中，我们可以复制下来后，修改的几个变量来进行筛选： pageindex：第几页列表页(1 ~ ) fieldtype: 主题/篇名/全文/作者/关键词/单位/摘要/来源 sorttype: 相关度/下载次数/被引频次/最新文献/历史文献 articletype：文献类型 starttime_sc: 开始年份 endtime_sc: 结束年份 def parse(self, response): self.header['Referer'] = response.request.url yield FormRequest(url=self.list_url, headers = self.header, method = 'POST', meta = {'cookiejar': 1}, formdata = self.myFormData, callback = self.parse_list, dont_filter = True) 复制代码

解析得到总列表页数,并构造请求

#总页数 paper_size = int(response.xpath('//*[@id="totalcount"]/text()').extract_first()) #构造请求 for page in range(1, paper_num): self.myFormData["pageindex"] = str(page), yield FormRequest(url=self.list_url, headers = self.header, method = 'POST', meta = {'cookiejar': page+1, 'page': page},#更新会话 formdata = self.myFormData, callback = self.parse_list_link, dont_filter = True) 复制代码

注意：我们观察请求过程，在网页中我们是通过点击更新的，我们观察LoadNextPage函数，可以看到请求更新数据也是通过提交表单的方式，因此我们可以构造POST请求数据。

请求详情页

items = response.xpath('//a[@class="c-company-top-link"]/@href').extract() #可以将已爬取详情页数写入文件进行记录 with open('../record_page.txt', 'a') as f: f.write(str(response.meta['page']) + '\n') for item in items: yield Request(url = item, meta={'cookiejar': response.meta['cookiejar']},#对应会话标志 headers = self.header, callback = self.parse_item) 复制代码

解析详情页(示例)

baseinfo = response.xpath('/html/body/div[@class="c-card__paper2"]') keywords = baseinfo.xpath('//div[contains(text(),"关键词")]/following-sibling::*/a/text()').extract() 复制代码补充

为了提高爬取速度，防止ip被识别的可能，推荐阿布云进行IP代理，申请账号及HTTP动态隧道后，更改settings:

DOWNLOAD_DELAY = 0.2 DOWNLOADER_MIDDLEWARES = { 'myspider.middlewares.RandomUserAgentMiddleware': 401, 'myspider.middlewares.ABProxyMiddleware': 1, } AB_PROXY_SERVER = { 'proxyServer': "http://http-dyn.abuyun.com:9020", 'proxyUser': "xxxxxxxxxxxxxxx",#你的 'proxyPass': "xxxxxxxxxxxxxxx"#你的 } 复制代码

添加中间件：

proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8") class ABProxyMiddleware(object): """ 阿布云ip代理配置 """ def process_request(self, request, spider): request.meta["proxy"] = proxyServer request.headers["Proxy-Authorization"] = proxyAuth 复制代码

爬虫菜鸟，有问题请帮忙指出！

↧

Python Numpy : Select rows / columns by index from a 2D ndarray | Multi Dimensio ...

December 6, 2018, 4:32 am

≫ Next: Create your own Telegram bot with Django on Heroku Part 10 Creating a view ...

≪ Previous: 爬取中国知网CNKI的遇到的坑与技术总结

In this article we will discuss how to select elements from a 2D ndarray. Elements to select can be a an element only or single/multiple rows & columns or an another sub 2D array.

First of all, let’s import numpy module i.e.

import numpy as np

Now let’s create a 2d ndArray by passing a list of lists to numpy.array() i.e.

# Create a 2D Numpy adArray with 3 rows & 3 columns | Matrix nArr2D = np.array(([21, 22, 23], [11, 22, 33], [43, 77, 89]))

Contents of the 2D ndArray will be,

[[21 22 23] [11 22 33] [43 77 89]]

Now let’s see how to select elements from this 2D ndarray by index i.e.

Select a single element from 2D ndarray by index We can use [][] operator to select an element from ndarray i.e. ndArray[row_index][column_index] Example 1:

Select the element at row index 1 and column index 2.

# Select element at row index 1 & column index 2 num = nArr2D[1][2] print('element at row index 1 & column index 2 is : ' , num)

Output:

element at row index 1 & column index 2 is :33 Example 2:

Or we can pass the comma separated list of indices representing row index & column index too i.e.

# Another way to select element at row index 1 & column index 2 num = nArr2D[1, 2] print('element at row index 1 & column index 2 is : ', num)

Output:

element at row index 1 & column index 2 is :33 Select Rows by Index from a 2D ndarray We can call [] operator to select a single or multiple row.To select a single row use, ndArray[row_index]

It will return a complete row at given index.

To select multiple rows use,

ndArray[start_index: end_index , :]

It will return rows from start_index to end_index 1 and will include all columns.

Let’s use this,

Contents of the 2D a ndArray nArr2D created above are,

[[21 22 23] [11 22 33] [43 77 89]]

Let’s select a row at index 2 i.e.

# Select a Row at index 1 row = nArr2D[1] print('Contents of Row at Index 1 : ' , row)

Output:

Contents of Row at Index 1 :[11 22 33]

Select multiple rowsfrom index 1 to 2 i.e.

# Select multiple rows from index 1 to 2 rows = nArr2D[1:3, :] print('Rows from Index 1 to 2 :') print(rows)

Output:

Rows from Index 1 to 2 : [[11 22 33] [43 77 89]]

Select multiple rowsfrom index 1 to last index

# Select multiple rows from index 1 to last index rows = nArr2D[1: , :] print('Rows from Index 1 to last row :') print(rows)

Output:

[[11 22 33] [43 77 89]] Select Columns by Index from a 2D ndArray

To select a single column use,

ndArray[ : , column_index]

It will return a complete column at given index.

To select multiple columns use,

ndArray[ : , start_index: end_index]

It will return columns from start_index to end_index 1.

Let’s use these,

Contents of the 2D ndArray nArr2D created above are,

[[21 22 23] [11 22 33] [43 77 89]]

Select a columnat index 1

# Select a column at index 1 column = nArr2D[:, 1] print('Contents of Column at Index 1 : ', column)

Output:

Contents of Column at Index 1 :[22 22 77]

Select multiple columnsfrom index 1 to 2

# Select multiple columns from index 1 to 2 columns = nArr2D[: , 1:3] print('Column from Index 1 to 2 :') print(columns)

Output:

Column from Index 1 to 2 : [[22 23] [22 33] [77 89]]

Select multiple columnsfrom index 1 to last index

# Select multiple columns from index 1 to last index columns = nArr2D[:, 1:]

Output is same as above because there are only 3 columns 0,1,2. So 1 to last columns means columns at index 1 & 2.

Select a Sub Matrix or 2d ndarray from another 2D ndarray To select sub 2d ndArray we can pass the row & column index range in [] operator i.e. ndArray[start_row_index : end_row_index , start_column_index : end_column_index]

It will return a sub 2D ndArray for given row and column range.

Let’s use these,

Contents of the 2D ndArray nArr2D created at start of article are,

[[21 22 23] [11 22 33] [43 77 89]]

Select a sub 2D ndarray from row indices 1 to 2 & column indices 1 to 2

# Select a sub 2D array from row indices 1 to 2 & column indices 1 to 2 sub2DArr = nArr2D[1:3, 1:3] print('Sub 2d Array :') print(sub2DArr)

Output:

Sub 2d Array : [[22 33] [77 89]] Selected Row or Column or Sub Array is View only Contents of the ndarray selected using [] operator returns a View only i.e. any modification in returned sub array will be reflected in original ndarray.

Let’s check this,

Contents of the 2D ndArray nArr2D created at start are,

[[21 22 23] [11 22 33] [43 77 89]]

Select a row at index 1 from 2D array i.e.

# Select row at index 1 from 2D array row = nArr2D[1]

Contents of row :

[11 22 33]

Now modify the contents of row i.e.

# Change all the elements in selected sub array to 100 row[:] = 100

New contents of the row will be

[100 100 100]

Modification in sub array will be reflected in main ndArray too. Updated Contents of the 2D ndArray nArr2D are,

[[ 212223] [100 100 100] [ 437789]] Get a copy of 2D Sub Array from 2D ndArray using ndarrat.copy()

to the copy instead of view in sub array use copy() function.

Let’s check this,

Create a 2D Numpy adArray with3 rows & columns | Matrix

# Create a 2D Numpy adArray with3 rows & columns | Matrix nArr2D = np.array(([21, 22, 23], [11, 22, 33], [43, 77, 89]))

Content of nArr2D is,

[[ 212223] [100 100 100] [ 437789]]

Select a copy of row at index 1 from 2D array and set all the elements in selected sub array to 100

# Select a copy of row at index 1 from 2D array row = nArr2D[1].copy() # Set all the elements in selected sub array to 100 row[:] = 100

Here, sub array is a copy of original array so, modifying it will not affect the original ndArray

Contents of the modified sub array row is,

[100 100 100]

Contents of the original ndArray is,

[[21 22 23] [11 22 33] [43 77 89]] Complete example is as follows, import numpy as np def main(): # Create a 2D Numpy adArray with 3 rows & 3 columns | Matrix nArr2D = np.array(([21, 22, 23], [11, 22, 33], [43, 77, 89])) print('Contents of 2D Array : ') print(nArr2D) print('*** Select an element by index from a 2D ndArray') # Select element at row index 1 & column index 2 num = nArr2D[1][2] print('element at row index 1 & column index 2 is : ' , num) # Another way to select element at row index 1 & column index 2 num = nArr2D[1, 2] print('element at row index 1 & column index 2 is : ', num) print('*** Select Rows by Index from a 2D ndArray ***') # Select a Row at index 1 row = nArr2D[1] print('Content

↧

Create your own Telegram bot with Django on Heroku Part 10 Creating a view ...

December 6, 2018, 4:02 am

≫ Next: Using Twilio to Build a Serverless SMS Raffle in Python

≪ Previous: Python Numpy : Select rows / columns by index from a 2D ndarray | Multi Dimensio ...

Create your own Telegram bot with Django on Heroku Part 10 Creating a view ...

In theprevious partofthis series, we created another database model named Message to hold the message-data from our Telegram bot. I also explained the process of defining a SQL schema using a Django model, what to consider during that phase and how to bring the Django’s model field reference docs to a good use during that process. Last but not least, we learned what a Heroku “ One-Off Dyno ” is and how it can be used to execute administrative tasks on our production site like applying outstanding migrations to a database.

This time, I will provide you with the last piece of the puzzle to make your bot available to the world. You will learn how to write and wire the python code toactually use all that we have prepared so far. At the end of this part, your bot will be able to receive and store each message sent to it by registered users. And since it’s already more than a month since I published the previous article in this series, let’s not waste any more time and jump right in!

A few more details on what we will do today

How to link the outside world (Telegram servers) with your bot’s code and data(base) now? Actually, with Django, that’s quite easy. But if you have never done something like this, you might feel a little lost here.

By now, you have already achieved to:

… register a Telegram bot using “ BotFather “, the official interface provided by Telegram to register bots for their service. … create a Django project to start your work in. … prepare yourproject to be easily repeatable by creating a virtualenv and a Pipenv file and thus prepared it to be deployed to a containerized environment. … register an account at yourpreferredcode-hosting provider ( Heroku , if you followed the recommendations of this guide). … register a project with yourhosting provider to hold your Django project’s code, served by a Git remote and hook. … register a new app for your bot in the Django project. … create a database and configured your remotely deployed Django application to utilize it. … designyour database models to hold the data needed from Telegram messages for your bot’s purpose. … applied the migrations, which are created from your models definitions, to your production code hosting database.

See? This is already a hell of a lot of complicated things! Managing to get this far is already a huge success! And since I did not receive a single question about anything from my previous articles ofthis series, you all did manage to achieve all of this successfully, didn’t you?:yum:

Great!

So, the last missing piece we have left in front of us is somehow bringing all these efforts together by creating a view .

Linking the outside world and your bot

OK, given the fact that we did not progress for more than a month with this project in the meantime (sorry for that, seriously:cry:), let’s first talk about what it is that we are trying to achieve next. Personally, when I’ve got hit by the “ Developer’s block ” (which happens more commonly the more time I let pass before I sit down and continue with a project), it helps me to form an easy yet explicit sentence, describing the task at hand.

For that, I’m trying not tothink about what it was that I’ve done last time or the details of my initial plan or something; it helps to establish a habit of not interrupting your work in the middle of something, really since this allows you to do so and not having to re-think where you left off! Always try to establish something SCRUM evangelists would call a “ potentially shippable artifact “, meaning: Something that could be added to a product already running in production without breaking anything but providing a new set of functionality or a good foundation to implement these in subsequent releases, at least.

In this project, that sentence could be something like:

Create an interface (called “ Webhook “), which the Telegram bot can connect to and submit messages it received.

In the Django world, this pretty perfectly describes a view. Cited from the very first sentences of the Django views docs :

A view function, or view for short, is simply a Python function that takes a Web request and returns a Web response. This response can be the HTML contents of a Web page, or a redirect, or a 404 error, or an XML document, or an image . . . or anything, really. The view itself contains whatever arbitrary logic is necessary to return that response. This code can live anywhere you want, as long as it’s on your Python path. There’s no other requirement no “magic”, so to speak. For the sake of putting the code somewhere, the convention is to put views in a file called views.py , placed in your project or application directory.

So, let’s fire up our preferred editor and open the file bot/views.py (since we will make all our modifications inside our “bot”-application directory).

So far, only our “ Hello World “-static content-view called index is in there. Another interesting file in this context is bot/urls.py ; open that in your editor as well.

So … let’s do the most obvious things first before we are diving into the logic. Think of a name for your view and create the basic skeleton for it in your bot/views.py file. The name is absolutely irrelevant; to prove that, I’ll continue with this called “ talkin_to_me_bruh “. You can come up with pretty much any name here as long as it’s unique. This won’t define the URL-path which the outside world is seeing or anything; it’s just the internal name of a function. So, my skeleton for“ talkin_to_me_bruh ” will look like this:

from django.http import HttpResponse def talkin_to_me_bruh(request): # please insert magic here return HttpResponse('OK')

If we define a URLconf for this (which we will do in the next step) and navigate to this without inserting some magic at the “ please insert magic here ” marker, Django would nevertheless render an appropriate HTTP answer, containing the string OK in its body.

… why trust me I’m not a trustworthy person at all :smiling_imp:, so let’s just do it (not sponsored by

↧

Using Twilio to Build a Serverless SMS Raffle in Python

December 6, 2018, 8:34 am

≫ Next: ESP32 Arduino: Brotli decompression

≪ Previous: Create your own Telegram bot with Django on Heroku Part 10 Creating a view ...

Using Twilio to Build a Serverless SMS Raffle in Python

If you’re like me, you drool just a little bit over serverless architectures. When Rackspace and AWS EC2 made cloud-based computing a mainstream reality, that was awesome enough. But you still had to spin up and maintain your own virtual servers.

With the introduction of things likeTwilio Functionsor Lambda for truly serverless function execution, DynamoDB for cached state, and API Gateway for click-and-deploy routing―just to name a few―it’s become deliciously easy to build and deploy powerful (and fun) services in minutes. Seriously, the IoT possibilities are endless!

With that in mind, let’s build something fun with python a Serverless SMS Raffle . What if users could text a word to a phone number and were entered in to a raffle? Then when we were ready to choose a winner, we could execute a Lambda to choose some number of winners at random and close the raffle?

To do this, we’re going to need accounts on two platforms I’ve already mentioned:Twilioand AWS . Don’t worry, they’re both free, and when all is said and done, running this raffle will cost us just pennies.

So, let’s get started. First things first, we need to setup an endpoint in AWS for Twilio to use when a text is received. We’ll setup this endpoint using API Gateway, which will in turn execute a Lambda function that process entries into the raffle. Easy peasy.

Configure an AWS Role

AWS Roles give us a set of permissions to work with. Before we can do anything, we need to create a Role that allows our raffle Lambdas to create logs in CloudWatch and manage tables in DynamoDB.

In the AWS Console, under "My Security Credentials", create a new Role . Choose the "Lambda" service, then create a new Policy.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "dynamodb:CreateTable", "dynamodb:GetItem", "dynamodb:PutItem", "dynamodb:UpdateItem", "dynamodb:DescribeTable", "dynamodb:GetShardIterator", "dynamodb:GetRecords", "dynamodb:ListStreams", "dynamodb:Query", "dynamodb:Scan" ], "Resource": "*" } ] }

Name this new Policy "AWSLambdaDynamo", then attach it to your new Role and name it the same.

Great, now let's make some Lambdas using this Role!

Create the AWS Lambda Functions

Actually, before we create our Lambdas, let's give them a place to store the raffle data. Create a DynamoDB table with a partition key of PartitionKey .

Alright, now let's make two Lambdas. The first one we'll attach to an API Gateway for receiving an inbound text message―this one will let people enter the raffle and manage all the housekeeping there. The second one will be a Lambda we manually execute―it will close the raffle and choose the winners.

Inbound Message Lambda

Go ahead and create a new Lambda function from scratch , naming it "Raffle_POST" and choosing "Python 3.6" for the runtime. When inbound text messages are sent to our API Gateway (which we'll setup next), they will be processed by this Lambda, and it'll store the sender's phone number in our DyanmoDB table.

Before we plop a bunch of code in there, let's define some environment variables for the function.

SUPER_SECRET_PASSPHRASE (some phrase people must text in order to be entered in to the raffle) BANNED_NUMBERS (JSON list of phone numbers formatted ["+15555555555","+15555555556"] ) DYNAMODB_REGION (like us-east-1 ) DYNAMODB_ENDPOINT (like https://dynamodb.us-east-1.amazonaws.com ) DYNAMODB_TABLE

Now that we have our environment variables defined, let's write some code in the Lambda.

First though: some housekeeping. Let's declare the imports we'll need and bring in those environment variables for easy access.

import os import json import logging import boto3 from urllib.parse import parse_qs SUPER_SECRET_PASSPHRASE = os.environ.get("SUPER_SECRET_PASSPHRASE", "Ahoy") BANNED_NUMBERS = json.loads(os.environ.get("BANNED_NUMBERS", "[]")) DYNAMODB_REGION = os.environ.get("DYNAMODB_REGION") DYNAMODB_ENDPOINT = os.environ.get("DYNAMODB_ENDPOINT") DYNAMODB_TABLE = os.environ.get("DYNAMODB_TABLE") logger = logging.getLogger() logger.setLevel(logging.INFO) dynamodb = boto3.resource("dynamodb", region_name=DYNAMODB_REGION, endpoint_url=DYNAMODB_ENDPOINT) table = dynamodb.Table(DYNAMODB_TABLE) def lambda_handler(event, context): # TODO implement return { 'statusCode': 200, 'body': json.dumps('Hello from Lambda!') }

Awesome. Now let's write some helper functions to determine if a raffle is closed (we know it's closed if any rows in the table have been marked a winner), and determine if the person texting us is Karen ― Karen is, of course, banned from entering the raffle.

def _is_raffle_closed(table): winner_response = table.scan( FilterExpression=boto3.dynamodb.conditions.Attr("Winner").exists() ) return winner_response["Count"] > 0 def _is_karen(phone_number): return phone_number in BANNED_NUMBERS

Let's also write a helper function that conveniently builds a properTwiML response for us.

Why does this matter? Because whatever response our Lambda gives will be passed back to Twilio. If it's validTwiML XML, we can task Twilio with an action in our response, in this case, sending a text message back to the sender.

def _get_response(msg): xml_response = "<?xml version='1.0' encoding='UTF-8'?><Response><Message>{}</Message></Response>".format(msg) logger.info("XML response: {}".format(xml_response)) return {"body": xml_response}

Great. Now we're ready to write the lambda_handler .

We need to check the incoming text message for the super secret word before entering the sender in to the raffle. Then we'll respond with TwiML to let the sender know they were successfully entered (or shame them accordingly, if they try entering multiple times. Or if they're Karen.).

↧

ESP32 Arduino: Brotli decompression

December 6, 2018, 1:38 pm

≫ Next: A script for backing up Tumblr posts and likes →

≪ Previous: Using Twilio to Build a Serverless SMS Raffle in Python

In this tutorial we will check how to decompress a string compressed with the Brotli algorithm , using the ESP32 and the Arduino core. The tests were performed using a DFRobot’s ESP32 module integrated in a ESP32 development board .

Introduction

In this tutorial we will check how to decompress a string compressed with the Brotli algorithm , using the ESP32 and the Arduino core.

We will be using this library, which has a C implementation that we can use on the Arduino core.

Since, at the time of writing, I did not find any online tool that allows to compress a string using the Brotli algorithm, we will be using a very simple python script for that task.

So, the script will compress a string and return the bytes of the compressed content, so we can use them on the Arduino code.

The script was based onthis previous tutorial, which also contains the installation instructions for the Python Brotli module we are going to need. The script used here was tested on Python version 2.7.8 .

Nonetheless, if you don’t want to run the Python code, the Arduino code already contains the byte array with the compressed data. The original string that was compressed corresponds to the “ |hello world| ” sentence repeated five times. So, that should be the expected output from the ESP32 decompression.

The tests were performed using a DFRobot’s ESP32 module integrated in a ESP32 development board .

Using the library

The first thing we are going to do is downloading the library from the GitHub page . To do it, simply click the green “ Clone or Download ” button and select “ Download ZIP “.

You should get a file called “ brotli-master.zip ” in your computer. Inside, there should be a folder called “ c “. Extract it to some location on your computer.

Then we need to do a find and replace operation to change some include paths in the library source code. The easiest way is to use a tool such as Notepad++ and its “ Find in files ” feature. This feature allows to search for a string in multiple files and replace it with some other content.

After opening Notepad++, simply click ctrl+f to open the find menu. There, select the “ find in files tab “. On the “ Find what ” text box write “ brotli/ “. Leave the “ Replace with ” text box empty, so the previous string is replaced by an empty string. Also, check the “ In all sub-folders ” option.

Then, on the “ Directory ” text box, put the path to the previously extracted “ c ” folder. Finally, click the “ Replace in files ” button, so the replacement occurs. This procedure is highlighted in figure 1 (my menu is in Portuguese).

Figure 1 Replacing paths in the library source code.

Then, go to your Arduino libraries folder and create there a new folder called “ Brotli “, as shown in figure 2.

Figure 2 Creating the brotli library folder.

Then, go back to the “ c ” folder where you have performed the replaces and copy all the files from the “ c\include\brotli “ folder to the previous one we have just created on the Arduino libraries, as shown in figure 3. Note that there should be only header files.

Figure 3 Adding the header files.

Then, open a new Arduino sketch (the one you are going to use to code) and go to its folder.

You can access the sketch folder from the Arduino IDE by clicking the “ Sketch ” menu and selecting the “ Show Sketch Folder ” option. Alternatively, you can use the Ctrl+k shortcut from the Arduino IDE.

On the sketch folder, create a new folder called src . Then, go back to the “ c ” folder where we previously made the replaces and copy the following folders to the src folder you have just created:

fuzz enc dec common

Figure 4 shows what should be contained in the src folder after pasting the content.

Figure 4 Final src folder structure.

After that, we should be able to include the header files of the Brotli library in our Arduino code.

Note: This procedure was a quick hack to be able to compile the libraries under the Arduino core. If someone knows a better way of making the Brotli libraries work as a regular Arduino library and keeping the original include paths, please share in the comments below.

The Python code

The first thing we are going to do is importing the brotli Python module.

import brotli

Then, to do the actual compression, we simply need to call the compress function of the previously imported brotli module, passing as input the string we want to compress.

We will compress the string “ |hello world| ” repeated five times, as mentioned in the introductory section. We can achieve this repetition easily by using the Python * operator, which corresponds to the string repetition operator.

So, we simply use the * operator after the string, followed by the number of times we want to repeat it.

This function will return the compressed content as a string. We will store it in a variable, as shown below.

compressed = brotli.compress("|hello world|" * 5)

Finally, we will iterate through all the characters of the string and print the corresponding byte value. To obtain the byte value of a character, we can use the ord function.

To make our life easier when copying the bytes to the Arduino code, we will append a comma after each byte. That way, we can simply copy and paste the comma separated bytes and enclose them in curly brackets, to declare an array of bytes in the Arduino code, as we will see below.

Note that the ord function returns the value of the byte as an integer. Nonetheless, in Python, we cannot concatenate a string or a character to an integer.

So, before we add the comma, we need to convert the value of the byte to a string using the str function. Only then we can

↧

A script for backing up Tumblr posts and likes →

December 6, 2018, 1:36 pm

≫ Next: Gender Diversity in the R and Python Communities

≪ Previous: ESP32 Arduino: Brotli decompression

backup_tumblr

This is a set of scripts for downloading your posts and likes from Tumblr.

The scripts try to download as much as possible, including:

Every post and like All the metadata about a post that's available through the Tumblr API Any media files attached to a post (e.g. photos, videos)

I've had these for private use for a while, and in the wake of Tumblr going on a deletion spree, I'm trying to make them usable by other people.

A script for backing up Tumblr posts and likes →

Pictured: a group of Tumblr users fleeing the new content moderation policies. Image credit: Wellcome Collection , CC BY.

Getting started

Install python 3.6 or later. Instructions on the Python website .

Check you have pip installed by running the following command at a command prompt:

$ pip3 --version pip 18.1 (python 3.6)

If you don't have it installed or the command errors, follow the pip installation instructions

Clone this repository:

$ git clone git@github.com:alexwlchan/backup_tumblr.git $ cd backup_tumblr

Install the Python dependencies:

$ pip3 install -r requirements.txt

Get yourself a Tumblr API key by registering an app at https://www.tumblr.com/oauth/apps .

You need the OAuth Consumer Key from this screen:

Usage

There are three scripts in this repo:

save_posts_metadata.py save_likes_metadata.py save_media_files.py

They're split into separate scripts because saving metadata is much faster than media files.

You should run (1) and/or (2), then run (3). Something like:

$ python3 save_posts_metadata.py $ python3 save_likes_metadata.py $ python3 save_media_files.py

If you know what command-line flags are: you can pass arguments (e.g. API key) as flags. Use --help to see the available flags.

If that sentence meant nothing: don't worry, the scripts will ask you for any information they need.

Unanswered questions and notes

I have no idea how Tumblr's content blocks interact with the API, or if blocked posts are visible through the API.

I've seen mixed reports saying that ordering in the dashboard has been broken for the last few days. Again, no idea how this interacts with the API.

Media files can get big. I have ~12k likes which are taking ~9GB of disk space. The scripts will merrily fill up your disk, so make sure you have plenty of space before you start!

These scripts are provided "as is". File an issue if you have a problem, but I don't have much time for maintenance right now.

Sometimes the Tumblr API claims to have more posts than it actually returns, and the effect is that the script appears to stop early, e.g. at 96%.

I'm reading the total_posts parameter from the API responses, and paginating through it as expected -- I have no idea what causes the discrepancy.

Acknowledgements

Hat tip to @cesy for nudging me to post it, and providing useful feedback on the initial version.

Licence

MIT.

↧

Gender Diversity in the R and Python Communities

December 6, 2018, 1:34 pm

≫ Next: Book Memo: “Multiple Criteria Decision Aid”

≪ Previous: A script for backing up Tumblr posts and likes →

Many (if not most) tech communities have far more representation from men than from women (and even fewer from nonbinary folk). This is a shame, because everybody uses software, and these projects would self-evidently benefit from the talent and expertise from across the entire community. Some projects are doing better than others, though, and data scientist Reshama Shaikh recently published an in-depth comparison of the representation of women in the R any python communities .

Shaikh's analysis draws from several data sources, which provide evidence thatwomen are better represented in the R community in the Python community. These include:

The R-Ladies community has 29,500 members compared to PyLadies' 36,500, despite the Python community being 6x larger overall. A 2016 study of GitHub contributors estimates R contributors are 9.3% women, and 2.0% for Python contributors. In a 2017 R Consortium survey of R users, 14% of respondents identified as women. The 2018 New York R conference had 45% women speakers; the 2016 useR! conference had 28% female attendees.
Gender Diversity in the R and Python Communities

Gender Diversity in the R and Python Communities

Several reasons are offered for the relative success of the R community in this regard, but in my opinion the most important of these is the vibrancy of the R-Ladies network , which now comprises 134 chapters worldwide . Shaikh lists some steps that the Python community are taking to address the issue, and provides several additional suggestions as well. You can read the complete analysis and recommendations in the blog post linked below.

Reshama Shaikh: Why Women Are Flourishing In R Community But Lagging In Python

↧

Book Memo: “Multiple Criteria Decision Aid”

December 6, 2018, 1:32 pm

≫ Next: Python: Brotli compression

≪ Previous: Gender Diversity in the R and Python Communities

Book Memo: “Multiple Criteria Decision Aid”

Methods, Examples and python Implementations Multiple criteria decision aid (MCDA) methods are illustrated in this book through theoretical and computational techniques utilizing Python. Existing methods are presented in detail with a step by step learning approach. Theoretical background is given for TOPSIS, VIKOR, PROMETHEE, SIR, AHP, goal programming, and their variations. Comprehensive numerical examples are also discussed for each method in conjunction with easy to follow Python code. Extensions to multiple criteria decision making algorithms such as fuzzy number theory and group decision making are introduced and implemented through Python as well. Readers will learn how to implement and use each method based on the problem, the available data, the stakeholders involved, and the various requirements needed. Focusing on the practical aspects of the multiple criteria decision making methodologies, this book is designed for researchers, practitioners and advanced graduate students in the applied mathematics, information systems, operations research and business administration disciplines, as well as other engineers and scientists oriented in interdisciplinary research.

Python: Brotli compression

December 6, 2018, 1:30 pm

≫ Next: How to Build a Market Simulator Using Markov Chains and Python

≪ Previous: Book Memo: “Multiple Criteria Decision Aid”

In this tutorial we will check how to compress a string using the Brotli compression algorithm.

Introduction

In this tutorial we will check how to compress a string using the Brotli compression algorithm. You can check the full specification of the algorithm here .

We will be using this library, which includes a python module. As explained in the Python section of the GitHub page of the library, you can install it using pip with the following command:

pip install brotli

As we will be seeing below, the code for this tutorial will be really simply and short, since this module offers a function that receives a string and returns as output the compressed content, without the need for any additional procedures.

This tutorial was tested on Python version 2.7.8 .

The code

We will start the code by importing the previously installed brotli module. This module will make available the functionalities we need to perform the compression.

import brotli

Then, in order to perform the compression, we simply need to call the compress function of the brotli module, passing as input the string we want to compress.

Note that this function has four more optional parameters, as can be seen here . Nonetheless, those for parameters have default values that we will keep unchanged.

For testing purposes, we will pass as input the string “ hello world! ” repeated one hundred times. We can repeat a string in Python by using the * operator after the string, followed by the number of times we want it to be repeated.

We will store the result of the compression in a variable, as shown below.

compressed = brotli.compress("hello world!" * 100)

The result from the compress function call will be a string (you can confirm it by calling the type operator on the variable).

So, we will iterate all characters of that string and obtain the corresponding byte value. We can obtain each individual character by using a for in loop.

For each iteration, we will use the ord function to obtain the value of the byte that represents the corresponding character.

You can check the loop and the printing of the values below. Note that the comma at the end of the print function call is a trick to avoid inserting a new line at the end of the printed content. This way, we print all the bytes in the same line.

for byte in compressed: print ord(byte),

The final complete code can be seen below.

import brotli compressed = brotli.compress("hello world!" * 100) for byte in compressed: print ord(byte), Testing the code

To test the code, simply run it in a Python tool of your choice. I’ll be using IDLE , a Python IDE. You should get an output similar to figure 1, which shows the bytes of the compressed string.

Figure 1-Bytes of the compressed content.

↧