Cuni: PyPy for low-latency systems

January 3, 2019, 10:24 pm

≫ Next: PyPy Development: PyPy for low-latency systems

Post Syndicated fromcorbet original https://lwn.net/Articles/775916/rss

Antonio Cuni

writes

to support running python code with low latency

requirements under PyPy. “

As we said, the total cost of memory

management is less on PyPy than on CPython, and it’s one of the reasons why

PyPy is so fast. However, one big disadvantage is that while on CPython the

cost of memory management is spread all over the execution of the program,

on PyPy it is concentrated into GC runs, causing observable pauses which

interrupt the execution of the user program. To avoid excessively long

pauses, the PyPy GC has been using an incremental strategy since 2013. The

GC runs as a series of ‘steps’, letting the user program to progress

between each step.

”

↧

PyPy Development: PyPy for low-latency systems

January 3, 2019, 10:26 pm

≫ Next: Beautiful Soup is now part of the Tidelift Subscription

≪ Previous: Cuni: PyPy for low-latency systems

Recently I have merged the gc-disable branch, introducing a couple of features which are useful when you need to respond to certain events with the lowest possible latency. This work has been kindly sponsored by Gambit Research (which, by the way, is a very cool and geeky place where to work , in case you are interested). Note also that this is a very specialized use case, so these features might not be useful for the average PyPy user, unless you have the same problems as described here.

The PyPy VM manages memory using a generational, moving Garbage Collector. Periodically, the GC scans the whole heap to find unreachable objects and frees the corresponding memory. Although at a first look this strategy might sound expensive, in practice the total cost of memory management is far less than e.g. on Cpython, which is based on reference counting. While maybe counter-intuitive, the main advantage of a non-refcount strategy is that allocation is very fast (especially compared to malloc-based allocators), and deallocation of objects which die young is basically for free. More information about the PyPy GC is available here .

As we said, the total cost of memory managment is less on PyPy than on CPython, and it's one of the reasons why PyPy is so fast. However, one big disadvantage is that while on CPython the cost of memory management is spread all over the execution of the program, on PyPy it is concentrated into GC runs, causing observable pauses which interrupt the execution of the user program.

To avoid excessively long pauses, the PyPy GC has been using anincremental strategy since 2013. The GC runs as a series of "steps", letting the user program to progress between each step.

The following chart shows the behavior of a real-world, long-running process:

PyPy Development: PyPy for low-latency systems

The orange line shows the total memory used by the program, which increases linearly while the program progresses. Every ~5 minutes, the GC kicks in and the memory usage drops from ~5.2GB to ~2.8GB (this ratio is controlled by the PYPY_GC_MAJOR_COLLECT env variable).

The purple line shows aggregated data about the GC timing: the whole collection takes ~1400 individual steps over the course of ~1 minute: each point represent the maximum time a single step took during the past 10 seconds. Most steps take ~10-20 ms, although we see a horrible peak of ~100 ms towards the end. We have not investigated yet what it is caused by, but we suspect it is related to the deallocation of raw objects.

These multi-millesecond pauses are a problem for systems where it is important to respond to certain events with a latency which is both low and consistent. If the GC kicks in at the wrong time, it might causes unacceptable pauses during the collection cycle.

Let's look again at our real-world example. This is a system which continuously monitors an external stream; when a certain event occurs, we want to take an action. The following chart shows the maximum time it takes to complete one of such actions, aggregated every minute:

You can clearly see that the baseline response time is around ~20-30 ms. However, we can also see periodic spikes around ~50-100 ms, with peaks up to ~350-450 ms! After a bit of investigation, we concluded that most (although not all) of the spikes were caused by the GC kicking in at the wrong time.

The work I did in the gc-disable branch aims to fix this problem by introducing two new features to the gc module:

gc.disable() , which previously only inhibited the execution of finalizers without actually touching the GC, now disables the GC major collections. After a call to it, you will see the memory usage grow indefinitely. gc.collect_step() is a new function which you can use to manually execute a single incremental GC collection step.

It is worth to specify that gc.disable() disables only the major collections, while minor collections still runs. Moreover, thanks to the JIT's virtuals, many objects with a short and predictable lifetime are not allocated at all. The end result is that most objects with short lifetime are still collected as usual, so the impact of gc.disable() on memory growth is not as bad as it could sound.

Combining these two functions, it is possible to take control of the GC to make sure it runs only when it is acceptable to do so. For an example of usage, you can look at the implementation of a custom GC inside pypytools . The peculiarity is that it also defines a " with nogc():" context manager which you can use to mark performance-critical sections where the GC is not allowed to run.

The following chart compares the behavior of the default PyPy GC and the new custom GC, after a careful placing of nogc() sections:

The yellow line is the same as before, while the purple line shows the new system: almost all spikes have gone, and the baseline performance is about 10% better. There is still one spike towards the end, but after some investigation we concluded that it was not caused by the GC.

Note that this does not mean that the whole program became magically faster: we simply moved the GC pauses in some other place which is not shown in the graph: in this specific use case this technique was useful because it allowed us to shift the GC work in places where pauses are more acceptable.

All in all, a pretty big success, I think. These functionalities are already available in the nightly builds of PyPy, and will be included in the next release: take this as a New Year present :)

Antonio Cuni and the PyPy team

↧

Beautiful Soup is now part of the Tidelift Subscription

January 4, 2019, 2:36 am

≫ Next: pyspark操作MongoDB

≪ Previous: PyPy Development: PyPy for low-latency systems

Beautiful Soup is a python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree and can save programmers hours or days of work.

With over 24,000 open source repositories depending on it, the beautifulsoup4 packages is one of the most widely used in the Python ecosystem. In fact, by dependent repositories count , beautifulsoup4 is within the top 40 of the greater than 171,000 packages distributed through PyPI. It is also one of the most downloaded packages in the Python ecosystem, with over 18 million downloads in the last year alone.

Leonard Richardson will be providing assurances for Beautiful Soup as part of the Tidelift Subscription.

You can learn more about Beautiful Soup by visiting the project site .

Beautiful Soup is now part of the Tidelift Subscription

What we are up to and how you can learn more

Tidelift provides a way to bring maintainers together in a scalable model that makes open source work better―for everyone. Those who build and maintain open source software get compensated for their effort―and those who use their creations get more dependable software.

If you work on professional software and are interested in getting security, maintenance, and licensing assurances for the open source software you already use from the people who create and maintain it, learn more about the Tidelift Subscription .

If you are an open source maintainer and are interested in getting paid for doing the work you love while attracting more users and creating the community you want to be a part of, learn more about partnering with Tidelift .

↧

pyspark操作MongoDB

January 4, 2019, 2:34 am

≫ Next: Backup of data extracted from Twitter using Python to a text file&quest;

≪ Previous: Beautiful Soup is now part of the Tidelift Subscription

pyspark对mongo数据库的基本操作 ( . )

这是崔斯特的第八十一篇原创文章

有几点需要注意的：

不要安装最新的pyspark版本，请安装 pip3 install pyspark==2.3.2 spark-connector 与平常的MongoDB写法不同，格式是： mongodb://127.0.0.1:database.collection 如果计算数据量比较大，你的电脑可能会比较卡，^_^ #!/usr/bin/env python # -*- coding: utf-8 -*- """ @author: zhangslob @file: spark_count.py @time: 2019/01/03 @desc: 不要安装最新的pyspark版本 `pip3 install pyspark==2.3.2` 更多pyspark操作MongoDB请看https://docs.mongodb.com/spark-connector/master/python-api/ """ import os from pyspark.sql import SparkSession # set PYSPARK_PYTHON to python36 os.environ['PYSPARK_PYTHON'] = '/usr/bin/python36' # load mongodb data # 格式是："mongodb://127.0.0.1:database.collection" input_uri = "mongodb://127.0.0.1:27017/spark.spark_test" output_uri = "mongodb://127.0.0.1:27017/spark.spark_test" # 创建spark，默认使用本地环境，或者"spark://master:7077" spark = SparkSession \ .builder \ .master("local") \ .appName("MyApp") \ .config("spark.mongodb.input.uri", input_uri) \ .config("spark.mongodb.output.uri", output_uri) \ .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.0') \ .getOrCreate() def except_id(collection_1, collection_2, output_collection, pipeline): """ 计算表1与表2中不同的数据 :param collection_1: 导入表1 :param collection_2: 导入表2 :param output_collection: 保存的表 :param pipeline: MongoDB查询语句 str :return: """ # 可以在这里指定想要导入的数据库，将会覆盖上面配置中的input_uri。下面保存数据也一样 # .option("collection", "mongodb://127.0.0.1:27017/spark.spark_test") # .option("database", "people").option("collection", "contacts") df_1 = spark.read.format('com.mongodb.spark.sql.DefaultSource').option("collection", collection_1) \ .option("pipeline", pipeline).load() df_2 = spark.read.format('com.mongodb.spark.sql.DefaultSource').option("collection", collection_2) \ .option("pipeline", pipeline).load() # df_1有但是不在 df_2，同理可以计算df_2有，df_1没有 df = df_1.subtract(df_2) df.show() # mode 参数可选范围 # * `append`: Append contents of this :class:`DataFrame` to existing data. # * `overwrite`: Overwrite existing data. # * `error` or `errorifexists`: Throw an exception if data already exists. # * `ignore`: Silently ignore this operation if data already exists. df.write.format("com.mongodb.spark.sql.DefaultSource").option("collection", output_collection).mode("append").save() spark.stop() if __name__ == '__main__': # mongodb query, MongoDB查询语句，可以减少导入数据量 pipeline = "[{'$project': {'uid': 1, '_id': 0}}]" collection_1 = "spark_1" collection_2 = "spark_2" output_collection = 'diff_uid' except_id(collection_1, collection_2, output_collection, pipeline) print('success')

完整代码地址： spark_count_diff_uid.py

↧

Backup of data extracted from Twitter using Python to a text file&quest;

January 4, 2019, 2:32 am

≫ Next: Notebooks from the Practical AI Workshop

≪ Previous: pyspark操作MongoDB

Hello all I am currently working on doing some research and was utilizing the twitter api to collect information. I wrote some code to query for specific tweets in python and would like to save the results to a text file, yet my code is only returning the last tweet of the tweets returned can anyone tell me how I may correct this and what's wrong? The following is a sample of my code in Python only saving the last tweet instead of all of the returned tweets:

u = urllib2.urlopen('http://search.twitter.com/search.json?geocode=29.762778,-95.383056,10.0mi&page=1&rpp=10') datares = json.load(u) pprint.pprint(datares) for tweet in datares['results']: print tweet['text'] archive=tweet['text'] unicodedata.normalize('NFKD', archive).encode('ascii','ignore') with codecs.open('HTXtweets.txt',mode='w', encoding='utf-8',errors='replace') as cache: cache.write(archive) cache.closed

You're opening the file in each iteration of the loop through the results. This recreates it from scratch each time.

You should open it before the loop - you don't need to close it at the end, as that will happen automatically when the with statement finishes.

↧

Notebooks from the Practical AI Workshop

January 4, 2019, 2:30 am

≫ Next: How to Use Date Picker with Django

≪ Previous: Backup of data extracted from Twitter using Python to a text file&quest;

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn't assume any particular programming language background. In addition to the lecture slides , the workshop was delivered as a series of Jupyter notebooks . I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too).

The workshop was divided into five sections, each with its associated Notebook:

The AI behind Seeing AI . Use the web interfaces to Cognitive Services to learn about the AI services behind the " Seeing AI " app Computer Vision API with R . Use an R script to interact with the Computer Vision API and generate captions for random Wikimedia images. Custom Vision with R . An R function to classify an image as a "Hot Dog" or "Not Hot Dog", using the Custom Vision service. MNIST with scikit-learn . Use sckikit-learn to build a digit recognizer for the MNIST data using a regression model. MNIST with Tensorflow . Use Tensorflow (from Python) to build a digit recognizer for the MNIST data using a convolutional neural network.

The workshop was a practical version of a talk I also gave at AI Live, " Getting Started with Deep Learning ", and I've embedded those slides below.

↧

How to Use Date Picker with Django

January 4, 2019, 4:54 am

≫ Next: Socorro in 2018

≪ Previous: Notebooks from the Practical AI Workshop

In this tutorial we are going to explore three date/datetime pickers options that you can easily use in a Django project. We are going to explore how to do it manually first, then how to set up a custom widget and finally how to use a third-party Django app with support to datetime pickers.

Tempus Dominus Bootstrap 4 XDSoft DateTimePicker Fengyuan Chen’s Datepicker Introduction

The implementation of a date picker is mostly done on the front-end.

The key part of the implementation is to assure Django will receive the date input value in the correct format, and also that Django will be able to reproduce the format when rendering a form with initial data.

We can also use custom widgets to provide a deeper integration between the front-end and back-end and also to promote better reuse throughout a project.

In the next sections we are going to explore following date pickers:

Tempus Dominus Bootstrap 4 Docs Source

XDSoft DateTimePicker Docs Source

Fengyuan Chen’s Datepicker Docs Source

Tempus Dominus Bootstrap 4

Docs Source

This is a great javascript library and it integrate well with Bootstrap 4. The downside is that it requires moment.js and sort of need Font-Awesome for the icons.

It only make sense to use this library with you are already using Bootstrap 4 + jQuery, otherwise the list of CSS and JS may look a little bit overwhelming.

To install it you can use their CDN or download the latest release from their GitHub Releases page .

If you downloaded the code from the releases page, grab the processed code from the build/ folder.

Below, a static HTML example of the datepicker:

<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>Static Example</title>  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.2.1/css/bootstrap.min.css" integrity="sha384-GJzZqFGwb1QTTN6wy59ffF1BuGJpLSa9DkKMp0DgiMDm4iYMj70gZWKYbI706tWS" crossorigin="anonymous"> <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.6/umd/popper.min.js" integrity="sha384-wHAiFfRlMFy6i5SRaxvfOCifBUQy1xHdJ/yoi7FRNXMRBu5WHdZYu1hA6ZOblgut" crossorigin="anonymous"></script> <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.2.1/js/bootstrap.min.js" integrity="sha384-B0UglyR+jN6CkvvICOB2joaf5I4l3gm9GU6Hc1og6Ls7i6U/mkkaduKaBhlAXv9k" crossorigin="anonymous"></script>  <link href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-wvfXpqpZZVQGK6TAh5PVlGOfQNHSoD2xbE+QkPxCAFlNEevoEH3Sl0sibVcOQVnN" crossorigin="anonymous">  <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.23.0/moment.min.js" integrity="sha256-VBLiveTKyUZMEzJd6z2mhfxIqz3ZATCuVMawPZGzIfA=" crossorigin="anonymous"></script>  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/tempusdominus-bootstrap-4/5.1.2/css/tempusdominus-bootstrap-4.min.css" integrity="sha256-XPTBwC3SBoWHSmKasAk01c08M6sIA5gF5+sRxqak2Qs=" crossorigin="anonymous" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/tempusdominus-bootstrap-4/5.1.2/js/tempusdominus-bootstrap-4.min.js" integrity="sha256-z0oKYg6xiLq3yJGsp/LsY9XykbweQlHl42jHv2XTBz4=" crossorigin="anonymous"></script> </head> <body> <div class="input-group date" id="datetimepicker1" data-target-input="nearest"> <input type="text" class="form-control datetimepicker-input" data-target="#datetimepicker1"/> <div class="input-group-append" data-target="#datetimepicker1" data-toggle="datetimepicker"> <div class="input-group-text"><i class="fa fa-calendar"></i></div> </div> </div> <script> $(function () { $("#datetimepicker1").datetimepicker(); }); </script> </body> </html>

Direct Usage

The challenge now is to have this input snippet integrated with a Django form.

forms.py from django import forms class DateForm(forms.Form): date = forms.DateTimeField( input_formats=['%d/%m/%Y %H:%M'], widget=forms.DateTimeInput(attrs={ 'class': 'form-control datetimepicker-input', 'data-target': '#datetimepicker1' }) ) template

<div class="input-group date" id="datetimepicker1" data-target-input="nearest"> {{ form.date }} <div class="input-group-append" data-target="#datetimepicker1" data-toggle="datetimepicker"> <div class="input-group-text"><i class="fa fa-calendar"></i></div> </div> </div> <script> $(function () { $("#datetimepicker1").datetimepicker({ format: 'DD/MM/YYYY HH:mm', }); }); </script>

The script tag can be placed anywhere because the snippet $(function () { ... }); will run the datetimepicker initialization when the page is ready. The only requirement is that this script tag is placed after the jQuery script tag.

Custom Widget

You can create the widget in any app you want, here I’m going to consider we have a Django app named core .

core/widgets.py from django.forms import DateTimeInput class BootstrapDateTimePickerInput(DateTimeInput): template_name = 'widgets/bootstrap_datetimepicker.html' def get_context(self, name, value, attrs): datetimepicker_id = 'datetimepicker_{name}'.format(name=name) if attrs is None: attrs = dict() attrs['data-target'] = '#{id}'.format(id=datetimepicker_id) attrs['class'] = 'form-control datetimepicker-input' context = super().get_context(name, value, attrs) context['widget']['datetimepicker_id'] = datetimepicker_id return context

In the implementation above we generate a unique ID datetimepicker_id and also include it in the widget context.

Then the front-end implementation is done inside the widget HTML snippet.

widgets/bootstrap_datetimepicker.html

<div class="input-group date" id="{{ widget.datetimepicker_id }}" data-target-input="nearest"> {% include "django/forms/widgets/input.html" %} <div class="input-group-append" data-target="#{{ widget.datetimepicker_id }}" data-toggle="datetimepicker"> <div class="input-group-text"><i class="fa fa-calendar"></i></div> </div> </div> <script> $(function () { $("#{{ widget.datetimepicker_id }}").datetimepicker({ format: 'DD/MM/YYYY HH:mm', }); }); </script>

Note how we make use of the built-in django/forms/widgets/input.html template.

Now the usage:

core/forms.py from .widgets import BootstrapDateTimePickerInput class DateForm(forms.Form): date = forms.DateTimeField( input_formats=['%d/%m/%Y %H:%M'], widget=BootstrapDateTimePickerInput() )

Now simply render the field:

template

The good thing about having the widget is that your form could have several date fields using the widget and you could simply render the whole form like:

<form method="post"> {% csrf_token %} {{ form.as_p }} <input type="submit" value="Submit"> </form>

XDSoft DateTimePicker

Docs Source

The XDSoft DateTimePicker is a very versatile date picker and doesn’t rely on moment.js or Bootstrap, although it looks good in a Bootstrap website.

It is easy to use and it is very straightforward.

You can download the source from GitHub releases page .

Below, a static example so you can see the minimum requirements and how all the pieces come together:

<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>Static Example</title>  <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/jquery-datetimepicker/2.5.20/jquery.datetimepicker.min.css" integrity="sha256-DOS9W6NR+NFe1fUhEE0PGKY/fubbUCnOfTje2JMDw3Y=" crossorigin="anonymous" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-datetimepicker/2.5.20/jquery.datetimepicker.full.min.js" integrity="sha256-FEqEelWI3WouFOo2VWP/uJfs1y8KJ++FLh2Lbqc8SJk=" crossorigin="anonymous"></script> </head> <body> <input id="datetimepicker" type="text"> <script> $(function () { $("#datetimepicker").datetimepicker(); }); </script> </body> </html>

Direct Usage

A basic integration with Django would look like this:

forms.py from django import forms class DateForm(forms.Form): date = forms.DateTimeField(input_formats=['%d/%m/%Y %H:%M'])

Simple form, default widget, nothing special.

Now using it on the template:

template

{{ form.date }} <script> $(function () { $("#id_date").datetimepicker({ format: 'd/m/Y H:i', }); }); </script>

The id_date is the default ID Django generates for the form fields ( id_ + name ).

Custom Widget core/widgets.py

from django.forms import DateTimeInput class XDSoftDateTimePickerInput(DateTimeInput): template_name = 'widgets/xdsoft_datetimepicker.html'

widgets/xdsoft_datetimepicker.html {% include "django/forms/widgets/input.html" %} <script> $(function () { $("input[name='{{ widget.name }}']").datetimepicker({ format: 'd/m/Y H:i', }); }); </script>

To have a more generic implementation, this time we are selecting the field to initialize the component using its name instead of its id, should the user change the id prefix.

Now the usage:

core/forms.py from django import forms from .widgets import XDSoftDateTimePickerInput class DateForm(forms.Form): date = forms.DateTimeField( input_formats=['%d/%m/%Y %H:%M'], widget=XDSoftDateTimePickerInput() ) template

Fengyuan Chen’s Datepicker

Docs Source

This is a very beautiful and minimalist date picker. Unfortunately there is no time support. But if you only need dates this is a great choice.

To install this datepicker you can either use their CDN or download the sources from their GitHub releases page . Please note that they do not provide a compiled/processed JavaScript files. But you can download those to your local machine using the CDN.

<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>Static Example</title> <style>body {font-family: Arial, sans-serif;}</style>  <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/datepicker/0.6.5/datepicker.min.css" integrity="sha256-b88RdwbRJEzRx95nCuuva+hO5ExvXXnpX+78h8DjyOE=" crossorigin="anonymous" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/datepicker/0.6.5/datepicker.min.js" integrity="sha256-/7FLTdzP6CfC1VBAj/rsp3Rinuuu9leMRGd354hvk0k=" crossorigin="anonymous"></script> </head> <body> <input id="datepicker"> <script> $(function () { $("#datepicker").datepicker(); }); </script> </body> </html>

Direct Usage

A basic integration with Django (note that we are now using DateField instead of DateTimeField ):

forms.py from django import forms class DateForm(forms.Form): date = forms.DateTimeField(input_formats=['%d/%m/%Y %H:%M']) template

{{ form.date }} <script> $(function () { $("#id_date").datepicker({ format:'dd/mm/yyyy', }); }); </script>

Custom Widget core/widgets.py

from django.forms import DateInput class FengyuanChenDatePickerInput(DateInput): template_name = 'widgets/fengyuanchen_datepicker.html'

widgets/fengyuanchen_datepicker.html {% include "django/forms/widgets/input.html" %} <script> $(function () { $("input[name='{{ widget.name }}']").datepicker({ format:'dd/mm/yyyy', }); }); </script>

Usage:

core/forms.py from django import forms from .widgets import FengyuanChenDatePickerInput class DateForm(forms.Form): date = forms.DateTimeField( input_formats=['%d/%m/%Y %H:%M'], widget=FengyuanChenDatePickerInput() ) template

Conclusions

The implementation is very similar no matter what date/datetime picker you are using. Hopefully this tutorial provided some insights on how to integrate this kind of frontend library to a Django project.

As always, the best source of information about each of those libraries are their official documentation.

I also created an example project to show the usage and implementation of the widgets for each of the libraries presented in this tutorial. Grab the source code at github.com/sibtc/django-datetimepicker-example .

↧

Socorro in 2018

January 4, 2019, 4:52 am

≫ Next: Import Python: ImportPython Newsletter - Issue 188

≪ Previous: How to Use Date Picker with Django

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the crash reporter collects data about the crash, generates a crash report, and submits that report to Socorro. Socorro saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.

2018 was a big year for Socorro. In this blog post, I opine about our accomplishments.

Highlights 2018

2018 was a big year. I really can't overstate that. Some highlights:

Switched from Google sign-in to Mozilla's SSO.

Alexis (one of our summer interns) switched the Crash Stats site from Google sign-in to Mozilla's SSO. It fixed a ton of problems we've had with sign-in over the years and brings Crash Stats into the fold along with other Mozilla sites.

It also created a couple of new problems which I'm still working out. The big one being "Periodic 'An unexpected error occurred' when browsing reports and comments" ( [bug 1473068] ). Redid our AWS infrastructure.

This was a huge project that reworked everything about Socorro's infrastructure. Now we have:

aggregated, centralized logs and log history CI-triggered deploys Docker-based services a local development environment that matches stage and prod server environments disposable nodes version-control managed configuration locked-down access to storage systems automatic scaling AWS S3 bucket names that don't have periods in them

This project took a year and a half to do and simplified deploying and maintaining the project significantly. It also involved rewriting a lot of stuff.

I talk more about this project in Socorro Smooth Mega-Migration 2018 .

We did a fantastic job on this--it was super smooth!

Rewrote Socorro's signature generation system.

Early this summer, Will Lachance took on Ben Wu as an intern to look at Telemetry crash ping data. One of the things Ben wanted to do was generate Socorro-style signatures from the data. Then he could do analysis on crash ping data using Telemetry tools and do deep dives on specific crashes in Socorro.

I refactored and extracted Socorro's signature generation code into a python library that could be used outside of Socorro.

I talk more about this project in Siggen (Socorro signature generator) v0.2.0 released! .

After Ben finished up his internship, the project was shut down. I don't think anyone uses the Siggen library. Ted says if we make it a web API, then people could use it in other places. That's the crux of "Add a web API to generate a signature from a list of frames" ( [bug 828452] ). I want to work on that, but have to hone the signature generation API more first.

I also cleaned up a bunch of signature generation removing one of the siglist files we had, generalizing some of the code, and improving signature generation in several cases.

Tried out React.

Mike and Alexis investigated switching the Crash Stats front end to React. Towards that, they tested out converting the report view to a React to see how it felt, what problems it solved, and what new issues came up.

Alexis ended his summer internship and Mike switched to a different project, so I spent some time mulling over things and deciding that while I like React and there are some compelling reasons to React-ify Crash Stats, this isn't a good move right now.

Reworked Socorro to support new products.

I reworked processing and the web interface to allow Socorro to support products that don't have the same release management process as Firefox and Fennec.

Now Socorro supports Focus, FirefoxReality, and the GeckoView ReferenceBrowser.

Switched from FTPScraperCronApp to ArchiveScraperCronApp.

Incoming crash reports for the beta channel report the release version and not the beta version. For example, crash reports for "64.0b4" come in saying they're for "64.0". That's tough because then it's hard to group crashes by specific beta. Because of that, the processor has a BetaVersionRule which looks up the (product, channel, buildid) in a table and pulls out the version string for all incoming crash reports in the beta channel.

Previously, "a table" was a set of tables containing product build/version data. It was populated by FTPScraperCronApp which scraped archive.mozilla.org every hour for build information. It would pass the build information through a series of stored procedures and magically data would appear in the table[1]. Most of this code was written many years ago and didn't work with recent changes to releases like release candidates and aurora.

I rewrote the BetaVersionRule to do a lookup on Buildhub. However, we hit a bunch of issues that I won't go into among which is that the data in Buildhub doesn't have exactly what we need for the BetaVersionRule to do its thing correctly.

So I wrote a new ArchiveScraperCronApp that scrapes archive.mozilla.org for the data the BetaVersionRule needs to correctly find the version string. It now handles release candidates correctly and also aurora.

[1]

And sometimes, data wouldn't appear in the table for magical and inexplicable reasons, too.

Removed PostgreSQL from the processor; removed alembic, sqlalchemy, and everything they managed.

For years, Socorro engineering team worked on cleaning up the Gormenghast-like sprawl that was postgres. For years, we've been generating PR after PR tweaking things and removing things to reduce the spaghetti morass. It was like removing a mountain with a plastic beach toy.

All that has come to an end.

https://github.com/mozilla-services/socorro/pull/4723

We now have one ORM. We now have one migration system. We no longer have stored procedures or other bits that lack unit tests and documentation. We also bid farewell to ftpscraper and that data flow of build/release information that could have been a character or a setting in a Clive Barker novel. This gets rid of a bunch of things that were really hard to maintain and never worked quite right.

While I did the final PR, all the work I did built upon work Adrian and Peter and other people did over the years. Yay us!

Migrated to Python 3.

I started the Python 3 migration project a couple of years ago because the death knell for Python 2 had sounded and time was ticking.

We did this work in a series of baby steps so that we could make progress incrementally without upsetting or blocking other development initiatives. In the process of doing this, we updated and rewrote a lot of code including most of the error handling in the processor.

I talk more about this project in Socorro: migrating to Python 3 .

This was a big deal. Python 3 is sooooo much easier to deal with. Plus some of the libraries we're using or are planning to use are dropping support for Python 2 and things were going to get increasingly irksome.

Big thanks to Ced, Lonnen, and Mike for their efforts on this!

Removed ADI and ADI-related things.

Socorro used ADI to normalize crash rates in a couple of reports. There were tons of problems with this. Now we have Mission Control which does a better job with rates and normalizing and has more representative crash data, too.

Thus, we removed the reports from Socorro and also all the code we had to fetch and manage ADI data.

Stopped saving crash reports that won't get processed.

Socorro was saving roughly 70% of incoming crash reports over half of which it wasn't processing. That was problematic because it meant we had a whole bunch of crash report data in storage that we didn't know anything about. That's one of the reasons we had to drop all the crash report data back in December 2017--we couldn't in a reasonable amount of time figure out which crash reports were ok to keep and which had to go.

Now Socorro saves and processes roughly 20% of incoming crash reports and rejects everything else.

Note that this doesn't affect users--they can still go to about:crashes and submit crash reports and those will get processed just like before.

Removed a lot of code.

In 2017, we removed a lot of code. We did the same in 2018.

At the beginning of 2018, we had this:

-------------------------------------------------------------------------------- Language files blank comment code -------------------------------------------------------------------------------- Python 401 12447 10881 61034 C++ 11 816 474 6052 HTML 66 695 24 5167 javascript 52 904 959 4926 JSON 88 21 0 4432 LESS 19 146 49 2614 SQL 67 398 333 2242 C/C++ Header 12 322 614 1259 Bourne Shell 36 298 366 1094 CSS 13 55 65 1012 MSBuild script 3 0 0 463 YAML 4 34 44 241 Markdown 3 69 0 187 INI 4 27 0 120 make 3 31 14 96 Mako 1 10 0 20 Bourne Again Shell 1 7 13 13 Dockerfile 1 4 2 11 -------------------------------------------------------------------------------- SUM: 785 16284 13838 90983 --------------------------------------------------------------------------------

At the end of 2018, we had this:

------------------------------------------------------------------------------ Language files blank comment code ------------------------------------------------------------------------------- Python 296 8493 6708 41107 C++ 11 827 474 6095 JSON 92 21 0 4296 HTML 50 484 19 4270 JavaScript 37 624 773 3368 LESS 36 287 51 2712 C/C++ Header 12 322 614 1259 CSS 3 27 53 704 MSBuild script 3 0 0 463 Bourne Shell 21 173 263 449 YAML 3 28 33 226 make 3 36 15 142 Dockerfile 1 14 12 35 INI 1 0 0 8 ------------------------------------------------------------------------------- SUM: 569 11336 9015 65134 -------------------------------------------------------------------------------

We're doing roughly the same stuff, but with less code.

I don't think we're going to have another year of drastic code reduction, but it's likely we'll remove some more in 2018 as we address the last couple of technical debt projects.

Improved documentation.

I documented data flows and services. That helps maintainers and future me going forward.

I documented how to request access to PII/memory dumps. The former wasn't documented and sure seemed like any time an engineer needed elevated access, he/she would stumble around to figure out how to get it. That stinks. Hopefully it's better now.

I also documented how to request a new product in Crash Stats. Socorro is effectively a service for other parts of the organization and it should have documentation covering the kinds of things services have: a list of what it does, how to use it, how to set your product up, etc. Getting there.

Lots of stuff happened. A lot of big multi-year projects were completed. It was a good year!

Thank you!

Thank you to everyone who helped out: Lonnen, Miles, Brian, Stephen, Greg, Mike, and Will, our two interns Ced and Alexis, and everyone who submits bugs, PRs, and helps out in their own ways!

We accomplished a ton this year. We're almost done with technical debt projects. 2019 will be fruitful.

Bugzilla and GitHub stats for 2018 Period (2018-01-01 -> 2018-12-31) ================================= Bugzilla ======== Bugs created: 623 Creators: 67 Will Kahn-Greene [:willkg] : 349 Peter Bengtsson [:peterbe] : 38 Michael Kelly [:mkelly,:Osmose : 29 Stephen Donner [:stephend] : 16 Alexis Deschamps [:alexisdesch : 16 Brian Pitts : 13 Marcia Knous [:marcia] : 13 Miles Crabill [:miles] : 10 Andy Mikulski [:amikulski] : 9 Calixte Denizet (:calixte) : 8 Kartikaya Gupta : 8 Andrew McCreight [:mccr8] : 7 [:philipp] : 6 Wayne Mery (:wsmwk) : 4 Ted Mielczarek [:ted] [:ted.mi : 4 Lonnen :lonnen : 4 Chris Peterson [:cpeterson] : 4 Jonathan Watt [:jwatt] : 3 Jan Andre Ikenmeyer [:darkspir : 3 Cristi Fogel [:cfogel] : 3 Aaron Klotz [:aklotz] : 2 Jeff Muizelaar [:jrmuizel] : 2 Markus Stange [:mstange] : 2 Liz Henry (:lizzard) : 2 cmiller : 2 Paul Theriault [:pauljt] : 2 Brian Hackett (:bhackett) : 2 Julien Cristau [:jcristau] : 2 Treeherder Bug Filer : 1 Peter Van der Beken [:peterv] : 1 Arun babu : 1 Tristan Weir [:weir] : 1 David Bolter [:davidb] : 1 Eric Rescorla (:ekr) : 1 Yasin Soliman : 1 AJ Bahnken [:ajvb] : 1 Dan Glastonbury (:kamidphish) : 1 Worcester12345 : 1 Ted Campbell [:tcampbell] : 1 Matthew Gregan [:kinetik] : 1 Suriti Singh : 1 Johan Lorenzo [:jlorenzo] : 1 Adolfo Jayme : 1 Tom Prince [:tomprince] : 1 Mike Hommey [:glandium] : 1 David Baron :dbaron: : 1 Marco Castelluccio [:marco] : 1 Ehsan Akhgari : 1 Stephen A Pohl [:spohl] : 1 Tim Smith [:tdsmith] : 1 Daosheng Mu[:daoshengmu] : 1 Rob Wu [:robwu] : 1 Randell Jesup [:jesup] : 1 Hiroyuki Ikezoe (:hiro) : 1 Cameron McCormack (:heycam) : 1 Julien Vehent [:ulfr] : 1 James Willcox (:snorp) (jwillc : 1 kiavash.satvat : 1 Jan Henning [:JanH] : 1 Sebastian Kaspari (:sebastian) : 1 Yaron Tausky [:ytausky] : 1 Atoll : 1 Andreas Farre [:farre] : 1 Gabriele Svelto [:gsvelto] : 1 Petru-Mugurel Lingurar[:petru] : 1 Dragana Damjanovic [:dragana] : 1 Tom Tung [:tt, :ttung] : 1 Bugs resolved: 781 WONTFIX : 93 INCOMPLETE : 16 FIXED : 597 WORKSFORME : 23 INVALID : 28 DUPLICATE : 20 : 4 Resolvers: 50 Will Kahn-Greene [:willkg] ET : 499 Peter Bengtsson [:peterbe] : 70 Miles Crabill [:miles] [also m : 50 Michael Kelly [:mkelly,:Osmose : 35 Brian Pitts : 22 Alexis Deschamps [:alexisdesch : 17 Stephen Donner [:stephend] : 16 Andy Mikulski [:amikulski] : 9 Issei Horie [:is2ei] : 7 Lonnen :lonnen : 7 Ted Mielczarek [:ted] [:ted.mi : 7 mozilla+bugcloser : 5 Andrew McCreight [:mccr8] : 3 Kartikaya Gupta (email:kats : 3 Calixte Denizet (:calixte) : 3 madperson : 2 vseerror : 2 Marco Castelluccio [:marco] (P : 2 cmiller : 2 rhelmer : 1 jimnchen+bmo : 1 JP Schneider [:jp] : 1 sarentz : 1 gguthe : 1 nfroyd : 1 Aaron Klotz [:aklotz] : 1 abahnken : 1 lhenry : 1 Mike Hommey [:glandium] : 1 dbaron : 1 [:philipp] : 1 Chris Peterson [:cpeterson] : 2 Sotaro Ikeda [:sotaro out of o : 1 mstange : 1 mozillamarcia.knous : 1 Cameron McCormack (:heycam) : 1 Jeff Muizelaar [:jrmuizel] : 1 Julien Cristau [:jcristau] [PT : 1 Commenters: 175 willkg : 2297 peterbe : 442 mozilla+bugcloser : 435 miles : 161 mkelly : 123 bpitts : 123 ted : 93 chris.lonnen : 77 adrian : 59 stephen.donner : 50 etc... Tracker bugs: 17 1083384: [tracker] deprecate /status/ telemetry machinery 1257531: [tracker] Stop saving crash data to postgresql 1316435: [tracker][e2e-tests] Find a remedy for the skipped and xfail'd e2e-tests 1346883: [tracker] remove postgres usage from processor 1361394: [tracker] Simplify and clean up postgresql schema 1373997: [tracker] rewrite docs 1391034: [tracker] switch to dockerized socorro in cloudops infra 1395647: [tracker] Migrate uploaders from Socorro to Tecken 1406703: [tracker] switch to python 3 1408041: [tracker] expose MinidumpSha256Hash 1433274: [tracker] Photon: Refactor webapp UI styling and structure 1478110: [tracker] stop saving crash data we aren't processing 1478351: [tracker] support rust 1478353: [tracker] support new products on Socorro 1497956: [tracker] upgrade postgres to 9.5 1497957: [tracker] upgrade postgres to 9.6 1505231: [tracker] rework error handling in processor Statistics Youngest bug : 0.0d: 1429209: Switch from msgpack-python to msgpack Average bug age : 207.8d Median bug age : 18.0d Oldest bug : 3028.0d: 578760: Allow (manual) annotation of system graphs with... GitHub ====== mozilla-services/antenna: 25 prs Committers: willkg : 22 ( +944, -901, 22 files) milescrabill : 3 ( +104, -102, 3 files) Total : ( +1048, -1003, 25 files) Most changed files: antenna/throttler.py (12) tests/unittest/test_throttler.py (8) antenna/breakpad_resource.py (4) tests/unittest/test_breakpad_resource.py (4) requirements/default.txt (3) .circleci/config.yml (3) docs/breakpad_reporting.rst (2) tests/unittest/test_s3_crashstorage.py (2) Dockerfile (2) tests/unittest/test_crashstorage.py (2) Age stats: Youngest PR : 0.0d: 286: Update requests to 2.20.0 Average PR age : 1.5d Median PR age : 0.0d Oldest PR : 20.0d: 260: Update docs on triggering a crash in Firefox mozilla-services/socorro: 453 prs Committers: willkg : 328 (+36325, -82912, 812 files) stephendonner : 27 ( +686, -2426, 32 files) Osmose : 26 ( +7429, -1830, 120 files) AlexisDeschamps : 19 (+14081, -9398, 166 files) pyup-bot : 9 ( +779, -724, 8 files) is2ei : 7 ( +110, -516, 16 files) andymikulski : 7 ( +2468, -2182, 73 files) lonnen : 5 ( +461, -6378, 71 files) milescrabill : 4 ( +44, -74, 3 files) amccreight : 3 ( +3, -0, 3 files) ceddy-cedd : 3 ( +171, -74, 49 files) renovate[bot] : 3 ( +1490, -814, 5 files) jcristau : 2 ( +2, -1, 2 files) jrmuizel : 1 ( +2, -0, 1 files) heycam : 1 ( +1, -0, 1 files) sotaroikeda : 1 ( +1, -0, 1 files) cpeterso : 1 ( +1, -0, 1 files) philipp-sumo : 1 ( +1, -0, 1 files) sciurus : 1 ( +0, -2, 1 files) luser : 1 ( +2, -1, 1 files) g-k : 1 ( +1, -1, 1 files) dblohm7 : 1 ( +172, -49, 3 files) chartjes : 1 ( +0, -9, 2 files) Total : (+64230, -107391, 1015 files) Most changed files: socorro/processor/mozilla_transform_rules.py (44) webapp-django/crashstats/crashstats/models.py (37) requirements/default.txt (32) webapp-django/crashstats/settings/base.py (30) socorro/unittest/processor/test_mozilla_transform_rules.py (29) socorro/signature/rules.py (25) webapp-django/crashstats/crashstats/utils.py (25) socorro/cron/crontabber_app.py (23) Makefile (22) webapp-django/crashstats/settings/bundles.py (21) Age stats: Youngest PR : 0.0d: 4756: fix bug 1516010: add version flow docs Average PR age : 1.3d Median PR age : 0.0d Oldest PR : 72.0d: 4253: [ready] 1409648 gc rule sets part 2 mozilla-services/socorro-pigeon: 10 prs Committers: willkg : 9 ( +630, -225, 22 files) milescrabill : 1 ( +1, -1, 1 files) Total : ( +631, -226, 22 files) Most changed files: README.rst (4) pigeon.py (4) bin/build_artifact.sh (3) requirements-dev.txt (3) tests/conftest.py (3) Makefile (2) tests/test_pigeon.py (2) circle.yml (2) setup.cfg (2) .gitignore (1) Age stats: Youngest PR : 0.0d: 37: bug 1452681 - artefact 2 Average PR age : 1.3d Median PR age : 0.0d Oldest PR : 6.0d: 34: bug 1432491 - redo aws lambda scaffolding All repositories: Total merged PRs: 488 Contributors ============ Atoll Ehsan Akhgari [:philipp] Aaron Klotz [:aklotz] abahnken acrichton adityamotwani Adolfo Jayme adrian afarre AJ Bahnken [:ajvb] ajones akimov.alex alex_mayorga alexbruceharley Alexis Deschamps [:alexisdeschamps] almametcal Andreas Farre [:farre] Andrew McCreight [:mccr8] Andy Mikulski [:amikulski] apavel april Arun babu arunbalu123 aryx.bugmail ayumiqmazaky bbirtles benjamin bewu bhackett1024 bhearsum bloodyhazel7 bobby.chien+bugzilla brad Brian Hackett (:bhackett) Brian Pitts bzhao Calixte Denizet (:calixte) cam Cameron McCormack (:heycam) catlee cdenizet ceddy-cedd chartjes Chris Peterson [:cpeterson] chutten cliang cmiller continuation cr Cristi Fogel [:cfogel] culucenk Dan Glastonbury (:kamidphish) Daosheng Mu[:daoshengmu] dave.hunt David Baron :dbaron: David Bolter [:davidb] dblohm7 dbrown dd.mozilla ddurst dmajor dmu Dragana Damjanovic [:dragana] dteller dthorn dustin dveditz dylan ehsan emilio Eric Rescorla (:ekr) fbraun felash g-k Gabriele Svelto [:gsvelto] gdestuynder gfritzsche gguthe gijskruitbosch+bugs giles gps heycam Hiroyuki Ikezoe (:hiro) hkirschner hutusoru.andrei i17gyp igoldan Issei Horie [:is2ei] James Willcox (:snorp) Jan Andre Ikenmeyer [:darkspirit] Jan Henning [:JanH] jbecerra jclaudius jdow Jeff Muizelaar [:jrmuizel] jh+bugzilla jimb jimnchen+bmo jld Johan Lorenzo [:jlorenzo] John99-bugs Jonathan Watt [:jwatt] JP Schneider [:jp] jrmuizel jteh Julien Cristau [:jcristau] Julien Vehent [:ulfr] jwalker kairo Kartikaya Gupta kbrosnan key-mozillabugzilla2939-contact kiavash.satvat kinetik lars larsberg lassey laura Liz Henry (:lizzard) (PTO Dec 28) Lonnen :lonnen ludovic luser m_kato madperson Marcia Knous [:marcia - needinfo? me] Marco Castelluccio [:marco] (PTO until Jan 2) Markus Stange [:mstange] (away until Jan 8) markus.vervier mats matt.woodrow Matthew Gregan [:kinetik] mbrandt mconley mdaly merwin Michael Kelly [:mkelly,:Osmose] Mike Hommey [:glandium] miket milaninbugzilla Miles Crabill [:miles] [also mcrabill mozilla mozilla+bugcloser n.nethercote ncsoregi nfroyd nhirata.bugzilla nitanwar nkochar nthomas orangefactor overholt Paul Theriault [:pauljt] pbone Peter Bengtsson [:peterbe] Peter Van der Beken [:peterv] Petru-Mugurel Lingurar[:petru] ptheriault pulgasaur rail Randell Jesup [:jesup] rares.doghi rbarker rhelmer rkothari Rob Wu [:robwu] s.kaspari sarentz schalk.neethling.bugs sciurus sdaswani sdeckelmann Sebastian Kaspari (:sebastian) secreport sespinoza skywalker333 sledru smani Sotaro Ikeda [:sotaro] sotaroikeda sphink Stephen A Pohl [:spohl] Stephen Donner [:stephend] Suriti Singh susingh svoisen Ted Campbell [:tcampbell] Ted Mielczarek [:ted] [:ted.mielczarek] Tim Smith [:tdsmith] Tobias.Besemer Tom Prince [:tomprince] Tom Tung [:tt, :ttung] Tristan Weir [:weir] -- use NEEDINFO for response videetssinghai viveknegi1 Wayne Mery (:wsmwk) Will Kahn-Greene [:willkg] ET needinfo? me wlachance Worcester12345 Yaron Tausky [:ytausky] Yasin Soliman yor

↧

Import Python: ImportPython Newsletter - Issue 188

January 4, 2019, 4:50 am

≫ Next: How to Develop a Snapshot Ensemble Deep Learning Neural Network in Python With K ...

≪ Previous: Socorro in 2018

Clean architectures in python - by leonardo giordani ( free book )

The clean architecture is the opposite of spaghetti code, where everything is interlaced and there are no single elements that can be easily detached from the rest and replaced without the whole system collapsing. The main point of the clean architecture is to make clear "what is where and why", and this should be your first concern while you design and implement a software system, whatever architecture or development methodology you want to follow.

is_project

The python graph gallery visualizing data

Welcome to the Python Graph Gallery. This website displays hundreds of charts, always providing the reproducible python code! It aims to showcase the awesome dataviz possibilities of python and to help you benefit it. Feel free to propose a chart or report a bug. Any feedback is highly welcome. Get in touch with the gallery by following it on Twitter, Facebook, or by subscribing to the blog. Note thatthis online course is another good resource to learn dataviz with python.

is_project

Episode #191 python's journey at microsoft - podcast

Join me along with Steve Dower (a core dev working at Microsoft), who just published an amazing retrospective of Python at Microsoft entitled: Python at Microsoft: flying under the radar.

is_project

Five languages - five stories kari marttila medium

See how Python compared with other languages.

is_project

Visual parameter tuning with facebook prophet and python

Facebook prophet is by far my favorite python package. It allows for quick and easy forecasting of many time series with a novel bayesian model, that estimates various parameters using a general additive model. More on Facebook prophet can be found right here. Forecasting is important many business situations, particularly supply chain management and demand planning.

is_project

Python and django logging in plain english django deconstructed

If you’ve ever written a program and printed out a value to see what’s going on during execution, then you understand at some level why logging is so valuable. Knowing what’s happening in your code at a point in time is enormously useful from both technical and business perspectives. This knowledge lets developers and product managers make smart choices about what systems to fix or modify and lets them see what actions users take when they use your software.

is_project

Programming ftdi devices in python

FTDI chips are frequently used as USB-to-serial adaptors, but the newer devices have the ability to drive more complex protocols such as SPI and I2C.I like to use Python when first experimenting with new PC hardware, and there are some Python libraries for interfacing to FTDI chips, but I couldn’t find any real projects or complete worked examples.

is_project

Advent of code 2018 solutions

Advent of Code is an Advent calendar of small programming puzzles for a variety of skill sets and skill levels that can be solved in any programming language you like. People use them as a speed contest, interview prep, company training, university coursework, practice problems, or to challenge each other.

is_project

8 reasons python sucks - the hacker factor blog

And me, well... I just blurted it out: I hate Python. I hate it with a passion. If I have the choice between using some pre-existing Python code or rewriting it in C, I'd rather rewrite it in C.
Curator's Note - Someone give him a Django website to rewrite in C. ;) ... Nevertheless it's good to hear critique and ponder over it.

is_project

Python gets a new governance model

A new thread was started on the Python committers Discourse instance to discuss the pros and cons of various voting systems. Instant-runoff voting fell out of favor; there were concerns that it didn't truly represent the will of the electorate, as seen in a Burlington, Vermont mayoral election in 2009, for example. The fact that it was put in place by fiat under a self-imposed deadline based on in-person conversations at the core developer sprint, rather than being hashed out on the Discourse instance or the python-committers mailing list may have also been a factor.

is_project

First impressions of gpus and pydata

Like many PyData developers, I’m loosely aware that GPUs are sometimes fast, but don’t deal with them often enough to have strong feeling about them.To get a more visceral feel for the performance differences, I logged into a GPU machine, opened up CuPy (a Numpy-like GPU library developed mostly by Chainer in Japan) and cuDF (a Pandas-like library in development at NVIDIA) and did a couple of small speed comparisons:

is_project

Blazing fast python

Perhaps you’ve faced the fortunate challenge of scaling up a Python application to accommodate a steadily increasing user base. Though most cloud hosting providers make it easier than ever to throw more hardware at a problem, there comes a point when the cost outweighs convenience.Around the time scaling horizontally starts looking less attractive, developers turn to performance tuning to make applications more efficient. In the Python community there are a number of tools to help in this arena; from the built-in timeit module to profiling tools like cProfile, there are quick ways to test the difference between a particular line of code and any of its alternatives.Although profiling tools help you see important information about which calls in your application are time consuming, it’s difficult to exercise an application during local development the same way your users exercise it in real life. The solution to bridging this gap? Profile in production!PyflamePyflame is a profiling tool

is_project

Python io streams in examples

Python IO streams: BytesIO and StringIO in p

↧

How to Develop a Snapshot Ensemble Deep Learning Neural Network in Python With K ...

January 4, 2019, 4:48 am

≫ Next: How to Level up Dev Teams

≪ Previous: Import Python: ImportPython Newsletter - Issue 188

Model ensembles can achieve lower generalization error than single models but are challenging to develop with deep learning neural networks given the computational cost of training each single model.

An alternative is to train multiple model snapshots during a single training run and combine their predictions to make an ensemble prediction. A limitation of this approach is that the saved models will be similar, resulting in similar predictions and predictions errors and not offering much benefit from combining their predictions.

Effective ensembles require a diverse set of skillful ensemble members that have differing distributions of prediction errors. One approach to promoting a diversity of models saved during a single training run is to use an aggressive learning rate schedule that forces large changes in the model weights and, in turn, the nature of the model saved at each snapshot.

In this tutorial, you will discover how to develop snapshot ensembles of models saved using an aggressive learning rate schedule over a single training run.

After completing this tutorial, you will know:

Snapshot ensembles combine the predictions from multiple models saved during a single training run. Diversity in model snapshots can be achieved through the use of aggressively cycling the learning rate used during a single training run. How to save model snapshots during a single run and load snapshot models to make ensemble predictions.

Let’s get started.

How to Develop a Snapshot Ensemble Deep Learning Neural Network in python With Keras

Photo by Jason Jacobs , some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

Snapshot Ensembles Multi-Class Classification Problem Multilayer Perceptron Model Cosine Annealing Learning Rate MLP Snapshot Ensemble Snapshot Ensembles

A problem with ensemble learning with deep learning methods is the large computational cost of training multiple models.

This is because of the use of very deep models and very large datasets that can result in model training times that may extend to days, weeks, or even months.

Despite its obvious advantages, the use of ensembling for deep networks is not nearly as widespread as it is for other algorithms. One likely reason for this lack of adaptation may be the cost of learning multiple neural networks. Training deep networks can last for weeks, even on high performance hardware with GPU acceleration.

― Snapshot Ensembles: Train 1, get M for free , 2017.

One approach to ensemble learning for deep learning neural networks is to collect multiple models from a single training run. This addresses the computational cost of training multiple deep learning models as models can be selected and saved during training, then used to make an ensemble prediction.

A key benefit of ensemble learning is in improved performance compared to the predictions from single models. This can be achieved through the selection of members that have good skill, but in different ways, providing a diverse set of predictions to be combined. A limitation of collecting multiple models during a single training run is that the models may be good, but too similar.

This can be addressed by changing the learning algorithm for the deep neural network to force the exploration of different network weights during a single training run that will result, in turn, with models that have differing performance. One way that this can be achieved is by aggressively changing the learning rate used during training.

An approach to systematically and aggressively changing the learning rate during training to result in very different network weights is referred to as “ Stochastic Gradient Descent with Warm Restarts ” or SGDR for short, described by Ilya Loshchilov and Frank Hutter in their 2017 paper “ SGDR: Stochastic Gradient Descent with Warm Restarts .”

Their approach involves systematically changing the learning rate over training epochs, called cosine annealing. This approach requires the specification of two hyperparameters: the initial learning rate and the total number of training epochs.

The “ cosine annealing ” method has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being dramatically increased again. The model weights are subjected to the dramatic changes during training, having the effect of using “ good weights ” as the starting point for the subsequent learning rate cycle, but allowing the learning algorithm to converge to a different solution.

The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a “ warm restart ,” in contrast to a “ cold restart ” where a new set of small random numbers may be used as a starting point.

The “ good weights ” at the bottom of each cycle can be saved to file, providing a snapshot of the model. These snapshots can be collected together at the end of the run and used in a model averaging ensemble. The saving and use of these models during an aggressive learning rate schedule is referred to as a “ Snapshot Ensemble ” and was described by Gao Huang, et al. in their 2017 paper titled “ Snapshot Ensembles: Train 1, get M for free ” and subsequently also used in an updated version of the Loshchilov and Hutter paper.

… we let SGD converge M times to local minima along its optimization path. Each time the model converges, we save the weights and add the corresponding network to our ensemble. We then restart the optimization with a large learning rate to escape the current local minimum.

― Snapshot Ensembles: Train 1, get M for free , 2017.

The ensemble of models is created during the course of training a single model, therefore, the authors claim that the ensemble forecast is provided at no additional cost.

[the approach allows] learning an ensemble of multiple neural networks without incurring any additional training costs.

― Snapshot Ensembles: Train 1, get M for free , 2017.

Although a cosine annealing schedule is used for the learning rate, other aggressive learning rate schedules could be used, such as the simpler cyclical learning rate schedule described by Leslie Smith in the 2017 paper titled “ Cyclical Learning Rates for Training Neural Networks .”

Now that we are familiar with the snapshot ensemble technique, we can look at how to implement it in Python with Keras.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Multi-Class Classification Problem

We will use a small multi-class classification problem as the basis to demonstrate the snapshot ensemble.

The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

The problem has two input variables (to represent the x and y coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.

# generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)

The result is the input and output elements of a dataset that we can model.

In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.

The complete example is listed below.

# scatter plot of blobs dataset from sklearn.datasets.samples_generator import make_blobs from matplotlib import pyplot from numpy import where # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # scatter plot for each class value for class_value in range(3): # select indices of points with the class label row_ix = where(y == class_value) # scatter plot for points with a different color pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show plot pyplot.show()

Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line) causing many ambiguous points.

This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different “good enough” candidate solutions resulting in a high variance.

Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value

Multilayer Perceptron Model

Before we define a model, we need to contrive a problem that is appropriate for the ensemble.

In our problem, the training dataset is relatively small. Specifically, there is a 10:1 ratio of examples in the training dataset to the holdout dataset. This mimics a situation where we may have a vast number of unlabeled examples and a small number of labeled examples with which to train a model.

We will create 1,100 data points from the blobs problem. The model will be trained on the first 100 points and the remaining 1,000 will be held back in a test dataset, unavailable to the model.

The problem is a multi-class classification problem, and we will model it using a softmax activation function on the output layer. This means that the model will predict a vector with three elements with the probability that the sample belongs to each of the three classes. Therefore, we must one hot encode the class values before we split the rows into the train and test datasets. We can do this using the Keras to_categorical() function.

# generate 2d classification dataset X, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 100 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define and compile the model.

The model will expect samples with two input variables. The model then has a single hidden layer with 25 nodes and a rectified linear activation function, then an output layer with three nodes to predict the probability of each of the three classes and a softmax activation function.

Because the problem is multi-class, we will use the categorical cross entropy loss function to optimize the model and stochastic gradient descent with a small learning rate and momentum.

# define model model = Sequential() model.add(Dense(25, input_dim=2, activation='relu')) model.add(Dense(3, activation='softmax')) opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

The model is fit for 200 training epochs and we will evaluate the model each epoch on the test set, using the test set as a validation set.

# fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)

At the end of the run, we will evaluate the performance of the model on the train and test sets.

# evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Then finally, we will plot learning curves of the model accuracy over each training epoch on both the training and validation datasets.

# learning curves of model accuracy pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Tying all of this together, the complete example is listed below.

# develop an mlp for blobs dataset from sklearn.datasets.samples_generator import make_blobs from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD from matplotlib import pyplot # generate 2d classification dataset X, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 100 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(25, input_dim=2, activation='relu')) model.add(Dense(3, activation='softmax')) opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # learning curves of model accuracy pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example prints the performance of the final model on the train and test datasets.

Your specific results will vary (by design!) given the high variance nature of the model.

In this case, we can see that the model achieved about 84% accuracy on the training dataset, which we know is optimistic, and about 79% on the test dataset, which we would expect to be more realistic.

Train: 0.840, Test: 0.796

A line plot is also created showing the learning curves for the model accuracy on the train and test sets over each training epoch.

We can see that training accuracy is more optimistic over most of the run as we also noted with the final scores.

Line Plot Learning Curves of Model Accuracy on Train and Test Dataset over Each Training Epoch

Next, we can look at how to implement an aggressive learning rate schedule.

Cosine Annealing Learning Rate

An effective snapshot ensemble requires training a neural network with an aggressive learning rate schedule.

The cosine annealing schedule is an example of an aggressive learning rate schedule where learning rate starts high and is dropped relatively rapidly to a minimum value near zero before being increased again to the maximum.

We can implement the schedule as described in the 2017 paper “ Snapshot Ensembles: Train 1, get M for free .” The equation requires the total training epochs, maximum learning rate, and number of cycles as arguments as well as the current epoch number. The function then returns the learning rate for the given epoch.

Equation for the Cosine Annealing Learning Rate Schedule

Where a(t) is the learning rate at epoch t, a0 is the maximum learning rate, T is the total epochs, M is the number of cycles, mod is the modulo operation, and square brackets indicate a floor operation.

Taken from “Snapshot Ensembles: Train 1, get M for free”.

The function cosine_annealing() below implements the equation.

# cosine annealing learning rate schedule def cosine_annealing(epoch, n_epochs, n_cycles, lrate_max): epochs_per_cycle = floor(n_epochs/n_cycles) cos_inner = (pi * (epoch % epochs_per_cycle)) / (epochs_per_cycle) return lrate_max/2 * (cos(cos_inner) + 1)

We can test this implementation by plotting the learning rate over 100 epochs with five cycles (e.g. 20 epochs long) and a maximum learning rate of 0.01. The complete example is listed below.

# example of a cosine annealing learning rate schedule from matplotlib import pyplot from math import pi from math import cos from math import floor # cosine annealing learning rate schedule def cosine_annealing(epoch, n_epochs, n_cycles, lrate_max): epochs_per_cycle = floor(n_epochs/n_cycles) cos_inner = (pi * (epoch % epochs_per_cycle)) / (epochs_per_cycle) return lrate_max/2 * (cos(cos_inner) + 1) # create learning rate series n_epochs = 100 n_cycles = 5 lrate_max = 0.01 series = [cosine_annealing(i, n_epochs, n_cycles, lrate_max) for i in range(n_epochs)] # plot series pyplot.plot(series) pyplot.show()

Running the example creates a line plot of the learning rate schedule over 100 epochs.

We can see that the learning rate starts at the maximum value at epoch 0 and decreases rapidly to epoch 19, before being reset at epoch 20, the start of the next cycle. The cycle is repeated five times as specified in the argument.

Line Plot of Cosine Annealing Learning Rate Schedule

We can implement this schedule as a custom callback in Keras. This allows the parameters of the schedule to be specified and for the learning rate to be logged so we can ensure it had the desired effect.

A custom callback can be defined as a Python class that extends the Keras Callback class.

In the class constructor, we can take the required configuration as arguments and save them for use, specifically the total number of training epochs, the number of cycles for the learning rate schedule, and the maximum learning rate.

We can use our cosine_annealing() defined above to calculate the learning rate for a given training epoch.

The Callback class allows an on_epoch_begin() function to be overridden that will be called prior to each training epoch. We can override this function to calculate the learning rate for the current epoch and set it in the optimizer. We can also keep track of the learning rate in an internal list.

The complete custom callback is defined below.

# define custom learning rate schedule class CosineAnnealingLearningRateSchedule(Callback): # constructor def __init__(self, n_epochs, n_cycles, lrate_max, verbose=0): self.epochs = n_epochs self.cycles = n_cycles self.lr_max = lrate_max self.lrates = list() # calculate learning rate for an epoch def cosine_annealing(self, epoch, n_epochs, n_cycles, lrate_max): epochs_per_cycle = floor(n_epochs/n_cycles) cos_inner = (pi * (epoch % epochs_per_cycle)) / (epochs_per_cycle) return lrate_max/2 * (cos(cos_inner) + 1) # calculate and set learning rate at the start of the epoch def on_epoch_begin(self, epoch, logs=None): # calculate learning rate lr = self.cosine_annealing(epoch, self.epochs, self.cycles, self.lr_max) # set learning rate backend.set_value(self.model.optimizer.lr, lr) # log value self.lrates.append(lr)

We can create an instance of the callback and set the arguments. We will train the model for 400 epochs and set the number of cycles to be 50 epochs long, or 500 / 50, a suggestion made and configuration used throughout the snapshot ensembles paper.

We lower the learning rate at a very fast pace, encouraging the model to converge towards its first local minimum after as few as 50 epochs.

― Snapshot Ensembles: Train 1, get M for free , 2017.

The paper also suggests that the learning rate can be set each sample or each mini-batch instead of prior to each epoch to give more nuance to the updates, but we will leave this as a future exercise.

… we update the learning rate at each iteration rather than at every epoch. This improves the convergence of short cycles, even when a large initial learning rate is used.

― Snapshot Ensembles: Train 1, get M for free , 2017.

Once the callback is instantiated and configured, we can specify it as part of the list of callbacks to the call to the fit() function to train the model.

# define learning rate callback n_epochs = 400 n_cycles = n_epochs / 50 ca = CosineAnnealingLearningRateSchedule(n_epochs, n_cycles, 0.01) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=n_epochs, verbose=0, callbacks=[ca])

At the end of the run, we can confirm that the learning rate schedule was performed by plotting the contents of the lrates list.

# plot learning rate pyplot.plot(ca.lrates) pyplot.show()

Tying these elements together, the complete example of training an MLP on the blobs problem with a cosine annealing learning rate schedule is listed below.

# mlp with cosine annealing learning rate schedule on blobs problem from sklearn.datasets.samples_generator import make_blobs from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from keras.callbacks import Callback from keras.optimizers import SGD from keras import backend from math import pi from math import cos from math import floor from matplotlib import pyplot # define custom learning rate schedule class CosineAnnealingLearningRateSchedule(Callback): # constructor def __init__(self, n_epochs, n_cycles, lrate_max, verbose=0): self.epochs = n_epochs self.cycles = n_cycles self.lr_max = lrate_max self.lrates = list() # calculate learning rate for an epoch def cosine_annealing(self, epoch, n_epochs, n_cycles, lrate_max): epochs_per_cycle = floor(n_epochs/n_cycles) cos_inner = (pi * (epoch % epochs_per_cycle)) / (epochs_per_cycle) return lrate_max/2 * (cos(cos_inner) + 1) # calculate and set learning rate at the start of the epoch def on_epoch_begin(self, epoch, logs=None): # calculate learning rate lr = self.cosine_annealing(epoch, self.epochs, self.cycles, self.lr_max)

↧

How to Level up Dev Teams

January 4, 2019, 4:02 am

≫ Next: I Deliver a Machine Learning Workshop to 120 People

≪ Previous: How to Develop a Snapshot Ensemble Deep Learning Neural Network in Python With K ...

One question that clients frequently ask: how do you effectively level up development teams? How do you take a group of engineers who have never written python and make them effective Python developers? How do you take a group who has never built distributed systems and have them build reliable, fault-tolerant microservices? What about a team who has never built anything in the cloud that is now tasked with building cloud software?

Some say training will level up teams. Bring in a firm who can teach us how to write effective Python or how to build cloud software. Run developers through a bootcamp; throw raw, undeveloped talent in one end and out pops prepared and productive engineers on the other.

My question to those who advocate this is: when do you know you’re ready ? Once you’ve completed a training course? Is the two-day training enough or should we opt for the three-day one? The six-month pair-coding boot camp? You might be more ready than you were before, but you also spent piles of cash on training programs, not to mention the opportunity cost of having a team of expensive engineers sit in multi-day or multi-week workshops. Are the trade-offs worth it? Perhaps, but it’s hard to say. And what happens when the next new thing comes along? We have to start the whole process over again.

Others say tools will help level up teams. A CI/CD pipeline will make developers more effective and able to ship higher quality software faster. Machine learning products will make our on-call experience more manageable. Serverless will make engineers more productive. Automation will improve our company’s slow and bureaucratic processes.

This one’s simple: tools are often band-aids for broken or inefficient policies, and policies are organizational scar tissue . Tools can be useful, but they will not fix your broken culture and they certainly will not level up your teams, only supplement them at best.

Yet others say developer practices will level up teams. Teams doing pair programming or test-driven development (TDD) will level up faster and be more effective―or scrum, or agile, or mob programming. Teams not following these practices just aren’t ready, and it will take them longer to become ready.

These things can help, but they don’t actually matter that much . If this sounds like blasphemy to you, you might want to stop and reflect on that dogma for a bit. I have seen teams that use scrum, pair programming, and TDD write terrible software. I have seen teams that don’t write unit tests write amazing software. I have seen teams implement DevOps on-prem, and I have seen teams completely silo ops and dev in the cloud. These are tools in the toolbox that teams can choose to leverage, but they will not magically make a team ready or more effective. The one exception to this is code reviews by non-authors.

Code reviews are the one practice that helps improve software quality, and there is empirical data to support this. Pair programming can be a great way to mentor junior engineers and ensure someone else understands the code, but it’s not a replacement for code reviews. It’s just as easy to come up with a bad idea working by yourself as it is working with another person, but when you bring in someone uninvolved with outside perspective, they’re more likely to realize it’s a bad idea.

Code reviews are an effective way to quickly level up teams provided you have a few pockets of knowledgeable reviewers to bootstrap the process (which, as a corollary, means high-performing teams should occasionally be broken up to seed the rest of the organization). They provide quick feedback to developers who will eventually internalize it and then instill it in their own code reviews. Thus, it quickly spreads expertise. Leveling up becomes contagious.

I experienced this firsthand when I started working at Workiva. Having never written a single line of Python and having never used Google App Engine before, I joined a company whose product was predominantly written in Python and running on Google App Engine. Within the span of a few months, I became a fairly proficient Python developer and quite knowledgeable of App Engine and distributed systems practices. I didn’t do any training. I didn’t read any books. I rarely pair-coded. It was through code reviews (and, in particular, group code reviews!) alone that I leveled up. And it’s why we were ruthless on code reviews , which often caught new hires off guard. Using this approach, Workiva effectively took a team of engineers with virtually no Python or cloud experience, shipped a cloud-based SaaS product written in Python, and then IPO’d in the span of a few years.

Code reviews promote a culture which separates ego from code. People are naturally threatened by criticism, but with a culture of code reviews, we critique code, not people. Code reviews are also a good way to share context within a team. When other people review your code, they get an idea of what you’re up to and where you’re at. Code reviews provide a pulse to your team, and that can help when a teammate needs to context switch to something you were working on.

They are also a powerful way to scale other functions of product development. For example, one area many companies struggle with is security. InfoSec teams are frequently a bottleneck for R&D organizations and often resource-constrained. By developing a security-reviewer program, we can better scale how we approach security and compliance. Require security-sensitive changes to undergo a security review. In order to become a security reviewer, engineers must go through a security training program which must be renewed annually. Google takes this idea even further , having certifications for different areas like “JS readability.”

This is why our consulting at Real Kinetic emphasizes mentorship and building a culture of continuous improvement. It’s also why we bring a bias to action. We talk to companies who want to start adopting new practices and technologies but feel their teams aren’t prepared enough. Here’s the reality: you will never feel fully prepared because you can never be fully prepared. As John Gall points out, the best an army can do is be fully prepared to fight the previous war. This is where being agile does matter, but agile only in the sense of reacting and pivoting quickly.

Nothing is a replacement for experience. You don’t become a professional athlete by watching professional sports on TV. You don’t build reliable cloud software by reading about it in books or going to trainings. To be clear, these things can help , but they aren’t strategies. Similarly, developer practices can help, but they aren’t prerequisites. And more often than not, they become emotional or philosophical debates rather than objective discussions. Teams need to be given the latitude to experiment and make mistakes in order to develop that experience. They need to start doing .

The one exception is code reviews. This is the single most effective way to level up development teams. Through rigorous code reviews, quick iterations, and doing , your teams will level up faster than any training curriculum could achieve. Invest in training or other resources if you think they will help, but mandate code reviews on changes before merging into master. Along with regular retros, this is a foundational component to building a culture of continuous improvement. Expertise will start to spread like wildfire within your organization.

Follow @tyler_treat

↧

I Deliver a Machine Learning Workshop to 120 People

January 4, 2019, 9:04 am

≫ Next: Ugly soup with Python, requests & Beautiful Soup

≪ Previous: How to Level up Dev Teams

I work in the Research division at a large tech company. One of the other divisions in my company asked me if I’d do a two-hour hands-on machine learning workshop. I would normally have to decline such a request because I’m super busy with my regular work duties. But in this case, I said yes because I have a similar workshop coming up and so I figured I’d use the requested workshop as a trial run for the other workshop.

Based on previous workshop experiences, I knew I wanted the attendees to install all the needed ML software during the workshop. Installing compatible versions of python, TensorFlow, Keras, PyTorch, CNTK, NumPy, and the dozens of other necessary systems is an absolute nightmare. Also, the installation process itself has all kinds of valuable information associated with it.

But installing all this software is difficult because almost everything assumes you have a live Internet connection to pull files as needed. My work building has extraordinarily good wireless network connectivity but even so, the installation file sizes can be extremely large and anything more than about a dozen people installing at once will bring the network to its knees.

I Deliver a Machine Learning Workshop to 120 People

I explained what neural networks are during the 15 minutes it took to install Anaconda.

So, a couple of weeks before the workshop, I set out to create an installation process that doesn’t require an Internet connection. That was quite a chore but I finally figured it all out.

So, the day of the workshop came and the room had over 120 people. I was a bit nervous but these guys were extremely sharp and they were able to follow along with the installation directions.

In the second hour, I did the classic Iris Dataset example using Keras. Some of my experienced colleagues scoff when I tell them I always start with Iris. They say, “Sheesh, everyone uses Iris.” And I reply that, yes, that’s exactly the point. The Iris data is used for a good reason ― it’s not too large and it provides a common example.

I don’t do it consciously, but I tend to wave my hands around a lot when I give a talk.

Anyway, the workshop went surprisingly well. My impression was the same as it always is when I do a beginner’s workshop ― the key is not what to teach, but what to leave out. Even the Iris program, which is only about 100 lines of code, is incredibly dense conceptually. For example, initializing the weights is one or two lines of code. But a full explanation of weight and bias initialization, even the five basic algorithms, (uniform, Gaussian, Glorot uniform, Glorot normal, He) could easily be a two-hour lecture all by itself. But beginners will get bogged to a halt if an instructor goes into too much detail.

Anyway, there’s no real moral to the story. Giving a training workshop or a lecture is formal education. But in some ways, most interactions we have with other people is informal education.

↧

Ugly soup with Python, requests & Beautiful Soup

January 4, 2019, 9:02 am

≫ Next: Python Data: Quick Tip: Comparing two pandas dataframes and getting the differen ...

≪ Previous: I Deliver a Machine Learning Workshop to 120 People

Web scraping has never been a coveted nor favorite discipline of mine; in fact, for me web scraping is an unfortunate, but sometimes necessary evil. Scraping web-pages, at least for me, is a very unstructured process, basically pure Trial & Error. Perhaps, with more experience, more interest in html and the www in general, it might become more likable, akin to the other types of programming I really enjoy doing… Who knows….

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer thus, this stuff is done with pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…

Ugly soup with Python, requests & Beautiful Soup

# coding: utf-8 import requests from requests.utils import quote import re from bs4 import BeautifulSoup as bs import numpy as np from datetime import datetime,timedelta import matplotlib.pyplot as plt comp_list = [] # results are in 56 separate pages for page in range(1,57): print page url = 'https://www.marcialonga.it/marcialonga_ski/EN_results.php' payload = {'pagenum':page} print url print payload r = requests.get(url,params=payload) print r.status_code #print r.text c = r.content soup = bs(c,'html.parser') #print soup.prettify() tbl= soup.find('table') #print tbl main_table = tbl #print main_table #print main_table.get_text() competitors = main_table.find_all(class_='SP') for comp in competitors: comp_list.append(comp.get_text()) comp_list = list(map(lambda x : x.encode('utf-8'),comp_list)) #print comp_list print len(comp_list) def parse_item(i): res_pattern = r'[0-9]+' char_pattern = r'[A-Z]+' num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]' age_pattern = r'[0-9]+/' res =re.match(res_pattern,i).group() chars = re.findall(char_pattern,i,flags=re.IGNORECASE) nat = chars[-1][1:] nums = re.findall(num_pattern,i) age = re.findall(age_pattern,i) age_over=age[0][:-1] time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$' time = re.findall(time_pattern,nums[0]) t = datetime.strptime(time[0],'%H:%M:%S.%f') #t = datetime.strftime(t,'%H:%M:%S.%f') td = (t - datetime(1900,1,1)).total_seconds() name = chars[0] + ' ' + chars[1] gender = chars[-2] return (res, name,nat,gender,age_over,td) results = [] for comp in comp_list: results.append(parse_item(comp)) ''' results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')]) ''' gender = np.array([results[i][3] for i in range(len(results))]) pos = np.array([results[i][0] for i in range(len(results))]) ages = np.array([results[i][4] for i in range(len(results))]).astype(int) secs = np.array([results[i][5] for i in range(len(results))]) print ages.size print secs.size friend_1 = 18844 friend_2= 24446 male_mask = gender=='M' male_secs = secs[male_mask] female_secs = secs[~male_mask] male_mean = male_secs.mean() female_mean = female_secs.mean() bins=range(10000,38000,1000) plt.subplot(211) plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Relative Frequency') plt.axvline(friend_1,ls='dashed',color='cyan',label='Friend_1',lw=5) plt.axvline(friend_2,ls='dashed',color='magenta',label='Friend_2',lw=5) plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5) plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5) plt.legend(loc='upper right') def colors(x): if gender[x] == 'M': return 'b' else: return 'r' print secs.min(),secs.max() colormap = list(map(colors,range(len(gender)))) plt.subplot(212) plt.hist(male_secs,color='b',alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Nr of Skiers') plt.legend(loc='upper right') plt.tight_layout() plt.show()

the end :wink:

↧

Python Data: Quick Tip: Comparing two pandas dataframes and getting the differen ...

January 4, 2019, 9:00 am

≫ Next: Transposing a matrix into Python

≪ Previous: Ugly soup with Python, requests & Beautiful Soup

There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.

As an example, let’s look at two pandas dataframes.Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

dataframe 1 (named df1):

Date FruitNumColor 2013-11-24 Banana 22.1 Yellow 2013-11-24 Orange8.6 Orange 2013-11-24 Apple 7.6 Green 2013-11-24 Celery 10.2 Green

dataframe 2 (named df2):

Date FruitNumColor 2013-11-24 Banana 22.1 Yellow 2013-11-24 Orange8.6 Orange 2013-11-24 Apple 7.6 Green 2013-11-24 Celery 10.2 Green 2013-11-25 Apple22.1 Red 2013-11-25 Orange8.6 Orange

The answer, it seems, is quite simple but I couldn’t figure it out at the time. Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.

Here’s the code ( as provided by user alko on stackoverlow ):

df = pd.concat([df1, df2]) # concat dataframes df = df.reset_index(drop=True) # reset the index df_gpby = df.groupby(list(df.columns)) #group by idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex

This simple approach leads to the correct answer:

Date Fruit Num Color 92013-11-25Orange 8.6Orange 82013-11-25 Apple22.1 Red

There are most likely more ‘pythonic’ answers (one suggestion is here ) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is fast enough for my needs.

Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.

Author: Eric Brown

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com View all posts by Eric Brown

↧

Transposing a matrix into Python

January 4, 2019, 8:58 am

≫ Next: Python Lands on the Windows 10 App Store

≪ Previous: Python Data: Quick Tip: Comparing two pandas dataframes and getting the differen ...

I'm trying to create a matrix transpose function in python. A matrix is a two dimensional array, represented as a list of lists of integers. For example, the following is a 2X3 matrix (meaning the height of the matrix is 2 and the width is 3):

A=[[1, 2, 3], [4, 5, 6]]

To be transposed the the jth item in the ith index should become the ith item in the jth index. Here's how the above sample would look transposed:

>>> transpose([[1, 2, 3], [4, 5, 6]]) [[1, 4], [2, 5], [3, 6]] >>> transpose([[1, 2], [3, 4]]) [[1, 3], [2, 4]]

How can I do this?

You can use zip with * to get transpose of a matrix:

>>> A = [[ 1, 2, 3],[ 4, 5, 6]] >>> zip(*A) [(1, 4), (2, 5), (3, 6)] >>> lis = [[1,2,3], ... [4,5,6], ... [7,8,9]] >>> zip(*lis) [(1, 4, 7), (2, 5, 8), (3, 6, 9)]

If you want the returned list to be a list of lists:

>>> [list(x) for x in zip(*lis)] [[1, 4, 7], [2, 5, 8], [3, 6, 9]] #or >>> map(list, zip(*lis)) [[1, 4, 7], [2, 5, 8], [3, 6, 9]]

↧

Python Lands on the Windows 10 App Store

January 4, 2019, 8:56 am

≫ Next: Building a Sentiment Analysis Python Microservice with Flair and Flask

≪ Previous: Transposing a matrix into Python

Python Lands on the Windows 10 App Store

python Software Foundation recently released Python 3.7 as an app on the official windows 10 app store. Python 3.7 is now available to install from the Microsoft Store, meaning you no longer need to manually download and install the app from the official Python website.

Python joins a bunch of other advanced apps and tools that have been coming to the Windows 10 app store over the last year or so. A bunch of different linux distros are already available through the Microsoft Store, and the addition of Python should be welcomed by many developers. The app listing was first noticed by WalkingCat on Twitter.

Since Python is quite easy to learn and write programs with, the launch of Python on the Microsoft Store could also help students. Python is widely used by schools (especially in the UK) to teach kids basic programming skills, and being available from the Microsoft Store could really make installing it much easier than before. And as kids that are just getting into learning about computers and programming, they could easily get confused with the traditionalPython installer, and the availability from the Microsoft Store helps change that.

And for advanced users, Python 3.7 will work fairly well for most users. However, there is one limitation. “Because of restrictions on Microsoft Store apps, Python scripts may not have full write access to shared locations such as TEMP and the registry. Instead, it will write to a private copy. If your scripts must modify the shared locations, you will need to install the full installer,” the Python Software Foundation stated in the official Python Docs.

You can get Python from the Microsoft Store here.

Tagged withMicrosoft Store, Python , Windows 10

↧

Building a Sentiment Analysis Python Microservice with Flair and Flask

January 4, 2019, 8:54 am

≫ Next: Vasudev Ram: Multiple item search in an unsorted list in Python

≪ Previous: Python Lands on the Windows 10 App Store

Flair delivers state-of-the-art performance in solving NLP problems such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and text classification. It’s an NLP framework built on top of PyTorch.

In this post, I will cover how to build sentiment analysis Microservice with flair and flask framework.

Step 1: Create python 3.6 virtualenv

To use Flair you need Python 3.6. We will start by creating a Python 3.6 virtualenv

$ python3.6 -m venv pyeth

Next, we activate the virtualenv

$ source pyeth/bin/activate

Next, you can check Python version

(pyeth) $ python --version Python 3.6.1 Step 2: Install flair and flask package

To install Flair and Flask we will use pip as shown below

$ pip install flair flask

The above command will install all the required packages needed to build our Microservice. It will also install PyTorch which flair uses to do the heavy lifting.

Step 3: Create a REST API to analyse sentiments

Create a new file called app.py under the application directory.

$ touch app.py

Copy the following source code and paste it in app.py source file

from flask import abort, Flask, jsonify, request from flair.models import TextClassifier from flair.data import Sentence app = Flask(__name__) classifier = TextClassifier.load('en-sentiment') @app.route('/api/v1/analyzeSentiment', methods=['POST']) def analyzeSentiment(): if not request.json or not 'message' in request.json: abort(400) message = request.json['message'] sentence = Sentence(message) classifier.predict(sentence) print('Sentence sentiment: ', sentence.labels) label = sentence.labels[0] response = {'result': label.value, 'polarity':label.score} return jsonify(response), 200 if __name__ == "__main__": app.run()

The code shown above does the following:

It imports Flask classes and functions Next, we import TextClassifier and Sentence classes from flair package Next, we load the model related to sentiment analysis en-sentiment . The sentiment analysis model is based on IMDB dataset. When line 7 runs, it will download the sentiment analysis model and store it into the .flair subfolder of the home directory. This will take few minutes depending on your internet speed. Next, we defined a POST route mapping to /api/v1/analyzeSentiment url. This API endpoint will receive the message in a JSON body. We created an instance of Sentence and and passed it to classifier predict method. The result is in the form of label object that has two fields value and score.

You can now start the app using python app.py

Once application is started, you can test the REST API using on your favourite REST client. I will show how to make REST API using cURL.

We will first check a positive review I could watch The Marriage over and over again. At 90 minutes, it's just so delightfully heartbreaking

curl --request POST \ --url http://localhost:5000/api/v1/analyzeSentiment \ --header 'content-type: application/json' \ --data '{ "message":"I could watch The Marriage over and over again. At 90 minutes, it'\''s just so delightfully heartbreaking." }'

The response returned by API

{

"polarity": 1.0,

"result": "POSITIVE"

}

Let’s now look at an example of negative sentence Inoffensive and unremarkable

curl --request POST \ --url http://localhost:5000/api/v1/analyzeSentiment \ --header 'content-type: application/json' \ --data '{ "message":"Inoffensive and unremarkable." }'

The response returned by API

{

"polarity": 1.0,

"result": "NEGATIVE"

}

Finally, let’s look at mixed review I don't think Destroyer is a good movie, but it is never boring and often hilarious.

curl --request POST \ --url http://localhost:5000/api/v1/analyzeSentiment \ --header 'content-type: application/json' \ --data '{ "message":"I don'\''t think Destroyer is a good movie, but it is never boring and often hilarious." }'

The response returned by API

{ "polarity": 0.11292144656181335, "result": "NEGATIVE" }

That’s it for today.

↧

Vasudev Ram: Multiple item search in an unsorted list in Python

January 4, 2019, 8:52 am

≫ Next: Using Python to Pull Data from MS Graph API Part 1

≪ Previous: Building a Sentiment Analysis Python Microservice with Flair and Flask

- By Vasudev Ram - Online python training / SQL training / linux training

Vasudev Ram: Multiple item search in an unsorted list in Python

Hi, readers,

I was reviewing simple algorithms with a view to using some as examples or exercises in my Python programming course . While doing so, I thought of enhancing simple linear search for one item in a list, to make it search for multiple items.

Here are a couple of program versions I wrote for that task. They use straightforward logic. There are just a few additional points:

- In both programs, I use a generator to yield the values found (the index and the item).

- In the first program, I print out the index and item for each item found.

- In the second program, I mark where the items are found with text "arrows".

This is the first program, mult_item_search_unsorted_list.py :

# mult_item_search_unsorted_list.py # Purpose: To search for multiple items in an unsorted list. # Prints each item found and its index. # Author: Vasudev Ram # Copyright 2019 Vasudev Ram # Training: https://jugad2.blogspot.com/p/training.html # Blog: https://jugad2.blogspot.com # Web site: https://vasudevram.github.io # Product store: https://gumroad.com/vasudevram from __future__ import print_function import sys from random import sample, shuffle def mult_item_search_unsorted_list(dlist, slist): for didx, ditem in enumerate(dlist): for sitem in slist: if sitem == ditem: yield (didx, ditem) def main(): # Create the search list (slist) with some items that will be found # and some that will not be found in the data list (dlist) below. slist = sample(range(0, 10), 3) + sample(range(10, 20), 3) # Create the data list. dlist = range(10) for i in range(3): # Mix it up, DJ. shuffle(slist) # MIX it up, DEK. shuffle(dlist) print("\nSearching for:", slist) print(" in:", dlist) for didx, ditem in mult_item_search_unsorted_list(dlist, slist): print(" found {} at index {}".format(ditem, didx)) main()

Output of a run:

$ python mult_item_search_unsorted_list.py Searching for: [1, 18, 3, 15, 19, 4] in: [8, 9, 1, 2, 0, 7, 5, 3, 6, 4] found 1 at index 2 found 3 at index 7 found 4 at index 9 Searching for: [4, 19, 18, 15, 1, 3] in: [7, 5, 8, 2, 9, 4, 0, 3, 6, 1] found 4 at index 5 found 3 at index 7 found 1 at index 9 Searching for: [1, 3, 4, 18, 19, 15] in: [9, 6, 1, 8, 7, 4, 3, 0, 2, 5] found 1 at index 2 found 4 at index 5 found 3 at index 6

And this is the second program, mult_item_search_unsorted_list_w_arrows.py :

# mult_item_search_unsorted_list_w_arrows.py # Purpose: To search for multiple items in an unsorted list. # Marks the position of the items found with arrows. # Author: Vasudev Ram # Copyright 2019 Vasudev Ram # Training: https://jugad2.blogspot.com/p/training.html # Blog: https://jugad2.blogspot.com # Web site: https://vasudevram.github.io # Product store: https://gumroad.com/vasudevram from __future__ import print_function import sys from random import sample, shuffle def mult_item_search_unsorted_list(dlist, slist): for didx, ditem in enumerate(dlist): for sitem in slist: if sitem == ditem: yield (didx, ditem) def main(): # Create the search list (slist) with some items that will be found # and some that will not be found in the data list (dlist) below. slist = sample(range(10), 4) + sample(range(10, 20), 4) # Create the data list. dlist = range(10) for i in range(3): # Mix it up, DJ. shuffle(slist) # MIX it up, DEK. shuffle(dlist) print("\nSearching for: {}".format(slist)) print(" in: {}".format(dlist)) for didx, ditem in mult_item_search_unsorted_list(dlist, slist): print("---------{}^".format('---' * didx)) main()

Output of a run:

$ python mult_item_search_unsorted_list_w_arrows.py Searching for: [16, 0, 15, 4, 6, 1, 10, 12] in: [8, 9, 0, 1, 5, 4, 7, 2, 6, 3] ---------------^ ------------------^ ------------------------^ ---------------------------------^ Searching for: [6, 16, 10, 0, 1, 4, 12, 15] in: [2, 7, 0, 8, 1, 4, 6, 3, 9, 5] ---------------^ ---------------------^ ------------------------^ ---------------------------^ Searching for: [0, 12, 4, 10, 6, 16, 1, 15] in: [8, 1, 0, 7, 9, 6, 2, 5, 4, 3] ------------^ ---------------^ ------------------------^ ---------------------------------^

In a recent post, Dynamic function creation at run time with Python's eval built-in , I had said:

"Did you notice any pattern to the values of g(i)? The values are 1, 4, 9, 16, 25 - which are the squares of the integers 1 to 5. But the formula I entered for g was not x * x, rather, it was x * x + 2 * x + 1. Then why are squares shown in the output? Reply in the comments if you get it, otherwise I will answer next time."

No reader commented with a solution. So here is a hint to figure it out:

What is the expansion of (a + b) ** 2 (a plus b the whole squared) in algebra?

Heh.

The drawing of the magnifying glass at the top of the post is by:

Yours truly.

( The same one that I used in this post:

Command line D utility - find files matching a pattern under a directory )

I'll leave you with another question: What, if any, could be the advantage of using Python generators in programs like these?

Notice that I said " programs like these ", not " these programs ".

Enjoy.

- Vasudev Ram - Online Python training and consulting

I conduct online courses on Python programming, Unix / Linux commands and shell scripting and SQL programming and database design , with course material and personal coaching sessions.

The course details and testimonials are here.

Contact me for details of course content, terms and schedule.

Try FreshBooks : Create and send professional looking invoices in less than 30 seconds.

Getting a new web site or blog, and want to help preserve the environment at the same time? Check out GreenGeeks.com web hosting.

Sell your digital products via DPD: Digital Publishing for Ebooks and Downloads .

Learning Linux? Hit the ground running with my vi quickstart tutorial . I wrote it at the request of two windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix. Of course, vi/vim is one of the most ubiquitous text editors around , and works on most other common operating systems and on some uncommon ones too, so the knowledge of how to use it will carry over to those systems too.

Check out WP Engine, powerful WordPress hosting .

Get a fast web site with A2 Hosting .

Creating online products for sale? Check out ConvertKit , email marketing for online creators.

Teachable: feature-packed course creation platform , with unlimited video, courses and students.

Posts about:Python *DLang *xtopdf

My ActiveState Code recipes

Follow me on:

↧

Using Python to Pull Data from MS Graph API Part 1

January 4, 2019, 1:18 pm

≫ Next: 用 Python 做机器学习不得不收藏的重要库

≪ Previous: Vasudev Ram: Multiple item search in an unsorted list in Python

Welcome to 2019 fellow geeks! I hope each of you had a wonderful holiday with friends and family.

It’s been a few months since my last post. As some of you may be aware I made a career move last September and took on a new role with a different organization. The first few months have been like drinking from multiple fire hoses at once and I’ve learned a ton. It’s been an amazing experience that I’m excited to continue in 2019.

One area I’ve been putting some focus in is learning the basics of python. I’ve been a PowerShell guy (with a bit of C# thrown in there) for the past six years so diving into a new language was a welcome change. I picked up a few books on the language, watched a few videos, and it wasn’t clicking. At that point I decided it was time to jump into the deep end and come up with a use case to build out a script for. Thankfully I had one queued up that I had started in PowerShell.

Early last year my wife’s Office 365 account was hacked. Thankfully no real damage was done minus some spam email that was sent out. I went through the wonderful process of changing her passwords across her accounts, improving the complexity and length, getting her on-boarded with a password management service, and enabling Azure MFA (Multi-factor Authentication) on her Office 365 account and any additional services she was using that supported MFA options. It was not fun.

Curious of what the logs would have shown, I had begun putting together a PowerShell script that was going to pull down the logs from Azure AD (Active Directory), extract the relevant data, and export it CSV (comma-separate values) where I could play around with it in whatever analytics tool I could get my hands on. Unfortunately life happened and I never had a chance to finish the script or play with the data. This would be my use case for my first Python script.

Azure AD offers a few different types of logs which Microsoft divides into a security pillar and an activity pillar . For my use case I was interested in looking at the reports in the Activity pillar, specifically the Sign-ins report. This report is available for tenants with an Azure AD Premium P1 or P2 subscription (I added P2 subscriptions to our family accounts last year). The sign-in logs have a retention period of 30 days and are available either through the Azure Portal or programmatically through the MS Graph API (Application Programming Interface).

My primary goals were to create as much reusable code as possible and experiment with as many APIs/SDKs (Software Development Kits) as I could. This was accomplished by breaking the code into various reusable modules and leveraging AWS (Amazon Web Services) services for secure storage of Azure AD application credentials and cloud-based storage of the exported data. Going this route forced me to use the MS Graph API, Microsoft’s Azure Active Directory Library for Python (or ADAL for short) , and Amazon’s Boto3 Python SDK .

On the AWS side I used AWS Systems Manager Parameter Store to store the Azure AD credentials as secure strings encrypted with a AWS KMS (Key Management Service) customer-managed customer master key (CMK). For cloud storage of the log files I used Amazon S3.

Lastly I needed a development environment and source control. For about a day I simply used Sublime Text on my Mac and saved the file to a personal cloud storage account. This was obviously not a great idea so I decided to finally get my GitHub repository up and running. Additionally I moved over to using AWS’s Cloud9 for my IDE (integrated development environment). Cloud9 has the wonderful perk of being web based and has the capability of creating temporary credentials that can do most of what my AWS IAM user can do. This made it simple to handle permissions to the various resources I was using.

Once the instance of Cloud9 was spun up I needed to set the environment up for Python 3 and add the necessary libraries. The AMI (Amazon Machine Image) used by the Cloud9 service to provision new instances includes both Python 2.7 and Python 3.6. This fact matters when adding the ADAL and Boto3 modules via pip because if you simply run a pip install module_name it will be installed for Python 2.7. Instead you’ll want to execute the command python3 -m pip install module_name which ensures that the two modules are installed in the appropriate location.

In my next post I’ll walk through and demonstrate the script.

Have a great week!

↧

用 Python 做机器学习不得不收藏的重要库

January 4, 2019, 1:16 pm

≪ Previous: Using Python to Pull Data from MS Graph API Part 1

本文为 AI 研习社编译的技术博客，原标题：

Essential libraries for Machine Learning in python

作者 | Shubhi Asthana

翻译 |就2

校对 | 就2 整理 | 菠萝妹

原文链接：

https://medium.freecodecamp.org/essential-libraries-for-machine-learning-in-python-82a9ada57aeb

用 Python 做机器学习不得不收藏的重要库

Python通常被应用统计技术或者数据分析人员当做工作中的首选语言。数据科学家也会用python作为连接自身工作与WEB 应用程序/生产环境集成中。

Python在机器学习领域非常出色。它具有一致的语法、更短的开发时间和灵活性，非常适合开发能够直接插入生产系统的复杂模型和预测引擎。

Python的一个最大的资产是其广泛的库。

库是一组用给定语言编写的程序和功能的集合。一组健壮的库可以使开发人员更容易执行复杂的任务，而无需重写许多代码。

机器学习很大程度上是基于数学。具体来说就是数学优化、统计和概率。Python库帮助那些不具备开发人员知识的研究人员/数学家轻松地“进行机器学习”。

以下是机器学习中最常用的一些库:

Scikit-learn 经典的ML算法
用 Python 做机器学习不得不收藏的重要库

Scikit-learn 是最流行的ML 库之一，他支持很多监督学习和非监督学习算法。例如：线性回归，逻辑回归，决策树，聚类，k-means等。

他基于两个python库：Numpy 和 Scipy 。他为常见的机器学习和数据挖掘提供了一组算法：聚类，回归和分类。甚至像数据转换，特征选择，集成学习这样的任务也可与通过简短几行代码实现。

对于机器学习的新手来说，Scikit-learn 是一个够用的工具，直到你自己开始实现更复杂的算法。

Tensorflow for Deep Learning 深度学习
用 Python 做机器学习不得不收藏的重要库

如果你在机器学习的世界里，你可能听过，尝试过或者实现过某种形式的深度学习的算法。但是他们是必要的吗？回答可能是不必要。但是完成他们后感觉很酷对吗？回答是：对的！酷毙了。

Tensorflow 有趣的地方在于，当你使用python 编写代码，你可以编译和运行在你的CPU 或者GPU 上，而且你不需要写 c++或者 CUDA 的代码，就可以运行在GPUs 集群上。

他使用一个多层节点的系统，允许你快速的简历，训练，部署具有大量数据集的人工神经网络。这让谷歌能够识别照片中的物体，通过语音识别程序理解在口语中的单词。

Theano is also for Deep Learning
用 Python 做机器学习不得不收藏的重要库

Theano 是另一个用于数值计算的优秀类库，有点类似于Numpy。Theano 允许你高效的定义，优化和评估涉及多维数组的数学表达式。

使Theano 与众不同的是它利用了计算机的GPU。这使得它能够比单独在CPU上运行时快100倍进行数据密集型计算。Theano的速度使得它对于深度学习和其他复杂的计算任务特别有价值。

Theano 库的最终发布是在去年――2017年，版本1.0.0包含了许多新特性、界面更改和改进。

Pandas 数据提取与预处理

panda是一个非常流行的库，它提供了简单易用且直观的高级数据结构。

它有许多内建的方法来分组、组合数据和过滤以及执行时间序列分析。

panda可以轻松地从SQL数据库、CSV、Excel、JSON文件等不同来源获取数据，并对数据进行操作。图书馆有两个主要结构:

Series“级数”---- 一维。

Data Frames“数据帧”---- 二维。

如果想得知关于如何使用序列和数据看框架的更多细节，请查看的我的其他文章。

Matplotlib用于数据可视化

如果你不能很好的与其他人交流，那么最好的，最复杂的机器学习就显得没有意义。

那么如何从这些数据中转换出价值呢?你如何激励你的业务分析师，告诉他们充满“洞察力”的“故事”?

这就是Matplotlib发挥作用的地方。它是每个数据科学家用于创建2D图形和图形的标准Python库。它是命令行简单，这意味着它需要更多的命令来生成好看的图形和数字，而不是使用一些高级库。

然而，这也带来了灵活性。有了足够的命令，您可以使用Matplotlib制作任何您想要的图形。您可以构建不同的图表，从直方图和散点图到非笛卡尔坐标图。

它支持所有操作系统上的不同GUI后端，还可以将图形导出到通用矢量和图形格式，如PDF、SVG、JPG、PNG、BMP、GIF等。

Seaborn是另一个数据可视化库
用 Python 做机器学习不得不收藏的重要库

Seaborn是一个流行的可视化库，它建立在Matplotlib的基础之上。它是一个高级库，这意味着更容易生成某些类型的图，包括热图、时间序列和小提琴图。

最后

这是机器学习中最重要的Python库的集合。如果您打算使用Python和数据科学，那么这些库是值得一看的，同时也值得您熟悉。

我是否错过了任何重要的Python ML库?如果是，请务必在下面的评论中提到它。尽管我试图介绍最有用的库，但可能仍然没有介绍其他一些值得研究的库。

问题或建议吗?我很想听听你的意见――请随意留言。

想要继续查看该篇文章相关链接和参考文献？

长按链接点击打开或点击底部【用python做机器学习不得不收藏的重要库】：

https://ai.yanxishe.com/page/TextTranslation/1119

AI研习社每日更新精彩内容，观看更多精彩内容：雷锋网 (公众号：雷锋网) 雷锋网雷锋网

AI/机器学习年度2018年度进展综述算法基础：五大排序算法Python实战教程手把手：用PyTorch实现图像分类器（第一部分）手把手：用PyTorch实现图像分类器（第二部分）

等你来译：

对混乱的数据进行聚类初学者怎样使用Keras进行迁移学习强化学习：通往基于情感的行为系统一文带你读懂 WaveNet：谷歌助手的声音合成器

↧

Latest Images