Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Top pandas, numpy and scipy functions and modules used in github repos

$
0
0
Introduction

Github recently introduced all files query-able on the Google BigQuery, SQL-like distributed query engine. See the announcement from github. You can execute the BigQuery queries at the BigQuery console .

numpy, scipy and pandas are popular python packages for data analysis. They are quite big, so it may be hard to distinguish between functions you would use all the time and functions that you use once in a while.

Using the new github data on BigQuery I calculated the most popular numpy, scipy and pandas functions and modules used in github repos. I have found a few functions that I should have been using that I have not been aware of. It would be very cool to sort code completions in python packages based on the frequency on github.

I will be using pre-extracted data with python file contents from https://bigquery.cloud.google.com/dataset/fh-bigquery:github_extracts .

URL to example usage assumes that file points to master branch, so it may be sometimes broken until https://twitter.com/kozikow/status/749016021852418048 gets resolved.

Revision history of this blog post is stored on github.

Top pandas functions

Results are approximate and based on the heuristic assumption that people usually prefix python pandas functions by “pd.” or “pandas.”.

SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)') AS function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, COUNT(*) as count_total, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import pandas" HAVING NOT LEFT(LTRIM(line),1)='#' AND REGEXP_MATCH(line, r'[^a-zA-Z](?:pd|pandas)\.') ) GROUP BY 1 ORDER BY 2 DESC LIMIT 500;

Full result list in google docs. Top 20 results:

function count_distinct_repos count_total example_url DataFrame 5486 47478 https://github.com/konchris/RunMeas/blob/master/RunMeas/Buffer.py read_csv 4056 17567 https://github.com/fcollman/MakeAT/blob/master/make_make_file.py Series 2248 19124 https://github.com/AllenDowney/ThinkBayes2/blob/master/code/thinkplot.py concat 1869 7456 https://github.com/mhallsmoore/qstrader/blob/master/price_handler/price_handler.py to_datetime 774 3176 https://github.com/cbyn/bitpredict/blob/master/model/features.py merge 650 2642 https://github.com/dmnfarrell/mirnaseq/blob/master/mirdeep2.py date_range 548 3233 https://github.com/and2egg/philharmonic/blob/master/philharmonic/simulator/environment.py read_table 499 1683 https://github.com/cdeboever3/cdpybio/blob/master/cdpybio/express.py util.testing 477 1856 https://github.com/sauloal/cnidaria/blob/master/scripts/venv/lib/python2.7/site-packages/pandas/tseries/tests/test_timeseries_legacy.py isnull 468 1459 https://github.com/Weissger/ext2rdf/blob/master/src/RDFConverter/TripleStructureConverter.py DataFrame.from_dict 399 1455 https://github.com/mdbartos/vic_utils/blob/master/deprecated/mohseni_reg.py Timestamp 387 7029 https://github.com/paulperry/quant/blob/master/vti_agg_7030.py DatetimeIndex 336 1629 https://github.com/readevalprint/zipline/blob/master/zipline/utils/tradingcalendar.py Index 322 2772 https://github.com/caseyclements/dask/blob/master/dask/dataframe/shuffle.py read_excel 302 946 https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Lab/SPF_forecasts.py notnull 284 713 https://github.com/DataViva/dataviva-scripts/blob/master/scripts/secex_monthly/_rdo_temp.py DataFrame.from_csv 265 802 https://github.com/idbedead/RNA-sequence-tools/blob/master/RNA_Seq_analysis/make_monocle_data_js.py HDFStore 251 783 https://github.com/konchris/TDMS2HDF5/blob/master/TDMS2HDF5/tdms2hdf5.py DataFrame.from_records 249 534 https://github.com/phaustin/A405/blob/master/notebooks/python/dropgrowC.py MultiIndex.from_tuples 237 744 https://github.com/ZoomerAnalytics/xlwings/blob/master/xlwings/tests/test_xlwings.py rolling_mean 233 651 https://github.com/Ernestyj/PyStudy/blob/master/finance/DaysTest/DaysDataPrepare.py Top pandas data frame functions

Results are again approximate and based on the heuristic assumption that data frames are usually named with the suffix “df”. To filter out noise, only files containing “import pandas” and matching regexp r”.*df\s=.*(?:pandas|pd)\.” are included.

SELECT REGEXP_EXTRACT(line, r"df\.([a-zA-Z-_\.]+)") AS pandas_function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import pandas" and REGEXP_MATCH(content, r".*df\s=.*(?:pandas|pd)\.") HAVING line CONTAINS "df.") GROUP BY 1 HAVING LENGTH(pandas_function) > 1 ORDER BY 2 DESC LIMIT 1000;

Full result list in Google Docs. Top 20 results:

pandas_function count_distinct_repos example_url columns 1290 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py index 958 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py to_csv 945 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py loc 729 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py groupby 614 https://github.com/fepz/AyCC/blob/master/process_results.py set_index 571 https://github.com/LinJM/clothesDetection/blob/master/caffe-fast-rcnn/python/detect.py drop 473 https://github.com/lukassnoek/skbold/blob/master/skbold/exp_model/parse_presentation_logfile.py ix 450 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py iloc 418 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py shape 387 https://github.com/sdpython/ensae_projects/blob/master/_unittests/ut_data/test_data_helper.py iterrows 348 https://github.com/rmhyman/DataScience/blob/master/Lesson1/titanic_data_heuristic1.py sort 341 https://github.com/CGATOxford/cgat/blob/master/scripts/data2spike.py append 340 https://github.com/MadsJensen/CAA/blob/master/calc_ali.py copy 298 https://github.com/wavelets/lifelines/blob/master/tests/test_estimation.py rename 288 https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py reset_index 283 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py apply 278 https://github.com/lukovkin/ufcnn-keras/blob/master/models/UFCNN_predict.py dropna 273 https://github.com/nelsonag/openmc/blob/master/openmc/filter.py head 263 https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py values 259 https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py fillna 228 https://github.com/thesgc/cbh_chembl_ws_extension/blob/master/cbh_chembl_ws_extension/serializers.py plot 203 https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/bootcamp_pandas-input.py Top numpy functions

Results are again approximate and it’s a simple string replace from the pandas version.

SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](?:np|numpy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, COUNT(*) as count_total, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import numpy" HAVING NOT LEFT(LTRIM(line),1)='#' AND REGEXP_MATCH(line, r'[^a-zA-Z](?:np|numpy)\.') ) GROUP BY 1 ORDER BY 2 DESC LIMIT 500;

Full result list in Google docs. Top 20 results:

function count_distinct_repos count_total example_url array 23877 604263 https://github.com/AlexBourassa/Generic_UI/blob/master/Widgets/GraphWidget/Fitter.py zeros 19406 280579 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py arange 13587 158705 https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py sqrt 10297 77810 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py ones 10028 80998 https://github.com/iamtrask/keras/blob/master/keras/models.py sum 9829 85793 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py mean 9773 56402 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py linspace 8769 62970 https://github.com/Titan-C/learn-dmft/blob/master/examples/plot_ipt_coex.py asarray 7745 82563 https://github.com/ratnania/caid/blob/master/caid-gui/viewer.py ndarray 7617 71141 https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py dot 7386 90422 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py exp 6979 42446 https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py abs 6979 43168 https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py where 6781 56778 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py empty 6632 51718 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py max 6533 31860 https://github.com/live-clones/dolfin-adjoint/blob/master/tests_dolfin/mantle_convection/retrieve_demo.py concatenate 6425 36532 https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py log 5742 33105 https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py sin 5302 25481 https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py vstack 5251 25913 https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py min 5064 21231 https://github.com/gwpy/seismon/blob/master/seismon/psd.py Top scipy functions

Results are again approximate and it’s a simple string replace from the numpy version.

SELECT REGEXP_EXTRACT(line, r'[^a-zA-Z](?:sp|scipy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function, COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos, COUNT(*) as count_total, CONCAT("https://github.com/", FIRST(sample_repo_name), "/blob/master/", FIRST(sample_path)) AS example_url, FROM ( SELECT SPLIT(content, '\n') AS line, sample_path, sample_repo_name FROM [fh-bigquery:github_extracts.contents_py] WHERE content CONTAINS "import scipy" HAVING NOT LEFT(LTRIM(line),1)='#' AND REGEXP_MATCH(line, r'[^a-zA-Z](?:sp|scipy)\.') ) GROUP BY 1 ORDER BY 2 DESC LIMIT 500;

Full result list in google docs. Top 20 results:

function count_distinct_repos count_total example_url stats 2281 5717 https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py sparse 1706 6500 https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py optimize 1531 2788 https://github.com/cni/t1fit/blob/master/t1_fitter.py io 1218 3079 https://github.com/wojtekwalczak/FB_datalab/blob/master/lib/most_distinctive.py linalg 1199 3047 https://github.com/lesteve/scikit-learn/blob/master/sklearn/utils/arpack.py interpolate 972 2022 https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py special 968 1792 https://github.com/liberatorqjw/scikit-learn/blob/master/sklearn/utils/fixes.py signal 915 1883 https://github.com/garibaldu/radioblobs/blob/master/code/code_1d/old_and_extra/score_GA.py ndimage 864 2196 https://github.com/cni/t1fit/blob/master/t1_fitter.py misc 650 1135 https://github.com/sillvan/hyperspy/blob/master/hyperspy/drawing/_markers/point.py integrate 574 986 https://github.com/kleskjr/scipy/blob/master/scipy/stats/tests/test_distributions.py sparse.linalg 495 1056 https://github.com/lesteve/scikit-learn/blob/master/sklearn/utils/arpack.py spatial.distance 469 721 https://github.com/wjchen84/rapprentice/blob/master/rapprentice/registration.py spatial 420 766 https://github.com/delmic/odemis/blob/master/src/odemis/acq/align/coordinates.py io.loadmat 414 1501 https://github.com/jdsika/TUM_SmartCardLab/blob/master/DPA/benchmark.py sparse.csr_matrix 401 1305 https://github.com/waterponey/scikit-learn/blob/master/scikits/learn/svm/tests/test_sparse.py org 369 894 https://github.com/chiotlune/ext/blob/master/gnuradio-3.7.0.1/gr-filter/examples/fir_filter_ccc.py csr_matrix 361 2541 https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py array 352 3873 https://github.com/PMBio/limix/blob/master/limix/deprecated/io/data_util.py issparse 334 2309 https://github.com/thilbern/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py Attribution

Regular expression used to extract function have improved upon by Felipein the comment.


Viewing all articles
Browse latest Browse all 9596

Trending Articles