Weird behavior in Class methods vs StaticMethods in Pyspark

Note, using Spark 2.0.0 with python 2.7

I just found a very weird behavior in PySpark. I will show it with an example. Who knows, maybe this can help someone else.

I am processing a list of text files containing data in jsonlines format . After some fiddling I set up a basic class to process the files:

class TestClassProcessor(object): def __init__(self): self.spark = SparkSession...GetOrCreate() @staticmethod def parse_record(self, record): ... do something with record... return record_updated def process_file(self, fname): data = self.spark.read.text(fname) data_processed = data.rdd.map( lambda r: self.parse_record(r.value) ) df = data_processed.toDF()

The reason why I set up parse_record as a staticmethod was simple. Initially I set up this code as a set of functions, and when I switched to classes I wanted the existing code to work. So in the existing code, I just changed

data_processed = data.rdd.map( lambda r: parse_record(r.value) )

data_processed = data.rdd.map( lambda r: TestClassProcessor.parse_record(r.value) )

I mean, that is the purpose of static methods, ain't it?

Big was my surprise, when I spark-submitt ed (is that a verb?) the script and got one of Spark's wonderful spaguetti traces, ending in the infamous:

py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745)

Trace which, based on my experience, it can mean fkin anything .

Interesingly though, the old function based code still worked.

At the end, the way I fixed this baffling error was easy. Just had to change one line of the class TestClassProcessor to

class TestClassProcessor(object): def __init__(self): self.spark = SparkSession...GetOrCreate() @staticmethod def parse_record(self, record): ... do something with record... return record_updated def process_file(self, fname): data = TestClassProcessor.spark.read.text(fname) #<==============TADAAA! data_processed = data.rdd.map( lambda r: self.parse_record(r.value) ) df = data_processed.toDF()

Which Im pretty sure is not required in regular python.

Cheers!

Weird behavior in Class methods vs StaticMethods in Pyspark

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本