I just found a very weird behavior in PySpark. I will show it with an example. Who knows, maybe this can help someone else.
I am processing a list of text files containing data in jsonlines format . After some fiddling I set up a basic class to process the files:
class TestClassProcessor(object): def __init__(self): self.spark = SparkSession...GetOrCreate() @staticmethod def parse_record(self, record): ... do something with record... return record_updated def process_file(self, fname): data = self.spark.read.text(fname) data_processed = data.rdd.map( lambda r: self.parse_record(r.value) ) df = data_processed.toDF()The reason why I set up parse_record as a staticmethod was simple. Initially I set up this code as a set of functions, and when I switched to classes I wanted the existing code to work. So in the existing code, I just changed
data_processed = data.rdd.map( lambda r: parse_record(r.value) )to
data_processed = data.rdd.map( lambda r: TestClassProcessor.parse_record(r.value) )I mean, that is the purpose of static methods, ain't it?
Big was my surprise, when I spark-submitt ed (is that a verb?) the script and got one of Spark's wonderful spaguetti traces, ending in the infamous:
py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745)Trace which, based on my experience, it can mean fkin anything .
Interesingly though, the old function based code still worked.
At the end, the way I fixed this baffling error was easy. Just had to change one line of the class TestClassProcessor to
class TestClassProcessor(object): def __init__(self): self.spark = SparkSession...GetOrCreate() @staticmethod def parse_record(self, record): ... do something with record... return record_updated def process_file(self, fname): data = TestClassProcessor.spark.read.text(fname) #<==============TADAAA! data_processed = data.rdd.map( lambda r: self.parse_record(r.value) ) df = data_processed.toDF()Which Im pretty sure is not required in regular python.
Cheers!