Not too long ago I was assigned the project of my dreams here at IBM's Spark Technology Center: create a super life-changing application that incorporates Apache Spark and Apache SystemML. If you've been following along, you'll know that I am barely a year into my data science career and between my internship here at the STC and my masters degree at UC Berkeley, I have met with a steep learning curve. Because of this, I've decided to blog about every step along the way! That way every data enthusiast and fellow data scientist can follow along and build their own life-changing app. (After all, we might as well crowd source saving the world.)
My lastblog post was a tutorial on how to use the new SystemML API on the Spark Shell, but before that, I looked at the frustrating step of finding big, open data. On this quest for delightful data, my team and I came across a breast cancer research competition that was an ideal use case for SystemML and Spark. I mean, it was BIG data, life-changing, and interesting. What's not to love? Let me elaborate. After entering the competition, we were given 500 digital images of breast cancer tissue on medical slides, taken from a microscope. Considering that these images are huge slides (apx 7GB each) with 20-40x zoom, with 50,000 pixels to 100,000 pixels in both directions, we can safely say we are dealing with really big data! Because of the size, it is an excellent challenge for Apache Spark and Apache SystemML and our goal will be to develop an automatic way, or a SystemML solution, to determine the grade of cancer in any given tissue image. In order to solve this problem, we will need to use deep learning and neural networks, but first , we have to clean up our data. That's what this blog is for!
While in this pre-processing stage, I've been able learn about a ravishing resource for viewing large images: Openslide and deepzoom. Because of this, I'll first walk you through how to set up and use these tools. After that, we will go ahead and get started on some pre-processing steps! If you don't have access to images of your own, try this source .
First update. brew update brew upgrade Install python 3. brew install python3 Install the Python packages. pip3 install -U matplotlib numpy pandas scipy jupyter scikit-learn scikit-image flask Install OpenSlide. brew install openslide pip3 install openslide-python Now, create a new folder and work from there. I named mine AwesomeProject/.*Note: Check where you installed SystemML in my firsttutorial. *Note #2: If you don't have tissue images lying around, use this source . Download the .svs files.
#Download a few images to get started. #Place them in your a new folder within AwesomeProject/. #I called mine data/. #Start your Jupyter notebook. PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path $SYSTEMML_HOME/SystemML.jar #Leave this tab running and Jupyter open in your browser. We #will come back to it later.Make sure your Jupyter notebook starts up with Python3 in the right hand corner. If it doesn't show up automatically, go to Kernel -> Change Kernel -> Python3. If that doesn't work you may need to make sure Python3 is the version being used.
Now, in a new tab on terminal, go into your data/ folder. You'll now need to clone OpenSlide and go into the folder to start it. Don't know git? Here 's a great tutorial. git clone https://github.com/openslide/openslide-python.git cd openslide-python/examples/deepzoom python3 deepzoom_multiserver.py../../../data/ Now you need to open OpenSlide on your browser. #After you push enter, your terminal should say: *Running on http://address #copy that http: address and paste it in your browser. Now you should have two tabs on terminal occupied by Jupyter and OpenSlide. Leave both of them running. When you go to OpenSlide on your browser you should see a list of your image files in your data/ file.
Click on one of the images to see it.

One you are viewing the image you can use your mouse or track pad to zoom in and out.

Congrats! You've now looked at all of that tissue using OpenSlide and Python 3. Now, let's do our first pre-processing step using Jupyter.
Navigate back to your Jupyter Notebook that should be in your browser. Remember, we are still in a bit of an exploratory phase, so our aim is to look at example tiles and change it around before applying it to the entire slide and most definitely before applying it to all 500 slides.
Our first step is to load everything we need. %load_ext autoreload %autoreload 2 %matplotlib inline # Add SystemML PySpark API file. sc.addPyFile("https://raw.githubusercontent.com/apache/incubator-systemml/branch-0.10/src/main/java/org/apache/sysml/api/python/SystemML.py") from glob import glob import matplotlib.pyplot as plt import matplotlib as mpl import numpy as np import openslide from openslide import open_slide from openslide.deepzoom import DeepZoomGenerator import pandas as pd from scipy.ndimage.morphology import binary_fill_holes, binary_closing, binary_dilation from skimage.color import rgb2gray from skimage.morphology import closing, binary_closing, disk, remove_small_holes, dilation, remove_small_objects from skimage import color, morphology, filters, exposure, feature plt.rcParams['figure.figsize'] = (10, 6) Now we can choose the slide we want to work with. #Start by getting your images from your data/ file. files = glob("data/*.svs") files #Specify which image/slide it is. For this example I will #use slide 7. slide_num = 7 slide = open_slide(files[slide_num-1]) Now we will generate tiles or, in other words, slice the image up into smaller squares. This will help us look at the image in more detail and will also help us process the content later. We want to do this because we can't process the entire image, but need to instead process them by tile. tile_size = 1024 tiles = DeepZoomGenerator(slide, tile_size=tile_size, overlap=0, limit_bounds=False) # overlap adds pixels to each side # See how many tiles there are for each level of magnification. tiles.level_tiles #choose tiles you want to look at. You can change around #the coordinates to get the tile you are looking for. #This is where OpenSlide helps. tile = tiles.get_tile(tiles.level_count-1, (85, 35)) tile Below are examples of what I did.


Look at you! You have generated your tiles and visualized some examples! You are now officially an expert at OpenSlide after looking at images of tissue, loading your images, and visualizing some example tiles. Next up will be further pre-processing steps and exploration. Once we have finished our pre-processing on example tiles, we will be able to apply it to all of our slides and use our Spark cluster. This will be followed by our fancy SystemML steps. It seems we are well on our way to changing lives.
Stay tuned for more!
By Madison J. Myers