Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Talkin’ ‘Bout Trucks, Beer, and Love in Country Songs ― Analyzing Genius Lyri ...

$
0
0

Trucks, beer, and love, all things that make country music go round.I’ve said before that country music is just pop music with a slide, and thenlyrics about slightly different topics than what you’ll hear in hip hop or “normal” pop music on the radio.

In my continuing quest to validate my theory that all country songs can fit into one of four different topics, in this post, I go through lyrics to see which artiststalk about trucks, beer, and love the most. In my firstpost on this topic , I talked about how to get song lyrics from genius and print them out on the command line.

The goal here, and what I’m going to walk you through, is how I stored stored info and lyrics for all the songs for the country artists, how I made sure that all the lyrics were unique, and then ran some stats on the songs.Another note before we go is thata lot of data work is just janitorial. The actual code for getting“interesting” results is fairly simple. The key it to enjoy doing the janitor-style coding and then you’ll be good.

If you’re interested in which country music people talk most about trucks, beer, alcohol, or small towns, skip to the end where I list out some stats. For the rest, here’s some code.


Talkin’ ‘Bout Trucks, Beer, and Love in Country Songs ― Analyzing Genius Lyri ...

I wonder how they feel about beer trucks. I’m guessing they’d all be fans of them.

Step 1 ― Save the Lyrics!

When doing anything with web scraping, the one thing to always, always keep in mind here, is that you want to avoid hitting the server for as little as possible. With that in mind, we’re going to do here is assume the inputs are names of artists. For each of those artists, find all of their songs, and then for each of those songs, grab the lyrics in the way that I did in the first post, and then save them locally along with some meta information the API provides.

Now when I post the following code, don’t imagine that I knew what I wanted . Everything in here was created iteratively. Here’s a list of all the features of this piece of code does that were created iteratively.

Directory structure― Within the folder that contains the main .py file, there’s a folder named artists. And within that folder, when the code runs, a folder with the artist’s name is created (if not already). And within that folder, there are two more folders, info and lyrics. When we run the code, I put the lyrics in /artists/artist_name/lyrics/Song Title.txt and the info from the API, containing information about the song, like annotations, title, and song API id so we can grab it again if need be, in the file/artists/artist_name/info/Song Title.txt. The key, again, being saving all the info given to avoid unnecessary requests.

Redundancy Checking― Along with making sure to save all the info given, if we run an artist for the second time, we don’t want to get lyrics that we already have. So once we have all the songs for that artist, I run a check to see if we have a file with the name of the song already, and that the file isn’t empty. If the file is there, we continue to the next song.

Lyric Error Checking― Ahh unicode. While great for allowing multitudes ofdifferent characters rather than the standard English alphabet along with a few specialty characters, they’re not ideal when I’m trying to deal with simple song lyrics. And when saving the lyrics, I encountered more than a few random, unnecessary characters that python threw errors for encoding problems. In a semi-janky rule-based solution (which isn’t great to use, see below), when I saw these errors being thrown, I would specifically replace them with the correct “normal” character. I assume there’s some library out there that would take care of all the encoding issues, but this worked for me. Also, on Genius’s end, it would be sweet if they, you know, checked for abnormal characters when lyrics were uploaded and didn’t have them in the first place. Also would be cool if they included the lyrics in the API.

def clean_lyrics(lyrics): lyrics = lyrics.replace(u"\u2019", "'") #right quotation mark lyrics = lyrics.replace(u"\u2018", "'") #left quotation mark lyrics = lyrics.replace(u"\u02bc", "'") #a with dots on top lyrics = lyrics.replace(u"\xe9", "e") #e with an accent lyrics = lyrics.replace(u"\xe8", "e") #e with an backwards accent lyrics = lyrics.replace(u"\xe0", "a") #a with an accent lyrics = lyrics.replace(u"\u2026", "...") #ellipsis apparently lyrics = lyrics.replace(u"\u2012", "-") #hyphen or dash lyrics = lyrics.replace(u"\u2013", "-") #other type of hyphen or dash lyrics = lyrics.replace(u"\u2014", "-") #other type of hyphen or dash lyrics = lyrics.replace(u"\u201c", '"') #left double quote lyrics = lyrics.replace(u"\u201d", '"') #right double quote lyrics = lyrics.replace(u"\u200b", ' ') #zero width space ? lyrics = lyrics.replace(u"\x92", "'") #different quote lyrics = lyrics.replace(u"\x91", "'") #still different quote lyrics = lyrics.replace(u"\xf1", "n") #n with tilde! lyrics = lyrics.replace(u"\xed", "i") #i with accent lyrics = lyrics.replace(u"\xe1", "a") #a with accent lyrics = lyrics.replace(u"\xea", "e") #e with circumflex lyrics = lyrics.replace(u"\xf3", "o") #o with accent lyrics = lyrics.replace(u"\xb4", "") #just an accent, so remove lyrics = lyrics.replace(u"\xeb", "e") #e with dots on top lyrics = lyrics.replace(u"\xe4", "a") #a with dots on top lyrics = lyrics.replace(u"\xe7", "c") #c with squigly bottom return lyrics

Check out the most of themain function below. If you’re looking for the actual full file, check out this gist. It’s easier to post that on Github than formatting the entire thing here.

def song_ids_already_scraped(artist_folder_path, force=False): #check for ids already scraped so we don't redo if force: return [] song_ids = [] files = os.listdir(artist_folder_path) for file_name in files: dot_split = file_name.split('.') #sometimes the file is empty, we don't want to include if that's the case if dot_split[1] == 'txt': try: song_id = dot_split[0].split("_")[-1] if os.path.getsize(artist_folder_path + '/' + file_name) != 0: song_ids.append(song_id) except: pass return song_ids def info_from_song_api_path(song_api_path): song_url = base_url + song_api_path response = requests.get(song_url, headers=headers) json = response.json() return json def songs_from_artist_api_path(artist_api_path): api_paths = [] artist_url = base_url + artist_api_path + "/songs" data = {"per_page": 50} while True: response = requests.get(artist_url, data=data, headers=headers) json = response.json() songs = json["response"]["songs"] for song in songs: api_paths.append(song["api_path"]) if len(songs) < 50: break #no more songs for artist else: if "page" in data: data["page"] = data["page"] + 1 else: data["page"] = 1 return list(set(api_paths)) if __name__ == "__main__": for artist_name in artist_names: #setting up path to artist's directories artist_folder_path = "artists/%s" % artist_name.replace(' ', '_').lower() artist_lyrics_path = "%s/lyrics" % artist_folder_path artist_info_path = "%s/info" % artist_folder_path if not os.path.exists(artist_folder_path): os.makedirs(artist_folder_path) if not os.path.exists(artist_lyrics_path): os.makedirs(artist_lyrics_path) if not os.path.exists(artist_info_path): os.makedirs(artist_info_path) #only using lyrics since they're saved second prev_song_ids = song_ids_already_scraped(artist_lyrics_path) #find the artist! search_url = base_url + "/search" data = {'q': artist_name} response = requests.get(search

Viewing all articles
Browse latest Browse all 9596

Trending Articles