This semester I am working on a web application to help USC students schedule their classes.
My job right now is to figure out a way to get course information from the USC website. The way I went about that was using python specifically the Beautiful Soup Library in order to scrape data from the USC website into our database (FireBase).
Now some challenges I ran into - the fact that USC’s website is almost entirely rendered in Javascript - ie in order to get the course times and section numbers some one has to click on the class on the website. I found a really janky work around to this which essentially meant triggering all the javascript responses on the page before scraping it.
Another challenge I ran into was actually cleaning the data and standardizing it so it could be sent to the database. The number of course sections that I scraped numbered in the high thousands - so of course there were going to be variations in the way the courses were displayed (ie Wednesday vs Wed vs W or professor names being hyperlinks to webpages or just like stray commas that would mess up the delimiter which was going through all this data). What I think made this a challenging project was the fact that one small problem that I couldn’t see - could mess up how the data got read and sent to the database. That meant that on top of random spot checks - I had to devise a way to go through all the data I had collected in the database and verify that a stray delimiter or something did not mess up which field the data.