(UPDATED) PASS Summit 2017 Session Data (You know you want it…so go get it!)

UPDATE #1 –
The following changes were made to the PowerShell script for the PASS Summit 2017 session scrape:

  • Abstract field is now varchar(max). ‘Cause precons get all the chars.
  • Added SubmittedSession column.
  • Upper bound increased to include precons.

UPDATE #2 –
Before I could get update #1 posted, it appeared all the general sessions and lightning talks were removed from the submissions page and replaced with submitted precons. Don’t fret – the PowerShell script will still grab all the submitted sessions (for now). There were also some more precons added so the upper bound has been changed yet again.

 

The deadline for PASS Summit submissions came and went. Either you got yours in or you didn’t. What does this mean? Time to gather the data.

Go get it! You know you want to!

Never having done a web scrape before, this was the perfect subject for my first time – grabbing all the sessions submitted to PASS Summit 2017…and doing it with PowerShell! Here is the script I used for this. I have accounted for the following:

  • Apostrophes (aka single quote). They will break your insert unless you have two of them, and for some reason, people seem to use them all over the place.
  • Formatting the string data for insert. No, your data will not magically come out right in your insert with single quotes so you need to add them.
  • Additional ID and deleted fields.
  • Speaker URL and ID. Will be using this to scrape speaker details later.
  • Accurate lower and upper bounds. These were arrived at by trial and error (you’re welcome), as well as the clean up of the data I scraped. More on this later.

Now for the code –

 

There will also be some webids that error out – this means the session doesn’t exist for that specified webid. You will see an error in the output to the PowerShell window, but the script will keep running.

Once you run the script and it completes, you will have your table of session data. You will notice that there are more sessions there than have been submitted, according to the website. If you do a count of the sessions by speaker you will also find that there are more than three sessions (the maximum allowed) for quite a few speakers. Why is this? You are pulling from a data source that contains sessions other than the ones that were submitted to Summit (apparently). This means you will have to clean up the data РI included the deleted field for this purpose. Look at the duration and filter for the 75 minute and 10 minute sessions and update your deleted field accordingly.

After this, you will have sessions that meet the duration, but you will find the speaker is not listed on the submission site when doing an advanced search. Clean the data some more.

Are you done cleansing your data yet? You might think so, but you aren’t. You see sessions that look like they were submitted for Summit but are not on the site. These appear to be sessions that were created and saved, but not officially submitted. You can rule out quite a few of these if you look for speakers that have more than three sessions. Get those counts, check the website for those speakers, and clean your data.

“Good lord, am I done yet?”

No. No you’re not.

There will be speakers that have three sessions in your table but only submitted one or two. This sounds worse than it is, but it is a little painful. How do you do this? You’re going to search the speakers with the advanced search on the website. Do these in groups based on the number of sessions a speaker has, ordering by speaker name. You can exclude the the speakers you have already reviewed in the previous clean-up steps, deleting their names from your list as you go. When you find a speaker with three sessions in your list and fewer sessions submitted, mark the session that was not submitted as deleted. Next, do the speakers with two sessions as they may have only submitted one. Finally, look at the speakers with one session as there may be ones that didn’t officially submit anything.

“But what about the code for the clean-up?”

No. You know how to do this, and may write it differently (and better) than me. My code was ugly and I don’t feel like cleaning it up and posting it here. I gave you the script to grab it all. Stop complaining or I will send you to time-out.

The speaker info grab will be a separate blog post. With the script in this post you might be able to use the syntax and craft your own PowerShell script for this. Or you could wait for my blog on it.

Yeah…I haz the datas!

6 Responses to (UPDATED) PASS Summit 2017 Session Data (You know you want it…so go get it!)

  1. Great job. And if data quality sucks, welcome to every BI project ever.

    • Thanks! It was really cool to work through the data cleansing and see everything that came back…and all the sessions that weren’t quite submitted.

  2. Pingback: Scraping Session Data – Curated SQL

  3. Hi Amy,

    Does this still work? I’ve just tried and it did not work. Not sure if I’m the problem or if the records are not there anymore.

    Cheers

    • It is working for me as of this morning. Make sure you update the variables for your server and the database you are using and create the table – the script assumes this is already there.

  4. Thanks Amy.
    Fatty fingers… mispelled “locahost” and could not find the error…

Leave a Reply

Your email address will not be published. Required fields are marked *