Simple 'Webscraping' with Tableau and Regex

Simple 'Webscraping' with Tableau and Regex

7/13/2015


Okay, we're clearly not going to be webscraping with the functionality of writing a Python script or anything. Yes, using one of those technologies will probably result in collecting data in a more efficient manner. However, utilizing the new regex that was implemented in Tableau 9.0, we can still pull some pretty cool information from the web in a fairly simple way (and we're not bothering the site by sending a bunch of requests like you would with python!). This will require no programming experience aside from some regex, which could also be looked over if you're able the desired expression online.

So, the goal here is to develop a way to pull all the links from blog posts. Blog posts are always filled with great references to other sources of information and I want them all in a single location.

First thing we're going to do is go to the website that we want the information from. I'm going to use this blog post on using Regex in Tableau, because well... we are going to be using regex in Tableau. This method will work with any website, though (try it on this page!).

Once you get to the website, right click -> view page source. This will open up a new tab with all of the html, css, and javascript used on this page. Now, just ctrl+a to select it all, then ctrl+c to copy it all. Once it's all copied, open up an excel workbook and paste it into the first cell. When you paste your copied data in, it should automatically place each line into a different cell, so we don't have to do any reformatting before putting it into Tableau. I'm going to insert a new row at the top, just so I can name it something like Blog Code.

Next, we need to load this up into Tableau. Save the file, fire up Tableau, and open up our newly created spread sheet. Next, put the Blog Code dimension onto your rows. You will see all of the lines of html, css, and js divided out. What we want to do is write a regular expression that will pull out any url from the source code. To do this, create a new Calculated Field. For this instance, we will be using REGEXP_EXTRACT (read more on the other regex functions here). To start off the calc, we need to type:

REGEXP_EXTRACT([Blog Code], regex here)

Now, it's time for the intimidating part. If you're not familiar with regex, I highly recommend reading the article we are getting the data from. Follow all of the links and read and practice with those as well. Be sure to check out regexr.com, it's been the most help for me since I tried to start learning regex. Now, if you simply don't want to learn to write any regex, you can search regexlib.com and hopefully find what you want. This is how you get around the programming aspect; scouring the web and hoping someone has written the regex you are looking for (which they probably have).

For this case, here is the regular expression we are going to use:
 '(((http(s)?):\/\/|www\.)[^\s/$\.?#][^\s">]+)' 
Important: in Tableau, the regex needs to be surrounded by '(regex here)', meaning you need to put the regex inside of those parentheses and it all needs to be surrounded by quotes.

So, let's break down this scary line of text, shall we?

  • (http(s)?):\/\/ - this means that we are first searching for either http (the s is optional is denoted by the following '?'), followed by ://. The backslashes (\) before the forward slashes are required because forward slashes have use in actual regex syntax. Placing the backslash before "escapes" the forward slash, allowing us to actually search for that character. Here is a list of special characters that need to be escaped
  • |www\. - we start off with |, which represents a boolean or. This means that the first pattern we are searching for is now either one of http(s) OR www. This is followed by \., which is again using that escaping slash to allow us to search for a period
  • [^\s/$\.?#] - the starting [^ signifies a negated set. What this means is that anything following this will be excluded from the search pattern. In this case, this pattern will pick up on any character except whitespace (\s), /, $, ., ?, or #. This will be the part of the domain name before the ending .com (or whatever else it could be, as specified by the next snippet)
  • [^\s">]+ - finally, we have another negated set, finding anything but whitespace, ", or >. The plus sign at the end means that we are looking for 1 or more of the preceding pattern. Got all that?
The full calculation in Tableau will look like:



Once you have that click OK and drag the new field out next to the [Blog Code]. Now, you will see a bunch of nulls and some links scattered around there. Drag off [Blog Code] to see only the links! Please note that this regex will pull anything formatted like a link, so you might see some .css, .js, and image files in there. If you don't want those, go ahead and customize the regex to fit your needs!

Thanks for reading! Please post any comments or any corrections you may have in the section below.


Check out more posts in the Tableau category!