Can you structure the unstructured?

As soon as I saw that Candra’s challenge for this week was going to involve Regular Expressions (RegEx), I gave a little groan. RegEx just isn’t my thing 😦 I only ever seem to use them for these challenges, and not in my working life, so have minimal experience. I always think I should focus some time on learning them properly, but other things just end up taking priority. Ho Hum…

So most of my time was spent trying to wrangle the info I needed to identify ‘how many bedrooms’ each property had. I did a bit of googling to try to find the right expressions I think I needed, used the regex101 site to test my expression to find certain patterns of text against some of the data in the Description field, and then tried to plug that into a calculated field in Tableau to extract the data I needed.

But I couldn’t get it to work 😦 I could find matching text using the REGEXP_MATCH function, but when I then tried to use the REXP_EXTRACT functions I couldn’t get anything out…

So I ended up having to look at the solutions that had already been published by the time I started, Candra’s, Lorna Brown’s and Sam Epley’s. I just needed to get my head round what I was obviously doing wrong and give me some pointers. All 3 had slightly different approaches. I absorbed, then closed their workbooks and attempted again from memory. With a lot more trial and error I got somewhere… it isn’t perfect and has some mismatches from the others (but they don’t all match each other either…).

Once I’d got a grouping for each property, the actual Tableau stuff was quite straightforward…

  • Identifying the ‘Number of Bedrooms’
  • Building the Histogram
  • Adding the Average Price
  • Building the Map
  • Adding the Interactivity

Identifying the Number of Bedrooms

So the way I approached this, was to try to identify all the various permutations that represented the word ‘bedroom’ and replace it with the word ‘Bedroom’. But one of the options was BR or br, and the Description field contained html markup with the term <br />. I didn’t want all these to become ‘bedroom’, so I got rid of them all first,

Desc with Bedroom

REGEXP_REPLACE(LOWER(REPLACE([Description],'<br />’, ‘ ‘)),’bedroom|br |bdrm|bed|bd|br, |br/|rooms’,’ Bedroom’)

Firstly, replace any occurence of <br /> with a space, then replace any occurrence of the text bedroom or br<space> or bdrm or bed or bd or br<comma> or br<forward slash> or rooms with the word Bedroom.

I basically added more options to the or statement (identified by the | separator), as I went on examining the descriptions that were left. Using the LOWER function meant that bedroom or Bedroom or BedRoom etc would all be covered with one option.

Then I attempted to extract the number of bedrooms or identify as a studio

Studio | Beds

IF CONTAINS(LOWER([Desc with Bedroom]), ‘studio’) THEN ‘Studio’
ELSEIF REGEXP_MATCH(LOWER([Desc with Bedroom]),’\d bedroom’) THEN REGEXP_EXTRACT(LOWER([Desc with Bedroom]),'(\d+) bedroom’)
ELSEIF REGEXP_MATCH(LOWER([Desc with Bedroom]),’\d bedroom’) THEN REGEXP_EXTRACT(LOWER([Desc with Bedroom]),'(\d+) bedroom’)
ELSEIF CONTAINS(LOWER([Desc with Bedroom]), ‘six bedroom’) THEN ‘6’
END

If the revised description contains the word ‘studio’ then assume its a Studio.

Else if the revised description contains a number (\d) followed by 2 spaces then the word ‘bedroom’ then extract the numbers (\d+) that occur before the word bedroom. The brackets around the \d+ is what is used to identify what bit of the matching pattern to extract… this is the bit that I didn’t really know about and why I couldn’t get things to work.

Else if the revised description contains a number (\d) followed by 3 spaces then the word ‘bedroom’ then extract the numbers (\d+) that occur before the word bedroom. This just happened to be another pattern that occurred and meant some records didn’t get picked up by the prior statement. There’s probably a better way of doing this in one statement…

Finally, if the revised description contains the text ‘six bedroom’ then assume the property has 6 rooms.

This logic seemed to get a match against every record although it’s not 100% accurate, but it was close enough given my struggles.

I then wanted to get the rooms grouped

Room Grouping

CASE [Studio | Beds]
WHEN ‘Studio’ THEN ‘Studio’
WHEN ‘1’ THEN ‘1 Bedroom’
WHEN ‘2’ THEN ‘2 Bedrooms’
WHEN ‘3’ THEN ‘3 Bedrooms’
WHEN ‘4’ THEN ‘4 Bedrooms’
ELSE ‘5 or more Bedrooms’
END

I planned to use this field as my filter, but in doing so the value listed alphabetically, so Studio ended up at the bottom of the list.

To resolve this I created a parameter which meant I could define the order I wanted :

pBedroomSelector

And then I created a new field to use for the filter

Filter Room

[pBedroomSelector] = ‘All’ OR
[pBedroomSelector] = [Room Grouping]

I could then add this onto the filter shelf of the sheets I needed to build, setting the value to True.

Building the Histogram

For this chart, we need to ‘bin’ the Price of each property into groups of $100 ranges. However if we use the built in ‘bin’ function, the field created can’t be referenced in other calculations, and I needed to do this. So instead I determined the ‘lower’ value of the range by

Price per Night Min

FLOOR([Price]/100) *100

Divide the price by 100, round down to the nearest whole integer (so 1.9 will round down to 1), then multiply the result by 100.

And given that, I can then calculate

Price per Night Max

[Price per Night Min]+100

I also created a ‘friendlier’ field to store the number of properties

# of Listings

COUNT([listings copy_listings copy])

which is just a reference to the auto generated field created when you connect to the data source.

With these I can plot the histogram

  • Price per Night Min on Columns (set to discrete, continuous)
  • # of Listings on Rows
  • Mark type of Bar
  • Size set to be Fixed with a width of 100
  • Filter Room on the Filter shelf, set to True.
  • Adjust the colour via the Colour shelf and set a white border
  • Show the pBedroomSelector parameter
  • Add Price per Night Max to the Tooltip shelf and set to be an attribute.
  • Set the Tooltip accordingly and format gridlines, axes labels etc

Adding the Average Price

I wasn’t entirely sure what the average price on Candra’s solution represented, so I chose to go for the average price of the properties in the filtered selection; that is of all the 2-bedroom properties for example, find the average price per night, based on the total price per night of all the properties divided by the number of properties. ie I was looking for these values in the 3rd column.

But I couldn’t simply add the Price field aggregated to Avg to the bar chart. Doing so gave me different values per Price per Night Min grouping.

I just want the value on the grand total line spread across the all the data in the chart. So I created

Window Avg Price

WINDOW_SUM(SUM([Price])) / WINDOW_SUM([# of Listings])

This table calculation, set to compute by Price per Night Min gives the value I want across all rows of data

Add Window Avg Price to the Detail shelf of the histogram, set the calc to compute as above. Then you can add a reference line to the Price per Night Min axis.

Building the Map

To build maps you need fields that are geographic data types. For me, the Longitude field was already set, but I had to manually set the Latitude field (right click -> Geographic Role -> Latitude).

Once done, the map could be quickly built by double-clicking the Longitude field, then double clicking the Latitude field, then adding Name and Listing URL to the Detail shelf, and Price to the Tooltip shelf. Finally set Filter Room = True to the Filter shelf.

I then adjusted the colour of the circles, reduced the opacity to 50% and added a border (all via the Colour shelf).

I also added Area Code Boundaries via the Map -> Map Layers menu to get the map style Candra had used.

Adding the Interactivity

Add the 2 sheets to a dashboard. Each chart can be used to filter each other. This functionality can easily be added by clicking on the context menu of the dashboard object, and selecting Use as Filter. A filter dashboard action will automatically be added. Do this for both charts.

The final requirement, is for a link to the actual listing to be available from the map tooltip. This is a dashboard URL Action (Dashboard -> Actions -> Add Action -> Go to URL). Set as below

The words in the Name field will what is displayed on the tooltip.

The layout requires use of containers, background colours and a bit of padding. This is typically a bit of trial and error to get this right. You can check out my published version here.

Happy vizzin’! Stay Safe!

Donna

One thought on “Can you structure the unstructured?

Leave a comment