Can you structure the unstructured?

As soon as I saw that Candra’s challenge for this week was going to involve Regular Expressions (RegEx), I gave a little groan. RegEx just isn’t my thing 😦 I only ever seem to use them for these challenges, and not in my working life, so have minimal experience. I always think I should focus some time on learning them properly, but other things just end up taking priority. Ho Hum…

So most of my time was spent trying to wrangle the info I needed to identify ‘how many bedrooms’ each property had. I did a bit of googling to try to find the right expressions I think I needed, used the regex101 site to test my expression to find certain patterns of text against some of the data in the Description field, and then tried to plug that into a calculated field in Tableau to extract the data I needed.

But I couldn’t get it to work 😦 I could find matching text using the REGEXP_MATCH function, but when I then tried to use the REXP_EXTRACT functions I couldn’t get anything out…

So I ended up having to look at the solutions that had already been published by the time I started, Candra’s, Lorna Brown’s and Sam Epley’s. I just needed to get my head round what I was obviously doing wrong and give me some pointers. All 3 had slightly different approaches. I absorbed, then closed their workbooks and attempted again from memory. With a lot more trial and error I got somewhere… it isn’t perfect and has some mismatches from the others (but they don’t all match each other either…).

Once I’d got a grouping for each property, the actual Tableau stuff was quite straightforward…

  • Identifying the ‘Number of Bedrooms’
  • Building the Histogram
  • Adding the Average Price
  • Building the Map
  • Adding the Interactivity

Identifying the Number of Bedrooms

So the way I approached this, was to try to identify all the various permutations that represented the word ‘bedroom’ and replace it with the word ‘Bedroom’. But one of the options was BR or br, and the Description field contained html markup with the term <br />. I didn’t want all these to become ‘bedroom’, so I got rid of them all first,

Desc with Bedroom

REGEXP_REPLACE(LOWER(REPLACE([Description],'<br />’, ‘ ‘)),’bedroom|br |bdrm|bed|bd|br, |br/|rooms’,’ Bedroom’)

Firstly, replace any occurence of <br /> with a space, then replace any occurrence of the text bedroom or br<space> or bdrm or bed or bd or br<comma> or br<forward slash> or rooms with the word Bedroom.

I basically added more options to the or statement (identified by the | separator), as I went on examining the descriptions that were left. Using the LOWER function meant that bedroom or Bedroom or BedRoom etc would all be covered with one option.

Then I attempted to extract the number of bedrooms or identify as a studio

Studio | Beds

IF CONTAINS(LOWER([Desc with Bedroom]), ‘studio’) THEN ‘Studio’
ELSEIF REGEXP_MATCH(LOWER([Desc with Bedroom]),’\d bedroom’) THEN REGEXP_EXTRACT(LOWER([Desc with Bedroom]),'(\d+) bedroom’)
ELSEIF REGEXP_MATCH(LOWER([Desc with Bedroom]),’\d bedroom’) THEN REGEXP_EXTRACT(LOWER([Desc with Bedroom]),'(\d+) bedroom’)
ELSEIF CONTAINS(LOWER([Desc with Bedroom]), ‘six bedroom’) THEN ‘6’
END

If the revised description contains the word ‘studio’ then assume its a Studio.

Else if the revised description contains a number (\d) followed by 2 spaces then the word ‘bedroom’ then extract the numbers (\d+) that occur before the word bedroom. The brackets around the \d+ is what is used to identify what bit of the matching pattern to extract… this is the bit that I didn’t really know about and why I couldn’t get things to work.

Else if the revised description contains a number (\d) followed by 3 spaces then the word ‘bedroom’ then extract the numbers (\d+) that occur before the word bedroom. This just happened to be another pattern that occurred and meant some records didn’t get picked up by the prior statement. There’s probably a better way of doing this in one statement…

Finally, if the revised description contains the text ‘six bedroom’ then assume the property has 6 rooms.

This logic seemed to get a match against every record although it’s not 100% accurate, but it was close enough given my struggles.

I then wanted to get the rooms grouped

Room Grouping

CASE [Studio | Beds]
WHEN ‘Studio’ THEN ‘Studio’
WHEN ‘1’ THEN ‘1 Bedroom’
WHEN ‘2’ THEN ‘2 Bedrooms’
WHEN ‘3’ THEN ‘3 Bedrooms’
WHEN ‘4’ THEN ‘4 Bedrooms’
ELSE ‘5 or more Bedrooms’
END

I planned to use this field as my filter, but in doing so the value listed alphabetically, so Studio ended up at the bottom of the list.

To resolve this I created a parameter which meant I could define the order I wanted :

pBedroomSelector

And then I created a new field to use for the filter

Filter Room

[pBedroomSelector] = ‘All’ OR
[pBedroomSelector] = [Room Grouping]

I could then add this onto the filter shelf of the sheets I needed to build, setting the value to True.

Building the Histogram

For this chart, we need to ‘bin’ the Price of each property into groups of $100 ranges. However if we use the built in ‘bin’ function, the field created can’t be referenced in other calculations, and I needed to do this. So instead I determined the ‘lower’ value of the range by

Price per Night Min

FLOOR([Price]/100) *100

Divide the price by 100, round down to the nearest whole integer (so 1.9 will round down to 1), then multiply the result by 100.

And given that, I can then calculate

Price per Night Max

[Price per Night Min]+100

I also created a ‘friendlier’ field to store the number of properties

# of Listings

COUNT([listings copy_listings copy])

which is just a reference to the auto generated field created when you connect to the data source.

With these I can plot the histogram

  • Price per Night Min on Columns (set to discrete, continuous)
  • # of Listings on Rows
  • Mark type of Bar
  • Size set to be Fixed with a width of 100
  • Filter Room on the Filter shelf, set to True.
  • Adjust the colour via the Colour shelf and set a white border
  • Show the pBedroomSelector parameter
  • Add Price per Night Max to the Tooltip shelf and set to be an attribute.
  • Set the Tooltip accordingly and format gridlines, axes labels etc

Adding the Average Price

I wasn’t entirely sure what the average price on Candra’s solution represented, so I chose to go for the average price of the properties in the filtered selection; that is of all the 2-bedroom properties for example, find the average price per night, based on the total price per night of all the properties divided by the number of properties. ie I was looking for these values in the 3rd column.

But I couldn’t simply add the Price field aggregated to Avg to the bar chart. Doing so gave me different values per Price per Night Min grouping.

I just want the value on the grand total line spread across the all the data in the chart. So I created

Window Avg Price

WINDOW_SUM(SUM([Price])) / WINDOW_SUM([# of Listings])

This table calculation, set to compute by Price per Night Min gives the value I want across all rows of data

Add Window Avg Price to the Detail shelf of the histogram, set the calc to compute as above. Then you can add a reference line to the Price per Night Min axis.

Building the Map

To build maps you need fields that are geographic data types. For me, the Longitude field was already set, but I had to manually set the Latitude field (right click -> Geographic Role -> Latitude).

Once done, the map could be quickly built by double-clicking the Longitude field, then double clicking the Latitude field, then adding Name and Listing URL to the Detail shelf, and Price to the Tooltip shelf. Finally set Filter Room = True to the Filter shelf.

I then adjusted the colour of the circles, reduced the opacity to 50% and added a border (all via the Colour shelf).

I also added Area Code Boundaries via the Map -> Map Layers menu to get the map style Candra had used.

Adding the Interactivity

Add the 2 sheets to a dashboard. Each chart can be used to filter each other. This functionality can easily be added by clicking on the context menu of the dashboard object, and selecting Use as Filter. A filter dashboard action will automatically be added. Do this for both charts.

The final requirement, is for a link to the actual listing to be available from the map tooltip. This is a dashboard URL Action (Dashboard -> Actions -> Add Action -> Go to URL). Set as below

The words in the Name field will what is displayed on the tooltip.

The layout requires use of containers, background colours and a bit of padding. This is typically a bit of trial and error to get this right. You can check out my published version here.

Happy vizzin’! Stay Safe!

Donna

Profitability Spotlight – is your budget recovered?

For week 22 of #WOW2020, Ivett Kovacs set her first challenge as a guest poster.

At first glance it looked like it would be tricky, and it was! There were many head-scratching moments as I made my way through this, and I couldn’t complete it all without having to look at Ivett’s solution – more on that later.

Understanding the data

First up, a word on the data provided, as the requirements weren’t as clear as I’d have liked.

Ivett provided a small custom data set to build this challenge on

It contains 2 rows for each Sub-Category. The value to be plotted for Review is evident on the challenge – it’s an average. But the aggregation for Revenue & Budget isn’t that obvious, as there is no direct indication on a label or in the tooltip. Should the values be summed or averaged, or a combination of both?

The Profit value in the tooltip is the only clue you have. The Profit value for Tables is -$60, the value for Chairs is $50. From examining the values in the data, the assumption is Profit = AVG(Revenue) – AVG(Budget), so the values to plot for Budget and Revenue need to be averaged.

The assumption is this data set is a combination of a two datasets; one containing the Revenue & Budget per Sub-Category

the other containing the customer reviews

So when it comes to the viz, we’re looking to plot based on the following data

Understanding what numbers I need to be aiming for when doing these challenges is crucial to me to ensure I’m heading down the right path 🙂

Duplicating the Data

The key thing to realise about this chart is that the right hand line is a fake axis; ie not an axis at all, just a vertical line located so that it looks like it is an axis.

What we actually need to do is build a triangle for each Sub-Category, where each point of the triangle is a different (x,y) coordinate. This means we need 3 rows of data per Sub-Category, 1 row to represent each point being plotted.

So for this we need to Union the data twice, so there are 3 instances of the data – drag another instance of the table into the data pane and drop when the Union option appears. Do this twice.

The union duplicates the rows of data, and each set is identifiable by an automatically generated Table Name field containing values WW, WW1,WW2

Defining the points of the triangle

As I said above, we need to build a triangle for each Sub-Category, where each point of the triangle is a different (x,y) coordinate.

At point 1, we’re going to plot our Review value. At point 2, we’re going to plot our Revenue value, and Budget will be at point 3.

From the diagram above, you can see our x coordinates are one of two numbers, either 0 or x1 (ie the x coordinate for the Revenue & Budget points is the same).

The y coordinates vary per measure, and need to be ‘normalised’ to a common scale, as otherwise, the Revenue & Budget figures would be plotted significantly higher than the Review values (you might find this blog by Zen master Jonathan Drummey useful at this point).

When normalising, you’re typically looking to convert your values to a scale from 0-1; this means the values aren’t plotted at their absolute values, but are still plotted in the same relative order to each other ie lowest value at the bottom, highest at the top.

When normalising a set of values from 0 to 1, your lowest value would typically be at the 0 position , with your highest value at the 1 position. To calculate where your value would sit on this scale, you need to know 3 numbers; the value to convert, the maximum value in the range of values to plot and the minimum value in in the range of values to plot. The normalised value is then :

(Current value – min value) / (max value – min value)

ie the difference between your value and the lowest as a proportion of the difference between the whole range.

However, in this instance, I chose to simplify the normalisation, and my method only works due to the values in the example data provided.

Side Note I consider Revenue & Budget to be values that are directly related to each other ie if Budget = Revenue I’d expect them to be plotted at the same point. I therefore chose to normalise these values across the combined range. This gave me different results from the solution, where Budget and Revenue were normalised independently of each other.

The maximum average review value is 5, the maximum value for budget and revenue combined is 100. As 100 is exactly 20 times bigger than 5, I simply chose to normalise the revenue/budget values to be on a scale of 0-5 instead, rather than normalising all values (Review, Budget & Revenue) to be on a 0-1 scale.

First up, we want to identify a position for each point

Path

IF [Table Name] = ‘WW’ THEN 1
ELSEIF [Table Name] = ‘WW1’ THEN 2
ELSE 3
END

Then we’ll create the x coordinate for each point

x

IF [Path] = 1 THEN 0
ELSE 20 END

I’ve just chosen an arbitrary value to plot the 2nd & 3rd points at – could easily have been 1.

Then we’ll create the y coordinate for each point, which is where the normalisation comes in

y

IF MIN([Path]) = 1 THEN AVG([Review])
ELSEIF MIN([Path]) = 2 THEN AVG([Revenue])/20
ELSEIF MIN([Path]) = 3 THEN AVG([Budget])/20
END

Dividing by 20, normalises the values to a scale of 0-5 as discussed above.

Let’s put these values all out in a table, so we can see what’s going on

Building the Polygon Viz

You can see we have 3 points per Sub-Category, so we can plot the x & y measures on a sheet as follows :

  • Add Min(x) to Rows
  • Add y to Columns
  • Add Sub-Category to Detail
  • Add Table Name (or Path) to Detail
  • Change Mark Type to Polygon
  • Change Colour to #666666 with 30% opacity

Tooltips

We need 2 new calculated fields for the Tooltip

Profit

AVG([Revenue]) – AVG([Budget])

formatted to $ with 0 dp

Margin

[Profit]/AVG([Budget])

formatted to % with 1 dp.

Note this will differ from the solution, where I think Ivett inadvertently used SUM(Budget) rather than AVG.

Add these to Tooltip field and adjust text accordinly.

Colouring the Line

To make the line, I created a second instance of y

y2

IF MIN([Path]) = 1 THEN AVG([Review])
ELSEIF MIN([Path]) = 2 THEN AVG([Revenue])/20
END

which just plots 2 points rather than 3.

Add this to Columns next to y, make dual axis and synchronise axis. Things might have disappeared – you need to remove Measure Names from both marks cards. Change the mark type of the y2 card to Line.

You won’t see much difference at this point (or you shouldn’t). We need a field to define the colour

Colour : Line

IF [Profit] < 0 THEN ‘Non-Profitable’ ELSE ‘Profitable’ END

Add this to the Colour shelf of the y2 marks card and adjust to suit.

Labelling the line

On the y2 marks card, add Review (set to AVG) to Label shelf, and move Sub-Category to Label shelf. Set Label to only label Start of Line. Adjust format/layout of Label to suit.

Set the format of Review to Number Standard – this is a format little used, but will display a whole number or a decimal. I discovered this through an Andy Kriebel WoW from a long time ago and is a real gem!

Finalising the viz

To tidy up and get the viz looking like the solution

  • Format to remove all gridlines
  • Hide the y2 axis
  • Hide the ‘4 nulls’ indicator if you have it
  • Edit the y axis
    • Fix the axis from 0.5 to 5.25
    • Add a title
    • Set axis ticks to none
  • On the y2 marks card, edit the Colour shelf and change the markers to show to All (this gives the circles on each end of the line)
  • Change the Min(x) pill to be FLOAT(Min(x)) using the ‘type in pill’ function
  • Edit the x axis to be Fixed from -0.3 to 19.9
    • cutting off the end of the axis makes the end circle disappear. This is very sneaky, and I had to see the solution for this!
  • Hide the x axis
  • Remove the row divider lines

So we’ve now got the core viz – hooray! But we’re not quite done – boo!

Expand /Collapse

Ivett set the viz to expand to show a column by Sub-Category on click.

This is another requirement I couldn’t get however hard I tried. I duplicated my viz and created a version with Sub-Category on Columns and tried to use Parameter Actions to set a parameter that would control which viz would display using a sheet swap techinque. However when I added the necessary field to the Detail shelf, all the polygons disappeared and I don’t understand why…. I therefore published my first instance of this challenge with a parameter the user controlled to decide which view to show. This is here.

So I had to look at Ivett’s solution to see how she had achieved this, and it involved sets.

Right click on Sub-Category and Create – > Set

Selected Sub-Cats

Don’t select any values at this point.

Create a field

Expand

IF [Selected Sub-Cats] THEN [Sub-Category]
ELSE ‘Click to Expand’
END

If there are values in the set, then the set values will be returned, otherwise the text ‘Click to Expand’ will display.

Add this to the Columns shelf, and ‘Hide field labels for columns‘ so the text ‘Expand’ doesn’t show.

A dashboard action will ultimately drive the behaviour, but we can test the display by editing the set and selecting all values.

The view should expand as expected, showing a column per Sub-Category

But the name of the Sub-Category is still displayed on the label, and in the example this isn’t the case. To fix this I created another calculated field

Label: SubCat

IF [Selected Sub-Cats] THEN ” ELSE [Sub-Category] END

which I added to the Label shelf, and adjusted the display to suit.

‘Reset’ the view to show the collapsed version by re-editing the set and removing all values.

How to Read this chart legend

I simply duplicated the main viz, and filtered by Sub-Category = Chairs.

I removed the label, and then used the Annotate Point functionality to add the relevant labels against the points.

Profitable Legend

The Profitable / Non-Profitable legend makes use of our trusted friend MIN(0), with pills arranged as below.

Applying the Set Action

Create a dashboard to the size specified, and just use floating objects to position the text and various sheets.

To drive the expand / collapse function, create a Set Action on the main viz as below, that targets the Selected Sub-Cats set.

Revenue / Budget Label

This is just managed using Text objects on the dashboard, carefully positioned in the right locations. You might need to add additional objects to get the layout required.

The background of the whole dashboard is set to light grey and the sheets and objects need to be set to the same grey or white to get the same presentation.

So fingers crossed, you should have a complete solution now.

My final version (after I peeked at Ivett’s) is here.

Happy Vizzin’! Stay safe!

Donna