Let’s Practice RegEx in Tableau with Generative AI

Yoshi set this week’s challenge which focused on 2 areas : the use of tools such as CHAT GPT to help us create Tableau RegEx functions (which are always tricksy), and the build of the chart itself.

Creating the REGEX calculated fields

I chose to use CHATGPT and simply entered a prompt as

“give me a regex expression to use in Tableau which extracts a string from another string in a set format. The string to interrogate is”

and then I pasted in the string I’d copied from one of the Session Html fields. I got the following response

along with many more code snippet examples. I used these to create

Session Title

REGEXP_EXTRACT([Session Html], '<div class="title-text">(.*?)</div>')

Description

REGEXP_EXTRACT([Session Html], '<div class="description"[^>]*><div>(.*?)</div>')

Location

REGEXP_EXTRACT([Session Html], '<span class="session-location"[^>]*>(.*?)</span>')

Session Level

REGEXP_EXTRACT([Session Html], 'session-level-([a-zA-Z]+)')

Alias this field to show the values in proper case (ie ‘Advanced’ rather than ‘advanced’) and display Null as ‘All Levels’

Session Date

REGEXP_EXTRACT([Session Html], '<span class="semibold session-date">(.*?)</span>')

Session Time

REGEXP_EXTRACT([Session Html], '<span class="semibold session-time">(.*?)</span>')

Start Time

REGEXP_EXTRACT([Session Html], '<span class="semibold session-time">([0-9: ]+[AP]M)')

End Time

REGEXP_EXTRACT([Session Html], '- ([0-9: ]+[AP]M)')

Creating the additional fields

All of the RegEx functions return string fields. To build the viz, we need actual date fields and additional information

AM|PM

IIF(CONTAINS([Start Time],’AM’), ‘AM’, ‘PM’)

Date

DATE(DATEPARSE(“MMMM d yyyy”, SPLIT([Session Date], “, “, 2) + ” 2023″))

returns the day of the session as a proper date field, all hardcoded to 2023, since this is when the data is from.

Start Date

DATETIME(STR([Date]) + ” ” + [Start Time])

returns the actual day & start time as a proper datetime field.

End Date

DATETIME(STR([Date]) + ” ” + [End Time])

Duration

DATEDIFF(‘second’, [Start Date], [End Date])

Add fields to a table as below

Firstly, when we build the viz, we’ll need to sort it based on the Start Date, the Session Level (where Advanced is listed first, and All Levels last) and the Duration (where longer sessions take higher precedence) . For this we need a new field which is a combination of information related to all these values.

Sort

STR([Start Date]) + ‘ – ‘ + STR(
CASE [Session Level]
WHEN ‘advanced’ THEN 4
WHEN ‘intermediate’ THEN 3
WHEN ‘beginner’ THEN 2
ELSE 1
END
) + STR([Duration]/100000)

(this just took some trial and error)

Add this to the table after the Session ID pill and then apply a Sort to the Session ID pill so it is sorting by the Minimum value of the field Sort Ascending. You should see that when the start time and session level match, the sessions are sorted with the higher duration first.

The viz also lists the AM & PM sessions as separate instances, where the 1st session of the PM session is displayed on the same ‘row’ as the 1st session of the AM session. To handle this we need

Session Index

INDEX()

Make this discrete and add to the table after the AM|PM pill. Change the table calculation so it is computing by all fields except AM|PM and Session Date. The numbering should restart at each am/pm timeslot

Finally, the last field we need to build this viz, is another date field, so we can position the gantt chart bars in the right places. As the viz already segments the chart by Session Date and AM|PM, we can’t reuse the existing Start Date field as this will display the information as if staggered across the 3 days. Instead we want to create a date field, where the day is the same for al the sessions, but just the time differs

Baseline Date

DATETIME(STR(#2023-01-01#) + ” ” + [Start Time])

the date 01 Jan 2023 is arbitrary and could be set to any day. Add this into the table, so you can see what it’s doing. You will probably need to adjust the table calculation on Session Index again to include Baseline Date in the list of checked fields.

Building the viz

On a new sheet, add Session Date to Columns and manually re-order so Tuesday listed first. Then add AM|PM to Columns . Add Baseline Date as a continuous exact date (green pill) to Columns . Add Session Id to Detail then add Session Index to Rows and adjust the table calculation so it is computing by all fields except Session Date and AM|PM.

Create a new field

Size

SUM([Duration])/86400

and add this to the Size shelf. 86400 is the number of seconds in 24hrs, so adjusts the size of the gantt bars to fit the timeframe we’re displaying.

Add Session Level to Colour and adjust accordingly. You will need to readjust the Session Index table calc to include Session Level in the checked fields. Add a border to the bars. Edit the Baseline Date axis so the axis ranges are independent

Add Description and Session Title to Detail and again adjust the Session Index table calc to include these fields too. Apply a Sort to the Session Id field so the data is sorted by the Sort field descending

Add Location, Start Time and End Time to Tooltip, and update accordingly. Then apply formatting to

  • change font of the Session Date and AM|PM header values
  • remove the Session Date/ AM|PM column heading label (right click > hide field labels for columns)
  • Hide the Session Index field
  • Hide the Baseline Date axis
  • Add column banding so the Wednesday pane is coloured differently

Add the ability to highlight the Session Title and Description by selecting Show Highlighter from the context menu of each pill

Then add the sheet to a dashboard, using containers to display the colour legend and the highlighter fields.

And that should be it. My published viz is here.

Happy vizzin’!

Donna

Can you structure the unstructured?

As soon as I saw that Candra’s challenge for this week was going to involve Regular Expressions (RegEx), I gave a little groan. RegEx just isn’t my thing 😦 I only ever seem to use them for these challenges, and not in my working life, so have minimal experience. I always think I should focus some time on learning them properly, but other things just end up taking priority. Ho Hum…

So most of my time was spent trying to wrangle the info I needed to identify ‘how many bedrooms’ each property had. I did a bit of googling to try to find the right expressions I think I needed, used the regex101 site to test my expression to find certain patterns of text against some of the data in the Description field, and then tried to plug that into a calculated field in Tableau to extract the data I needed.

But I couldn’t get it to work 😦 I could find matching text using the REGEXP_MATCH function, but when I then tried to use the REXP_EXTRACT functions I couldn’t get anything out…

So I ended up having to look at the solutions that had already been published by the time I started, Candra’s, Lorna Brown’s and Sam Epley’s. I just needed to get my head round what I was obviously doing wrong and give me some pointers. All 3 had slightly different approaches. I absorbed, then closed their workbooks and attempted again from memory. With a lot more trial and error I got somewhere… it isn’t perfect and has some mismatches from the others (but they don’t all match each other either…).

Once I’d got a grouping for each property, the actual Tableau stuff was quite straightforward…

  • Identifying the ‘Number of Bedrooms’
  • Building the Histogram
  • Adding the Average Price
  • Building the Map
  • Adding the Interactivity

Identifying the Number of Bedrooms

So the way I approached this, was to try to identify all the various permutations that represented the word ‘bedroom’ and replace it with the word ‘Bedroom’. But one of the options was BR or br, and the Description field contained html markup with the term <br />. I didn’t want all these to become ‘bedroom’, so I got rid of them all first,

Desc with Bedroom

REGEXP_REPLACE(LOWER(REPLACE([Description],'<br />’, ‘ ‘)),’bedroom|br |bdrm|bed|bd|br, |br/|rooms’,’ Bedroom’)

Firstly, replace any occurence of <br /> with a space, then replace any occurrence of the text bedroom or br<space> or bdrm or bed or bd or br<comma> or br<forward slash> or rooms with the word Bedroom.

I basically added more options to the or statement (identified by the | separator), as I went on examining the descriptions that were left. Using the LOWER function meant that bedroom or Bedroom or BedRoom etc would all be covered with one option.

Then I attempted to extract the number of bedrooms or identify as a studio

Studio | Beds

IF CONTAINS(LOWER([Desc with Bedroom]), ‘studio’) THEN ‘Studio’
ELSEIF REGEXP_MATCH(LOWER([Desc with Bedroom]),’\d bedroom’) THEN REGEXP_EXTRACT(LOWER([Desc with Bedroom]),'(\d+) bedroom’)
ELSEIF REGEXP_MATCH(LOWER([Desc with Bedroom]),’\d bedroom’) THEN REGEXP_EXTRACT(LOWER([Desc with Bedroom]),'(\d+) bedroom’)
ELSEIF CONTAINS(LOWER([Desc with Bedroom]), ‘six bedroom’) THEN ‘6’
END

If the revised description contains the word ‘studio’ then assume its a Studio.

Else if the revised description contains a number (\d) followed by 2 spaces then the word ‘bedroom’ then extract the numbers (\d+) that occur before the word bedroom. The brackets around the \d+ is what is used to identify what bit of the matching pattern to extract… this is the bit that I didn’t really know about and why I couldn’t get things to work.

Else if the revised description contains a number (\d) followed by 3 spaces then the word ‘bedroom’ then extract the numbers (\d+) that occur before the word bedroom. This just happened to be another pattern that occurred and meant some records didn’t get picked up by the prior statement. There’s probably a better way of doing this in one statement…

Finally, if the revised description contains the text ‘six bedroom’ then assume the property has 6 rooms.

This logic seemed to get a match against every record although it’s not 100% accurate, but it was close enough given my struggles.

I then wanted to get the rooms grouped

Room Grouping

CASE [Studio | Beds]
WHEN ‘Studio’ THEN ‘Studio’
WHEN ‘1’ THEN ‘1 Bedroom’
WHEN ‘2’ THEN ‘2 Bedrooms’
WHEN ‘3’ THEN ‘3 Bedrooms’
WHEN ‘4’ THEN ‘4 Bedrooms’
ELSE ‘5 or more Bedrooms’
END

I planned to use this field as my filter, but in doing so the value listed alphabetically, so Studio ended up at the bottom of the list.

To resolve this I created a parameter which meant I could define the order I wanted :

pBedroomSelector

And then I created a new field to use for the filter

Filter Room

[pBedroomSelector] = ‘All’ OR
[pBedroomSelector] = [Room Grouping]

I could then add this onto the filter shelf of the sheets I needed to build, setting the value to True.

Building the Histogram

For this chart, we need to ‘bin’ the Price of each property into groups of $100 ranges. However if we use the built in ‘bin’ function, the field created can’t be referenced in other calculations, and I needed to do this. So instead I determined the ‘lower’ value of the range by

Price per Night Min

FLOOR([Price]/100) *100

Divide the price by 100, round down to the nearest whole integer (so 1.9 will round down to 1), then multiply the result by 100.

And given that, I can then calculate

Price per Night Max

[Price per Night Min]+100

I also created a ‘friendlier’ field to store the number of properties

# of Listings

COUNT([listings copy_listings copy])

which is just a reference to the auto generated field created when you connect to the data source.

With these I can plot the histogram

  • Price per Night Min on Columns (set to discrete, continuous)
  • # of Listings on Rows
  • Mark type of Bar
  • Size set to be Fixed with a width of 100
  • Filter Room on the Filter shelf, set to True.
  • Adjust the colour via the Colour shelf and set a white border
  • Show the pBedroomSelector parameter
  • Add Price per Night Max to the Tooltip shelf and set to be an attribute.
  • Set the Tooltip accordingly and format gridlines, axes labels etc

Adding the Average Price

I wasn’t entirely sure what the average price on Candra’s solution represented, so I chose to go for the average price of the properties in the filtered selection; that is of all the 2-bedroom properties for example, find the average price per night, based on the total price per night of all the properties divided by the number of properties. ie I was looking for these values in the 3rd column.

But I couldn’t simply add the Price field aggregated to Avg to the bar chart. Doing so gave me different values per Price per Night Min grouping.

I just want the value on the grand total line spread across the all the data in the chart. So I created

Window Avg Price

WINDOW_SUM(SUM([Price])) / WINDOW_SUM([# of Listings])

This table calculation, set to compute by Price per Night Min gives the value I want across all rows of data

Add Window Avg Price to the Detail shelf of the histogram, set the calc to compute as above. Then you can add a reference line to the Price per Night Min axis.

Building the Map

To build maps you need fields that are geographic data types. For me, the Longitude field was already set, but I had to manually set the Latitude field (right click -> Geographic Role -> Latitude).

Once done, the map could be quickly built by double-clicking the Longitude field, then double clicking the Latitude field, then adding Name and Listing URL to the Detail shelf, and Price to the Tooltip shelf. Finally set Filter Room = True to the Filter shelf.

I then adjusted the colour of the circles, reduced the opacity to 50% and added a border (all via the Colour shelf).

I also added Area Code Boundaries via the Map -> Map Layers menu to get the map style Candra had used.

Adding the Interactivity

Add the 2 sheets to a dashboard. Each chart can be used to filter each other. This functionality can easily be added by clicking on the context menu of the dashboard object, and selecting Use as Filter. A filter dashboard action will automatically be added. Do this for both charts.

The final requirement, is for a link to the actual listing to be available from the map tooltip. This is a dashboard URL Action (Dashboard -> Actions -> Add Action -> Go to URL). Set as below

The words in the Name field will what is displayed on the tooltip.

The layout requires use of containers, background colours and a bit of padding. This is typically a bit of trial and error to get this right. You can check out my published version here.

Happy vizzin’! Stay Safe!

Donna