Unlocking the Power of Geospatial Data for Insights

Over the last three geospatial-centric blog posts, we’ve covered the basics of what geospatial data is, how it works in the broader world of data and how it specifically works in Snowflake based on our native support for GEOGRAPHY, GEOMETRY and H3. Those articles are great for dipping your toe in, getting a feel for the water and maybe even wading into the shallow end of the pool. But there is so much more you can do with geospatial data in your Snowflake account! We get it, though — it can be tough to jump from “I now understand the concepts” to “How can I tackle my first use cases?”

The world of geospatial data processing is vast and complex, and we’re here to simplify it for you. That’s why we created a Snowflake quickstart that walks you through various examples of geospatial capabilities within real-world contexts. While these examples use representative data from either Snowflake Marketplace or data we’ve made freely available to you on Amazon S3, they can be applied to your Snowflake account, data and use cases to deliver valuable insights to your user community. This blog post discusses how to do that, but refer to the quickstart itself if you want to see the actual code.

But let’s start with a quick recap: Location data is everywhere. Just like your Snowflake account almost certainly has time-based data, your Snowflake account almost certainly has location-based data as well, as most transactional data contains some element of who, what, when and where.

In its simplest form, location data is generally known as a variety of text and numeric fields that comprise what we call “addresses” and include things such as streets, cities, states, counties, ZIP codes and countries. Humans can read and understand these strings fairly easily, but in their text form, they are largely informational values that we look up when we need to know them and nothing more. Sometimes, these text fields can also be accompanied by latitude and longitude fields. These fields are great because now we can do more with the location data — we can place it on the Earth as a point. This is useful because we can integrate all such points into a BI tool’s mapping capability to show which country or territory the point belongs to, or we can integrate these points into a spatial grid such as H3. While various Snowflake BI partners support working with latitude and longitude data, Snowflake has built-in support to transform your longitude and latitude data into an H3 spatial grid for fast computation and visualization (as we’ll see later).

But we can go a step further and transform that latitude and longitude data into a GEOGRAPHY or GEOMETRY data type. We can use the former when we need an ellipsoid “globe” representation of the Earth, and the latter when we need a planar “map” representation of a more localized location. And we can construct these data types with a variety of functions.

Now, that may not sound like much, but with these two data types we can start constructing a series of points into a line, connect multiple lines into a polygon and represent complex locations as a series of lines and polygons. We can perform complex calculations and relationship associations between objects of these types and start to unlock insights we didn’t know were available to us with a simple “address” field alone. You may ask, “But what if all I have is the textual address field?” Ahh, welcome, fair traveler, to our first swashbuckling adventure: geocoding. “Geo-what-ing??”

Geocoding is the act of taking your textual address data and transforming it into a geospatial type to unlock other use cases you wouldn’t otherwise have been able to deliver. And of course, it’s also possible to go the other direction — turn your less readable geospatial data type into a more human-readable set of location fields. Both of these transformations can be done with specialized Snowflake partners such as Mapbox and TravelTime, which are both Snowflake Native Apps available via Snowflake Marketplace. We recommend these providers for the most accurate geocoding and reverse geocoding. But sometimes, you need to balance accuracy with cost or you need to demonstrate ROI before you can spend money — so let’s talk about how you can get started with geocoding and reverse geocoding directly in Snowflake.

In this use case, we will use two data sets from Snowflake Marketplace: Worldwide Address Data, a free and open global address data collection, and a tutorial data set from our partner CARTO, which has a restaurant table with a single street_address column. If you were to follow this example for your data, you would replace the tutorial restaurant table with your table. You’ll do some data prep on these two tables, which you can see in the quickstart.

The DIY Snowflake geocoding involves three steps:

1. Use an LLM to transform a complete address string into a JSON array with its parts as attributes. You’ll make use of the SNOWFLAKE.CORTEX.COMPLETE function in this step, which uses a Snowflake-hosted Mixtral-8x7B model to do the conversion using a detailed set of instructions you provide as part of the functional call. See the quickstart notes for choosing the proper warehouse size to run this at scale.

2. Create a simple geospatial column on the latitude and longitude in Worldwide Address Data.

ST_POINT is used to create the geospatial column from the latitude and longitude.

Unnecessary data is removed from the Worldwide Address Data table based on invalid latitude and longitude values.

3. Use JAROWINKLER_SIMILARITY to match addresses. You’ll join the two tables together on a few columns, wrapping the street name column in a JAROWINKLER_SIMILARITY function, because it’s not uncommon for street names to have subtle string differences between sources, hence the “95” as our similarity threshold in the function call.

It’s important to note here that this method isn’t flawless in accuracy. The quickstart shows in more detail the level of accuracy achieved and some of the reasons why this method is imperfect, but it’s important to note the cost difference between this method versus a dedicated geocoding service. This method is a great way to achieve simpler requirements or prototype the justification for more dedicated geocoding services with more complicated requirements.

To do reverse geocoding, or produce an address from a geospatial data type, we’re going to build a stored procedure that does three things: creates a results table, selects rows that haven’t been processed and finds the closest address match using a loop with an increasing radius search until a result is found. You can call this procedure like all procedures in Snowflake, passing in the appropriate arguments as defined at the top of the procedure code, which you can see in the quickstart.

Geocoding your address data unlocks more potential to use that address data in more insightful ways. Let’s dig into one way that you can use that geocoded data.

Forecasting is a common activity with time-series data, as there are sound ML models designed to input historical data and predict a window of future data based on historical trends. Snowflake includes a built-in ML forecast model called SNOWFLAKE.ML.FORECAST that you can easily use for this activity. While you can do time-series forecasting across any time-based data, enriching that forecasting with location data provides another value dimension in the forecasting process.

There are two data sets used in the quickstart: New York City taxi ride data provided by CARTO and event data provided by PredictHQ. You can think of the two data sets like this: The taxi ride data is like any time-series-based data you have about something that happens at a certain time and place. You likely have that kind of data in your organization. The event data adds more context to a time-based element in your data. In this example, the event data may inform any increase or decrease in expected taxi rides in New York City. Other types of contextual data you may have include open or closed periods, recurring promotional periods, public events and more. Whatever it is, contextual data can help provide more accuracy to a forecast by enhancing any given day/time with an increasing or decreasing expectation.

The forecasting activity involves four steps:

1. Use a spatial grid to calculate a time series of taxi pickups for each cell in the grid.

You’ll use the H3 spatial grid to divide New York City into cells based on the pickup location.

You can pinpoint a pickup within a cell by taking the latitude and longitude and converting that to an H3 cell.

You can then aggregate the number of pickups within each New York City cell by hours of the day using Snowflake’s TIME_SLICE and H3_POINT_TO_CELL_STRING functions.

Lastly, you’ll need to “fill” in any hour gaps in a location with 0 (zero) pickup records. This is necessary for time-series forecasting.

2. Enrich the hourly time-series data with event data.

This adds “feature” columns to the above data by flagging each row as either a school holiday, public holiday or sporting event based on joining our event data.

Note that you can add whatever “features” make sense for your event data. School holidays, public holidays and sporting events are just realistic examples for this scenario.

3. Build a model, a training data set and a prediction data set.

The SNOWFLAKE.ML.FORECAST model requires a training step before making predictions, so your data from Step 2 needs to be divided into two tables by time: one for training and one for prediction.

From there you’ll create a model, with one of the inputs being the training data table you established above. You’ll also identify key columns, like the time column, the metric to forecast and, in this instance, what location data you want to forecast by.

4. Run the model and visualize the accuracy of the predictions.

Once the model has been trained, it can be called, pointing to another data set to use for prediction.

The prediction can be output to a table and then compared to the actuals to assess accuracy. There are several ways to do this, but the quickstart suggests using a symmetric mean absolute percentage error (SMAPE).

The chart below shows what the output of this comparison can look like for one H3 cell, with the forecast model having a pretty strong accuracy rate:

Unlocking the Power of Geospatial Data for Insights

Related Posts

Cyber insurance policies and coverage details

Cybersecurity risk assessment for financial institutions

Leave a Reply Cancel reply