Welcome, and a useful SQL device
01 Oct 2023Welcome to my blog! You can learn more about me in the About section, or view a detailed CV here. But in short:
- I like programming.
- I like maths.
- I (sometimes) like data.
This blog will be my avenue for sharing some things I find useful or interesting, and hopefully you feel the same way. Since a purely “welcome” post would be pretty dull, I’ll start the blog off by sharing with you a T-SQL device that can come in useful for helping to structure complex queries.
Local variables in SQL expressions
Not, I’m not talking about the likes of declare @var int
in T-SQL or declare var int
in PostgreSQL (or PL/SQL). I
mean a means of aliasing expressions in a SQL query, so you can refer to them by name rather than duplicating the
expression in multiple places.
To illustrate, let’s suppose we have some data representing the GPS positions of a vehicle. (I’ve done extensive analytics engineering with this sort of telematics data, so it tends to pop into mind when I’m thinking of examples.) The data is in a table that contains:
- An
id
field identifying each position. - A
vehicle_id
field indicating which vehicle the GPS position is for. - Three fields to hold the time and coordinates of the position:
timestamp
,latitude
, andlongitude
. Typically latitudes and longitudes are given in decimal form, so we’ll assume that’s the case here.
As a SQL DDL:
create table db.positions (
id int,
vehicle_id int,
timestamp datetime,
latitude float,
longitude float
);
What I have in mind is to calculate the point-to-point distance, hence the (average) speed from one point to the next. The distance is approximated by the Haversine formula for the shortest distance between two points on a sphere:
\[\text{d}(P, Q) = 2r \arcsin\left(\sqrt{\sin^2\left(\tfrac{\varphi_2 - \varphi_1}{2}\right) + \cos \varphi_1 \cdot \cos \varphi_2 \cdot \sin^2\left(\tfrac{\lambda_2 - \lambda_1}{2}\right)}\right)\]where $P = (\varphi_1, \lambda_1)$ and $Q = (\varphi_2, \lambda_2)$ are two positions expressed in radians (as $(\text{latitude},\,\text{longitude})$) and $r$ is the radius of the sphere under consideration. We’ll take $r = 6371\,\text{km}$, which is roughly the radius of the Earth. Of course, the Earth isn’t really a sphere, so this will just be an approximation—albeit a pretty accurate one.
To simplify things, let’s assume we’ve defined a Haversine SQL function:
db.haversine(lat_1, lon_1, lat_2, lon_2)
Armed with the above function, we can calculate the point-to-point speed of the vehicle at each position. Now typically this would be done in one of two ways:
- By using sub-queries (this will get messy very quickly).
- By using CTEs.
- By repeating (sub-)expressions.
(Or some combination of the three.)
Another approach—the one I’m proposing in this post—is to use cross apply
(also known as a lateral join
; for
instance, in Postgres). It looks like this:
select
pos.id,
pos.vehicle_id,
kmh.kmh_to_next_position
from dbo.positions pos
cross apply (
-- Convert latitudes and longitudes to radians
select
timestamp,
radians(latitude ) as latitude,
radians(longitude) as longitude
) cur
cross apply (
-- Get a vehicle's next timestamp and position coordinates
select top 1
cra.timestamp,
radians(cra.latitude ) as latitude,
radians(cra.longitude) as longitude
from dbo.positions cra
where 1=1
and cra.vehicle_id = pos.vehicle_id
and cra.timestamp > pos.timestamp
) nxt
cross apply (
-- Calculate the KM and hours to a vehicle's next position
select
dbo.haversine(
cur.latitude, cur.longitude,
nxt.latitude, nxt.longitude
) as km_to_next_position,
datediff(
second,
cur.timestamp,
nxt.timestamp
) / 3600.0 as hr_to_next_position
) tnp
cross apply (
-- Calculate the KM/hr to a vehicle's next position
select
tnp.km_to_next_position / tnp.hr_to_next_position as kmh_to_next_position
) kmh;
The calculations are now broken up into a sequence of cross apply
’s:
- The first one converts the latitudes and longitudes to radians.
- The second gets the required data for the “next” position.
- The third calculates the position-to-position distance and time.
- And lastly, the final one calculates the speed.
I’ve found this pattern to be especially useful when there are multiple complex formula, depending on the same
sub-expressions (or sub-calculations). The cross apply
technique ends up much shorter than using CTEs, as well as
being much clearer, because you don’t need to drag along the entire table along each time.
Of course, the obvious downside is that the calculations appear after the query that references them—but this reversal of expected order is par for the course with SQL, so we should all be accustomed to it!