Distance calculation with Impala (or Hive)
The haversine formula is an equation important in navigation, giving great-circle distances between two points on a sphere from their longitudes and latitudes. It is a special case of a more general formula in spherical trigonometry, the law of haversines, relating the sides and angles of spherical triangles. The first table of haversines in English was published by James Andrew in 1805. Florian Cajori credits an earlier use by Jose de Mendoza y Ríos in 1801 The term haversine was coined in 1835 by Prof James Inman.
- Source: wikipedia-haversine
My colleague Giovanni described the formula in a previous post and coded an implementation in Python with NumPy for fast results. In this post I show you how to use Impala
This also works in Hive. (but that is way slower ofcourse)
Impala query language
Impala has a lot of Math, String, Date and other functions you should checkout. Impala supports all the math functions which are required to implement the formula. The query I put together:
select 2 * asin( sqrt( cos(radians(lat1)) * cos(radians(lat2)) * pow(sin(radians((lon1 - lon2)/2)), 2) + pow(sin(radians((lat1 - lat2)/2)), 2) ) ) * 6371 distance_km from my_table;
note: for miles use, 3956 instead of 6371.
I created an example using Hoofddorp Station to Amsterdam Central Station. In HUE you can plot your query results with lat/lon values on the map. I assume you may not be familiar where Amsterdam is, but Hoofddorp is well known for it's..., well now you know where Hoofddorp and Amsterdam are located.
- Hoofddorp Station: 52.2909264998, 4.700868765513
- Amsterdam Station: 52.3773759354, 4.896747677825
Let's use these variables in the query:
select 2 * asin( sqrt( cos(radians(52.2909264998)) * cos(radians(52.3773759354)) * pow(sin(radians((4.700868765513 - 4.896747677825)/2)), 2) + pow(sin(radians((52.2909264998 - 52.3773759354)/2)), 2) ) ) * 6371 distance_km; -- result: 16.415929129056497
Verify the result
Using the google-maps-calculate-distance you can measure the distance between two points on the map.
That's good! Feel free to copy past the formula and use it in Impala or Hive to crunch some geo data.
Greetings, Alexander Bij
Follow us for more of this
Elitist shuffle for recommendation systems
May 13, 2018
GoDataDriven open source contribution: April 2018 edition
May 06, 2018
Event Report: Blockchain Ask Me Anything
May 03, 2018
Towards fairness in ML with adversarial networks
April 27, 2018
GDPR: "May 25th is when it really begins
April 10, 2018
GoDataDriven open source contribution: March 2018 edition
April 04, 2018