Distance calculation with Impala (or Hive)
The haversine formula is an equation important in navigation, giving great-circle distances between two points on a sphere from their longitudes and latitudes. It is a special case of a more general formula in spherical trigonometry, the law of haversines, relating the sides and angles of spherical triangles. The first table of haversines in English was published by James Andrew in 1805. Florian Cajori credits an earlier use by Jose de Mendoza y Ríos in 1801 The term haversine was coined in 1835 by Prof James Inman.
- Source: wikipedia-haversine
My colleague Giovanni described the formula in a previous post and coded an implementation in Python with NumPy for fast results. In this post I show you how to use Impala
This also works in Hive. (but that is way slower ofcourse)
Impala query language
Impala has a lot of Math, String, Date and other functions you should checkout. Impala supports all the math functions which are required to implement the formula. The query I put together:
select 2 * asin( sqrt( cos(radians(lat1)) * cos(radians(lat2)) * pow(sin(radians((lon1 - lon2)/2)), 2) + pow(sin(radians((lat1 - lat2)/2)), 2) ) ) * 6371 distance_km from my_table;
note: for miles use, 3956 instead of 6371.
I created an example using Hoofddorp Station to Amsterdam Central Station. In HUE you can plot your query results with lat/lon values on the map. I assume you may not be familiar where Amsterdam is, but Hoofddorp is well known for it's..., well now you know where Hoofddorp and Amsterdam are located.
- Hoofddorp Station: 52.2909264998, 4.700868765513
- Amsterdam Station: 52.3773759354, 4.896747677825
Let's use these variables in the query:
select 2 * asin( sqrt( cos(radians(52.2909264998)) * cos(radians(52.3773759354)) * pow(sin(radians((4.700868765513 - 4.896747677825)/2)), 2) + pow(sin(radians((52.2909264998 - 52.3773759354)/2)), 2) ) ) * 6371 distance_km; -- result: 16.415929129056497
Verify the result
Using the google-maps-calculate-distance you can measure the distance between two points on the map.
That's good! Feel free to copy past the formula and use it in Impala or Hive to crunch some geo data.
Greetings, Alexander Bij
Follow us for more of this
How to Find Blockchain Use Cases: Part I
September 17, 2018
Opening up some training material
September 05, 2018
GoDataDriven open source contribution: Augustus 2018 edition
September 05, 2018
Python Masterclass with Restart Network
August 21, 2018
August 15, 2018
Write less terrible code with Jupyter Notebook
August 05, 2018