Distance calculation with Impala (or Hive)
The haversine formula is an equation important in navigation, giving great-circle distances between two points on a sphere from their longitudes and latitudes. It is a special case of a more general formula in spherical trigonometry, the law of haversines, relating the sides and angles of spherical triangles. The first table of haversines in English was published by James Andrew in 1805. Florian Cajori credits an earlier use by Jose de Mendoza y Ríos in 1801 The term haversine was coined in 1835 by Prof James Inman.
- Source: wikipedia-haversine
My colleague Giovanni described the formula in a previous post and coded an implementation in Python with NumPy for fast results. In this post I show you how to use Impala
This also works in Hive. (but that is way slower ofcourse)
Impala query language
Impala has a lot of Math, String, Date and other functions you should checkout. Impala supports all the math functions which are required to implement the formula. The query I put together:
select 2 * asin( sqrt( cos(radians(lat1)) * cos(radians(lat2)) * pow(sin(radians((lon1 - lon2)/2)), 2) + pow(sin(radians((lat1 - lat2)/2)), 2) ) ) * 6371 distance_km from my_table;
note: for miles use, 3956 instead of 6371.
I created an example using Hoofddorp Station to Amsterdam Central Station. In HUE you can plot your query results with lat/lon values on the map. I assume you may not be familiar where Amsterdam is, but Hoofddorp is well known for it's..., well now you know where Hoofddorp and Amsterdam are located.
- Hoofddorp Station: 52.2909264998, 4.700868765513
- Amsterdam Station: 52.3773759354, 4.896747677825
Let's use these variables in the query:
select 2 * asin( sqrt( cos(radians(52.2909264998)) * cos(radians(52.3773759354)) * pow(sin(radians((4.700868765513 - 4.896747677825)/2)), 2) + pow(sin(radians((52.2909264998 - 52.3773759354)/2)), 2) ) ) * 6371 distance_km; -- result: 16.415929129056497
Verify the result
Using the google-maps-calculate-distance you can measure the distance between two points on the map.
That's good! Feel free to copy past the formula and use it in Impala or Hive to crunch some geo data.
Greetings, Alexander Bij
Follow us for more of this
Testing and debugging Apache Airflow
February 22, 2019
The Zen of Python and Apache Airflow
February 18, 2019
AWS Machine Learning Competency Status for GoDataDriven
February 14, 2019
GoDataDriven Open Source Contribution for January 2019, the Apache Edition
February 13, 2019
Our social responsibility as a company
February 08, 2019
Keras: multi-label classification with ImageDataGenerator
January 31, 2019