Mapping New York in Seconds: Geospatial Magic with Polar
Today, we will tackle an interesting library available for Rust enthusiasts called Polars (pola.rs). Polars is a high-performance DataFrame library (if you have ever worked with similar libraries - I guess the most famous is Pandas in the Python ecosystem - you will know straightaway what it is and where it's useful). Polars is, of course, implemented in Rust but has Python bindings so it can be used in both worlds. It's designed for efficient data processing and data manipulation with excellent performance.
In contrast to Pandas, which uses NumPy arrays, Polars is built on Apache Arrow columnar memory format and is optimized for analytics workloads. The execution model emphasizes lazy execution which allows query optimization and is generally faster than Pandas, especially for operations like groupBy, joins and merges, as well as filter operations on large data sets (you can check out some benchmarks here).
Because Polars is written in Rust, it has outstanding memory efficiency when compared to Pandas. It uses more efficient columnar storage and has significantly less overhead in data structures. On top of that, it has built-in multithreading for many operations compared to Pandas (unless explicitly using additional Python libraries like Dask).
Short tutorial on Pola.rs
For the purpose of this blog post and to test Polars with a dataset that more closely resembles the real-world data usage, I have picked the NYC Taxi dataset which consists of 3 files and 5.4 GB of data (first quarter of the year 2016).
First, we need to load the data and merge it into one lazy dataset:
let file_paths = vec!["nyc_taxi_data/yellow_tripdata_2016-01.csv",
"nyc_taxi_data/yellow_tripdata_2016-02.csv",
"nyc_taxi_data/yellow_tripdata_2016-03.csv",
];
// Create a vector to store individual LazyFrames
let mut lazy_frames: Vec<LazyFrame> = Vec::new();
println!("loading data...");
// Read each CSV file into a LazyFrame
for path in file_paths {
let lf = LazyCsvReader::new(path)
.with_has_header(true)
.with_try_parse_dates(false)
.with_low_memory(true)
.finish()?;
lazy_frames.push(lf);
}
The code above reads multiple CSV files into LazyFrames instead of simple DataFrames. This means that the computations like parsing, filtering, etc., will not be executed immediately. Instead, we can build a logical plan that will be later optimized and executed by Pola.rs engine.
With Polars, we can load data eagerly or lazily. The eager version is simpler and more straightforward for small datasets and quick prototyping. Still, there are drawbacks like no optimizations and possibly more memory used (Polars will not remove the columns you are actually not using in your queries).
// Create UnionArgs for concat function
let union_args = UnionArgs {
parallel: false,
..Default::default()
};
// Concatenate all LazyFrames vertically
let combined_lf = concat(lazy_frames, union_args)?;
// Collect data
let df = combined_lf.collect()?;
Once we have data loaded into our lazy frames from multiple sources, we can join them into a single combined data frame. For this purpose, we use the UnionArgs
struct, which configures how Polars should concatenate the frames for us.
Okay, so we are at the point where we have all of our data collected and ready to be processed. I have decided that for the purpose of this short tutorial, we will investigate only 2 columns dropoff_latitude
and dropoff_longitude
which indicate the locations where NYC taxis drop off passengers. Of course, there are other interesting columns like tip amount, totals, etc., but we want location so we can play with it using H3
library and show something on the map.
We pick the columns and we try to get a typed representation of each one as f64
.
let lat_series = df.column("dropoff_latitude")?;
let lng_series = df.column("dropoff_longitude")?;
let lat_vec = lat_series.f64()?;
let lng_vec = lng_series.f64()?;
At this point, our data is ready for manipulation. For our simple example, we’ll convert the lat/lng points into H3 cells while counting the occurrences of records falling into the same H3 cell.
// HashMap to count frequencies of each H3 cell
let mut h3_counts: HashMap<String, u32> = HashMap::new();
// Convert lat/lng to H3 cells and count occurrences
for i in 0..lat_vec.len() {
if let (Some(lat), Some(lng)) = (lat_vec.get(i), lng_vec.get(i)) {
let latlng = LatLng::new(lat, lng).expect("Invalid coordinates");
let h3_index = latlng.to_cell(Resolution::Nine);
let h3_str = h3_index.to_string();
// Store the H3 index as a string
h3_cells.push(h3_str.clone());
// Increment count for this H3 cell
*h3_counts.entry(h3_str).or_insert(0) += 1;
} else {
h3_cells.push(String::new()); // Handle missing values
}
}
At the end of our simple data manipulation, we add the h3_index
column to our data frame so that we can use it later for map generation with JavaScript.
// Add H3 cell indices to the dataframe
let h3_series = Series::new(PlSmallStr::from_str("h3_index"), &h3_cells);
let mut df_with_h3: DataFrame = df.clone();
df_with_h3.with_column(h3_series)?;
Once we have an index, we can populate the simple html with leaflet.js
and h3-js.umd.js
to show our polygons on a New York City map.
I have added some frequency heat map functionality so that you can easily see which locations were the drop-off locations used most often (red colour).
H3 frequencies are converted to json and injected into a js
object later to be used on the page:
// Convert HashMap to a JSON object
let h3_frequency_json = h3_counts
.iter()
.map(|(h3, count)| format!("\"{}\":{}", h3, count))
.collect::<Vec<_>>()
.join(",\n ");
We also do some basic calculations to generate the legend, like max frequency value for all h3 generated polygons:
let max_count = h3_counts.values().max().cloned().unwrap_or(1);
The script saves the generated html to a new file, and the output is a simple leaflet map on the html page with some tooltips (frequency) and a legend for the colors used in the heatmap.
Demo page is available under the link.
Speeding things up
That was all fine and dandy, and Pola.rs is giving a great developer experience but things look like they can be a bit faster. With the current code the whole generation process takes about 5 minutes (00:05:00.105). Given the fact that it generates 5.4 GB of data, it’s still impressive, but we can make things better.
The first thing we can do is to drop all the columns we won’t use as quickly as possible before calling .collect()
on our lazy data frame.
When using lazy data frames in Polars, some optimizations will be done for us automatically. Nevertheless, dropping unused columns as soon as possible is always a good practice.
// Read each CSV file into a LazyFrame
for path in file_paths {
let lf = LazyCsvReader::new(path)
.with_has_header(true)
.with_try_parse_dates(false)
.with_low_memory(true)
.finish()?;
let lf_lat_lng = lf.select([
col("dropoff_latitude"),
col("dropoff_longitude"),
]);
lazy_frames.push(lf_lat_lng);
}
With this simple trick, we saved about 16 seconds, and the total elapsed time to run our script end-to-end is 00:04:43.905
If you are reading this post carefully you have probably noticed that we were using parallel: false
field in our union_args
struct when we combined our lazy frames into a single entity. This was done on purpose, because changing this flag to true gives us another 19 seconds improvement as Polars engine can utilize multiple cores on our machine to do this operation.
UnionArgs
is just a simple example, but when using Polars, watch out for other operations which can utilise parallelism and speed up your code significantly, some of those include:
1. .par_apply() on ChunkedArrays
let result = some_chunked_array.par_apply(|v| expensive_fn(v));
2. .with_streaming(true) on joins and groupbys
lazy_df
.join(other, ...)
.with_streaming(true)
3. CsvReader::with_n_threads(n) (Eager only)
let df = CsvReader::from_path("file.csv")?
.with_n_threads(4)
.finish()?;
In most internal operations like .groupby()
, .join()
, .agg()
, etc, Polars will use parallel execution by default unless it’s explicitly disabled but you should always check the docs and use parallel options if they are available.
After enabling parallel in our union args we are now at 00:04:24.839
If you run the script yourself, you will quickly notice that most of the time is spent converting our lat/lng pairs into H3 cells. This is because we are using a simple for loop to iterate over all the records in the data frame. This definitely can be done better.
If you are not completely new to Polars you probably already know this - Polars is using Rayon data-parallelism library under the hood. We can utilise Rayon easily to make our code parallel too and speed things up considerably instead of using a simple for
loop and going each record one by one.
Rayon is a high-level abstraction over threads. It makes parallel for-loops and map/filter/collect chains run in parallel across CPU cores. It's also very safe and ergonomic. There is no manual thread management, no Mutex’es, no RwLocks, no unsafe and no Pinning.
The small challenge we have in our code when trying to make it parallel is that we count occurrences of each cell as we go. For this purpose we need to split computations into two separate parts, one converting the data to cells and another counting the occurrences.
// Combine into (index, lat, lng) tuples and process in parallel
let results: Vec<Option<String>> = (0..lat_vec.len())
.into_par_iter()
.map(|i| {
match (lat_vec.get(i), lng_vec.get(i)) {
(Some(lat), Some(lng)) => {
let latlng = LatLng::new(lat, lng).ok()?;
let h3_index = latlng.to_cell(Resolution::Nine);
Some(h3_index.to_string())
}
_ => None,
}
})
.collect();
Counting occurrences and collecting data into h3_cells
like before:
let mut h3_cells:Vec<String> = Vec::with_capacity(results.len());
let mut h3_counts: HashMap<String, u32> = HashMap::new();
for maybe_h3 in results {
if let Some(h3_str) = maybe_h3 {
*h3_counts.entry(h3_str.clone()).or_insert(0) += 1;
h3_cells.push(h3_str);
} else {
h3_cells.push(String::new());
}
}
Currently, we are at: Elapsed time: 00:00:50.339, a 3 minutes 24 seconds improvement over the last result and 4 minutes and 10 seconds improvement since the beginning.
Final thoughts
What I wanted to show with this blog post is:
- Pola.rs is fun and can be used similarly to Pandas but with much faster and type-safe Rust scripting.
- Any data modification library you use, think from the start about possible optimizations as it can make your script run much faster and sometimes even run at all.
- Check out more about h3 - powerful geospatial indexing system that makes working with location surprisingly elegant and efficient.