sia.hackernoon.com

Usage of some programming & data engineering techniques and modules on GeoSpatial (GPS location) dataset

Hi everyone 👋👋,

Python is a powerful, flexible, and beginner-friendly language. But let’s be honest — it’s not always the fastest. Especially when dealing with large data sets, Python’s performance can become a bottleneck. This is especially true when using pandas’ .apply() or loops — tasks that can take a painfully long time to run.

The good news? There are many faster alternatives — and in this post, I’ll walk you through them.

You’ll learn how to:

Generate synthetic geospatial data (like GPS coordinates) in Python,
Apply distance calculations to each row of a DataFrame,
Replace apply() with faster alternatives (like vectorization and parallelism),
Benchmark different strategies — including Python’s built-ins, NumPy, and parallel computing libraries like pandarallel and swifter.

🧪 Step 1: Setting Up a Clean Python Environment

To keep things tidy, let’s start with a virtual environment:

python3 -m pip install virtualenv python3 -m venv my_env source my_env/bin/activate

If you’re using Jupyter (e.g., with VS Code), select the `my_env` kernel when prompted.

my_env is activated on zsh (fino theme) — Image by author

my_env is activated.

If you work on the Visual Studio Code Jupyter extension, you will get a pop-up about installing kernel for the Jupyter notebook. To code, it can use Jupyter Notebook or VS Code Jupyter extension. The thing to be noted is that you should choose the Jupyter kernel my_env when using Jupyter Notebook or VS Code.

You can remove my_env folder if you don’t need no anymore and re-create a new virtual environment again and again…

🛠️ Step 2: The Problem — Processing Every Row in a DataFrame

Let’s say you have a DataFrame with random distributions of GPS-like coordinates (latitude, longitude), and you want to apply a function (like calculating distance) to each row.

This is a common scenario in data science, and how you process rows matters a lot when performance is key.

🐌 The Usual Way — Loops & .apply() (The Slow Lane)

Sometimes, we need to do the same operation to each row in a data frame. The first method to do that is using a primitive loop. An instance can be seen below. We have a data frame, and I want to apply an operation in foo() function. iterrows() feature gives me each row, and after I get the rows, I can process them in the for loop easily.

1. iterrows(): The Classic Loop

for index, row in df.iterrows(): df.loc[index, "processed_feature"] = foo(row["A"])

🔻 This method is readable but very slow. It processes rows one at a time in pure Python.

2. apply(): Cleaner, but Still Slow

df["processed_feature"] = df.apply(lambda row: foo(row["A"]), axis=1)

✅ Shorter syntax

🔻 Still slow — especially with large DataFrames.

🏃‍♂️ Step 3: Faster Alternatives to .apply()

Here are better options — faster, more efficient, and just as readable.

3. itertuples(): Faster Than iterrows()

The third method is iteruples. This is better than the two methods mentioned above. To use itertuples, you can use a for loop and get all rows one by one via df.itertuple() just like the iterrows method.

foo(row.A) for index, row in df.itertuples()

✅ Much faster than iterrows()

🔻 Still not vectorized — performance may suffer on very large datasets.

4. List Comprehension: Pythonic & Lean

The list comprehension method has a more specific approach. We would rather only the feature we need than use all dataframe. Of course, this approach will speed up the calculation time. Because the interpreter doesn’t have to separate one column from among the others every time. Usage of this approach can be seen in the code below.

df["processed_feature"] = [foo(x) for x in df["A"]]

✅ Efficient and compact

✅ Avoids overhead of row-wise operations

🔻 Only works on a single column

5. map(): Simple and Fast

Map function is one of the best approaches. It’s faster than primitive loops or the others because this method is also used by sending a parameter to a specific column. As an example.

df["processed_feature"] = df["A"].map(foo)

✅ Very fast

✅ Cleaner syntax

🔻 Limited to single-column functions

⚡ Step 4: Supercharged Speed with Vectorization

6. numpy.vectorize(): Vectorized, But Not Always Faster

numpy is a vector library written for Python users, and it is used by almost every programmer or data scientist who works with dataframes. For numpy, thy name is numerical. Python module has a function called vectorized, and it can help us by taking a dataframe's specific column as a parameter.

This feature provides a fast computation in vector space. Just like that, with numba or cffi modules, they compute faster than the other options, as they have the ability to compile the code after converting a code block to a low-level language such as C, C++. numpy.vectorize method can be used like below.

import numpy as np 
df["processed_feature"] = np.vectorize(foo)(df["A"])

✅ Vectorized syntax

🔻 Doesn’t provide true vectorized performance — it’s still a loop under the hood

7. True Pandas Vectorization 💥 (The Gold Standard)

Pandas vectorization methods are a very simple approach to solve this problem, and they’re extremely fast. The only thing to do is to separate the operation as a function and call the function in the main block. The new function gets the dataframe as a parameter and computes the processes, then returns the processed column. Please check the code below.

def func(df): 
  return df['A']*10 
df["procecessedfeature"]=func(df)

✅ Fastest option for numeric or column-wise operations

✅ Clean and elegant

🔻 Limited to operations that can be vectorized

🔥 Step 5: Go Parallel! (With All Your CPU Cores)

For even more power, let’s harness parallel computing with specialized libraries.

8. pandarallel: Parallel .apply() with Ease

pandarallel is a module that can use CPU cores more than one, like Swifter. If you have many CPUs, you can distribute calculation operations over the cores. To install, you can use pip and the pypi repository.

pip install pandarallel from pandarallel

Basic Usage:

import pandarallel 
pandarallel.initialize() 
df["processed_feature"] = df.parallel_apply(lambda row: foo(row["A"]), axis=1)

✅ Takes advantage of multiple CPU cores

✅ Easy drop-in replacement for .apply()

🔻 Slight setup overhead

9. swifter: Smarter, Parallel `.apply()`

swifter is another great tool for accelerating pandas operations. It automatically decides the best execution path — whether to use pandas, Dask, or a parallelized approach — based on the size and complexity of your data.

This makes it extremely user-friendly: just swap .apply() with .swifter.apply() and let swifter optimize for speed behind the scenes.

To install:

pip install swifter

Basic Usage:

import swifter 
df["processed_feature"] = df.swifter.apply(lambda row: foo(row["A"]), axis=1)

✅ Automatically chooses the optimal computation strategy

✅ No manual parallel setup needed

🔻 Slightly higher memory usage on small data

🔻 May fallback to single-threaded mode on small DataFrames

💰🎁 Step 6 (Bonus): Use R or Julia Engines to Accelerate Programs

This section's methods don’t attend to benchmark tests.

10. apply function in R

R language has many apply functions, and it can be used with basic tricks in a Python environment. To use, we need a rpy2 module and r-base package.

apt install r-base -y ## or brew install if you use mac pip install rpy2

Then, in your notebook:

%load_ext rpy2.ipython
##Push a pandas DataFrame 'df' to R environment
%Rpush df

And define an R function to operate on your DataFrame:

%%R foo <- function(df) {
##Example operation: calculate the sum of each column
df[] <- lapply(df, function(col) col * 2) # Just doubling each element return(df) } 
df <- foo(df)

Then, you can pull the modified df back into Python if needed with %Rpull df.

✅ Unlocks R’s performance and functions

🔻 Requires extra setup

11. Use Julia Lang Functions With pyJulia

Just like rpy2 allows you to use R functions inside Python, you can use PyJulia to run Julia code seamlessly within your Python environment. This means you can write high-performance Julia functions for your heavy computations, then call them directly from Python — unlocking Julia’s speed without leaving your Python workflow.

!apt-get install julia -y !pip install julia

Then, initialize Julia inside Python and define your Julia function:


from julia import Main
##Define a Julia function in Python
Main.eval(""" function haversine(lat1, lon1, lat2, lon2) 
                 R = 6371.0 # Earth radius in kilometers 
                dlat = deg2rad(lat2 - lat1) 
                dlon = deg2rad(lon2 - lon1) 
                a = sin(dlat/2)^2 + cos(deg2rad(lat1)) * cos(deg2rad(lat2)) * sin(dlon/2)^2 
                c = 2 * atan2(sqrt(a), sqrt(1 - a)) 
                return R * c end """)

Using the Julia function on your data:

Assuming you have a pandas DataFrame df with columns lat1, lon1, lat2, lon2, apply the Julia function like this:

import pandas as pd
##Example DataFrame
df = pd.DataFrame({ "lat1": [34.05, 40.71], "lon1": [-118.25, -74.01],
                    "lat2": [36.12, 42.36], "lon2": [-115.17, -71.06] })
##Call the Julia haversine function row-wise

df["distance_km"] = [Main.haversine(*row) for row in df.itertuples(index=False)]

print(df)

✅ Unlocks Julia’s performance inside Python

🔻 Requires Julia installation and initial setup

Summary and Conclusion

100K Records Test Results

In this study, we explored and benchmarked various methods for performing row-wise operations in Pandas, ranging from basic iterrows() and apply() to more optimized techniques like vectorization, swifter, and pandarallel.

Using a sample of 100,000 geospatial records, we measured and visualized the execution time of each method at consistent intervals to understand their computational efficiency.

The results reveal a stark contrast in performance:

🔵 iterrows() was by far the slowest, with a total execution time exceeding 2 seconds. This confirms its reputation as a method best avoided in performance-critical applications.

🟠 apply() fared better, but still demonstrated significant overhead compared to more efficient alternatives, clocking in at nearly 1 second.

🟢 Methods like itertuples, map, and list comprehension showed substantial speed improvements, processing the entire dataset in under 0.1 seconds, making them solid options when vectorization isn't feasible.

🟣 np.vectorize performed comparably to the above, but didn't provide any distinct advantage over native Python techniques.

🔴 numpy_direct_vectorized was the fastest of all — completing the task in just ~0.02 seconds — demonstrating the power of true vectorized computation with NumPy.

⚫ swifter and 🟡 pandarallel delivered decent speedups by leveraging parallel computing, but didn’t outperform NumPy vectorization in this single-node setup. Their true potential likely shines with larger datasets or more complex row operations.

If maximum performance is critical and the operation can be vectorized mathematically, NumPy vectorization is the clear winner.
If vectorization isn’t feasible (e.g., complex logic per row), itertuples, map, or list comprehension offer a clean trade-off between readability and speed.
For larger data or multicore machines, modules like swifter and pandarallel can significantly accelerate row-wise operations by utilizing parallel execution.

10M Records Test Results

In this extended benchmark using 10 million rows processed in 10,000-row chunks, we evaluated the performance of several popular Pandas-compatible row-wise operation techniques. The goal was to assess their scalability and practical efficiency under heavier computational load.

The results clearly show that choice of method matters even more at scale:

🔵 iterrows() once again proved the slowest by a wide margin, with a total runtime of 211.48 seconds. Its inefficiency becomes exponentially pronounced as data volume grows, reinforcing that it should be avoided in all performance-sensitive applications.

🟠 apply() delivered better performance than iterrows, but still clocked in at 102.22 seconds, indicating a high overhead and moderate scalability.

🟢 itertuples, map, and list comprehension significantly outperformed both, with times ranging from 6.49s to 8.5s. These methods provide a good balance between readability and efficiency for non-vectorizable logic.

🟣 np.vectorize achieved 5.75 seconds, slightly ahead of native Python methods, but still fell short of true vectorization performance.

🔴 numpy_direct_vectorized was again the fastest, completing the task in just 1.03 seconds. This underscores the unmatched speed of low-level, array-based operations when applicable.

⚫ swifter and 🟡 pandarallel both offered substantial performance gains (129.28s and 81.55s, respectively) over native Pandas methods by leveraging parallel computation. However, they still lagged behind optimized NumPy-based logic. Their benefits are likely more pronounced with complex row-wise computations or multi-core environments handling larger workloads.

Ultimately, choosing the right method depends on your use case — and this benchmark provides an actionable reference to make informed decisions for efficient data processing in Python.

All of the codes are here and free for review on the public gist.

https://gist.github.com/nuhyurdev/53249123ff9dacb7cc3935016abe15ea

Master GeoSpatial Data: From Python's .apply() to Advanced Alternatives