Usage of some programming & data engineering techniques and modules on GeoSpatial (GPS location) dataset
Hi everyone 👋👋,
Python is a powerful, flexible, and beginner-friendly language. But let’s be honest — it’s not always the fastest. Especially when dealing with large data sets, Python’s performance can become a bottleneck. This is especially true when using pandas’ .apply() or loops — tasks that can take a painfully long time to run.
The good news? There are many faster alternatives — and in this post, I’ll walk you through them.
You’ll learn how to:
- Generate synthetic geospatial data (like GPS coordinates) in Python,
- Apply distance calculations to each row of a DataFrame,
- Replace apply() with faster alternatives (like vectorization and parallelism),
- Benchmark different strategies — including Python’s built-ins, NumPy, and parallel computing libraries like pandarallel and swifter.
🧪 Step 1: Setting Up a Clean Python Environment
To keep things tidy, let’s start with a virtual environment:
python3 -m pip install virtualenv python3 -m venv my_env source my_env/bin/activate
If you’re using Jupyter (e.g., with VS Code), select the `my_env` kernel when prompted.
my_env is activated on zsh (fino theme) — Image by author
my_env
is activated.
If you work on the Visual Studio Code Jupyter extension, you will get a pop-up about installing kernel for the Jupyter notebook. To code, it can use Jupyter Notebook or VS Code Jupyter extension. The thing to be noted is that you should choose the Jupyter kernel my_env
when using Jupyter Notebook or VS Code.
You can remove my_env
folder if you don’t need no anymore and re-create a new virtual environment again and again…
🛠️ Step 2: The Problem — Processing Every Row in a DataFrame
Let’s say you have a DataFrame with random distributions of GPS-like coordinates (latitude, longitude), and you want to apply a function (like calculating distance) to each row.
This is a common scenario in data science, and how you process rows matters a lot when performance is key.
🐌 The Usual Way — Loops & .apply() (The Slow Lane)
Sometimes, we need to do the same operation to each row in a data frame. The first method to do that is using a primitive loop. An instance can be seen below. We have a data frame, and I want to apply an operation in foo()
function. iterrows()
feature gives me each row, and after I get the rows, I can process them in the for loop easily.
1. iterrows(): The Classic Loop
for index, row in df.iterrows(): df.loc[index, "processed_feature"] = foo(row["A"])
🔻 This method is readable but very slow. It processes rows one at a time in pure Python.
2. apply(): Cleaner, but Still Slow
df["processed_feature"] = df.apply(lambda row: foo(row["A"]), axis=1)
✅ Shorter syntax
🔻 Still slow — especially with large DataFrames.
🏃♂️ Step 3: Faster Alternatives to .apply()
Here are better options — faster, more efficient, and just as readable.
3. itertuples(): Faster Than iterrows()
The third method is iteruples. This is better than the two methods mentioned above. To use itertuples, you can use a for loop and get all rows one by one via df.itertuple()
just like the iterrows method.
foo(row.A) for index, row in df.itertuples()
✅ Much faster than iterrows()
🔻 Still not vectorized — performance may suffer on very large datasets.
4. List Comprehension: Pythonic & Lean
The list comprehension method has a more specific approach. We would rather only the feature we need than use all dataframe. Of course, this approach will speed up the calculation time. Because the interpreter doesn’t have to separate one column from among the others every time. Usage of this approach can be seen in the code below.
df["processed_feature"] = [foo(x) for x in df["A"]]
✅ Efficient and compact
✅ Avoids overhead of row-wise operations
🔻 Only works on a single column
5. map(): Simple and Fast
Map function is one of the best approaches. It’s faster than primitive loops or the others because this method is also used by sending a parameter to a specific column. As an example.
df["processed_feature"] = df["A"].map(foo)
✅ Very fast
✅ Cleaner syntax
🔻 Limited to single-column functions
⚡ Step 4: Supercharged Speed with Vectorization
6. numpy.vectorize(): Vectorized, But Not Always Faster
numpy is a vector library written for Python users, and it is used by almost every programmer or data scientist who works with dataframes. For numpy, thy name is numerical. Python module has a function called vectorized, and it can help us by taking a dataframe's specific column as a parameter.
This feature provides a fast computation in vector space. Just like that, with numba or cffi modules, they compute faster than the other options, as they have the ability to compile the code after converting a code block to a low-level language such as C, C++. numpy.vectorize method can be used like below.
import numpy as np
df["processed_feature"] = np.vectorize(foo)(df["A"])
✅ Vectorized syntax
🔻 Doesn’t provide true vectorized performance — it’s still a loop under the hood
7. True Pandas Vectorization 💥 (The Gold Standard)
Pandas vectorization methods are a very simple approach to solve this problem, and they’re extremely fast. The only thing to do is to separate the operation as a function and call the function in the main block. The new function gets the dataframe as a parameter and computes the processes, then returns the processed column. Please check the code below.
def func(df):
return df['A']*10
df["procecessedfeature"]=func(df)
✅ Fastest option for numeric or column-wise operations
✅ Clean and elegant
🔻 Limited to operations that can be vectorized
🔥 Step 5: Go Parallel! (With All Your CPU Cores)
For even more power, let’s harness parallel computing with specialized libraries.
8. pandarallel: Parallel .apply() with Ease
pandarallel
is a module that can use CPU cores more than one, like Swifter. If you have many CPUs, you can distribute calculation operations over the cores. To install, you can use pip and the pypi repository.
pip install pandarallel from pandarallel
Basic Usage:
import pandarallel
pandarallel.initialize()
df["processed_feature"] = df.parallel_apply(lambda row: foo(row["A"]), axis=1)
✅ Takes advantage of multiple CPU cores
✅ Easy drop-in replacement for .apply()
🔻 Slight setup overhead
9. swifter: Smarter, Parallel .apply()
swifter
is another great tool for accelerating pandas
operations. It automatically decides the best execution path — whether to use pandas, Dask, or a parallelized approach — based on the size and complexity of your data.
This makes it extremely user-friendly: just swap .apply()
with .swifter.apply()
and let swifter
optimize for speed behind the scenes.
To install:
pip install swifter
Basic Usage:
import swifter
df["processed_feature"] = df.swifter.apply(lambda row: foo(row["A"]), axis=1)
✅ Automatically chooses the optimal computation strategy
✅ No manual parallel setup needed
🔻 Slightly higher memory usage on small data
🔻 May fallback to single-threaded mode on small DataFrames
💰🎁 Step 6 (Bonus): Use R or Julia Engines to Accelerate Programs
This section's methods don’t attend to benchmark tests.
10. apply function in R
R language has many apply functions, and it can be used with basic tricks in a Python environment. To use, we need a rpy2 module and r-base package.
apt install r-base -y ## or brew install if you use mac pip install rpy2
Then, in your notebook:
%load_ext rpy2.ipython
##Push a pandas DataFrame 'df' to R environment
%Rpush df
And define an R function to operate on your DataFrame:
%%R foo <- function(df) {
##Example operation: calculate the sum of each column
df[] <- lapply(df, function(col) col * 2) # Just doubling each element return(df) }
df <- foo(df)
Then, you can pull the modified df back into Python if needed with %Rpull df.
✅ Unlocks R’s performance and functions
🔻 Requires extra setup
11. Use Julia Lang Functions With pyJulia
Just like rpy2 allows you to use R functions inside Python, you can use PyJulia
to run Julia code seamlessly within your Python environment. This means you can write high-performance Julia functions for your heavy computations, then call them directly from Python — unlocking Julia’s speed without leaving your Python workflow.
!apt-get install julia -y !pip install julia
Then, initialize Julia inside Python and define your Julia function:
from julia import Main
##Define a Julia function in Python
Main.eval(""" function haversine(lat1, lon1, lat2, lon2)
R = 6371.0 # Earth radius in kilometers
dlat = deg2rad(lat2 - lat1)
dlon = deg2rad(lon2 - lon1)
a = sin(dlat/2)^2 + cos(deg2rad(lat1)) * cos(deg2rad(lat2)) * sin(dlon/2)^2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return R * c end """)
Using the Julia function on your data:
Assuming you have a pandas DataFrame df with columns lat1, lon1, lat2, lon2, apply the Julia function like this:
import pandas as pd
##Example DataFrame
df = pd.DataFrame({ "lat1": [34.05, 40.71], "lon1": [-118.25, -74.01],
"lat2": [36.12, 42.36], "lon2": [-115.17, -71.06] })
##Call the Julia haversine function row-wise
df["distance_km"] = [Main.haversine(*row) for row in df.itertuples(index=False)]
print(df)
✅ Unlocks Julia’s performance inside Python
🔻 Requires Julia installation and initial setup
Summary and Conclusion
100K Records Test Results
In this study, we explored and benchmarked various methods for performing row-wise operations in Pandas, ranging from basic iterrows()
and apply()
to more optimized techniques like vectorization
, swifter
, and pandarallel
.
Using a sample of 100,000 geospatial records, we measured and visualized the execution time of each method at consistent intervals to understand their computational efficiency.
The results reveal a stark contrast in performance:
- 🔵
iterrows()
was by far the slowest, with a total execution time exceeding 2 seconds. This confirms its reputation as a method best avoided in performance-critical applications.
- 🟠
apply()
fared better, but still demonstrated significant overhead compared to more efficient alternatives, clocking in at nearly 1 second.
- 🟢 Methods like
itertuples
,map
, andlist comprehension
showed substantial speed improvements, processing the entire dataset in under 0.1 seconds, making them solid options when vectorization isn't feasible.
- 🟣
np.vectorize
performed comparably to the above, but didn't provide any distinct advantage over native Python techniques.
- 🔴
numpy_direct_vectorized
was the fastest of all — completing the task in just ~0.02 seconds — demonstrating the power of true vectorized computation with NumPy.
- ⚫
swifter
and 🟡pandarallel
delivered decent speedups by leveraging parallel computing, but didn’t outperform NumPy vectorization in this single-node setup. Their true potential likely shines with larger datasets or more complex row operations.
- If maximum performance is critical and the operation can be vectorized mathematically, NumPy vectorization is the clear winner.
- If vectorization isn’t feasible (e.g., complex logic per row),
itertuples
,map
, orlist comprehension
offer a clean trade-off between readability and speed. - For larger data or multicore machines, modules like
swifter
andpandarallel
can significantly accelerate row-wise operations by utilizing parallel execution.
10M Records Test Results
In this extended benchmark using 10 million rows processed in 10,000-row chunks, we evaluated the performance of several popular Pandas-compatible row-wise operation techniques. The goal was to assess their scalability and practical efficiency under heavier computational load.
The results clearly show that choice of method matters even more at scale:
- 🔵
iterrows()
once again proved the slowest by a wide margin, with a total runtime of 211.48 seconds. Its inefficiency becomes exponentially pronounced as data volume grows, reinforcing that it should be avoided in all performance-sensitive applications.
- 🟠
apply()
delivered better performance thaniterrows
, but still clocked in at 102.22 seconds, indicating a high overhead and moderate scalability.
- 🟢
itertuples
,map
, and list comprehension significantly outperformed both, with times ranging from 6.49s to 8.5s. These methods provide a good balance between readability and efficiency for non-vectorizable logic.
- 🟣
np.vectorize
achieved 5.75 seconds, slightly ahead of native Python methods, but still fell short of true vectorization performance.
- 🔴
numpy_direct_vectorized
was again the fastest, completing the task in just 1.03 seconds. This underscores the unmatched speed of low-level, array-based operations when applicable.
- ⚫
swifter
and 🟡pandarallel
both offered substantial performance gains (129.28s and 81.55s, respectively) over native Pandas methods by leveraging parallel computation. However, they still lagged behind optimized NumPy-based logic. Their benefits are likely more pronounced with complex row-wise computations or multi-core environments handling larger workloads.
Ultimately, choosing the right method depends on your use case — and this benchmark provides an actionable reference to make informed decisions for efficient data processing in Python.
All of the codes are here and free for review on the public gist.
https://gist.github.com/nuhyurdev/53249123ff9dacb7cc3935016abe15ea