- swishbook
- Posts
- developing: python is fine, you guys
developing: python is fine, you guys
Here's why I'm using python instead of R to develop swishbook
The R programming language is and always will be my first love. When I was first picking up programming, the ease of getting up-and-running with the language made it inviting as a newcomer. I’ve yet to run into a problem that isn’t solvable within the language. It’s the first tool I reach towards for quick analyses or personal projects and by far the language I’m most proficient in.
All that being said, I’m using python for all the backend work on swishbook.
Python Pro’s
As much as I am an R shill, python has some alluring features that make it the right choice for this project.
- Dependency management: Python has many (too many?) tools for safely managing language and package versions. The language implicitly encourages users to maintain clean setups with in-project virtual environments. R, on the other hand, implicitly encourages use of a single global package library. While you can manage packages with renv (possibly the only tool for doing so in R?), renv doesn’t manage R versions, and there’s not great options for doing so, to my knowledge. 
- Classes are first class citizens (see what I did there?): Python makes class creation accessible and intuitive. While R has classes, I’ve honestly only ever made use of S3 generics directly in all the years I’ve been using the language. R’s function-first approach is great most of the time. Whenever you have sub-functions that depend on variables scoped to the top-level function, however, you can end up with argument bloat. So, the natural way to - do_something()in R can end up being a lot more verbose than the natural way to- do_something()in python.
# subfunctions here require us to pass arguments forward!
do_something <- function(a, b) {
	c <- some_scraping_function(a, b)
	d <- do_something_else(c, a = a, b = b)
	e <- do_another_thing(d, a = a, b = b, c = c)
	return(d)
}
do_something_else <- function(c, ..., a, b) {
	# ~something happens~
}
do_another_thing <- function(d, ..., a, b, c) {
	# ~another thing happens~
}
# class methods can reference self.var without argument bloat!
class Something:
	def __init__(self, a, b):
		self.a = a
		self.b = b
				
	def do_something(self):
		self.c = self._some_scraping_function()
		self.d = self._do_something_else()
		e = self._do_another_thing()
		return e
				
	def _do_something_else(self):
		# ~something happens~
		
	def _do_another_thing(self):
		# ~another thing happens~
- Interactive web scraping: The python packages for web scraping are a bit more robust than their R equivalents. Namely, Beautiful Soup and Selenium let you do things like User-Agents in HTTP headers and collect html that’s rendered after site javascript has evaluated. The best-in-class option in R, rvest, is great for scraping static sites, but doesn’t include some of these ~fancier~ features (as far as I can tell). 
- Straightforward execution of arbitrary SQL: Both python and R have a myriad of packages available that make it easy to run select statements against a database and return a dataframe. A number of python packages make it similarly easy to run arbitrary SQL queries (like table creation, inserts, views, etc.), whereas I’ve yet to find an easy equivalent in R (though this may just be because I’m mirroring methods I use in my day job). 
There’s still just something about R
All that being said, using python for this project means I won’t have access to R’s strengths and benefits.
- Nothing beats ggplot2: I’ve never had to make a plot using matplotlib (and god-willing, I never will). The best python package for static visualization, plotnine, is just a port of ggplot2. 
- The flow of dplyr has yet to be replicated: dplyr is just an absolute godsend in terms of making data transformation work human-readable. The de-facto python equivalent, pandas, is nigh-unreadable for folks coming from R. Even the best alternative, polars, which is close to dplyr in terms of readability, has a bit more friction. 
# you can iteratively update a column within dplyr::mutate
data %>%
	mutate(b = log(a),
		   c = 2 * b)
# with polars, you need to use a separate .with_columns call
# to modify a variable created by .with_columns
(
	data
	.with_columns(log(col('a')).alias('b'))
	.with_columns((2 * col('b')).alias('c'))
)
- You can do so much quirky shit with nested lists: Nested lists are magic. Want to perform a computationally costly - separate()on a grouping column? Nest by your grouping column, now- separate()runs way faster. Want to store separate model objects per row in a dataframe? Add a nested list. Want to put a dataframe inside a dataframe inside a dataframe? Nested list.
Closing thoughts
As much as this post is a comparison of features between languages, it’s not meant to drum up the R vs. python flame war. Nor am I trying to plant my flag in the ground and declare one language to be wholly superior to the other. I just use whichever tool makes the task at hand easiest for me — 90% of the time, this results in me using R. For swishbook, however, it turns out that python is the better tool for the job.
