One-time data transformation scripts are underrated
An example of transforming a spreadsheet with Python by writing a script that you will never look at again
In a past life, I was an accountant (my career path to where I am today is another article). Accounting is basically stuck in the 1980s in terms of technology. It’s like the spreadsheet happened, then they’ve been coasting ever since. I worked at a pretty tech-forward accounting firm and most of what we did were things that would not be unfamiliar to accountants forty years ago, except with more technology in the way.
I think that there is a widespread belief in accounting that coding is the way forward. There were many of my former colleagues who moved into careers where they needed to code more on a regular basis. I think everyone understands that coding is the way of the future, but even so I think that coding is underrated in accounting.
Coding is a tool, just like any other tool. There’s a tendency to want to replace entire work processes with code: something that might be used on a daily, weekly, monthly basis. Accounting automation projects talk about automatically generating bank reconciliations and things like that, which is great, but I think coding is an underrated tool for quick, ad hoc, one-time scripts to accomplish a task quicker than you could otherwise do.
Spreadsheet Transformation
Let’s say we have a spreadsheet that looks something like this:
This is a pretty standard spreadsheet, nothing really unusual going on. Let’s say we want it to look something like this though instead:
This looks more like a proper ‘database’, even if its just an excel file. After transforming the table, it might be easier to add new rows for different things. It might be easier to work with in an automation project. This might take you a few minutes to do in excel. It might be easiest to create a new file and copy everything over manually. But what about when your file looks like this:
The difference here is that you have x rows instead of four. The complexity has exploded and now suddenly it’s not so easy to manually plug away at in excel. This sort of thing comes up all the time in my experience, and is where coding is a very useful skill to have.
The following is an ad hoc script that I wrote in python to transform the first spreadsheet to the second:
import csv
input = "/home/nick/Desktop/Salary History 2020-2022.csv"
output = "/home/nick/Desktop/output.csv"
# input
data = []
with open(input) as f:
reader = csv.reader(f)
for line in reader:
data.append(line)
# processing
data.pop(0) # We don't want to iterate over the headers
results = []
results.append(["Person Name", "Salary", "Year"])
for line in data:
results.append([line[0], line[1], "2020"])
results.append([line[0], line[2], "2021"])
results.append([line[0], line[3], "2022"])
# output
with open(output, 'w') as f:
writer = csv.writer(f)
for line in results:
writer.writerow(line)
There are three sections to the code, which are marked by comments (the lines starting with #). There are input, processing and output
sections. The idea is that you get the information (input), you do something to it (process), and you spit it back out into excel (output).
I don’t think this is the best code ever written. In fact, I’m positive you could find a better way to do it. I didn’t even try to improve the code before writing this post because it doesn’t matter. The point is to take your coding tool out, do something with it, and move on with your life.
For example, I couldn’t figure out how to properly use csv.reader
so I had to iterate that into an array, touching the data twice. That’s not ideal, but it worked anyways in about a second because my data wasn’t really that large. What does matter is that it works and that wrote it in about 10 minutes, which is quicker than I could have done it while manually bashing away at it in excel. I might never look at the script again, but I saved some time and did it in a way where no mistakes can be made.
Excel vs CSV
You might have noticed that I said I had a spreadsheet, but the code is working with a CSV. Excel files are not very simple to work with in python, while CSVs are very easy1. Instead of figuring out how to use python’s libraries to work with excel, I used excel to save them as CSV files. This is using python for what its good for (manipulating data) and using excel for what its good for (converting excel files to other formats).
Size and complexity of data matters
The benefits of this approach is that while complexity skyrockets the more rows the spreadsheet has, code doesn’t care how many rows it has. If there are 100 or 100,000, computers are so fast now a days that it’s going to take a few seconds at most to run. Because you’re only planning to run this once as an ad hoc script, even if it takes ten minutes to run you’re still coming out ahead.
This is because the time it takes to write the code is unrelated to the number of rows of the data, although it is not necessarily unrelated to the complexity of the data and the type of transformations that are needed. Manually working on it scales at least linearly with the data (if not exponentially), with more time taken with more data
The following graph visually shows this, showing writing code (blue line) vs manual (green) graphing data size vs time.
This shows the essential heuristic: If your data is small it probably doesn’t make sense time-wise to code while if the data is large it probably doesn’t make sense to do it manually.
Conclusions
Coding is a really powerful tool for ad hoc data processing, in ways that many that are not intuitive. If you’re trying to transform data in excel, you might want to consider if you can quickly put together a few lines of code that will do it for you instead. It’s critical though to consider that if its only a few rows of data, it might not make sense while if there’s thousands or hundreds of thousands of rows, the calculus is a little different.
The key is to quickly write something that works, get your transformed data, and move on. Speed of the iteration is important - if you can get something that works, it doesn’t matter if its perfect, as long as the output is correct.
As a last thought, even if your data is small, you might want to write a script to do it anyways. Making yourself do something hard means you’re learning something, and might be able to better apply it in the future.
Excel files have a special structure that excel understands. CSV files have a very simple structure that is “plain text” and computers intuitively understand them.