A dummy’s framework for code optimisation.

A step-by-step guide on how to improve software efficiency

Oyugo Obonyo
12 min readJan 19, 2023
Photo by Markus Spiske on Unsplash

Introduction

Everyone seems to have a strong opinion on when exactly optimization should be considered a critical issue during a software project’s lifetime and how exactly the optimization process should be executed. This process does seem random at times because those involved in it often pick techniques that worked in the past and recycle them with the assumption that they will work again. The assumption is of course correct in most cases but a preference of random tweaks(they are of course not random to the person making the tweaks because they have done it before) over a structured methodology familiar to everyone in the team might seem to be either too complex or too superficial. Also, engineering purists would cringe at the thought of throwing tweaks at a particular issue without the existence of a consistent framework to guide this process or metrics to assert the fact that the tweaks are indeed warranted. This article will explore a framework that I personally use to guide my optimization routine as a backend engineer.

The framework

Philosophy

The development of the framework is guided by two mantras:

  1. “Make it work, make it right then make it fast” — The framework insists that code optimization should not be a major concern in the initial stages of a project/task but must be considered as soon as the code works as intended and is in adherence to language-specific and design best practices. Personally, I double down on optimization on the last two days before a task’s deadline.

2. “What can’t be measured can’t be managed” — The framework also insists on metrics collection and preference of use of data collected over intuition when deciding on what to optimise. Code is considered optimised not by assumption of performance gains after intuitive changes but when improvements are noted in the final metrics collected as compared to the initial metrics.

Stages

Let’s now explore the framework’s stages. It is worth noting that despite the code snippets being written in Python, the methods can be applied in any other language but with different tooling and libraries of course. Also, the article is heavily opinionated and will tend to prefer certain libraries and methods:

  1. The pre-optimization stage (measuring):

This stage involves metric collection before making any changes in the code we intend to optimise. Its essence is to not only help in the identification of parts that may need optimization but to also have a point of reference with which we can compare metrics collected after code optimisation. Metrics that I consider crucial and therefore usually collect include time(time taken by a program to execute) and space(memory usage) of our program.

i) Time

The total time taken by a function or compute process to run is often considered a major indicator when determining algorithmic performance. Despite a function’s Big O notation being considered an efficient enough data point to warrant performance tuning, it might fail to tell the whole story some times. It is surely useful in technical interviews but might not be so insightful in industry-grade code where milliseconds matter and one implementation with a time complexity of O(n) can be much more efficient compared to a different implementation with a similar time complexity.

Assume we intend to write a script, within which is a function that matches a word in a collection of sentences and returns any sentence with that word in it, think of the script as a very lightweight search engine. First, let us, in a separate file , test.py, write a test for the function. A test will help us swiftly detect any breaking changes we introduce to our codebase. It guarantees stability through numerous changes in our codebase. Therefore, it is highly recommended to write tests for any function you intend to optimize before commencing the optimization process. Here is our test:

from main import match_word

def test_match_word():
collection = ["Python is a dynamically typed language",
"C is a strongly typed language",
"JS is just like Python when it comes to types"]
result = match_word(collection, "Python")
expected_result = ['Python is a dynamically typed language',
'JS is just like Python when it comes to types']
assert result == expected_result

Upon installing pytest and running the test by invoking python3 -m test.py on the terminal, the test obviously fails in the first run. We can now proceed to writing the actual code in a separate script, main.py:

def match_word(collection, word):
matches = [sentence for sentence in collection if word in sentence]
return matches


def main():
collection = ["Python is a dynamically typed language",
"C is a strongly typed language",
"JS is just like Python when it comes to types",
"Java is a strongly typed language",
]
matches = match_word(collection, "Python")
print(matches)


if __name__ == "__main__":
main()

Your test should now be passing at this point upon rerunning it.

Before proceeding to develop an execution profile for our function, it is important that we develop a benchmark for our program, that is, simulate our program’s runtime a couple of times and gather actionable data from the simulations. This process ought to be done before as well as after code optimization so as to compare and quantify our performance gains or losses. Python allows you to do this in different ways but my preferred way of doing it is by using a pytest plugin. Pytest-benchmark is a third-party plugin and will have to be installed by running this command:

pip install pytest-benchmark

After installing the plugin, we can now execute our test as a benchmark by passing the benchmark object as an argument to the test we wish to benchmark like so:

from main import match_word


def test_match_word(benchmark):
collection = ["Python is a dynamically typed language",
"C is a strongly typed language",
"JS is just like Python when it comes to types"]
result = benchmark(match_word, collection, "Python")
expected_result = ['Python is a dynamically typed language',
'JS is just like Python when it comes to types']
assert result == expected_result

The benchmark can be run by rerunning our test as on the terminal. The following results will be shown on the standard output:


----------------------------------------------------- benchmark: 1 tests -----------------------------------------------------
Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Mops/s) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------
test_match_word 530.9994 30,465.0021 567.4974 135.6055 559.0118 8.0181 729;28323 1.7621 172533 1
------------------------------------------------------------------------------------------------------------------------------

Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean

The function within our benchmark will be executed a couple of times and the statistical summary will be displayed on the standard output. Currently, match_word() takes about a minimum of 530 microseconds(‘Min’ column reading) and a maximum of 30,465 microseconds(‘Max’ column reading) to run. The benchmark was run 172533 times as indicated in the ‘Rounds’ column.

To develop an execution profile for our function, we will use the popular general purpose profiler, cProfile. To use the cProfile module on our script and arrange our output in order based on the tottime column, run the following command on the terminal:

python3 -m cProfile -s tottime main.py

Our output will be as follows:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 main.py:1(match_word)
1 0.000 0.000 0.000 0.000 main.py:9(main)
1 0.000 0.000 0.000 0.000 main.py:1(<module>)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

The column names are neither random nor cryptic but mean the following, based on official documentation:

ncalls — the number of calls.

tottime — the total time spent in the given function (and excluding time made in calls to sub-functions)

percall — is the quotient of tottime divided by ncalls

cumtime — is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.

percall — is the quotient of cumtime divided by primitive calls

filename:lineno(function) — provides the respective data of each function

The time units are in seconds and are rounded up to the nearest 3 decimal places. A tottime of 0.000 does not therefore indicate that our script literally ran in no time but simply implies that upon rounding up tottime to 3 decimal places, the final result was 0.000. The real result could be 0.00001, 0.000002 or any number that upon being rounded up results to 0.000, we do not really know the exact number. What we do know is that within our script, the builtin print function took the most time, followed by the match_word function and so on.

Now that we know what functions we should look out for in our script, our next goal must be to go even deeper. We will do this by analyzing the expensive functions, line-by-line, to know the exact statements we should look out for. This can be done using a third-party module, line_profiler which is installed as follows:

pip install line_profiler

Upon installation, we need to decorate whatever function we intend to profile with @profile. We obviously will not decorate the print function, despite it hogging the most time in our script because we will delete it in future! We should not be printing any function return to standard output anyway at production because printing to standard output is too expensive, as illustrated in our previous profiling statistics, and doing so is often considered a rookie mistake. Let us assume that the print statement was used as a temporary debugging measure.

Making match_word() faster and deleting the print statement will make the main function faster in turn as all main() does is call match_word() and print its output. We will therefore ignore main() for now but rather target match_word() and decorate it with @profile so as to obtain profiling output:

@profile
def match_word(collection, word):
matches = [sentence for sentence in collection if word in sentence]
return matches


def main():
collection = ["Python is a dynamically typed language",
"C is a strongly typed language",
"JS is just like Python when it comes to types",
"Java is a strongly typed language",
]
matches = match_word(collection, "Python")
print(matches)


if __name__ == "__main__":
main()

We do not have to import the profile decorator as it will be automatically inferred by our profiling script when running it. To profile our script line by line and immediately print the results to standard output, run the following command on the terminal:

kernprof -l -v main.py

The results will be as follows:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1 @profile
2 def match_word(collection, word):
3 1 3.7 3.7 93.7 matches = [sentence for sentence in collection if word in sentence]
4 1 0.2 0.2 6.3 return matches

The columns names are interpreted as follows:

Line # — The number of the line that was run

Hits — The number of times that line was run

Time — The execution time of the line in microseconds (Time)

Per Hit — Time/hits

% Time — Fraction of the total time spent executing that line

Line Contents — The content of the line

From the results, we can deduce that the most time in our function is spent on and within the for loop statement.

ii) Memory usage(space)

The memory consumed by our program is often overlooked as memory optimizations are hardly ever a priority. However, they can be a priority in some cases such as when processing, storing, reading a dataset that does not have a constant or limited size and can grow exponentially or when working with embedded devices. When trying to optimise programs that don’t have a growing memory usage at runtime, this segment of metric collection can be skipped as it is not really necessary. In our case, the size of the collection variable is known before getting into the match_word() function and it is not expanded at any point. Memory usage is therefore not an issue for us. Nonetheless, a demonstration on how to gather memory usage data is warranted.

To know how much memory our program uses, we can use python’s memory profiler module which is installed as follows:

pip install memory-profiler

Just like line_profiler, the memory_profiler module also works by decorating the function we intend to profile with @profile. We can then run this command to have the memory usage statistics printed on screen:

python3 -m memory_profiler main.py

The output will be as follows:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
1 20.223 MiB 20.223 MiB 1 @profile
2 def match_word(collection, word):
3 20.223 MiB 0.000 MiB 7 matches = [sentence for sentence in collection if word in sentence]
4 20.223 MiB 0.000 MiB 1 return matches

The memory profiler tracks memory increments across the program and measures them against the initial memory allocated at the beginning of the program. The data collected asserts our earlier conclusion that our program maintains constant memory usage throughout.

2. The optimization stage (implementing):

This is a data-driven stage aimed at managing the bottlenecks that were initially measured and identified. There is no particular method to be applied here as the optimization approach widely varies from problem to problem or person to person as it heavily relies on a number of factors including the scale of bottleneck, backwards compatibility of proposed changes, time-space trade-offs and the depth of a person’s language-specific knowledge. The upside of being methodological and strictly relying on metrics over intuition is that you can continuously refine your program and rank the proposed optimizations based on collected metrics.

For match_word(), the obvious concern is how efficiently the function can query a vast collection multiple times. Its inefficiency can’t be easily noted in one-off queries on small datasets but will be so obvious when attempting to query a large dataset (the collection variable in our case) multiple times. An intuitive approach would be to maybe use a caching system but caching misses would be way more common than caching hits us each query has a high likelihood of being too different from the previous. Imagine a collection with 200 different words. At one point the cache misses might be so many that we may be forced to settle on caching all the words along with either their position in the collection or the sentences within the collection containing them, oh my space!

A more sane approach would be a, you guessed right, hash map! Building a hash map, which can also be referred to as an index in this case, that associates each word in the collection with a list of sentences from the collection where the words are found, before querying the document would be more efficient especially when querying an unchanged collection multiple times. For example, if we wanted to query “Python” from our current collection based on our current algorithm, we would have to perform a linear search. If we wanted to query “JS”, we would have to perform a linear search again. In short, a linear search on the collection would have to be repeated on each query. This makes the current algorithm so expensive, we might as well just rename it to louis_vuitton()!

Having an index in place allows us to obtain results without having to continuously perform a linear search. Just like in databases, building indices are a great overhead at first since they require lots of resources, in a time and memory usage perspective, to build. However, the performance benefits derived from querying them overrides the overhead incurred in building them. Let us optimise our code:

def build_index(collection):
index = {}
for i, sentence in enumerate(collection):
for word in sentence.split():
if word not in index:
index[word] = [i]
else:
index[word].append(i)
return index


def match_word(collection, word):
index = build_index(collection)
match_positions = index[word]
matches = [collection[i] for i in match_positions]
return matches


def main():
collection = ["Python is a dynamically typed language",
"C is a strongly typed language",
"JS is just like Python when it comes to types",
"Java is a strongly typed language",
]
matches = match_word(collection, "Python")
print(matches)


if __name__ == "__main__":
main()

The code works, the tests pass but is the program faster?

3. The post-optimization stage(comparing):

After the optimization, we again develop a benchmark and compare it with our initial benchmark to assess our performance gains or losses. Let’s run pytest again:


------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
----------------------------------------------------------------------------------------------------------------------
test_match_word 4.1440 1,086.6380 4.6463 3.4269 4.3510 0.1500 4575;15050 215.2244 172533 1
----------------------------------------------------------------------------------------------------------------------

Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean

Our performance gains are through the roof, and we can prove it! Our time gains have improved by 122 times. Also, we have maintained constant memory usage and again, we can prove it:

Filename: main.py

Line # Mem usage Increment Occurrences Line Contents
=============================================================
1 19.961 MiB 19.961 MiB 1 @profile
2 def build_index(collection):
3 19.961 MiB 0.000 MiB 1 index = {}
4 19.961 MiB 0.000 MiB 5 for i, sentence in enumerate(collection):
5 19.961 MiB 0.000 MiB 32 for word in sentence.split():
6 19.961 MiB 0.000 MiB 28 if word not in index:
7 19.961 MiB 0.000 MiB 17 index[word] = [i]
8 else:
9 19.961 MiB 0.000 MiB 11 index[word].append(i)
10 19.961 MiB 0.000 MiB 1 return index


Filename: main.py

Line # Mem usage Increment Occurrences Line Contents
=============================================================
12 19.961 MiB 19.961 MiB 1 @profile
13 def match_word(collection, word):
14 19.961 MiB 19.961 MiB 1 index = build_index(collection)
15 19.961 MiB 0.000 MiB 1 match_positions = index[word]
16 19.961 MiB 0.000 MiB 5 matches = [collection[i] for i in match_positions]
17 19.961 MiB 0.000 MiB 1 return matches

Conclusion

When you think about it, the framework’s core is to a degree quite simplistic — measure your performance, improve where you must then where you can and finally compare your current performance with your initial performance. The underlying procedures within the stages are however quite complex but with time and practice, they become second nature and code optimization becomes a fun activity rather than a tedious process the finally turn into an obsession that you can’t shake off. Off to add “Improved a search algorithm by 1220% therefore…’ entry on my resume.

--

--