Performance

Follow these tips to get the best out of your programs, both in speed and memory terms.

Premature optimization

Donald Knuth once said:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

However, if you are writing a program and you realize that writing a semantically equivalent, faster version involves just minor changes, you shouldn’t miss that opportunity.

And always be sure to profile your program to learn what its bottlenecks are. For profiling, on macOS you can use Instruments Time Profiler, which comes with XCode, or one of the sampling profilers. On Linux, any program that can profile C/C++ programs, like perf or Callgrind, should work. For both Linux and OS X, you can detect most hotspots by running your program within a debugger then hitting “ctrl+c” to interrupt it occasionally and issuing a gdb backtrace command to look for patterns in backtraces (or use the gdb poor man’s profiler which does the same thing for you, or OS X sample command.

Make sure to always profile programs by compiling or running them with the --release flag, which turns on optimizations.

Avoiding memory allocations

One of the best optimizations you can do in a program is avoiding extra/useless memory allocation. A memory allocation happens when you create an instance of a class, which ends up allocating heap memory. Creating an instance of a struct uses stack memory and doesn’t incur a performance penalty. If you don’t know the difference between stack and heap memory, be sure to read this.

Allocating heap memory is slow, and it puts more pressure on the Garbage Collector (GC) as it will later have to free that memory.

There are several ways to avoid heap memory allocations. The standard library is designed in a way to help you do that.

Don’t create intermediate strings when writing to an IO

To print a number to the standard output you write:

  1. puts 123

In many programming languages what will happen is that to_s, or a similar method for converting the object to its string representation, will be invoked, and then that string will be written to the standard output. This works, but it has a flaw: it creates an intermediate string, in heap memory, only to write it and then discard it. This, involves a heap memory allocation and gives a bit of work to the GC.

In Crystal, puts will invoke to_s(io) on the object, passing it the IO to which the string representation should be written.

So, you should never do this:

  1. puts 123.to_s

as it will create an intermediate string. Always append an object directly to an IO.

When writing custom types, always be sure to override to_s(io), not to_s, and avoid creating intermediate strings in that method. For example:

  1. class MyClass
  2. # Good
  3. def to_s(io)
  4. # appends "1, 2" to IO without creating intermediate strings
  5. x = 1
  6. y = 2
  7. io << x << ", " << y
  8. end
  9. # Bad
  10. def to_s(io)
  11. x = 1
  12. y = 2
  13. # using a string interpolation creates an intermediate string.
  14. # this should be avoided
  15. io << "#{x}, #{y}"
  16. end
  17. end

This philosophy of appending to an IO instead of returning an intermediate string results in better performance than handling intermediate strings. You should use this strategy in your API definitions too.

Let’s compare the times:

!!! example “io_benchmark.cr”

  1. ```crystal
  2. require "benchmark"
  3. io = IO::Memory.new
  4. Benchmark.ips do |x|
  5. x.report("without to_s") do
  6. io << 123
  7. io.clear
  8. end
  9. x.report("with to_s") do
  10. io << 123.to_s
  11. io.clear
  12. end
  13. end
  14. ```

Output:

  1. $ crystal run --release io_benchmark.cr
  2. without to_s 77.11M ( 12.97ns) 1.05%) fastest
  3. with to_s 18.15M ( 55.09ns) 7.99%) 4.25× slower

Always remember that it’s not just the time that has improved: memory usage is also decreased.

Use string interpolation instead of concatenation

Sometimes you need to work directly with strings built from combining string literals with other values. You shouldn’t just concatenate these strings with String#+(String) but rather use string interpolation which allows to embed expressions into a string literal: "Hello, #{name}" is better than "Hello, " + name.to_s.

Interpolated strings are transformed by the compiler to append to a string IO so that it automatically avoids intermediate strings. The example above translates to:

  1. String.build do |io|
  2. io << "Hello, " << name
  3. end

Avoid IO allocation for string building

Prefer to use the dedicated String.build optimized for building strings, instead of creating an intermediate IO::Memory allocation.

  1. require "benchmark"
  2. Benchmark.ips do |bm|
  3. bm.report("String.build") do
  4. String.build do |io|
  5. 99.times do
  6. io << "hello world"
  7. end
  8. end
  9. end
  10. bm.report("IO::Memory") do
  11. io = IO::Memory.new
  12. 99.times do
  13. io << "hello world"
  14. end
  15. io.to_s
  16. end
  17. end

Output:

  1. $ crystal run --release str_benchmark.cr
  2. String.build 597.57k ( 1.67µs) 5.52%) fastest
  3. IO::Memory 423.82k ( 2.36µs) 3.76%) 1.41× slower

Avoid creating temporary objects over and over

Consider this program:

  1. lines_with_language_reference = 0
  2. while line = gets
  3. if ["crystal", "ruby", "java"].any? { |string| line.includes?(string) }
  4. lines_with_language_reference += 1
  5. end
  6. end
  7. puts "Lines that mention crystal, ruby or java: #{lines_with_language_reference}"

The above program works but has a big performance problem: on every iteration a new array is created for ["crystal", "ruby", "java"]. Remember: an array literal is just syntax sugar for creating an instance of an array and adding some values to it, and this will happen over and over on each iteration.

There are two ways to solve this:

  1. Use a tuple. If you use {"crystal", "ruby", "java"} in the above program it will work the same way, but since a tuple doesn’t involve heap memory it will be faster, consume less memory, and give more chances for the compiler to optimize the program.

    1. lines_with_language_reference = 0
    2. while line = gets
    3. if {"crystal", "ruby", "java"}.any? { |string| line.includes?(string) }
    4. lines_with_language_reference += 1
    5. end
    6. end
    7. puts "Lines that mention crystal, ruby or java: #{lines_with_language_reference}"
  2. Move the array to a constant.

    1. LANGS = ["crystal", "ruby", "java"]
    2. lines_with_language_reference = 0
    3. while line = gets
    4. if LANGS.any? { |string| line.includes?(string) }
    5. lines_with_language_reference += 1
    6. end
    7. end
    8. puts "Lines that mention crystal, ruby or java: #{lines_with_language_reference}"

Using tuples is the preferred way.

Explicit array literals in loops is one way to create temporary objects, but these can also be created via method calls. For example Hash#keys will return a new array with the keys each time it’s invoked. Instead of doing that, you can use Hash#each_key, Hash#has_key? and other methods.

Use structs when possible

If you declare your type as a struct instead of a class, creating an instance of it will use stack memory, which is much cheaper than heap memory and doesn’t put pressure on the GC.

You shouldn’t always use a struct, though. Structs are passed by value, so if you pass one to a method and the method makes changes to it, the caller won’t see those changes, so they can be bug-prone. The best thing to do is to only use structs with immutable objects, especially if they are small.

For example:

!!! example “class_vs_struct.cr”

  1. ```crystal
  2. require "benchmark"
  3. class PointClass
  4. getter x
  5. getter y
  6. def initialize(@x : Int32, @y : Int32)
  7. end
  8. end
  9. struct PointStruct
  10. getter x
  11. getter y
  12. def initialize(@x : Int32, @y : Int32)
  13. end
  14. end
  15. Benchmark.ips do |x|
  16. x.report("class") { PointClass.new(1, 2) }
  17. x.report("struct") { PointStruct.new(1, 2) }
  18. end
  19. ```

Output:

  1. $ crystal run --release class_vs_struct.cr
  2. class 28.17M 2.86%) 15.29× slower
  3. struct 430.82M 6.58%) fastest

Iterating strings

Strings in Crystal always contain UTF-8 encoded bytes. UTF-8 is a variable-length encoding: a character may be represented by several bytes, although characters in the ASCII range are always represented by a single byte. Because of this, indexing a string with String#[] is not an O(1) operation, as the bytes need to be decoded each time to find the character at the given position. There’s an optimization that Crystal’s String does here: if it knows all the characters in the string are ASCII, then String#[] can be implemented in O(1). However, this isn’t generally true.

For this reason, iterating a String in this way is not optimal, and in fact has a complexity of O(n^2):

  1. string = "foo"
  2. while i < string.size
  3. char = string[i]
  4. # ...
  5. end

There’s a second problem with the above: computing the size of a String is also slow, because it’s not simply the number of bytes in the string (the bytesize). However, once a String’s size has been computed, it is cached.

The way to improve performance in this case is to either use one of the iteration methods (each_char, each_byte, each_codepoint), or use the more low-level Char::Reader struct. For example, using each_char:

  1. string = "foo"
  2. string.each_char do |char|
  3. # ...
  4. end