Performance
Follow these tips to get the best out of your programs, both in speed and memory terms.
Premature optimization
Donald Knuth once said:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
However, if you are writing a program and you realize that writing a semantically equivalent, faster version involves just minor changes, you shouldn’t miss that opportunity.
And always be sure to profile your program to learn what its bottlenecks are. For profiling, on macOS you can use Instruments Time Profiler, which comes with XCode, or one of the sampling profilers. On Linux, any program that can profile C/C++ programs, like perf or Callgrind, should work. For both Linux and OS X, you can detect most hotspots by running your program within a debugger then hitting “ctrl+c” to interrupt it occasionally and issuing a gdb backtrace
command to look for patterns in backtraces (or use the gdb poor man’s profiler which does the same thing for you, or OS X sample
command.
Make sure to always profile programs by compiling or running them with the --release
flag, which turns on optimizations.
Avoiding memory allocations
One of the best optimizations you can do in a program is avoiding extra/useless memory allocation. A memory allocation happens when you create an instance of a class, which ends up allocating heap memory. Creating an instance of a struct uses stack memory and doesn’t incur a performance penalty. If you don’t know the difference between stack and heap memory, be sure to read this.
Allocating heap memory is slow, and it puts more pressure on the Garbage Collector (GC) as it will later have to free that memory.
There are several ways to avoid heap memory allocations. The standard library is designed in a way to help you do that.
Don’t create intermediate strings when writing to an IO
To print a number to the standard output you write:
puts 123
In many programming languages what will happen is that to_s
, or a similar method for converting the object to its string representation, will be invoked, and then that string will be written to the standard output. This works, but it has a flaw: it creates an intermediate string, in heap memory, only to write it and then discard it. This, involves a heap memory allocation and gives a bit of work to the GC.
In Crystal, puts
will invoke to_s(io)
on the object, passing it the IO to which the string representation should be written.
So, you should never do this:
puts 123.to_s
as it will create an intermediate string. Always append an object directly to an IO.
When writing custom types, always be sure to override to_s(io)
, not to_s
, and avoid creating intermediate strings in that method. For example:
class MyClass
# Good
def to_s(io)
# appends "1, 2" to IO without creating intermediate strings
x = 1
y = 2
io << x << ", " << y
end
# Bad
def to_s(io)
x = 1
y = 2
# using a string interpolation creates an intermediate string.
# this should be avoided
io << "#{x}, #{y}"
end
end
This philosophy of appending to an IO instead of returning an intermediate string results in better performance than handling intermediate strings. You should use this strategy in your API definitions too.
Let’s compare the times:
require "benchmark"
io = IO::Memory.new
Benchmark.ips do |x|
x.report("without to_s") do
io << 123
io.clear
end
x.report("with to_s") do
io << 123.to_s
io.clear
end
end
Output:
$ crystal run --release io_benchmark.cr
without to_s 77.11M ( 12.97ns) (± 1.05%) fastest
with to_s 18.15M ( 55.09ns) (± 7.99%) 4.25× slower
Always remember that it’s not just the time that has improved: memory usage is also decreased.
Use string interpolation instead of concatenation
Sometimes you need to work directly with strings built from combining string literals with other values. You shouldn’t just concatenate these strings with String#+(String)
but rather use string interpolation which allows to embed expressions into a string literal: "Hello, #{name}"
is better than "Hello, " + name.to_s
.
Interpolated strings are transformed by the compiler to append to a string IO so that it automatically avoids intermediate strings. The example above translates to:
String.build do |io|
io << "Hello, " << name
end
Avoid IO allocation for string building
Prefer to use the dedicated String.build
optimized for building strings, instead of creating an intermediate IO::Memory
allocation.
require "benchmark"
Benchmark.ips do |bm|
bm.report("String.build") do
String.build do |io|
99.times do
io << "hello world"
end
end
end
bm.report("IO::Memory") do
io = IO::Memory.new
99.times do
io << "hello world"
end
io.to_s
end
end
Output:
$ crystal run --release str_benchmark.cr
String.build 597.57k ( 1.67µs) (± 5.52%) fastest
IO::Memory 423.82k ( 2.36µs) (± 3.76%) 1.41× slower
Avoid creating temporary objects over and over
Consider this program:
lines_with_language_reference = 0
while line = gets
if ["crystal", "ruby", "java"].any? { |string| line.includes?(string) }
lines_with_language_reference += 1
end
end
puts "Lines that mention crystal, ruby or java: #{lines_with_language_reference}"
The above program works but has a big performance problem: on every iteration a new array is created for ["crystal", "ruby", "java"]
. Remember: an array literal is just syntax sugar for creating an instance of an array and adding some values to it, and this will happen over and over on each iteration.
There are two ways to solve this:
Use a tuple. If you use
{"crystal", "ruby", "java"}
in the above program it will work the same way, but since a tuple doesn’t involve heap memory it will be faster, consume less memory, and give more chances for the compiler to optimize the program.lines_with_language_reference = 0
while line = gets
if {"crystal", "ruby", "java"}.any? { |string| line.includes?(string) }
lines_with_language_reference += 1
end
end
puts "Lines that mention crystal, ruby or java: #{lines_with_language_reference}"
Move the array to a constant.
LANGS = ["crystal", "ruby", "java"]
lines_with_language_reference = 0
while line = gets
if LANGS.any? { |string| line.includes?(string) }
lines_with_language_reference += 1
end
end
puts "Lines that mention crystal, ruby or java: #{lines_with_language_reference}"
Using tuples is the preferred way.
Explicit array literals in loops is one way to create temporary objects, but these can also be created via method calls. For example Hash#keys
will return a new array with the keys each time it’s invoked. Instead of doing that, you can use Hash#each_key
, Hash#has_key?
and other methods.
Use structs when possible
If you declare your type as a struct instead of a class, creating an instance of it will use stack memory, which is much cheaper than heap memory and doesn’t put pressure on the GC.
You shouldn’t always use a struct, though. Structs are passed by value, so if you pass one to a method and the method makes changes to it, the caller won’t see those changes, so they can be bug-prone. The best thing to do is to only use structs with immutable objects, especially if they are small.
For example:
require "benchmark"
class PointClass
getter x
getter y
def initialize(@x : Int32, @y : Int32)
end
end
struct PointStruct
getter x
getter y
def initialize(@x : Int32, @y : Int32)
end
end
Benchmark.ips do |x|
x.report("class") { PointClass.new(1, 2) }
x.report("struct") { PointStruct.new(1, 2) }
end
Output:
$ crystal run --release class_vs_struct.cr
class 28.17M (± 2.86%) 15.29× slower
struct 430.82M (± 6.58%) fastest
Iterating strings
Strings in Crystal always contain UTF-8 encoded bytes. UTF-8 is a variable-length encoding: a character may be represented by several bytes, although characters in the ASCII range are always represented by a single byte. Because of this, indexing a string with String#[]
is not an O(1)
operation, as the bytes need to be decoded each time to find the character at the given position. There’s an optimization that Crystal’s String
does here: if it knows all the characters in the string are ASCII, then String#[]
can be implemented in O(1)
. However, this isn’t generally true.
For this reason, iterating a String in this way is not optimal, and in fact has a complexity of O(n^2)
:
string = "foo"
while i < string.size
char = string[i]
# ...
end
There’s a second problem with the above: computing the size
of a String is also slow, because it’s not simply the number of bytes in the string (the bytesize
). However, once a String’s size has been computed, it is cached.
The way to improve performance in this case is to either use one of the iteration methods (each_char
, each_byte
, each_codepoint
), or use the more low-level Char::Reader
struct. For example, using each_char
:
string = "foo"
string.each_char do |char|
# ...
end