How I sped up my XML parser by 15 times

In the previous blog post we have showcased how possible optimizations can be made to binary parsing implementation in Elixir. This article aims to delineate how I have applied these techniques, in order to drastically speed up XML parsing.

Son Doong Cave

XML is one of the most widely used formats for data interchange on the web today. At Forza Football, in addition to JSON and MessagePack, we use XML in several facets of our work: receiving football data from providers, obtaining headlines from RSS feeds. Ensuring that we have an efficent method to parse the data we receive will boost the performance of the application as a whole and thus improve user experience. With speed and usability as my two ultimate goals, I started writing the first implementation of the Saxy project.

Not so “saxy”

After two months of work and two aborted proofs of concept, The first version was published to the world. Since speed was, from the beginning, the desired goal of this project, I decided to do some benchmarks on Saxy against other XML parsing libraries.

Benchmarking with XML is hard because the result highly depends on the complexity of the document, so a relatively simple XML file was picked. The chosen contestants were xmerl—the standard XML parser in Erlang OTP, Erlsom, and fast_xml. But in the end I could not get fast_xml to work as intended, so only xmerl and Erlsom went into the benchmark script.

  Microseconds per run
Saxy 0.3.0 24.4232
xmerl 1.3.16 299.242
Erlsom 1.4.1 65.8912

“Eureka! Eureka! Saxy is 2.7 times faster than erlsom and 12 times faster than xmerl 🎉🎉🎉” I excitedly screamed out.

But after taking some time to reflect and cool down, it turned out that I had a fatal mistake right in the benchmark script. When the script was fixed, reality slapped me really hard in the face.

  Microseconds per run
Saxy 0.3.0 943.401
xmerl 1.3.16 285.5068
Erlsom 1.4.1 66.8736

“Focus on speed … 😏, good that now you know how ignorant you are!”, I told my dispointed self.

Shortcomings

I decided to give it another try, but with a different strategy this time. Instead of throwing away yet another proof of concept and writing the new implementation, I spent a few days pondering why the previous parser was so unacceptably slow. What I did was: 1) Study how Erlsom does it (thus learning from the winner) 2) I read the Erlang Efficiency Guide 3) And used our previous blog post as the guideline.

It became clear that there were some shortcomings in my original approach:

Sub-binary creation

Let’s look at a common rule matching function in Saxy.

def match(buffer, position, :element, state) do
    with {:ok, {:<, _token_value}, {new_buffer, new_pos}, new_state} <-
           match(buffer, position, :<, state),
         {:ok, {:Name, tag_name}, {new_buffer, new_pos}, new_state} <-
           match(new_buffer, new_pos, :Name, new_state),
         {:ok, {:SAttribute, attributes}, {new_buffer, new_pos}, new_state} <-
           zero_or_more(new_buffer, new_pos, :SAttribute, new_state, []),
         {:ok, {:S, _s_char}, {new_buffer, new_pos}, new_state} <-
           zero_or_one(new_buffer, new_pos, :S, new_state) do
      {:ok, {:element, tag_name}, {new_buffer, new_pos}, new_state}
    end
end

A quick summary of what the code does: it tries matching < token, then :Name rule, :SAttribute, and finally :S. For each token/rule that is matched, it returns a four-element tuple representing the matched token/rule, the buffer and its current position, as well as the parsing state.

But why does it make the whole parsing flow slow? The answer is because it completely avoids the compiler from keeping the initial match context, because every match function returns its sub-binary. This topic has been extensively covered in the binary parsing optimization post.

Pattern Matching Mania

In Erlang, pattern-matching in functions, as well as in case and receive are usually optimized.

defmodule Foo do
  def bar([item]), do: item
  def bar({item}), do: item
  def bar([item | rest]), do: item
end

The generated Erlang VM instructions below will show us how the compiler rearranges the clauses and produces a more optimized code. See here for more information of how to generate these assembler codes.

{:function, :bar, 1, 10,
 [
   {:label, 9},
   {:line, [{:location, 'lib/foo.ex', 2}]},
   {:func_info, {:atom, Foo}, {:atom, :bar}, 1},
   {:label, 10},
   {:test, :is_nonempty_list, {:f, 12}, [x: 0]}, % <—————————————
   {:get_list, {:x, 0}, {:x, 1}, {:x, 2}},
   {:test, :is_nil, {:f, 11}, [x: 2]}, % <—————————————
   {:move, {:x, 1}, {:x, 0}},
   :return,
   {:label, 11},
   {:move, {:x, 1}, {:x, 0}},
   :return,
   {:label, 12},
   {:test, :is_tuple, {:f, 9}, [x: 0]}, % <—————————————
   {:test, :test_arity, {:f, 9}, [{:x, 0}, 1]},
   {:get_tuple_element, {:x, 0}, 0, {:x, 0}},
   :return
 ]}

But unfortunately, binary matching is one of the exceptions that the Erlang compiler does not rearrange clauses yet.

defp match_token(<<"-->", _rest::bits>>, :"-->") do
  {:ok, {"-->", 3}}
end

defp match_token(<<_::bits>>, :"-->") do
  :error
end

defp match_token(<<"-->", _rest::bits>>, :CommentChar) do
  :error
end

defp match_token(<<char::utf8, _rest::bits>>, :CommentChar) do
  {:ok, {<<char::utf8>>, byte_size(<<char::utf8>>)}}
end

# 50 more token matching.

defp match_token(<<charcode::utf8, _rest::bits>>, :HexChar) do
  if hex_char?(charcode) do
    char = <<charcode::utf8>>
    {:ok, {char, byte_size(char)}}
  else
    :error
  end
end

Pattern matching is absolutely one of the coolest features in Elixir/Erlang and makes writing conditional statements easier than ever, especially for someone who has background from other languages. However, like everything else in the Universe, using it excessively would eventually come back to bite us.

In the code above, every token type is passed as the second argument in the match_token’s clause. In order to match a :HexChar token, the Erlang VM has to try matching every clause of the function (because the first argument is matching binary) in a top-to-toe manner before it can reach the function genuinely doing the matching work. Working in this manner means costing a great deal of overheads and gaining nothing in return.

Binary construction

For better usability, Saxy was designed to emit SAX event data in binary, instead of a characters list, as in other libraries. This makes all the operations afterwards easier for the library users.

I am going to demonstrate how Saxy previously operated:

defp zero_or_more(buffer, position, rule, state, acc) do
  case match(buffer, position, rule, state) do
    {:ok, {^rule, value}, {new_buffer, current_pos}, new_state} ->
      zero_or_more(new_buffer, current_pos, rule, new_state, acc(acc, value))

    {:error, ^rule, {new_buffer, mismatch_pos}, new_state} ->
      {:ok, {rule, acc}, {new_buffer, mismatch_pos}, new_state}
  end
end

defp acc(acc, value) when is_binary(acc), do: acc <> value

defp match_token(<<charcode::utf8, _rest::bits>>, :NameChar) do
  if name_char?(charcode) do
    char = <<charcode::utf8>>
    {:ok, {char, byte_size(char)}}
  else
    :error
  end
end

At first, I suspected that binary appending operations were creating a bottleneck caused by allocations, but it turned out that the Erlang VM is way smarter than I expected. Unlike binary matching which is optimized by the compiler, the optimization work, primarily to avoid copying, of binary appending is done by the runtime system, therefore the optimization only fails in very few circumstances.

The good thing about open source is that you can learn from any random person on the Internet by looking at their code. I peaked at how other libraries implemented this and learned that Jason took a fairly smart approach. As mentioned, binaries in Erlang are designed in a way that they can be referenced internally, a sub binary is only a reference into a part of another binary. So instead of constructing the result binary, we can slice it out from the original binary with Kernel.binary_part/3.

Make Saxy great again

Taking into account all of these learnings: continuous binary matching, reasonable pattern matching, sub binaries instead of new binary construction, I implemented the next version of Saxy.

def parse_open_tag_name(<<charcode, rest::bits>>, cont, original, pos, state, len)
    when is_name_char(charcode) do
  parse_open_tag_name(rest, cont, original, pos, state, len + 1)
end

def parse_open_tag_name(<<charcode::utf8, rest::bits>>, cont, original, pos, state, len)
    when is_name_char(charcode) do
  parse_open_tag_name(rest, cont, original, pos, state, len + compute_unicode_len(charcode))
end

def parse_open_tag_name(<<buffer::bits>>, cont, original, pos, state, len) do
  name = binary_part(original, pos, len)
  state = %{state | stack: [name | state.stack]}
  parse_sattribute(buffer, cont, original, pos + len, state, [])
end

This implementation has many improvements in comparison to the previous one:

  1. Sub-binaries in each parsing function are now continuously passed to the next without being returned or used. This prevents new ones from being created and keeps the original match context.
  2. Every rule now gets its own function, instead of the rule name being passed as the second argument and the VM does crazy clause evaluations.
  3. Data can be sliced out from the original binary using binary_part/3 without having to be constructed. Some times binary construction cannot be completely avoided, for example mixing character references in XML content "Tom &#x26; Jerry", in that case we can avoid binary appending by building iodata (["Tom ", 0x26, " Jerry"]).
  4. It also improves error handling. For example if the look-ahead code point is not an expected token, we can immediately return a parsing error, instead of having to do a throw/catch block in the previous approach.

And a happy ending

After applying all the optimizations above, I ran the benchmark again (with the correct script 🙈). And the result started to look very promising:

  Microseconds per run
Saxy 0.4.0 64.401
xmerl 1.3.16 285.5068
Erlsom 1.4.1 66.8736

THAT IS 14.7x SPEED UP (yeah I rounded it up a bit in the blog title).

Results with other samples and a proper benchmark library.

Benchmarking against Hackernoon RSS, Saxy is 1.59 times faster than Erlsom.

  IPS 99th %
Saxy 0.4.0 437.79 3.21 ms
Erlsom 1.4.1 275.23 5.03 ms

This speed is particularly noticeable with deeply nested XML4.35 times.

  IPS 99th %
Saxy 0.4.0 15.37 76.18 ms
Erlsom 1.4.1 3.53 294.30 ms

More detailed benchmark results can be found on Github.

Summary

I had many different ups and downs, and learned a bunch of things while building the library. I hope that this blog is a fun read and provides a real-world example on optimizing binary parsing.