How to build PCRE2 with Zig

· 1046 words · 5 minute read

I’ve been learning Zig for a little while, but by no means am I an expert. Karl Seguin is a prolific writer on Zig. His articles are extremely educational and I owe him a debt of gratitude for his posts. His blog, openmymind.net, is a treasure trove of knowledge on not just Zig, but other technical topics and is very much worth your time to visit and bookmark.

As DHH said in a recent interview with the Primeagen and Teej, and I’m paraphrashing here, “a coding language that makes you want to just write code is probably the language you would want to learn.” For me, this language has been Zig. I used it to stream-parse large files in my examination of the forensics tool Cybertriage. but didn’t get a chance to do a follow up blog post just yet. I love Zig. Besides being a really fun language to program in for me personally, it also has some really straightforward C interop:

 1const c = @cImport({
 2    @cInclude("stdio.h");
 3});
 4
 5pub fn main() void {
 6    const str = "this is a string";
 7    _ = c.printf("this is a number: %d \n", @as(c_int, 42));
 8    _ = c.printf("this is a pointer: %p\n", &str);
 9    _ = c.printf("this is the 6th char of str: %c\n", str[5]);
10}
Here’s a contrived example of how you can use the c standard library in Zig
➜  sheran@leonov /tmp zig run ccall.zig
this is a number: 42
this is a pointer: 0x10418f380
this is the 6th char of str: i
➜  sheran@leonov /tmp

I’m writing some text processing tools and one area I had to focus on was text tokenization. For one part of the tokenization, I needed to use regex. While Zig does not have a regex component built into the standard library, it does offer some rudimentary data matching features in std.mem. Since Zig is barely 1.0 yet, it is still a language in active developent and there is a high likelihood that if the thing you want isn’t found in the language, then you’re going to have to build it yourself. When I’m learning a new language, I’d bias towards building my own features rather finding a dependencies for what I need. But build a regex library? I think not.

The reason I shared that snippet of code is to show how easy it is to work with c. Then I thought, if I needed a regex component, then why not use one from C? Some active searching brought me to Karl’s excellent post on Regular Expressions in Zig. In it, he shows how to use Posix’s regex.h and at the same time, points out an existing open issue with Zig wherein to use regex.h, you needed to do some prep beforehand. Now I could use regex.h, but again, since I was learning, I set myself a goal of tyring to do something not previously attempted. So I decided to use PCRE2 instead.

I managed to figure it all out over one weekend which surprised even me. Source is below if you want to have a look. I’ve put it all in a Github repo as well which contains an interesting aside. The PCRE team have very graciously accepted a build.zig file which is in the source. So it is an absolute breeze to integrate with your Zig project. You can check it out in the repo.

After we’ve added pcre2 as a dependency and built it, then we can import the header.

const std = @import("std");
const re = @cImport({
    @cDefine("PCRE2_CODE_UNIT_WIDTH", "8");
    @cInclude("pcre2.h");
});

const PCRE2_ZERO_TERMINATED = ~@as(re.PCRE2_SIZE, 0);

Some C macros and defines do not expand well when called from Zig. For example, the PCRE2_ZERO_TERMINATED const being cast in the way we did.

const pattern : [*]const u8 = pat;
var errornumber : i32 = undefined;
var erroroffset : usize = undefined;
const regeex = re.pcre2_compile_8(
    pattern,
    PCRE2_ZERO_TERMINATED,
    0,
    &errornumber,
    &erroroffset,
    null);

if (regeex == null){
    var errormessage : [256]u8 = undefined;
    const msgLen : c_int = re.pcre2_get_error_message_8(errornumber, &errormessage, errormessage.len);
    std.debug.print("Error compiling: {s}\n",.{errormessage[0..@intCast(msgLen)]});
    return;
}

Then we have to compile our pattern. Here I’ve used Zig primitives instead of C types and they work fine so far. As before, there is an issue with macro expansion, however. You may notice in the code, I’ve used functions like re.pcre2_compile_8 or re.pcre2_get_error_message_8. In the C version (which you can see in the repo) the functions are plain pcre2_compile or pcre2_get_error_message without the trailing _8. The expansion happens in the pcre2.h header. Unfortunately, Zig does not support text without quotes in a C #define and it will return: error: unable to translate macro: undefined identifier 'pcre2_compile_'. Further, even if you do manage to patch the header file, you will find that Zig will not support the C concatenate operator ## and so you will eventually have to resort to calling the raw functions which are suffixed _8, _16, and _32.

Why these suffixes? Because that’s how PCRE2 handles UTF-8, 16, and 32 respectively. That’s why you have to define PCRE2_CODE_UNIT_WIDTH. So with the code I’ve written, I have to be extremely sure that I am feeding the function UTF-8 strings.

const subject : []const u8  = sub;
const subject_length : usize = subject.len;

const match_data = re.pcre2_match_data_create_from_pattern_8(regeex, null);
const rc = re.pcre2_match_8(regeex, &subject[0], subject_length, 0, 0, match_data, null);

if (rc < 0) {
    switch (rc) {
        re.PCRE2_ERROR_NOMATCH => {
            std.debug.print("No match found\n",.{});
        },
        else => {
            std.debug.print("Matching error: {}\n",.{rc});
        }
    }
    re.pcre2_match_data_free_8(match_data);
    re.pcre2_code_free_8(regeex);
}

const ovector = re.pcre2_get_ovector_pointer_8(match_data);
if (rc == 1){
    std.debug.print("Match found at offset: {}\n",.{ovector.*});
} else if (rc > 1){
    for(0..@intCast(rc))|i|{
        std.debug.print("{}: {s}\n",.{i, subject[ovector[2 * i]..ovector[2 * i + 1]]});
    }
}

re.pcre2_match_data_free_8(match_data);
re.pcre2_code_free_8(regeex);

Then we specify our haystack, or subject and call re.pcre2_match_8. This is near enough to how you would write the C equivalent. Lastly, if we find more than one match (rc > 1), then we loop over the rc variable to find the offsets where the match will start and end.

You notice that we did no Zig memory allocations and that was all handled by C. We called the C free functions similar to the original C source. This is quite fresh code so if you find better ways to do what I did, then PRs are very much welcome!