Source Edit

This module contains a scanf macro that can be used for extracting substrings from an input string. This is often easier than regular expressions. Some examples as an appetizer:

  1. # check if input string matches a triple of integers:
  2. const input = "(1,2,4)"
  3. var x, y, z: int
  4. if scanf(input, "($i,$i,$i)", x, y, z):
  5. echo "matches and x is ", x, " y is ", y, " z is ", z
  6. # check if input string matches an ISO date followed by an identifier followed
  7. # by whitespace and a floating point number:
  8. var year, month, day: int
  9. var identifier: string
  10. var myfloat: float
  11. if scanf(input, "$i-$i-$i $w$s$f", year, month, day, identifier, myfloat):
  12. echo "yes, we have a match!"

As can be seen from the examples, strings are matched verbatim except for substrings starting with $. These constructions are available:

$bMatches a binary integer. This uses parseutils.parseBin.
$oMatches an octal integer. This uses parseutils.parseOct.
$iMatches a decimal integer. This uses parseutils.parseInt.
$hMatches a hex integer. This uses parseutils.parseHex.
$fMatches a floating-point number. Uses parseFloat.
$wMatches an ASCII identifier: [A-Za-z_][A-Za-z_0-9].
$cMatches a single ASCII character.
$sSkips optional whitespace.
$$Matches a single dollar sign.
$.Matches if the end of the input string has been reached.
$Matches until the token following the $* was found. The match is allowed to be of 0 length.
$+Matches until the token following the $+ was found. The match must consist of at least one char.
${foo}User defined matcher. Uses the proc foo to perform the match. See below for more details.
$[foo]Call user defined proc foo to skip some optional parts in the input string. See below for more details.

Even though $* and $+ look similar to the regular expressions .* and .+, they work quite differently. There is no non-deterministic state machine involved and the matches are non-greedy. [$*] matches [xyz] via parseutils.parseUntil.

Furthermore no backtracking is performed, if parsing fails after a value has already been bound to a matched subexpression this value is not restored to its original value. This rarely causes problems in practice and if it does for you, it’s easy enough to bind to a temporary variable first.

Startswith vs full match

scanf returns true if the input string starts with the specified pattern. If instead it should only return true if there is also nothing left in the input, append $. to your pattern.

User definable matchers

One very nice advantage over regular expressions is that scanf is extensible with ordinary Nim procs. The proc is either enclosed in ${} or in $[]. ${} matches and binds the result to a variable (that was passed to the scanf macro) while $[] merely matches optional tokens without any result binding.

In this example, we define a helper proc someSep that skips some separators which we then use in our scanf pattern to help us in the matching process:

  1. proc someSep(input: string; start: int; seps: set[char] = {':','-','.'}): int =
  2. # Note: The parameters and return value must match to what ``scanf`` requires
  3. result = 0
  4. while start+result < input.len and input[start+result] in seps: inc result
  5. if scanf(input, "$w$[someSep]$w", key, value):
  6. ...

It also possible to pass arguments to a user definable matcher:

  1. proc ndigits(input: string; intVal: var int; start: int; n: int): int =
  2. # matches exactly ``n`` digits. Matchers need to return 0 if nothing
  3. # matched or otherwise the number of processed chars.
  4. var x = 0
  5. var i = 0
  6. while i < n and i+start < input.len and input[i+start] in {'0'..'9'}:
  7. x = x * 10 + input[i+start].ord - '0'.ord
  8. inc i
  9. # only overwrite if we had a match
  10. if i == n:
  11. result = n
  12. intVal = x
  13. # match an ISO date extracting year, month, day at the same time.
  14. # Also ensure the input ends after the ISO date:
  15. var year, month, day: int
  16. if scanf("2013-01-03", "${ndigits(4)}-${ndigits(2)}-${ndigits(2)}$.", year, month, day):
  17. ...

The scanp macro

This module also implements a scanp macro, which syntax somewhat resembles an EBNF or PEG grammar, except that it uses Nim’s expression syntax and so has to use prefix instead of postfix operators.

(E)Grouping
EZero or more
+EOne or more
?EZero or One
E{n,m}From n up to m times E
~ENot predicate
a ^ bShortcut for ?(a *(b a)). Usually used for separators.
a ^+ bShortcut for ?(a +(b a)). Usually used for separators.
‘a’Matches a single character
{‘a’..’b’}Matches a character set
“s”Matches a string
E -> aBind matching to some action
$_Access the currently matched character

Note that unordered or ordered choice operators (/, |) are not implemented.

Simple example that parses the /etc/passwd file line by line:

  1. const
  2. etc_passwd = """root:x:0:0:root:/root:/bin/bash
  3. daemon:x:1:1:daemon:/usr/sbin:/bin/sh
  4. bin:x:2:2:bin:/bin:/bin/sh
  5. sys:x:3:3:sys:/dev:/bin/sh
  6. nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
  7. messagebus:x:103:107::/var/run/dbus:/bin/false
  8. """
  9. proc parsePasswd(content: string): seq[string] =
  10. result = @[]
  11. var idx = 0
  12. while true:
  13. var entry = ""
  14. if scanp(content, idx, +(~{'\L', '\0'} -> entry.add($_)), '\L'):
  15. result.add entry
  16. else:
  17. break

The scanp maps the grammar code into Nim code that performs the parsing. The parsing is performed with the help of 3 helper templates that that can be implemented for a custom type.

These templates need to be named atom and nxt. atom should be overloaded to handle both single characters and sets of character.

  1. import std/streams
  2. template atom(input: Stream; idx: int; c: char): bool =
  3. ## Used in scanp for the matching of atoms (usually chars).
  4. peekChar(input) == c
  5. template atom(input: Stream; idx: int; s: set[char]): bool =
  6. peekChar(input) in s
  7. template nxt(input: Stream; idx, step: int = 1) =
  8. inc(idx, step)
  9. setPosition(input, idx)
  10. if scanp(content, idx, +( ~{'\L', '\0'} -> entry.add(peekChar($input))), '\L'):
  11. result.add entry

Calling ordinary Nim procs inside the macro is possible:

  1. proc digits(s: string; intVal: var int; start: int): int =
  2. var x = 0
  3. while result+start < s.len and s[result+start] in {'0'..'9'} and s[result+start] != ':':
  4. x = x * 10 + s[result+start].ord - '0'.ord
  5. inc result
  6. intVal = x
  7. proc extractUsers(content: string): seq[string] =
  8. # Extracts the username and home directory
  9. # of each entry (with UID greater than 1000)
  10. const
  11. digits = {'0'..'9'}
  12. result = @[]
  13. var idx = 0
  14. while true:
  15. var login = ""
  16. var uid = 0
  17. var homedir = ""
  18. if scanp(content, idx, *(~ {':', '\0'}) -> login.add($_), ':', * ~ ':', ':',
  19. digits($input, uid, $index), ':', *`digits`, ':', * ~ ':', ':',
  20. *('/', * ~{':', '/'}) -> homedir.add($_), ':', *('/', * ~{'\L', '/'}), '\L'):
  21. if uid >= 1000:
  22. result.add login & " " & homedir
  23. else:
  24. break

When used for matching, keep in mind that likewise scanf, no backtracking is performed.

  1. proc skipUntil(s: string; until: string; unless = '\0'; start: int): int =
  2. # Skips all characters until the string `until` is found. Returns 0
  3. # if the char `unless` is found first or the end is reached.
  4. var i = start
  5. var u = 0
  6. while true:
  7. if i >= s.len or s[i] == unless:
  8. return 0
  9. elif s[i] == until[0]:
  10. u = 1
  11. while i+u < s.len and u < until.len and s[i+u] == until[u]:
  12. inc u
  13. if u >= until.len: break
  14. inc(i)
  15. result = i+u-start
  16. iterator collectLinks(s: string): string =
  17. const quote = {'\'', '"'}
  18. var idx, old = 0
  19. var res = ""
  20. while idx < s.len:
  21. old = idx
  22. if scanp(s, idx, "<a", skipUntil($input, "href=", '>', $index),
  23. `quote`, *( ~`quote`) -> res.add($_)):
  24. yield res
  25. res = ""
  26. idx = old + 1
  27. for r in collectLinks(body):
  28. echo r

In this example both macros are combined seamlessly in order to maximise efficiency and perform different checks.

  1. iterator parseIps*(soup: string): string =
  2. ## ipv4 only!
  3. const digits = {'0'..'9'}
  4. var a, b, c, d: int
  5. var buf = ""
  6. var idx = 0
  7. while idx < soup.len:
  8. if scanp(soup, idx, (`digits`{1,3}, '.', `digits`{1,3}, '.',
  9. `digits`{1,3}, '.', `digits`{1,3}) -> buf.add($_)):
  10. discard buf.scanf("$i.$i.$i.$i", a, b, c, d)
  11. if (a >= 0 and a <= 254) and
  12. (b >= 0 and b <= 254) and
  13. (c >= 0 and c <= 254) and
  14. (d >= 0 and d <= 254):
  15. yield buf
  16. buf.setLen(0) # need to clear `buf` each time, cause it might contain garbage
  17. idx.inc

Imports

macros, parseutils, since

Macros

  1. macro scanf(input: string; pattern: static[string]; results: varargs[typed]): bool

See top level documentation of this module about how scanf works. Source Edit

  1. macro scanp(input, idx: typed; pattern: varargs[untyped]): bool

See top level documentation of this module about how scanp works. Source Edit

  1. macro scanTuple(input: untyped; pattern: static[string];
  2. matcherTypes: varargs[untyped]): untyped

Works identically as scanf, but instead of predeclaring variables it returns a tuple. Tuple is started with a bool which indicates if the scan was successful followed by the requested data. If using a user defined matcher, provide the types in order they appear after pattern: line.scanTuple(“${yourMatcher()}”, int)

Example:

  1. let (success, year, month, day, time) = scanTuple("1000-01-01 00:00:00", "$i-$i-$i$s$+")
  2. if success:
  3. assert year == 1000
  4. assert month == 1
  5. assert day == 1
  6. assert time == "00:00:00"

Source Edit

Templates

  1. template atom(input: string; idx: int; c: char): bool

Used in scanp for the matching of atoms (usually chars). EOF is matched as ‘\0’. Source Edit

  1. template atom(input: string; idx: int; s: set[char]): bool

Source Edit

  1. template hasNxt(input: string; idx: int): bool

Source Edit

  1. template nxt(input: string; idx, step: int = 1)

Source Edit

  1. template success(x: int): bool

Source Edit