- SRL is case insensitive. Thus, LITERALLY "test" is exactly the same as literally "test". But please beware, that everything inside a string is in fact case sensitive. LITERALLY "TEST" does NOT equal literally "test".
- The comma separating statements is completely optional and has no effect whatsoever. In fact, it gets removed before interpreting. But since it helps the human eye to distinct between different statements, it's allowed.
- Strings are interpreted as literal characters and will be escaped. They can either be defined using 'single' or "double" quotation marks. Escaping them using a backslash is possible as well.
- Parentheses should only be used when building a sub-query, for example while using a capture or non-capture group to, for example, apply a quantifier to multiple items.
- Comments are currently not supported and may be implemented in the future.
A frame is anything that can match a tile in our model. An SRL statement to describe one, or a sequence of frames takes the following form:
<character-set-name> [specification] [quantifier] [anchor]
As you can see, the <character-set-name> almost always come first, (there can be another '[anchor]' before it). They start a new statement, and everything that follows defines or refines the frame(s) it introduces. Some <character-set-name>'s allow a specification. For example 'LETTER', allows you to specify a span of allowed letters, e.g.: 'from a to f'.
Every frame or frame sequence can be quantified. You may want to match exactly four letters from a to f. This would match abcd, but not abcg. You can do that by supplying 'exactly 4 times' as a quantifier:
letter from a to f exactly 4 times
Note: this adds 4 frames to our regex, four copies of 'letter from a to f'
Okay, let's dive into the different <character-set-name>'s. Below, you can find a list of all available <character-set-name> along with an example query.
literally "string"
The 'literally' allows you build a sequence of frames up with one statement. It passes a string to the query that will be interpreted as exactly what you've requested. Nothing else will match besides your string. Any special character will automatically be escaped.
literally "sample"
example format:
one of "characters"
So 'literally', (above), comes in handy if the string is known. But if there is a unknown string which may only contain certain characters, using ONE OF makes much more sense. This will match one of the supplied characters.
one of "a%1"
format:
letter [from a to z]
This will help you to match a letter between a specific span, if the exact word isn't known. If you know you're expecting an letter, then go for it. If not supplying anything, a normal letter between a and z will be matched. Of course, you can define a span, using the from x to y syntax.
Please note, that this will only match one letter. If you expect more than one letter, use a quantifier.
Note: LETTER would be called an alphabetic character in computer science class.
letter from a to f
uppercase letter
or use in this format:
uppercase letter [from A to Z]
This of course behaves just like the normal letter, with the only difference, that uppercase letter only matches letters that are written in uppercase. Of course, if the case insensitive flag is applied to the query, these two act completely the same.
uppercase letter from A to F
format:
any character
Just like a letter, any character matches anything between A to Z, plus 0 to 9 and _, -case insensitive. This way you can validate if someone for example entered a valid username.
In many computer languages, including Python, these are the characters from which you can form valid identifers.
starts with any character once or more, must end
Note: this example shows an anchor in front.
no character
The inverse to the any character-character is no character. This will match everything except a to z, A to Z, 0 to 9 and _.
Example query:
starts with no character once or more, must end
format:
digit [from 0 to 9]
When expecting a digit, but not a specific one, this comes in handy. Each digit matches only one digit, meaning you can only match digit from 0 to 9, but multiple times using a quantifier. Obviously, limiting the digit isn't a problem either. So if you're searching for a number from 5 to 7, go for it!
Note: number is an alias for digit.
starts with digit from 5 to 7 exactly 2 times, must end
format:
anything
Any character whatsoever. Well.. except for line breaks. This will match any character, except new lines. And, of course, only once. So don't forget to apply a quantifier, if necessary.
anything
format;
new line
Match a new line. Forgive us, if we can't provide an example for that one, but you can check it out on the builder.
[no] whitespace
This matches any whitespace character. This includes a space, tab or new line. If using no whitespace everything except a whitespace character will match.
whitespace
tab
If you want to match tabs, but no other whitespace characters, this might be for you. It will only match the tab character, and nothing else.
tab
backslash
Matching a backslash with literally would work, but requires escaping, since the backslash is the escaping character. Thus, you'd have to use literally "\" to match one backslash. Or you could just write backslash.
backslash
format:
raw "expression"
Sometimes, you may want to enforce a specific part of a regular expression. You can do this by using raw. This will append the given string without escaping it.
literally "an", whitespace, raw "[a-zA-Z]"
====================================================
Quantifiers are probably one of the most important things here. If you've specified a character or a group in your query and now want to multiply it, you don't have to copy and paste all of it. Just tell them how many copies to allow.
Oh, and don't be confused. Sometimes, you may find that these quantifiers don't match with the tinkered example. That's okay, since we're not forcing the string to start or end. Thus, even if only parts of that string are matching, the expression will be valid.
Remember: You can execute every Python Cell in this notebook by clicking it, and then pressing shift-enter !
format:
exactly <x> times
You're sure. You don't guess, you dictate. exactly 4 times. Not more, not less. The statement before has to match exactly x times.
Note: since exactly x times is pretty much to write, short terms exist. Instead of
exactly 1 time
, you can writeonce
, and for 2, writetwice
format:
digit exactly 3 times, letter twice
between <x> and <y> times
For a specific number of repetitions between a span of <x> to <y>, you may use this quantifier. It will make sure the previous character exists between x and y times.
Note: since between x and y times is pretty much to write, you can get rid of the times:
between 1 and 5
Example query:
Example query:
starts with digit between 3 and 5 times, letter twice
optional
optional
You can't always be sure that something exists. Sometimes it's okay if something is missing. In that case, the optional
quantifier comes in handy. It will match the sub-query, if it's there, and ignore it, if it's missing.
Example query:
digit optional, letter twice
format:
once or more
never or more
If something has to exist at least once, or never, but if it does, then it may exist multiple times, the quantifiers once or more and never or more will do the job.
Example query:
starts with letter once or more, must end
format:
at least <x> times
Something may exist in an infinite length, but must exist at least x times.
Example query:
letter at least 10 times
===========================================================
Groups are a powerful tool of regular expressions. With them, you can capture matches, join and/or summarize them.
To make things easier for you, think of groups as sub-queries. Everything in between a group could be a standalone expression which will later be combined.
Each group allows you to form a sub-query either using parentheses, or just a literal string by using quotes instead.
format:
capture (condition) [as "name"]
To go beyond simply validating input, a capture group comes in handy. You can capture any condition and return it by the engine. This helps you to filter inputs and only get the parts you care about.
If you're trying to get more than one match, capture names are useful, too. This is completely optional, but you can supply a name for a capture group using the as "name"
syntax.
Example query:
capture (anything once or more) as "first", literally " - ", capture "second part" as "second"
format:
any of (<condition>)
If you're not exactly sure which part of the condition will match, use any of. Every statement you supply in that sub-query, could be a match.
As you can see, you can feel free to nest multiple groups and even parentheses.
In the example for CAPTURE ...
repeated here:
capture (anything once or more) as "first", literally " - ", capture "second part" as "second"
Note:
either of
is a synonym ofany of
.
Query Example:
capture (any of (literally "sample", (digit once or more)))
If you would have removed the parentheses around the digit once or more, the expression would be invalid, since you can't match either a digit, or "once or more".
format:
until (<condition>)
Sometimes you want to match or capture a specific expression until some other condition is met. This can be achieved using the until
group.
In the example below, we'll provide a string as a condition. However, this would work as well using a more complex expression, just like above.
Example query:
begin with capture (anything once or more) until "m"