The preceding section covered the syntax of regular expressions. It used the simplest possible interface to the matcher: sending #matchesRegex: message to the sample string, with regular expression string as the argument. This section explains hairier ways of using the matcher.
A CharacterArray (an EsString in VA) also understands these messages:
#prefixMatchesRegex: regexString | |
#matchesRegexIgnoringCase: regexString |
|
#prefixMatchesRegexIgnoringCase: regexString |
The last two messages are case-insensitive versions of matching.
#prefixMatchesRegex: is just like #matchesRegex, except that the whole receiver is not expected to match the regular expression passed as the argument; matching just a prefix of it is enough. For example:
'abcde' matchesRegex: '(a|b)+' | -- false | |
'abcde' prefixMatchesRegex: '(a|b)+' |
-- true |
An application can be interested in all matches of a certain regular expression within a String. The matches are accessible using a protocol modelled after the familiar Collection-like enumeration protocol:
#regex: regexString matchesDo: aBlock |
Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string.
#regex: regexString matchesCollect: aBlock |
Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string. Collects results of evaluations and anwers them as a SequenceableCollection.
#allRegexMatches: regexString |
Returns a collection of all matches (substrings of the receiver string) of the regular expression. It is an equivalent of <aString regex: regexString matchesCollect: [:each | each]>.
It is possible to replace all matches of a regular expression with a certain string using the message:
#copyWithRegex: regexString matchesReplacedWith: aString |
For example:
'ab cd ab' copyWithregex: '(a|b)+' matchesReplacesWith: 'foo' |
A more general substitution is match translation:
#copyWithRegex: regexString matchesTranslatedUsing: aBlock |
This message evaluates a block passing it each match of the regular expression in the receiver string and answers a copy of the receiver with the block results spliced into it in place of the respective matches. For example:
'ab cd ab' copyWithregex: '(a|b)+' matchesTranslatedUsing: [:each| each asUppercase] |
All messages of enumeration and replacement protocols perform a case-sensitive match. Case-insensitive versions are not provided as part of a CharacterArray protocol. Instead, they are accessible using the lower-level matching interface.
Internally, #matchesRegex: works as follows:
If you repeatedly match a number of strings against the same regular expression using one of the messages defined in String, the regular expression string is parsed and a matcher is created anew for every match. You can avoid this overhead by building a matcher for the regular expression, and then reusing the matcher over and over again. You can, for example, create a matcher at a class or instance initialization stage, and store it in a variable for future use.
You can create a matcher using one of the following methods:
A more convenient way is using one of the two matcher-created messages understood by CharacterArray.
Here are four examples of creating a matcher:
hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' | |
hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: false | |
hexRecognizer := '16r[0-9A-Fa-f]+' asRegex | |
hexRecognizer := '16r[0-9A-F]+' asRegexIngnoringCase |
The matcher understands these messages (all of them return true to indicate successful match or search, and false otherwise):
matches: aString |
True if the whole target string (aString) matches. |
matchesPrefix: aString |
True if some prefix of the string (not necessarily the whole string) matches. |
search: aString |
Search the string for the first occurrence of a matching substring. (Note that the first two methods only try matching from the very beginning of the string). Using the above example with a matcher for `a+', this method would answer success given a string `baaa', while the previous two would fail. |
matchesStream: aStream | |
matchesStreamPrefix: aStream | |
searchStream: aStream |
Respective analogs of the first three methods, taking input from a stream instead of a string. The stream must be positionable and peekable. |
All these methods answer a boolean indicating success. The matcher also stores the outcome of the last match attempt and can report it:
lastResult |
Answers a Boolean -- the outcome of the most recent match attempt. If no matches were attempted, the answer is unspecified. |
After a successful match attempt, you can query the specifics of which part of the original string has matched which part of the whole expression.
A subexpression is a parenthesized part of a regular expression, or the whole expression. When a regular expression is compiled, its subexpressions are assigned indices starting from 1, depth-first, left-to-right. For example, `((ab)+(c|d))?ef' includes the following subexpressions with these indices:
1 | ((ab)+(c|d))?ef | |||
2 |
(ab)+(c|d) |
|||
3 | ab | |||
4 | c|d |
After a successful match, the matcher can report what part of the original string matched what subexpression. It understandards these messages:
subexpressionCount |
Answers the total number of subexpressions: the highest value that can be used as a subexpression index with this matcher. This value is available immediately after initialization and never changes. |
subexpressionCount: unIndex |
An index must be a valid subexpression index, and this message must be sent only after a successful match attempt. The method answers a substring of the original string the corresponding subexpression has matched to. |
subBeginning: unIndex | |
subEnd: unIndex |
Answer positions within the original string or stream where the match of a subexpression with the given index has started and ended, respectively. |
This facility provides a convenient way of extracting parts of input strings of complex format. For example, the following piece of code uses the 'MMM DD, YYYY' date format recognizer example from the `Syntax' section to convert a date to a three-element array with year, month, and day strings (you can select and evaluate it right here):
|matcher|
matcher := Rxmatcher forString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*19(:isDigit::isDigit:)'.
(matcher matches: 'Aug 6, 1996')
ifTrue:
[Array
with: (matcher subexpression: 4)
with: (matcher subexpression: 2)
with: (matcher subexpression: 3)]
ifTrue: ['no match']
(should answer ` #('96' 'Aug' '6')').
The enumeration and replacement protocols exposed in CharacterArray are actually implemented by the matcher. The following messages are understood:
#matchesIn: aString | |
#matchesIn: aString do: aBlock | |
#matchesIn: aString collect: aBlock | |
#copy: uneChaine replacingMatchesWith: replacementString | |
#copy: uneChaine translatingMatchesUsing: aBlock | |
#matchesOnStream: aStream | |
#matchesOnStream: aStream do: aBlock | |
#matchesOnStream: aStream collect: aBlock | |
#copy: streamSource to: targetStream replacingMatchesWith: replacementString | |
#copy: streamSource to: targetStream translatingMatchesWith: aBlock |
[CU: Note that the following has been modified since I've changed the VW-style exception system to an ANSI-style]
If a syntax error is detected while parsing expression, an RxSyntaxError is raised...
If an error is detected while building a matcher, an RxCompilationError is raised.
If an error is detected while matching (for example, if a bad selector was specified using `:<selector>:' syntax, or because of the matcher's internal error), an RxMatchError is raised.
RxError is the parent of all three. Since any of three signals can be raised within a call to #matchesRegex:, it is handy if you want to catch them all. For example:
'abc' matchesRegex: '))garbage['
on: RxError
do: [:ex | ex return: false]Updated on 2002-03-10