The preceding section covered the syntax of regular expressions. It used the simplest possible interface to the matcher: sending #matchesRegex: message to the sample string, with regular expression string as the argument. This section explains hairier ways of using the matcher.
A CharacterArray (an EsString in VA) also understands these messages:
The last two messages are case-insensitive versions of matching.
#prefixMatchesRegex: is just like #matchesRegex, except that the whole receiver is not expected to match the regular expression passed as the argument; matching just a prefix of it is enough. For example:
|'abcde' matchesRegex: '(a|b)+'||-- false|
'abcde' prefixMatchesRegex: '(a|b)+'
An application can be interested in all matches of a certain regular expression within a String. The matches are accessible using a protocol modelled after the familiar Collection-like enumeration protocol:
|#regex: regexString matchesDo: aBlock|
Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string.
|#regex: regexString matchesCollect: aBlock|
Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string. Collects results of evaluations and anwers them as a SequenceableCollection.
Returns a collection of all matches (substrings of the receiver string) of the regular expression. It is an equivalent of <aString regex: regexString matchesCollect: [:each | each]>.
It is possible to replace all matches of a regular expression with a certain string using the message:
|#copyWithRegex: regexString matchesReplacedWith: aString|
|'ab cd ab' copyWithregex: '(a|b)+' matchesReplacesWith: 'foo'|
A more general substitution is match translation:
|#copyWithRegex: regexString matchesTranslatedUsing: aBlock|
This message evaluates a block passing it each match of the regular expression in the receiver string and answers a copy of the receiver with the block results spliced into it in place of the respective matches. For example:
|'ab cd ab' copyWithregex: '(a|b)+' matchesTranslatedUsing: [:each| each asUppercase]|
All messages of enumeration and replacement protocols perform a case-sensitive match. Case-insensitive versions are not provided as part of a CharacterArray protocol. Instead, they are accessible using the lower-level matching interface.
Internally, #matchesRegex: works as follows:
If you repeatedly match a number of strings against the same regular expression using one of the messages defined in String, the regular expression string is parsed and a matcher is created anew for every match. You can avoid this overhead by building a matcher for the regular expression, and then reusing the matcher over and over again. You can, for example, create a matcher at a class or instance initialization stage, and store it in a variable for future use.
You can create a matcher using one of the following methods:
A more convenient way is using one of the two matcher-created messages understood by CharacterArray.
Here are four examples of creating a matcher:
|hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+'|
|hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: false|
|hexRecognizer := '16r[0-9A-Fa-f]+' asRegex|
|hexRecognizer := '16r[0-9A-F]+' asRegexIngnoringCase|
The matcher understands these messages (all of them return true to indicate successful match or search, and false otherwise):
|True if the whole target string (aString) matches.|
|True if some prefix of the string (not necessarily the whole string) matches.|
|Search the string for the first occurrence of a matching substring. (Note that the first two methods only try matching from the very beginning of the string). Using the above example with a matcher for `a+', this method would answer success given a string `baaa', while the previous two would fail.|
|Respective analogs of the first three methods, taking input from a stream instead of a string. The stream must be positionable and peekable.|
All these methods answer a boolean indicating success. The matcher also stores the outcome of the last match attempt and can report it:
|Answers a Boolean -- the outcome of the most recent match attempt. If no matches were attempted, the answer is unspecified.|
After a successful match attempt, you can query the specifics of which part of the original string has matched which part of the whole expression.
A subexpression is a parenthesized part of a regular expression, or the whole expression. When a regular expression is compiled, its subexpressions are assigned indices starting from 1, depth-first, left-to-right. For example, `((ab)+(c|d))?ef' includes the following subexpressions with these indices:
After a successful match, the matcher can report what part of the original string matched what subexpression. It understandards these messages:
|Answers the total number of subexpressions: the highest value that can be used as a subexpression index with this matcher. This value is available immediately after initialization and never changes.|
|An index must be a valid subexpression index, and this message must be sent only after a successful match attempt. The method answers a substring of the original string the corresponding subexpression has matched to.|
|Answer positions within the original string or stream where the match of a subexpression with the given index has started and ended, respectively.|
This facility provides a convenient way of extracting parts of input strings of complex format. For example, the following piece of code uses the 'MMM DD, YYYY' date format recognizer example from the `Syntax' section to convert a date to a three-element array with year, month, and day strings (you can select and evaluate it right here):
matcher := Rxmatcher forString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*19(:isDigit::isDigit:)'.
(matcher matches: 'Aug 6, 1996')
with: (matcher subexpression: 4)
with: (matcher subexpression: 2)
with: (matcher subexpression: 3)]
ifTrue: ['no match']
(should answer ` #('96' 'Aug' '6')').
The enumeration and replacement protocols exposed in CharacterArray are actually implemented by the matcher. The following messages are understood:
|#matchesIn: aString do: aBlock|
|#matchesIn: aString collect: aBlock|
|#copy: uneChaine replacingMatchesWith: replacementString|
|#copy: uneChaine translatingMatchesUsing: aBlock|
|#matchesOnStream: aStream do: aBlock|
|#matchesOnStream: aStream collect: aBlock|
|#copy: streamSource to: targetStream replacingMatchesWith: replacementString|
|#copy: streamSource to: targetStream translatingMatchesWith: aBlock|
[CU: Note that the following has been modified since I've changed the VW-style exception system to an ANSI-style]
If a syntax error is detected while parsing expression, an RxSyntaxError is raised...
If an error is detected while building a matcher, an RxCompilationError is raised.
If an error is detected while matching (for example, if a bad selector was specified using `:<selector>:' syntax, or because of the matcher's internal error), an RxMatchError is raised.
RxError is the parent of all three. Since any of three signals can be raised within a call to #matchesRegex:, it is handy if you want to catch them all. For example:
'abc' matchesRegex: '))garbage['
do: [:ex | ex return: false]
Updated on 2002-03-10