|
New 16-bit string processing classes TDIPerlRegEx16 and TDIDfaRegEx16. Both work on UnicodeStrings and WideStrings natively with not prior conversions. Full UTF-16 Unicode processing optional.
Fixed a bug in fixed-length calculation for lookbehinds that would show up only in quite long subpatterns.
For a non-anchored pattern, if (*SKIP) was given with a name that did not match a (*MARK), and the match failed at the start of the subject, a reference to memory before the start of the subject could occur.
A reference to an unset group with zero minimum repetition was giving totally wrong answers (in non-JavaScript-compatibility mode). For example, (another)?(\1?)test matched against "hello world test".
Ovector size of 2 is also supported by JIT based pcre_exec and pcre16_exec (the ovector size rounding is not applied in this particular case).
Remove deprecated pcre_info. Use pcre_fullinfo instead.
New Just-In-Time Compiler (JIT) optimization, which can greatly speed up pattern matching. Available as auto-option poAutoJit or by passing soJIT to TDIRegEx.Study.
A possessively repeated conditional subpattern such as (?(?=c)c|d)++ was being incorrectly compiled and would have given unpredicatble results.
A possessively repeated subpattern with minimum repeat count greater than one behaved incorrectly. For example, (A){2,}+ behaved as if it was (A)(A)++ which meant that, after a subsequent mismatch, backtracking into the first (A) could occur when it should not.
In non-UTF-8 mode, \C is now supported in lookbehinds and DFA matching.
Perl does not support \N without a following name in a [] class; DIRegEx now also gives an error.
Removed the fixed limit of repeated forward references. Additional workspace is noew dynamically allocated and limited at about 200000 repeats for safety. At the same time, the filling in of repeated forward references has been sped up.
A repeated forward reference in a pattern such as (a)(?2){2}(.) was incorrectly expecting the subject to contain another "a" after the start.
When (*SKIP:name) is activated without a corresponding (*MARK:name) earlier in the match, the SKIP should be ignored. This was not happening; instead the SKIP was being treated as NOMATCH. For patterns such as A(*MARK:A)A+(*SKIP:B)Z|AAC this meant that the AAC branch was never tested.
The behaviour of (*MARK), (*PRUNE), and (*THEN) has been reworked and is now much more compatible with Perl, in particular in cases where the result is a non-match for a non-anchored pattern. For example, if b(*:m)f|a(*:n)w is matched against "abc", the non-match returns the name "m", where previously it did not return a name. A side effect of this change is that for partial matches, the last encountered mark name is returned, as for non matches. The refactoring has had the pleasing side effect of it stack requirements.
Retrieve executable code size support for the JIT compiler and fixing some warnings.
A caseless match of a UTF-8 character whose other case uses fewer bytes did not work when the shorter character appeared right at the end of the subject string.
Computation of memory usage for the table of capturing group names was giving an unnecessarily large value.
Fixed that the following items were rejected as fixed length: (*ACCEPT), (*COMMIT), (*FAIL), (*MARK), (*PRUNE), (*SKIP), (*THEN), \h, \H, \v, \V, and single character negative classes with fixed repetitions, e.g. [^a]{3}, with and without coCaseLess.
Supporting of \x, \U and \u in JavaScript compatibility mode based on the ECMA-262 standard.
Lookbehinds such as (?<=a{2}b) that contained a fixed repetition were erroneously being rejected as "not fixed length" if coCaseLess was set.
Support Delphi XE2 Win64. Caution: DFA matching may cause access violations in 64-bit. Unfortunately, there is no way to locate their cause because *.obj file debugging is yet unavailable for Delphi XE2 64-bit (confirmed by Embarcadero in https://forums.embarcadero.com/thread.jspa?threadID=62631). PERL matching tests, however, pass without errors.
Changed some type names in DIRegEx_Api.pas so that they more closely resemble the PCRE original. The TDIRegEx classes (TDIPerlregEx, TDIDFARegEx) are not affected, but applications using the low level PCRE API might need small adjustments.
(*MARK) settings inside atomic groups that do not contain any capturing parentheses, for example, (?>a(*:m)), were not being passed out. This bug was introduced in DIRegEx 6.0.0.
Support Delphi XE2 Win32.
If a pattern such as (a)b|ac is matched against "ac", there is no captured substring, but while checking the failing first alternative, substring 1 is temporarily captured. If the output vector supplied to pcre_exec was not big enough for this capture, the yield of the function was still zero ("insufficient space for captured substrings"). This cannot be totally fixed without adding another stack variable, which seems a lot of expense for a edge case. However, I the situation is now improved in cases such as (a)(b)x|abc matched against "abc", where the return code indicates that fewer than the maximum number of slots in the ovector have been set.
Related to above: when there are more back references in a pattern than slots in the output vector, pcre_exec uses temporary memory during matching, and copies in the captures as far as possible afterwards. It was using the entire output vector, but this conflicts with the specification that only 2/3 is used for passing back captured substrings. Now it uses only the first 2/3, for compatibility. This is, of course, another edge case.
When the number of matches in a pcre_dfa_exec run exactly filled the ovector, the return from the function was zero, implying that there were other matches that did not fit. The correct "exactly full" value is now returned.
If a subpattern that was called recursively or as a subroutine contained (*PRUNE) or any other control that caused it to give a non-standard return, invalid errors such as PCRE_ERROR_RECURSELOOP or even infinite loops could occur.
If a pattern such as a(*SKIP)c|b(*ACCEPT)| was studied, it stopped computing the minimum length on reaching *ACCEPT, and so ended up with the wrong value of 1 rather than 0. Further investigation indicates that computing a minimum subject length in the presence of *ACCEPT is difficult (think back references, subroutine calls), and so the code was changed so that no minimum is registered for a pattern that contains *ACCEPT.
If (*THEN) was present in the first (true) branch of a conditional group, it was not handled as intended.
A pathological pattern such as (*ACCEPT)a was miscompiled, thinking that the first byte in a match must be "a".
If (*THEN) appeared in a group that was called recursively or as a subroutine, it did not work as intended.
Consider the pattern A (B(*THEN)C) | D where A, B, C, and D are complex pattern fragments (but not containing any | characters). If A and B are matched, but there is a failure in C so that it backtracks to (*THEN), PCRE was behaving differently to Perl. PCRE backtracked into A, but Perl goes to D. In other words, Perl considers parentheses that do not contain any | characters to be part of a surrounding alternative, whereas PCRE was treading (B(*THEN)C) the same as (B(*THEN)C|(*FAIL)) – which Perl handles differently. PCRE now behaves in the same way as Perl, except in the case of subroutine/recursion calls such as (?1) which have in any case always been different (but PCRE had them first).
Perl does not treat the | in a conditional group as creating alternatives. Such a group is treated in the same way as an ordinary group without any | characters when processing (*THEN). PCRE has been changed to match Perl's behaviour.
A change in DIRegEx 5.3.3 caused atomic groups to use more stack. This is inevitable for groups that contain captures, but it can lead to a lot of stack use in large patterns. The old behaviour has been restored for atomic groups that do not contain any capturing parentheses.
Fix an offset by 1 error in TDIRegEx.SubStrMatched.
Mark pcre_info as deprecated. Use pcre_fullinfo instead.
The Unicode data table have been updated to Unicode 6.0.0.
There were a number of related bugs in the code for matching backrefences caselessly in UTF-8 mode when codes for the characters concerned were different numbers of bytes. For example, U+023A and U+2C65 are an upper and lower case pair, using 2 and 3 bytes, respectively. The main bugs were: (a) A reference to 3 copies of a 2-byte code matched only 2 of a 3-byte code. (b) A reference to 2 copies of a 3-byte code would not match 2 of a 2-byte code at the end of the subject (it thought there wasn't enough data left).
Comprehensive information about what went wrong is now returned by pcre_exec and pcre_dfa_exec when the UTF-8 string check fails, as long as the output vector has at least 2 elements. The offset of the start of the failing character and a reason code are placed in the vector.
When the UTF-8 string check fails for pcre_compile, the offset that is now returned is for the first byte of the failing character, instead of the last byte inspected. This is an incompatible change, but it should be small enough not to be a problem. It makes the returned offset consistent with pcre_exec and pcre_dfa_exec.
When \R was used with a maximizing quantifier it failed to skip backwards over a #13#10 pair if the subsequent match failed. Instead, it just skipped back over a single character (#10). This seems wrong (because it treated the two characters as a single entity when going forwards), conflicts with the documentation that \R is equivalent to (?>\r\n|\n|…etc), and makes the behaviour of \R* different to (\R)*, which also seems wrong. The behaviour has been changed.
Extensive internal refactoring has drastically reduced the number of recursive calls and the amount of stack used for possessively repeated groups such as (abc)++ when using pcre_exec.
Fix a number of bugs in the handling of groups:
(?<=(a)+) was not diagnosed as invalid (non-fixed-length lookbehind).
(a|)*(?1) gave a compile-time internal error.
((a|)+)+ did not notice that the outer group could match an empty string.
(^a|^)+ was not marked as anchored.
(.*a|.*)+ was not marked as matching at start or after a newline.
When (*ACCEPT) was used in a subpattern that was called recursively, the restoration of the capturing data to the outer values was not happening correctly.
If a recursively called subpattern ended with (*ACCEPT) and matched an empty string, and PCRE_NOTEMPTY was set, pcre_exec thought the whole pattern had matched an empty string, and so incorrectly returned a no match.
There was optimizing code for the last branch of non-capturing parentheses, and also for the obeyed branch of a conditional subexpression, which used tail recursion to cut down on stack usage. Unfortunately, now that there is the possibility of (*THEN) occurring in these branches, tail recursion is no longer possible because the return has to be checked for (*THEN). These two optimizations have therefore been removed.
If a pattern containing \R was studied, it was assumed that \R always matched two bytes, thus causing the minimum subject length to be incorrectly computed because \R can also match just one byte.
If a pattern containing (*ACCEPT) was studied, the minimum subject length was incorrectly computed.
When (*ACCEPT) was used in an assertion that matched an empty string and PCRE_NOTEMPTY was set, PCRE applied the non-empty test to the assertion.
When an atomic group that contained a capturing parenthesis was successfully matched, but the branch in which it appeared failed, the capturing was not being forgotten if a higher numbered group was later captured. For example, (?>(a))b|(a)c when matching "ac" set capturing group 1 to "a", when in fact it should be unset. This applied to multi-branched capturing and non- capturing groups, repeated or not, and also to positive assertions (capturing in negative assertions does not happen in PCRE) and also to nested atomic groups.
The way atomic groups are processed by pcre_exec has been changed so that if they are repeated, backtracking one repetition now resets captured values correctly. For example, if ((?>(a+)b)+aabab) is matched against "aaaabaaabaabab" the value of captured group 2 is now correctly recorded as "aaa". Previously, it would have been "a". As part of this code refactoring, the way recursive calls are handled has also been changed.
If an assertion condition captured any substrings, they were not passed back unless some other capturing happened later. For example, if (?(?=(a))a) was matched against "a", no capturing was returned.
When studying a pattern that contained subroutine calls or assertions, the code for finding the minimum length of a possible match was handling direct recursions such as (xxx(?1)|yyy) but not mutual recursions (where group 1 called group 2 while simultaneously a separate group 2 called group 1). A stack overflow occurred in this case. This is now fixed this by limiting the recursion depth to 10.
An instance of \X with an unlimited repeat could fail if at any point the first character it looked at was a mark character.
Some minor code refactoring concerning Unicode properties and scripts should reduce the stack requirement slightly.
If \k was not followed by a braced, angle-bracketed, or quoted name, PCRE compiled something random. Now it gives a compile-time error (as does Perl).
A *MARK encountered during the processing of a positive assertion is now recorded and passed back (compatible with Perl).
Previously, PCRE did not allow quantification of assertions. However, Perl does, and because of capturing effects, quantifying parenthesized assertions may at times be useful. Quantifiers are now allowed for parenthesized assertions.
\g was being checked for fancy things in a character class, when it should just be a literal "g".
PCRE was rejecting [:a[:digit:]] whereas Perl was not. It seems that the appearance of a nested POSIX class supersedes an apparent external class. For example, [:a[:digit:]b:] matches "a", "b", ":", or a digit. Also, unescaped square brackets may also appear as part of class names. For example, [:a[:abc]b:] gives unknown class "[:abc]b:]". PCRE now behaves more like Perl.
PCRE was giving an error for \N with a braced quantifier such as {1,} (this was because it thought it was \N{name}, which is not supported).
PCRE tries to detect cases of infinite recursion at compile time, but it cannot analyze patterns in sufficient detail to catch mutual recursions such as ((?1))((?2)). There is now a runtime test that gives an error if a subgroup is called recursively as a subpattern for a second time at the same position in the subject string. In previous releases this might have been caught by the recursion limit, or it might have run out of stack.
A pattern such as (?(R)a+|(?R)b) is quite safe, as the recursion can happen only once. PCRE was, however incorrectly giving a compile time error "recursive call could loop indefinitely" because it cannot analyze the pattern in sufficient detail. The compile time test no longer happens when PCRE is compiling a conditional subpattern, but actual runaway loops are now caught at runtime.
It seems that Perl allows any characters other than a closing parenthesis to be part of the NAME in (*MARK:NAME) and other backtracking verbs. PCRE has been changed to be the same.
Add a pointer to the latest mark to the callout data block.
The pattern .(*F), when applied to "abc" with PCRE_PARTIAL_HARD, gave a partial match of an empty string instead of no match. This was specific to the use of ".".
The pattern f.*, if compiled with PCRE_UTF8 and PCRE_DOTALL and applied to "for" with PCRE_PARTIAL_HARD, gave a complete match instead of a partial match. This bug was dependent on both the PCRE_UTF8 and PCRE_DOTALL options being set.
For a pattern such as \babc|\bdef pcre_study was failing to set up the starting byte set, because \b was not being ignored.
(*THEN) was not working properly if there were untried alternatives prior to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it backtracked to try for "b" instead of moving to the next alternative branch at the same level (in this case, to look for "c"). The Perl documentation is clear that when (*THEN) is backtracked onto, it goes to the "next alternative in the innermost enclosing group".
(*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and (*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides (*THEN).
If \s appeared in a character class, it removed the VT character from the class, even if it had been included by some previous item, for example in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part of \s, but is part of the POSIX "space" class.)
A partial match never returns an empty string (because you can always match an empty string at the end of the subject); however the checking for an empty string was starting at the "start of match" point. This has been changed to the "earliest inspected character" point, because the returned data for a partial match starts at this character. This means that, for example, /(?<=abc)def/ gives a partial match for the subject "abc" (previously it gave "no match").
Changes have been made to the way PCRE_PARTIAL_HARD affects the matching of $, \z, \Z, \b, and \B. If the match point is at the end of the string, previously a full match would be given. However, setting PCRE_PARTIAL_HARD has an implication that the given string is incomplete (because a partial match is preferred over a full match). For this reason, these items now give a partial match in this situation. [Aside: previously, the one case /t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial match rather than a full match, which was wrong by the old rules, but is now correct.]
There was a bug in the handling of #-introduced comments, recognized when PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set. If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when scanning for the end of the comment. (*Character* 0x85 is an "any" newline, but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several places in pcre_compile.
When pcre_compile was skipping #-introduced comments when looking ahead for named forward references to subpatterns, the only newline sequence it recognized was NL. It now handles newlines according to the set newline convention.
Neither pcre_exec nor pcre_dfa_exec was checking that the value given as a starting offset was within the subject string. There is now a new error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is negative or greater than the length of the string. In order to test this, pcretest is extended to allow the setting of negative starting offsets.
Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a bad UTF-8 sequence and one that is incomplete.
If \c was followed by a multibyte UTF-8 character, bad things happened. A compile-time error is now given if \c is not followed by an ASCII character, that is, a byte less than 128.
Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_START_OPTIMIZE option, which is now allowed at compile time – but just passed through to pcre_exec or pcre_dfa_exec. This makes it available to pcregrep and other applications that have no direct access to PCRE options. The new /Y option in pcretest sets this option when calling pcre_compile.
Groups containing recursive back references were forced to be atomic, but in the case of named groups, the amount of memory required was incorrectly computed, leading to "Failed: internal error: code overflow". This has been fixed.
Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and THEN.
(*ACCEPT) was not working when inside an atomic group.
Inside a character class, \R and \X were always treated as literals, whereas Perl faults them if its -w option is set. Changed so that they fault when coExtra is set.
Added support for \N which always matches any character other than newline. (It is the same as "." when coDotAll is not set.)
Added four artifical Unicode properties to help with an option to make \s etc use properties. The new properties are: Xan (alphanumeric), Xsp ( Perl space), Xps (POSIX space), and Xwd (word).
Added coUCP to make \b, \d, \s, \w, and certain POSIX character classes use Unicode properties. (*UCP) at the start of a pattern can be used to set this option.
In coUtf8 mode, if a pattern that was compiled with coCaseLess was studied, and the match started with a letter with a code point greater than 127 whose first byte was different to the first byte of the other case of the letter, the other case of this starting letter was not recognized.
TDIRegEx.Study now recognizes \h, \v, and \R when constructing a bit map of possible starting bytes for non-anchored patterns.
Extended the "auto-possessify" recognition during pattern compilation. Now \R and a number of cases that involve Unicode properties are recognized, both explicit and implicit when coUCP is set.
Fix a Study problem in UTF-8 mode if a pattern starts with certain non ASCII characters.
A pattern such as (?&t)(?#()(?(DEFINE)(?<t>a)) which has a forward reference to a subpattern the other side of a comment that contains an opening parenthesis caused either an internal compiling error, or a reference to the wrong subpattern.
The Unicode data tables have been updated to Unicode 5.2.0.
A pattern such as (?&t)*+(?(DEFINE)(?<t>.)) which has a possessive quantifier applied to a forward-referencing subroutine call, could compile incorrect code or give the error "internal error: previously-checked referenced subpattern not found".
Fixed possible memory access outside allocated memory.
Hold memory texts as one long string to avoid too much relocation at load time.
Fix for \K giving a compile-time error if it appeared in a lookbehind assersion.
\K was not working if it appeared in an atomic group or in a group that was called as a "subroutine", or in an assertion. Perl 5.11 documents that \K is "not well defined" if used in an assertion. DIRegEx now accepts it if the assertion is positive, but not if it is negative.
A pattern such as (?P<L1>(?P<L2>0)|(?P>L2)(?P>L1)) in which the only other item in branch that calls a recursion is a subroutine call – as in the second branch in the above example – was incorrectly given the compile-time error "recursive call could loop indefinitely" because pcre_compile was not correctly checking the subroutine for matching a non-empty string.
Completely revised the help generator to ease navigation and improve readability. Send your feedback!
A pattern such as ^(?!a(*SKIP)b) where a negative assertion contained one of the verbs SKIP, PRUNE, or COMMIT, did not work correctly. When the assertion pattern did not match (meaning that the assertion was true), it was incorrectly treated as false if the SKIP had been reached during the matching. This also applied to assertions used as conditions.
If an item that is not supported by pcre_dfa_exec() was encountered in an assertion subpattern, including such a pattern used as a condition, unpredictable results occurred, instead of the error return PCRE_ERROR_DFA_UITEM.
A subtle bug concerned with back references has been fixed by a change of specification, with a corresponding code fix. A pattern such as ^(xa|=?\1a)+$ which contains a back reference inside the group to which it refers, was giving matches when it shouldn't. For example, xa=xaaa would match that pattern. Interestingly, Perl (at least up to 5.11.3) has the same bug. Such groups have to be quantified to be useful, or contained inside another quantified group. (If there's no repetition, the reference can never match.) The problem arises because, having left the group and moved on to the rest of the pattern, a later failure that backtracks into the group uses the captured value from the final iteration of the group rather than the correct earlier one. This is now fixed by forcing any group that contains a reference to itself to be an atomic group; that is, there cannot be any backtracking into it once it has completed. This is similar to recursive and subroutine calls.
If a pattern contained a conditional subpattern with only one branch (in particular, this includes all (DEFINE) patterns), studying this pattern computed the wrong minimum data length and resulted in matching failures.
For patterns such as (?i)a(?-i)b|c where an option setting at the start of the pattern is reset in the first branch, compilation failed with "internal error: code overflow at offset…". This happened only when the reset was to the original external option setting.
Change published TDIRegEx.MatchPattern property back to AnsiString. This was unfortunately required type to fix a Delphi 2009 / 2010 RawByteString streaming problem.
Add new public TDIRegEx.MatchPatternRaw: RawByteString property to allow Unicode Delphis to set the MatchPattern without automatic codepage conversion. This is now the recommended MatchPattern runtime property.
Improve Unicode in DIRegEx_MaskControls.pas. TDIRegExMaskEdit and TDIRegExMaskComboBox now automatically encode text to UTF-8 when their RegEx component is in UTF-8 mode.
The maximum size of a compiled regular expression is now 16 MB. This should make users happy which had hit the old 64 KB limit.
A UTF-8 pattern such as \x{123}{2,2}+ was incorrectly compiled; the trigger was a minimum greater than 1 for a wide character in a possessive repetition. The same bug could also affect UTF-8patterns like (\x{ff}{0,2})* which had an unlimited repeat of a nested, fixed maximum repeat of a wide character. Chaos in the form of incorrect output or a compiling loop could result.
The restrictions on what a pattern can contain when partial matching is requested for pcre_exec() have been removed. All patterns can now be partially matched by this function. In addition, if there are at least two slots in the offset vector, the offset of the earliest inspected character for the match and the offset of the end of the subject are set in them when PCRE_ERROR_PARTIAL is returned.
Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is synonymous with PCRE_PARTIAL, for backwards compatibility, and PCRE_PARTIAL_HARD, which causes a partial match to supersede a full match, and may be more useful for multi-segment matching.
Partial matching with pcre_exec() is now more intuitive. A partial match used to be given if ever the end of the subject was reached; now it is given only if matching could not proceed because another character was needed. This makes a difference in some odd cases such as Z(*FAIL) with the string "Z", which now yields "no match" instead of "partial match". In the case of pcre_dfa_exec(), "no match" is given if every matching path for the final character ended with (*FAIL).
Restarting a match using pcre_dfa_exec() after a partial match did not work if the pattern had a "must contain" character that was already found in the earlier partial match, unless partial matching was again requested. For example, with the pattern dog.(body)?, the "must contain" character is "g". If the first part-match was for the string "dog", restarting with "sbody" failed. This bug has been fixed.
The string returned by pcre_dfa_exec() after a partial match has been changed so that it starts at the first inspected character rather than the first character of the match. This makes a difference only if the pattern starts with a lookbehind assertion or \b or \B (\K is not supported by pcre_dfa_exec()). It's an incompatible change, but it was required to make it compatible with pcre_exec().
If an odd number of negated classes containing just a single character interposed, within parentheses, between a forward reference to a named subpattern and the definition of the subpattern, compilation crashed with an internal error, complaining that it could not find the referenced subpattern. An example of a crashing pattern is (?&A)(([^m])(?<A>)).
Added moNotEmptyAtStart which makes it possible to have an empty string match not at the start, even when the pattern is anchored.
If the maximum number of capturing subpatterns in a recursion was greater than the maximum at the outer level, the higher number was returned, but with unset values at the outer level. The correct (outer level) value is now given.
If (*ACCEPT) appeared inside capturing parentheses, previous releases did not set those parentheses. The string so far is captured, making this feature compatible with Perl.
DIRegEx now allows subroutine calls in lookbehinds, as long as the subroutine pattern matches a fixed length string. Recursion is not allowed.
The minimum length of subject string that was needed in order to match a given pattern is now provided. This code has now been added to pcre_study(); it finds a lower bound to the length of subject needed. It is not necessarily the greatest lower bound, but using it to avoid searching strings that are too short does give some useful speed-ups. The value is available to calling programs via pcre_fullinfo().
If (?| is used to create subpatterns with duplicate numbers, they are now allowed to have the same name, even if PCRE_DUPNAMES is not set. However, on the other side of the coin, they are no longer allowed to have different names, because these cannot be distinguished.
When duplicate subpattern names are present (necessarily with different numbers), and a test is made by name in a conditional pattern, either for a subpattern having been matched, or for recursion in such a pattern, all the associated numbered subpatterns are tested, and the overall condition is true if the condition is true for any one of them. This is the way Perl works, and is also more like the way testing by number works.
The pattern (?(?=.*b)b|^) was incorrectly compiled as "match must be at start or after a newline", because the conditional assertion was not being correctly handled. The rule now is that both the assertion and what follows in the first alternative must satisfy the test.
If auto-callout was enabled in a pattern with a conditional group whose condition was an assertion, DIRegEx could crash during matching, both with pcre_exec() and pcre_dfa_exec().
The PCRE_DOLLAR_ENDONLY option was not working when pcre_dfa_exec() was used for matching.
Unicode property support in character classes was not working for characters (bytes) greater than 127 when not in UTF-8 mode.
Added the PCRE_NO_START_OPTIMIZE match-time option.
A conditional group that had only one branch was not being correctly recognized as an item that could match an empty string. This meant that an enclosing group might also not be so recognized, causing infinite looping (and probably a segfault) for patterns such as ^"((?(?=[a])[^"])|b)*"$ with the subject "ab", where knowledge that the repeated group can match nothing is needed in order to break the loop.
If a pattern that was compiled with callouts was matched using pcre_dfa_ exec(), but without supplying a callout function, matching went wrong.
If PCRE_ERROR_MATCHLIMIT occurred during a recursion, there was a memory leak if the size of the offset vector was greater than 30. When the vector is smaller, the saved offsets during recursion go onto a local stack vector, but for larger vectors malloc() is used. It was failing to free when the recursion yielded PCRE_ERROR_MATCH_LIMIT (or any other "abnormal" error, in fact).
Forward references, both numeric and by name, in patterns that made use of duplicate group numbers, could behave incorrectly or give incorrect errors, because when scanning forward to find the reference group, PCRE was not taking into account the duplicate group numbers. A pattern such as ^X(?3)(a)(?|(b)|(q))(Y) is an example.
Added support for (*UTF8) at the start of a pattern.
Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in DIUtils.pas which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.
Delphi 2009 support.
Fix an expression study bug when a pattern contained a group with a zero qualifier.
Optimize Unicode Character Property searching, giving speed ups of 2 to 5 times on some simple patterns.
Updated the Unicode datatables to Unicode 5.1.0. This adds yet more scripts.
Fix caseless matching for non- ASCII characters in back references.
Fix overwriting or crash if the start of a pattern had top-level alternatives.
Fix a few cases where matching could read past the end of the subject.
Fix lazy qualifiers which were not working in some cases in UTF-8 mode.
Improve compatibility for parallel installation with other DI packages.
Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n', which, however, unlike Perl's \g{…}, are subroutine calls, not back references. DIRegEx supports relative numbers with this syntax.
Previously, a group with a zero repeat such as (…){0} was completely omitted from the compiled regex. However, this means that if the group was called as a subroutine from elsewhere in the pattern, things went wrong (an internal error was given). Such groups are now left in the compiled pattern, with a new opcode that causes them to be skipped at execution time.
Added the PCRE_JAVASCRIPT_COMPAT option. This makes the following changes to the way DIRegEx behaves:
A lone ] character is dis-allowed ( Perl treats it as data).
A back reference to an unmatched subpattern matches an empty string ( Perl fails the current match path).
A data ] in a character class must be notated as \] because if the first data character in a class is ], it defines an empty class. (In Perl it is not possible to have an empty class.) The empty class [] never matches; it forces failure and is equivalent to (*FAIL) or (?!). The negative empty class [^] matches any one character, independently of the DOTALL setting.
A pattern such as /(?2)[]a()b](abc)/ which had a forward reference to a non-existent subpattern following a character class starting with ']' and containing () gave an internal compiling error instead of "reference to non- existent subpattern". This is now corrected.
Accept (*FAIL) for DFA matching
DIRegEx 4.6 missed to update the internal PCRE version number.
Fixed a problem with TDIRegEx.Format and duplicate substring names.
Removed conditional directives from DIRegEx_Workbench_Form.pas which caused problems to some Delphi versions.
$(PRODUCT_NAME_VERSION) is mainly a bug-fix release:
Negative specials like \S did not work in character classes in UTF-8 mode. Characters greater than 255 were excluded from the class instead of being included. The same bug also applied to negated POSIX classes such as [:^space:].
The construct (?&) was not diagnosed as a syntax error (it referenced the first named subpattern) and a construct such as (?&a) would reference the first named subpattern whose name started with "a" (in other words, the length check was missing). Both these problems are fixed. "Subpattern name expected" is now given for (?&) (a zero-length name), and this patch also makes it give the same error for \k'' (previously it complained that that was a reference to a non- existent subpattern).
The erroneous patterns (?+-a) and (?-+a) give different error messages; this is right because (?- can be followed by option settings as well as by digits. I have, however, made the messages clearer.
Patterns such as (?(1)a|b) (a pattern that contains fewer subpatterns than the number used in the conditional) now cause a compile-time error. This is actually not compatible with Perl, which accepts such patterns, but treats the conditional as always being FALSE (as DIRegEx used to), but it seems that giving a diagnostic is better.
Correct some Unicode character properties which were in the wrong script.
The pattern (?=something)(?R) was not being diagnosed as a potentially infinitely looping recursion. The bug was that positive lookaheads were not being skipped when checking for a possible empty match (negative lookaheads and both kinds of lookbehind were skipped).
Specifying a possessive quantifier with a specific limit for a Unicode character property caused pcre_compile() to compile bad code, which led at runtime to PCRE_ERROR_INTERNAL (-14). Examples of patterns that caused this are: '\p{Zl}{2,3}+' and '\p{Cc}{2}+'. It was the possessive "+" that caused the error; without that there was no problem.
In UTF-8 mode, with newline set to "any", a pattern such as .*a.*=.b.* crashed when matching a string such as a\x{2029}b (note that \x{2029} is a UTF-8 newline character). The key issue is that the pattern starts .*; this means that the match must be either at the beginning, or after a newline. The bug was in the code for advancing after a failed match and checking that the new position followed a newline. It was not taking account of UTF-8 characters correctly.
DIRegEx was behaving differently from Perl in the way it recognized POSIX character classes. DIRegEx was not treating the sequence [:…:] as a character class unless the … were all letters. Perl, however, seems to allow any characters between [: and :], though of course it rejects as unknown any "names" that contain non-letters, because all the known class names consist only of letters. Thus, Perl gives an error for [[:1234:]], for example, whereas DIRegEx did not – it did not recognize a POSIX character class. This seemed a bit dangerous, so the code has been changed to be closer to Perl. The behaviour is not identical to Perl, because DIRegEx will diagnose an unknown class for, for example, [[:l\ower:]] where Perl will treat it as [[:lower:]]. However, DIRegEx does now give "unknown" errors where Perl does, and where it didn't before.
Correct a potential one byte overflow by ansi_mbtowc and oem_mbtowc in DIRegEx_SearchStream.pas.
Extend TDIRegEx.MatchNext to match empty result strings. The new algorithm detects potential infinite loops and advances the search position as necessary.
Do not count [\s] as an explicit reference to CR or LF. So now DIRegEx will match single CR and LF only if the pattern contains \r or \n (or a literal CR or LF).
The appearance of (?J) was not reflected by the PCRE_INFO_JCHANGED facility.
Added options (at compile time and exec time) to change \R from matching any Unicode line ending sequence to just matching CR, LF, or CRLF.
The pattern .*$ when run in not-DOTALL UTF-8 mode with newline=any failed when the subject happened to end in the byte 0x85 (e.g. if the last character was \x{1ec5}). *Character* 0x85 is one of the "any" newline characters but of course it shouldn't be taken as a newline when it is part of another character. The bug was that, for an unlimited repeat of . in not-DOTALL UTF-8 mode, DIRegEx was advancing by bytes rather than by characters when looking for a newline.
A small performance improvement in the DOTALL UTF-8 mode .* case.
Remove the explicit limit of non-capturing parenthesis at the expense of using more stack.
Remove the artificial limitation on group length – now there is only the limit on the total length of the compiled pattern, which is set at 65535.
Because Perl interprets \Q…\E at a high level, and ignores orphan \E instances, patterns such as [\Q\E] or [\E] or even [^\E] cause an error, because the ] is interpreted as the first data character and the terminating ] is not found. DIRegEx has been made compatible with Perl in this regard. Previously, it interpreted [\Q\E] as an empty class, and [\E] could cause memory overwriting.
Like Perl, DIRegEx automatically breaks an unlimited repeat after an empty string has been matched (to stop an infinite loop). It was not recognizing a conditional subpattern that could match an empty string if that subpattern was within another subpattern. For example, it looped when trying to match (((?(1)X|))*) but it was OK with ((?(1)X|)*) where the condition was not nested. This bug has been fixed.
A pattern like \X?\d or \P{L}?\d in non-UTF-8 mode could cause a backtrack past the start of the subject in the presence of bytes with the top bit set, for example "\x8aBCD".
Added Perl 5.10 experimental backtracking controls (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT).
Optimized (?!) to (*FAIL).
Updated the test for a valid UTF-8 string to conform to the later RFC 3629. This restricts code points to be within the range 0 to 0x10FFFF, excluding the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, DIRegEx allowed the full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still does: it's just the validity check that is more restrictive.
Inserted checks for integer overflows during escape sequence (backslash) processing, and also fixed erroneous offset values for syntax errors during backslash processing.
Fixed another case of looking too far back in non-UTF-8 mode for patterns like [\PPP\x8a]{1,}\x80 with the subject "A\x80".
An unterminated class in a pattern like (?1)\c[ with a "forward reference" caused an overrun.
A pattern like (?:[\PPa*]*){8,} which had an "extended class" (one with something other than just ASCII characters) inside a group that had an unlimited repeat caused a loop at compile time (while checking to see whether the group could match an empty string).
An orphan \E inside a character class could cause a crash.
A repeated capturing bracket such as (A)? could cause a wild memory reference during compilation.
There are several functions in pcre_compile() that scan along a compiled expression for various reasons (e.g. to see if it's fixed length for look behind). There were bugs in these functions when a repeated \p or \P was present in the pattern. These operators have additional parameters compared with \d, etc, and these were not being taken into account when moving along the compiled data. Specifically:
A item such as \p{Yi}{3} in a lookbehind was not treated as fixed length.
An item such as \pL+ within a repeated group could cause crashes or loops.
A pattern such as \p{Yi}+(\P{Yi}+)(?1) could give an incorrect "reference to non-existent subpattern" error.
A pattern like (\P{Yi}{2}\277)? could loop at compile time.
A repeated \S or \W in UTF-8 mode could give wrong answers when multibyte characters were involved (for example /\S{2}/8g with "A\x{a3}BC").
Patterns such as [\P{Yi}A] which include \p or \P and just one other character were causing crashes (broken optimization).
Patterns such as (\P{Yi}*\277)* (group with possible zero repeat containing \p or \P) caused a compile-time loop.
More problems have arisen in unanchored patterns when CRLF is a valid line break. For example, the unstudied pattern [\r\n]A does not match the string "\r\nA". However, the pattern \nA *does* match, because it doesn't start till \n, and if [\r\n]A is studied, the same is true. There doesn't seem any very clean way out of this, but to make sense for the common cases DIRegEx now takes note of whether there can be an explicit match for \r or \n anywhere in the pattern, and if so, does not advace CRLF by two bytes. As part of this change, there's a new PCRE_INFO_HASCRORLF option for finding out whether a compiled pattern has explicit CR or LF references.
Added (*CR) etc for changing newline setting at start of pattern.
Fix spelling of DIRegEx_Reg.dcr in the DIRegEx packages which caused a problem during IDE installation.
Documentation updates and fixes.
Add more features from Perl 5.10:
(?-n) (where n is a string of digits) is a relative subroutine or recursion call. It refers to the nth most recently opened parentheses.
(?+n) is also a relative subroutine call; it refers to the nth next to be opened parentheses.
Conditions that refer to capturing parentheses can be specified relatively, for example, (?(-2)… or (?(+3)…
\K resets the start of the current match so that everything before is not part of it.
\k{name} is synonymous with \k<name> and \k'name' (.NET compatible).
\g{name} is another synonym – part of Perl 5.10's unification of reference syntax.
(?| introduces a group in which the numbering of parentheses in each alternative starts with the same number.
\h, \H, \v, and \V match horizontal and vertical whitespace.
Fix: Matching a pattern such as (.*(.)?)* failed by either not terminating or by crashing.
Fix: A pattern with a very large number of alternatives (more than several hundred) was running out of internal workspace during the pre-compile phase. A bit of new cunning has reduced the workspace needed for groups with alternatives. The 1000-alternative test pattern now uses 12 bytes of workspace instead of running out of the 4096 that are available.
Fix: If \p or \P was used in non-UTF-8 mode on a character greater than 127 it matched the wrong number of bytes.
Added new method TDIRegEx.SubStrPtr.
Added two new info methods calls to TDIRegEx: InfoOkPartial and InfoJChanged.
Speed up performance of TDIRegEx.CompiledRegExpArray.
Added new menu entry to the DIRegEx Workbench to copy the pattern as a well formatted Pascal string.
Added a new demo DIRegEx_PreCompiled_Pattern.dpr which shows how to use precompiled regular expressions.
Delphi 2007 support.
Added coNewLineAnyCrLf which is like coNewLineAny, but matches only CR, LF, or CRLF as a newline sequence. The compile-option equivalent is moNewLineAnyCrLf. Only a single newline option may be set at the same time. Invalid combinations of newline options will raise an exception.
New classes to search for regular expressions in data / streams / files of arbitrary size by loading only a small portion of data into memory at a single time:
TDICustomRegExSearch
TDIRegExSearchStream
TDIRegExsEarchStream_Enc
TDIRegExSearchStream_ANSI
TDIRegExSearchStream_Binary
TDIRegExSearchStream_Binary16BE
TDIRegExSearchStream_Binary16LE
TDIRegExSearchStream_OEM
TDIRegExSearchStream_UTF16BE
TDIRegExSearchStream_UTF16LE
There is a new example project demonstrating the usage of these new classes.
Fixed a fairly obscure bugs concerned with quantified caseless matching with Unicode property support: For a maximizing quantifier, if the two different cases of the character were of different lengths in their UTF-8 codings, and the matching function had to back up over a mixture of the two cases, it incorrectly assumed they were both the same length.
In multiline mode when the newline sequence was set to "any", the pattern ^$ would give a match between the CR and LF of a subject such as 'A'#13#10'B'. This doesn't seem right; it now treats the CRLF combination as the line ending, and so does not match in that case. It's only a pattern such as ^$ that would hit this one: something like ^ABC$ would have failed after CR and then tried again after CRLF.
SubStrCount returns the actual count of captured substrings, even for descendent classes. Fixed a problem where the wrong value was returned for TDIDfaRegEx. Likewise improved the regular expression workbench.
Fixed TDIRegExInspector to handle Windows XP themes.
Added XP Theme support to the GUI demo projects. Also increased the demo projects' maximum stack size to {$MAXSTACKSIZE $00200000} in order to reduce the potential of stack overflow when matching very demanding regular expressions.
New and improved List2 and Replace2 functions: They are different from the old List and Replace in that they return the number of matches listed / replaced and also work on empty matches. This can be usefull for replacing empty lines, for example.
In response to the growing importance of Unicode, the default character set for caseless matching and character classes is now Latin 1, a subset of Unicode. Use the poUserLocale Option if you are matching ANSI strings in the user's default locale.
Major re-factoring of the way pcre_compile computes the amount of memory needed for a compiled pattern. It now runs the real compile function in a "fake" mode that enables it to compute how much memory it would need, while actually only ever using a few hundred bytes of working memory and without too many tests of the mode. A side effect of this work is that the limit of 200 on the nesting depth of parentheses has been removed. However, there is a downside: pcre_compile now runs more slowly than before (30% or more, depending on the pattern). There is no effect on runtime performance.
Extended pcre_study to be more clever in cases where a branch of a subpattern has no definite first character. For example, (a*|b*)[cd] would previously give no result from pcre_study. Now it recognizes that the first character must be a, b, c, or d.
There was an incorrect error "recursive call could loop indefinitely" if a subpattern (or the entire pattern) that was being tested for matching an empty string contained only one non-empty item after a nested subpattern.
A new optimization is now able automatically to treat some sequences such as a*b as a*+b. More specifically, if something simple (such as a character or a simple class like \d) has an unlimited quantifier, and is followed by something that cannot possibly match the quantified thing, the quantifier is automatically "possessified".
A recursive reference to a subpattern whose number was greater than 39 went wrong under certain circumstances in UTF-8 mode. This bug could also have affected the operation of pcre_study.
Possessive quantifiers such as a++ were previously implemented by turning them into atomic groups such as ($>a+). Now they have their own opcodes, which improves performance. This includes the automatically created ones from above.
A pattern such as (?=(\w+))\1: which simulates an atomic group using a lookahead was broken if it was not anchored. DIRegEx was mistakenly expecting the first matched character to be a colon. This applied both to named and numbered groups.
Forward references to subpatterns in conditions such as (?(2)…) where subpattern 2 is defined later cause pcre_compile to search forwards in the pattern for the relevant set of parentheses. This search went wrong when there were unescaped parentheses in a character class, parentheses escaped with \Q…\E, or parentheses in a #-comment in /x mode.
"Subroutine" calls and backreferences were previously restricted to referencing subpatterns earlier in the regex. This restriction has now been removed.
Added a number of extra features that are going to be in Perl 5.10. On the whole, these are just syntactic alternatives for features that DIRegEx had previously implemented using the Python syntax or my own invention. The other formats are all retained for compatibility.
Named groups can now be defined as (?…) or (?'name'…) as well as (?P…). The new forms, as well as being in Perl 5.10, are also .NET compatible.
A recursion or subroutine call to a named group can now be defined as (?&name) as well as (?P>name).
A backreference to a named group can now be defined as \k or \k'name' as well as (?P=name). The new forms, as well as being in Perl 5.10, are also .NET compatible.
A conditional reference to a named group can now use the syntax (?() or (?('name') as well as (?(name).
A "conditional group" of the form (?(DEFINE)…) can be used to define groups (named and numbered) that are never evaluated inline, but can be called as "subroutines" from elsewhere. In effect, the DEFINE condition is always false. There may be only one alternative in such a group.
A test for recursion can be given as (?(R1).. or (?(R&name)… as well as the simple (?(R). The condition is true only if the most recent recursion is that of the given number or name. It does not search out through the entire recursion stack.
The escape \gN or \g{N} has been added, where N is a positive or negative number, specifying an absolute or relative reference.
Updated the Unicode property tables to Unicode version 5.0.0. Amongst other things, this adds five new scripts.
Perl ignores orphaned \E escapes completely. DIRegEx now does the same. There were also incompatibilities regarding the handling of \Q..\E inside character classes, for example with patterns like [\Qa\E-\Qz\E] where the hyphen was adjacent to \Q or \E. I hope I've cleared all this up now.
Like Perl, DIRegEx detects when an indefinitely repeated parenthesized group matches an empty string, and forcibly breaks the loop. There were bugs in this code in non-simple cases. For a pattern such as ^(a()*)* matched against aaaa the result was just "a" rather than "aaaa", for example. Two separate and independent bugs (that affected different cases) have been fixed.
Implemented PCRE_NEWLINE_ANY and coNewLineAny to recognize any of the Unicode newline sequences as "newline" when processing dot, circumflex, or dollar metacharacters, or #-comments in /x mode.
Added \R to match any Unicode newline sequence, as suggested in the Unicode report.
For an unanchored pattern, if a match attempt fails at the start of a newline sequence, and the newline setting is CRLF or ANY, and the next two characters are CRLF, advance by two characters instead of one.
products/regex/history.txt · Last modified: 2012/01/19 17:13 (external edit)
|