Regex vs. string.IndexOf
I send a piece of code to Justin, which dealt with doing some simple text parsing. His comment was:
text.Substring(lastIndex, currentIndex - lastIndex);
Dude, Regex, dude!
This code reminds me of when I wrote an XML parser in ASP3
The reason that I used IndexOf there was performance, this piece of code is in the critical path, and I don't think that Regex would give me much there. But Justin said that compiled Regex is more efficient than IndexOf, so I decided to check it.
Here is my quick perf test:
static void Main(string[] args) { string testStr = "select foo, bar, x, y, z, 5 from Items"; int count = 500000; DateTime start = DateTime.Now; for (int i = 0; i < count; i++) { int last = 0, current = 0; while ((current = testStr.IndexOf(',', current)) != -1) { string x = testStr.Substring(last, current - last); current = last = current + 1; } } Console.WriteLine(DateTime.Now -start); start = DateTime.Now; Regex r = new Regex(",", RegexOptions.Compiled); for (int i = 0; i < count; i++) { int last = 0, current = 0; Match match = r.Match(testStr, current); while (match.Success) { current = match.Index; string y = testStr.Substring(last, current - last); current = last = current + 1; match = r.Match(testStr, current); } } Console.WriteLine(DateTime.Now - start); }
The results, on my machine:
String.IndexOf: 00.2343750
Compiled Regex: 01.4687500
Comments
Try running it for longer input string and longer pattern to be found. Regular expressions are faster. Of course everything depends on the scenario for which you're optimizing.
Data I used:
string testStr = "THIS_IS_JUST_A_TEST_abcdefghijklmnopqrstuvwxyz0123456789!987654321zyxwvutsrqpomnlkjihgfedcba_TSET_A_TSUJ_SI_SIHT";
pattern: "TSET_"
Results:
String.IndexOf: 00:00:00.4200000
Compiled Regex: 00:00:00.2970000
Can you share the full source code?
When I added this:
StringBuilder sb = new StringBuilder(testStr);
for (int i = 0; i < 50; i++)
{
}
testStr = sb.ToString();
The results where:
00:00:12.3750000
00:01:11.2500000
After changing the pattern to "from", with the same change as above:
00:00:11.4843750
00:00:15.9687500
Yeah, Regex for a simple split is overkill. It only starts showing value when things get complex. I'm also thinking it might be better for really-large-strings too. I wonder if IndexOf starts breaking down if the string is large.
One question: Why didn't you use testStr.Split()? I added that to your test and found that it was slightly slower (0.301 vs. 0.177) but it's much clearer in the code.
Using in the IndexOf:
StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase
you can speed up even more the IndexOf operation, that way you avoid all the Culture specific code =)
Cheers
Chad,
Because the code that I really have there is doing string parsing, not splitting, for instance, I need to handle ',' in a different way.
Ayende: sorry, I skimmed over your code without actually reading it :). I didn't notice the string joining part, and my "_TSET" pattern was matched only once, so the benchmark wasn't very fair. Updated version of the code:
00:00:02.9800000 (IndexOf) vs 00:00:02.0430000 (Regex)
On my machine:
00:00:03.1250000 - string.IndexOf
00:00:03.8750000 - RegEx
Interesting... What CPU / architecture / OS / .NET runtime / build config? Mine is: Intel Core 2 CPU / T7400 @ 2.16GHz / Windows Vista / .NET 2.0.50727.312 / default VS 2005 Debug build config.
@Marcos, you are correct, using the StringComparison.Ordinal or OrdinalIgnoreCase options for IndexOf(), StartsWith(), etc. can be quite a savings. There are a LOT of strings in most programs that don't require culturally-sensitive handling. Often, virtually all of the strings in the critical path will be straight ASCII. It's a great optimization and much under-used.
@Ayende, I would maintain that for most real-world applications outside of large scale text searching or transformations, Regex tends to add complexity without giving back all that much. Regex tends to add a "WTF?" factor into quickly reading and understanding code. Avoiding that is often worth the modest real-world performance penalty of not using regex, even when it exists.
Perntium D Dual Core, 3.00 Ghz, Windows 2003, .Net 2.0.50727, debug build
"But Justin said that compiled Regex is more efficient than IndexOf, so I decided to check it."
I think the problem is that the term "efficient" is ambiguous: In this context, it can either mean:
a) raw machine speed, as in "takes 7.538 ms on machine A, with input data B"
b) computational complexity, as in "O(n^2)"
System.Text.RegularExpressions (AFAIK) implements the Boyer-Moore-Algorithm, which is of course better in terms of computational complexity, but can be slower for short search patterns. So if speed is your only concern, you need to know how long your search patterns will probably be to make a decision. Everything else is voodoo.
@Bob: I'd argue that Regex's are better for more complex parsing jobs because they're declarative: You don't tell the computer how to parse a string, but what you're looking for. Of course there's a learning curve, but once you understood how they work they're a blessing (IMO). If I had to choose between a 20-line parsing function written in C#, with nested loops and hundreds of places to hide programming errors, and a 1-line-regex, I'd take the regex (almost) every time. It might take you five seconds longer to understand it for the first time, but it might take hours to find that filthy little bug in the 20-line parsing code...
It'll be nice to put a regex extension on the string class in c# 3.0...
Niki,
The string to search is a SQL string, which can be up to 1Kb in size (very rarely).
The string to search is ","
In this context, efficient means that I can't see a way for a regex impl to do something faster than simple string search.
Ayende: I have no idea what's causing the difference in the results we're getting... We're using pretty much the same architecture, if I'm not mistaken.
That regex performance observation is pretty interesting, though.
On an Athlon 64 3000+, using the updated code and "Release" mode outside VS.NET:
IndexOf: around 6 sec. (5.6 to 6.2 spread)
Regex: around 4 sec (3.8 to 4.5 spread)
same executable run under mono:
IndexOf: 13.641
Regex: 36.297
Running with a debugger attached can cause odd performance quirks, esp. if the code has a managed/unmanaged boundary (i.e. generally not with such simple code), as can running a debug build.
I would generally prefer regular expressions, were it not that they're not integrated into C# particularly intuitively. Using explicit capture groups as so:
new Regex("^((?<group>[^,]),)([^,]*)$", RegexOptions.Compiled | RegexOptions.CultureInvariant|RegexOptions.ExplicitCapture);
is in my opinion the most maintainable usage pattern, which you'ld extract information from using Groups:
Of course, it's not exactly always obvious how to modify such regular expression, but modifying complex parsers written in hand-coded procedural style isn't ever easier either, irrespective of the language.
I'd use string.Split if that's suitable, which it is in this example, and otherwise a Regex for maintainability.
I ran into a performance problem with Cuyahoga and Regexes.
You should always use static methods of the Regex like so:
Regex.IsMatch
The test and I seem to remember the RegexOptions.Compiled flag is bad for performance.....
I want to add this one to be complete. I think that it is much more readable but it is slow :-)
i think you accounted for the overhead of compiling the regex incorrectly.
you have "start = DateTime.Now" before the regex compile. A better test would be to compile the regex before setting the start time.
the reason i suggest this, is to show that you only have to compile the regex once, during the initializing of your application (during a static constructor or something on the object that needs it). after that, the regex will likely be much faster.
but as many previous comments said - it all depends on your scenario and how large the string is.
also - be careful about constantly regenerating compiled regex. i have heard that there is a memory leak caused by this (same problem as using codedom to compile code on the fly). not sure about this, though.
I've posted an optimized version of the regex so that it properly uses regex functionality instead of trying to mimic what IndexOf does. see my blog:
http://www.avocadosoftware.com/csblogs/dredge/archive/2007/11/05/776.aspx
looks like the IndexOf is faster in this case... just found a huge bug in the code i posted on my blog, and have corrected it...
Comment preview