Regex vs. string.IndexOf

time to read 2 min | 237 words

I send a piece of code to Justin, which dealt with doing some simple text parsing. His comment was:

text.Substring(lastIndex, currentIndex - lastIndex);
Dude, Regex, dude!
This code reminds me of when I wrote an XML parser in ASP3

The reason that I used IndexOf there was performance, this piece of code is in the critical path, and I don't think that Regex would give me much there. But Justin said that compiled Regex is more efficient than IndexOf, so I decided to check it.

Here is my quick perf test:

static void Main(string[] args)
{
	string testStr = "select foo, bar, x, y, z, 5 from Items";
	int count = 500000;
	DateTime start = DateTime.Now;
	for (int i = 0; i < count; i++)
	{
		int last = 0, current = 0;
		while ((current = testStr.IndexOf(',', current)) != -1)
		{
			string x = testStr.Substring(last, current - last);
			current = last = current + 1;
		}
	}
	Console.WriteLine(DateTime.Now -start);
	start = DateTime.Now;
	Regex r = new Regex(",", RegexOptions.Compiled);
	for (int i = 0; i < count; i++)
	{
		int last = 0, current = 0;
		Match match = r.Match(testStr, current);
		while (match.Success)
		{
			current = match.Index;
			string y = testStr.Substring(last, current - last);
			current = last = current + 1;
			match = r.Match(testStr, current);
		}
	}
	Console.WriteLine(DateTime.Now - start);
}

The results, on my machine:

String.IndexOf: 00.2343750
Compiled Regex: 01.4687500

Tweet Share Share 20 comments

Tags:

Performance

Comments

03 Nov 2007
20:45 PM

Marcin Seredynski

Try running it for longer input string and longer pattern to be found. Regular expressions are faster. Of course everything depends on the scenario for which you're optimizing.

Data I used:

string testStr = "THIS_IS_JUST_A_TEST_abcdefghijklmnopqrstuvwxyz0123456789!987654321zyxwvutsrqpomnlkjihgfedcba_TSET_A_TSUJ_SI_SIHT";

pattern: "TSET_"

Results:

String.IndexOf: 00:00:00.4200000

Compiled Regex: 00:00:00.2970000

03 Nov 2007
20:53 PM

Ayende Rahien

Can you share the full source code?

When I added this:

StringBuilder sb = new StringBuilder(testStr);

for (int i = 0; i < 50; i++)

{

sb.Append(testStr);

}

testStr = sb.ToString();

The results where:

00:00:12.3750000

00:01:11.2500000

After changing the pattern to "from", with the same change as above:

00:00:11.4843750

00:00:15.9687500

03 Nov 2007
20:59 PM

Chad Myers

Yeah, Regex for a simple split is overkill. It only starts showing value when things get complex. I'm also thinking it might be better for really-large-strings too. I wonder if IndexOf starts breaking down if the string is large.

One question: Why didn't you use testStr.Split()? I added that to your test and found that it was slightly slower (0.301 vs. 0.177) but it's much clearer in the code.

03 Nov 2007
21:05 PM

Marcos

Using in the IndexOf:

StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase

you can speed up even more the IndexOf operation, that way you avoid all the Culture specific code =)

Cheers

03 Nov 2007
21:12 PM

Ayende Rahien

Chad,

Because the code that I really have there is doing string parsing, not splitting, for instance, I need to handle ',' in a different way.

03 Nov 2007
21:25 PM

Marcin Seredynski

Ayende: sorry, I skimmed over your code without actually reading it :). I didn't notice the string joining part, and my "_TSET" pattern was matched only once, so the benchmark wasn't very fair. Updated version of the code:

    static void Main(string[] args)

    {

        string testStr = "THIS_IS_JUST_A_TEST_abcdefghijklmnopqrstuvwxyz0123456789!987654321zyxwvutsrqpomnlkjihgfedcba_TSET_A_TSUJ_SI_SIHT";

        for (int q = 0; q < 3; q++)

            testStr += testStr;


        int count = 500000;

        DateTime start = DateTime.Now;

        for (int i = 0; i < count; i++)

        {

            int last = 0, current = 0;

            while ((current = testStr.IndexOf("TSET_", current)) != -1)

            {

                string x = testStr.Substring(last, current - last);

                current = last = current + 1;

            }

        }

        Console.WriteLine(DateTime.Now - start);

        Regex r = new Regex("TSET_", RegexOptions.Compiled);

        start = DateTime.Now;

        for (int i = 0; i < count; i++)

        {

            int last = 0, current = 0;

            Match match = r.Match(testStr, current);

            while (match.Success)

            {

                current = match.Index;

                string y = testStr.Substring(last, current - last);

                current = last = current + 1;

                match = r.Match(testStr, current);

            }

        }

        Console.WriteLine(DateTime.Now - start);

    }

00:00:02.9800000 (IndexOf) vs 00:00:02.0430000 (Regex)

03 Nov 2007
21:35 PM

Ayende Rahien

On my machine:

00:00:03.1250000 - string.IndexOf

00:00:03.8750000 - RegEx

03 Nov 2007
21:45 PM

Marcin Seredynski

Interesting... What CPU / architecture / OS / .NET runtime / build config? Mine is: Intel Core 2 CPU / T7400 @ 2.16GHz / Windows Vista / .NET 2.0.50727.312 / default VS 2005 Debug build config.

03 Nov 2007
21:47 PM

Bob Grommes

@Marcos, you are correct, using the StringComparison.Ordinal or OrdinalIgnoreCase options for IndexOf(), StartsWith(), etc. can be quite a savings. There are a LOT of strings in most programs that don't require culturally-sensitive handling. Often, virtually all of the strings in the critical path will be straight ASCII. It's a great optimization and much under-used.

@Ayende, I would maintain that for most real-world applications outside of large scale text searching or transformations, Regex tends to add complexity without giving back all that much. Regex tends to add a "WTF?" factor into quickly reading and understanding code. Avoiding that is often worth the modest real-world performance penalty of not using regex, even when it exists.

03 Nov 2007
21:55 PM

Ayende Rahien

Perntium D Dual Core, 3.00 Ghz, Windows 2003, .Net 2.0.50727, debug build

03 Nov 2007
23:00 PM

Niki

"But Justin said that compiled Regex is more efficient than IndexOf, so I decided to check it."

I think the problem is that the term "efficient" is ambiguous: In this context, it can either mean:

a) raw machine speed, as in "takes 7.538 ms on machine A, with input data B"

b) computational complexity, as in "O(n^2)"

System.Text.RegularExpressions (AFAIK) implements the Boyer-Moore-Algorithm, which is of course better in terms of computational complexity, but can be slower for short search patterns. So if speed is your only concern, you need to know how long your search patterns will probably be to make a decision. Everything else is voodoo.

@Bob: I'd argue that Regex's are better for more complex parsing jobs because they're declarative: You don't tell the computer how to parse a string, but what you're looking for. Of course there's a learning curve, but once you understood how they work they're a blessing (IMO). If I had to choose between a 20-line parsing function written in C#, with nested loops and hundreds of places to hide programming errors, and a 1-line-regex, I'd take the regex (almost) every time. It might take you five seconds longer to understand it for the first time, but it might take hours to find that filthy little bug in the 20-line parsing code...

03 Nov 2007
23:10 PM

Evan

It'll be nice to put a regex extension on the string class in c# 3.0...

03 Nov 2007
23:15 PM

Ayende Rahien

Niki,

The string to search is a SQL string, which can be up to 1Kb in size (very rarely).

The string to search is ","

In this context, efficient means that I can't see a way for a regex impl to do something faster than simple string search.

03 Nov 2007
23:47 PM

Marcin Seredynski

Ayende: I have no idea what's causing the difference in the results we're getting... We're using pretty much the same architecture, if I'm not mistaken.

That regex performance observation is pretty interesting, though.

04 Nov 2007
14:48 PM

Eamon Nerbonne

On an Athlon 64 3000+, using the updated code and "Release" mode outside VS.NET:

IndexOf: around 6 sec. (5.6 to 6.2 spread)

Regex: around 4 sec (3.8 to 4.5 spread)

same executable run under mono:

IndexOf: 13.641

Regex: 36.297

Running with a debugger attached can cause odd performance quirks, esp. if the code has a managed/unmanaged boundary (i.e. generally not with such simple code), as can running a debug build.

I would generally prefer regular expressions, were it not that they're not integrated into C# particularly intuitively. Using explicit capture groups as so:

new Regex("^((?<group>[^,]),)([^,]*)$", RegexOptions.Compiled | RegexOptions.CultureInvariant|RegexOptions.ExplicitCapture);

is in my opinion the most maintainable usage pattern, which you'ld extract information from using Groups:

            foreach(Capture cap in r2.Match(testStr).Groups["group"].Captures) {

                string z = cap.Value;

            }

Of course, it's not exactly always obvious how to modify such regular expression, but modifying complex parsers written in hand-coded procedural style isn't ever easier either, irrespective of the language.

I'd use string.Split if that's suitable, which it is in this example, and otherwise a Regex for maintainability.

05 Nov 2007
12:56 PM

Paul Cowan

I ran into a performance problem with Cuyahoga and Regexes.

You should always use static methods of the Regex like so:

Regex.IsMatch

The test and I seem to remember the RegexOptions.Compiled flag is bad for performance.....

05 Nov 2007
14:02 PM

Ramon

I want to add this one to be complete. I think that it is much more readable but it is slow :-)

    start = DateTime.Now;

    r = new Regex("([^,]+),", RegexOptions.Compiled);

    for (int i = 0; i < count; i++){

        MatchCollection matches = r.Matches(testStr);

        for(int j=0;j<matches.Count; ++j){

            string z = matches[j].Value;

        }

    }

    Console.WriteLine(DateTime.Now - start);

05 Nov 2007
22:54 PM

Derick Bailey

i think you accounted for the overhead of compiling the regex incorrectly.

you have "start = DateTime.Now" before the regex compile. A better test would be to compile the regex before setting the start time.

the reason i suggest this, is to show that you only have to compile the regex once, during the initializing of your application (during a static constructor or something on the object that needs it). after that, the regex will likely be much faster.

but as many previous comments said - it all depends on your scenario and how large the string is.

also - be careful about constantly regenerating compiled regex. i have heard that there is a memory leak caused by this (same problem as using codedom to compile code on the fly). not sure about this, though.

05 Nov 2007
23:26 PM

Derick Bailey

I've posted an optimized version of the regex so that it properly uses regex functionality instead of trying to mimic what IndexOf does. see my blog:

http://www.avocadosoftware.com/csblogs/dredge/archive/2007/11/05/776.aspx

05 Nov 2007
23:32 PM

Derick Bailey

looks like the IndexOf is faster in this case... just found a huge bug in the code i posted on my blog, and have corrected it...

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB