String Interning: The Garbage Collectible way

time to read 4 min | 653 words

Since I know people will want the actual implementation, here is a simple way of handling string interning in a way that will allow you to GC the results at some point. The issue is simple, I want to intern strings (so a string value is only held once through my entire app), but I don’t want to be stuck with them if the profiler state has been clear, for example.

public class GarbageCollectibleStringInterning
{
    private static IDictionary<string,string> strings = new Dictionary<string,string>();

    private static ReaderWriterLockSlim locker = new ReaderWriterLockSlim();
    
    public static void Clear()
    {
        locker.EnterWriteLock();
        try
        {
            strings.Clear();
        }
        finally
        {
            locker.ExitWriteLock();
        }
    }
    
    public static string Intern(string str)
    {
        string val;
        
        locker.EnterReadLock();
        try
        {
            if(strings.TryGetValue(str, out val))
                return val;
        }
        finally
        {
            locker.ExitReadLock();
        }
        
        locker.EnterWriteLock();
        try
        {
            if(strings.TryGetValue(str, out val))
                return val;
                
            strings.Add(str,str);
            return str;
        }
        finally
        {
            locker.ExitWriteLock();
        }
    }
}

This is a fairly simple implementation, a more complex one may try to dynamically respond to GC notification, but I think that this would be useful enough on its own.

Using this approach, I was able to reduce used memory in the profiler by over 50%. I gave up on that approach, however, because while it may reduce the memory footprint, it doesn't actually solve the problem, only delay it.

Tweet Share Share 15 comments

Tags:

Programming

Comments

27 Dec 2009
10:38 AM

Robert

Any reason for which you don't use EnterUpgradeableReadLock in Intern method?

27 Dec 2009
11:05 AM

Ayende Rahien

Robert,

I'll answer that in a separate post.

27 Dec 2009
11:16 AM

Ayende Rahien

Robert,

The short answer is that EnterUpgradeableReadLock would produce a far higher locking rate than read/write.

27 Dec 2009
11:25 AM

Imran

Another thing you can add to reduce the memory footprint further is utf8 encoding. Store the strings as a byte[] instead by calling Encoding.UTF8.GetBytes(string) to circumvent the default .net utf16 encoding. You can use Encoding.UTF8.GetString to convert back to string as needed. Some characters will still be encoded with 2 bytes but for most ascii strings this will give a memory benefit. As you said before though, it doesn't solve your problem it just delays it.

27 Dec 2009
13:50 PM

Ken Egozi

why Dict <string,string> and not HashSet <string?

I know, I know, delay vs. solve.

27 Dec 2009
14:42 PM

Ayende Rahien

Imran,

That would keep allocating new strings every time I do the conversion back. I might use less memory in total, but I would allocate a lot more (more GC work)

27 Dec 2009
14:43 PM

Ayende Rahien

Ken,

Because I there is no way with a HashSet to get the value matching, and I care about the particular reference that I use.

27 Dec 2009
16:46 PM

Jon

Why not leave in the string interning code? Even if you do only delay the eventual a 50% decrease in memory sounds like a great thing.

27 Dec 2009
17:10 PM

Ayende Rahien

Jon,

Because when you do perf testing, you want to make as few changes as you possibly could, fixing one problem at a time.

The worst thing you can do is to try to patch over the root problem

27 Dec 2009
18:59 PM

Dmitry

Have you decided to persist strings to a temp storage instead of keeping them in memory?

27 Dec 2009
21:42 PM

Mike Chaliy

+1 to Dmitry, for example Sqlite

27 Dec 2009
21:43 PM

Leon Breedt

Have you considered a fixed-size buffer of strings, backed by disk storage? e.g. circular buffer of some sort.

Since I'm assuming your massive collection is append only (e.g. no-one comes and removes a string from the middle).

Your disk storage would then be indexed by line number, as would entries in your buffer....Someone tries to scroll outside of the starting and finishing bounds of your buffer, and you load in as many strings as appropriate from storage.

You'll want to optimize the disk storage for reading obviously, if you're using some form of list virtualization, you would probably want to have a mapping between line number and offset in disk file so you can do a Seek() to starting line and start reading.

Constrained memory usage, at the expense of a bit of a lag when scrolling far outside the current bounds.

27 Dec 2009
22:26 PM

Rafal

@Leon & others

It doesn't make sense to 'scroll' through the buffer of text messages if the buffer contains a gigabyte of data or more. You won't find anything by just scrolling, you need a search engine... So, it means keeping the raw text in memory doesn't make sense, much better to keep just the index in memory and retrieve raw text only when drilling down into details.

@Ayende,

My colleague has recently built an OLAP cube on IIS log files and uses that cube to analyze the application behavior. It was astonishing how much information he retrieved - he was able to create application performance graphs, analyze activity of users, identify application URLS that have too long execution time or transfer too much data, find errors in integration interfaces, identify several bottlenecks, everything by just slicing & dicing the cube..

Maybe something similar would be useful in your profiler? Statistical analysis is a very effective profiling method especially when you have to analyze large sets of data and today we are talking about amounts like millions of rows of data per day. We are using the OLAP as a capacity management tool - we process large amounts of information (think months of iis logs) and analyze the patterns and trends that emerge over time so we have a chance to identify performance problems before they are noticed by customers.

06 Jan 2010
22:00 PM

Henrique

There is a problem in this case, if you intend to have only one string reference for a given string, you will fail.

if you do something like this, for example:

lock(String.Intern(product.Id))

{

}

The "Clear" method will break this kind of trick.

Perhaps, WeakReference would be useful here

06 Jan 2010
22:03 PM

Ayende Rahien

Hernique,

I would NEVER do something like that

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB