String Interning: The Garbage Collectible way
Since I know people will want the actual implementation, here is a simple way of handling string interning in a way that will allow you to GC the results at some point. The issue is simple, I want to intern strings (so a string value is only held once through my entire app), but I don’t want to be stuck with them if the profiler state has been clear, for example.
public class GarbageCollectibleStringInterning { private static IDictionary<string,string> strings = new Dictionary<string,string>(); private static ReaderWriterLockSlim locker = new ReaderWriterLockSlim(); public static void Clear() { locker.EnterWriteLock(); try { strings.Clear(); } finally { locker.ExitWriteLock(); } } public static string Intern(string str) { string val; locker.EnterReadLock(); try { if(strings.TryGetValue(str, out val)) return val; } finally { locker.ExitReadLock(); } locker.EnterWriteLock(); try { if(strings.TryGetValue(str, out val)) return val; strings.Add(str,str); return str; } finally { locker.ExitWriteLock(); } } }
This is a fairly simple implementation, a more complex one may try to dynamically respond to GC notification, but I think that this would be useful enough on its own.
Using this approach, I was able to reduce used memory in the profiler by over 50%. I gave up on that approach, however, because while it may reduce the memory footprint, it doesn't actually solve the problem, only delay it.
Comments
Any reason for which you don't use EnterUpgradeableReadLock in Intern method?
Robert,
I'll answer that in a separate post.
Robert,
The short answer is that EnterUpgradeableReadLock would produce a far higher locking rate than read/write.
Another thing you can add to reduce the memory footprint further is utf8 encoding. Store the strings as a byte[] instead by calling Encoding.UTF8.GetBytes(string) to circumvent the default .net utf16 encoding. You can use Encoding.UTF8.GetString to convert back to string as needed. Some characters will still be encoded with 2 bytes but for most ascii strings this will give a memory benefit. As you said before though, it doesn't solve your problem it just delays it.
why Dict <string,string> and not HashSet <string?
I know, I know, delay vs. solve.
Imran,
That would keep allocating new strings every time I do the conversion back. I might use less memory in total, but I would allocate a lot more (more GC work)
Ken,
Because I there is no way with a HashSet to get the value matching, and I care about the particular reference that I use.
Why not leave in the string interning code? Even if you do only delay the eventual a 50% decrease in memory sounds like a great thing.
Jon,
Because when you do perf testing, you want to make as few changes as you possibly could, fixing one problem at a time.
The worst thing you can do is to try to patch over the root problem
Have you decided to persist strings to a temp storage instead of keeping them in memory?
+1 to Dmitry, for example Sqlite
Have you considered a fixed-size buffer of strings, backed by disk storage? e.g. circular buffer of some sort.
Since I'm assuming your massive collection is append only (e.g. no-one comes and removes a string from the middle).
Your disk storage would then be indexed by line number, as would entries in your buffer....Someone tries to scroll outside of the starting and finishing bounds of your buffer, and you load in as many strings as appropriate from storage.
You'll want to optimize the disk storage for reading obviously, if you're using some form of list virtualization, you would probably want to have a mapping between line number and offset in disk file so you can do a Seek() to starting line and start reading.
Constrained memory usage, at the expense of a bit of a lag when scrolling far outside the current bounds.
@Leon & others
It doesn't make sense to 'scroll' through the buffer of text messages if the buffer contains a gigabyte of data or more. You won't find anything by just scrolling, you need a search engine... So, it means keeping the raw text in memory doesn't make sense, much better to keep just the index in memory and retrieve raw text only when drilling down into details.
@Ayende,
My colleague has recently built an OLAP cube on IIS log files and uses that cube to analyze the application behavior. It was astonishing how much information he retrieved - he was able to create application performance graphs, analyze activity of users, identify application URLS that have too long execution time or transfer too much data, find errors in integration interfaces, identify several bottlenecks, everything by just slicing & dicing the cube..
Maybe something similar would be useful in your profiler? Statistical analysis is a very effective profiling method especially when you have to analyze large sets of data and today we are talking about amounts like millions of rows of data per day. We are using the OLAP as a capacity management tool - we process large amounts of information (think months of iis logs) and analyze the patterns and trends that emerge over time so we have a chance to identify performance problems before they are noticed by customers.
There is a problem in this case, if you intend to have only one string reference for a given string, you will fail.
if you do something like this, for example:
lock(String.Intern(product.Id))
{
}
The "Clear" method will break this kind of trick.
Perhaps, WeakReference would be useful here
Hernique,
I would NEVER do something like that
Comment preview