Subversion is Xmlish
You know that it is a bad day when you start parsing XML using regular expressions.
When I started working on SvnBridge, I expected to have a lot of issues with TFS. What I didn't expect was to get hit by a Subversion WTF of gigantic magnitude.
Take a look at the following XML:
<?xml version="1.0" encoding="utf-8" ?><D:propertyupdate xmlns:D="DAV:"xmlns:V="http://subversion.tigris.org/xmlns/dav/"xmlns:C="http://subversion.tigris.org/xmlns/custom/"xmlns:S="http://subversion.tigris.org/xmlns/svn/"><D:set><D:prop><C:bugtraq:label>Work Item:</C:bugtraq:label><C:bugtraq:url>http://www.codeplex.com/SvnBridge/WorkItem/View.aspx?WorkItemId=%BUGID%</C:bugtraq:url><C:bugtraq:message> Work Item: %BUGID%</C:bugtraq:message><C:bugtraq:number>true</C:bugtraq:number><C:bugtraq:warnifnoissue>true</C:bugtraq:warnifnoissue></D:prop></D:set></D:propertyupdate>
Check the properties, and look closely. Despite the so-called xml header, this is not valid XML, yet this is produce (and consumed) by Subversion. This raise some interesting questions about what parser they are using, but that is beside the point.
This is wrong, period.
Comments
Moreover, since long, it's easy to have the svn command line spit out invalid xml. Failing operations can leave behind open tags. This is especially frustrating within automated builds that consume the svn output to get status information.
Did you find any documentation on the svn wire protocol? I wanted to work with it a while back but couldn't find anything useful, even after talking to several svn developers.
Rik,
What SvnBridge did is reverse engineer the protocol.
You can take a look at TestsProtocol to see how it was done.
Basically, it started from TCP level sniffer and build the tests from there
I was tempted to do this myself, but it worried me that I'd make too many assumptions - or that the protocol would change 'under' me too rapidly.
Rik,
Why are you trying to simulate the protocol?
You don't have to worry about it changing, I would say. Too much relies on it.
You can also take the SvnBridge source code and use that as a base, you would need to supply an implementation of ISourceCodeProvider, but that about it
I was going to write a managed library for talking to svn servers, then use svn as the backend for a website - the idea of revision history being part of the storage was attractive.
Why not use SVN itself for that?
You mean the command line program? Because its output is not good enough to parse.
The output of svn.exe is explicitly designed to be parsed by machines.
But I meant using the SVN server
The svn developers I talked to told me that parsing the output of svn.exe is painful. Perhaps it has similar problems to the one you are seeing.
I'm not sure what you mean by 'using' the SVN server. If I'm not checking in / checking out / diffing etc. using svn.exe or the wire protocol, what else is there?
Storing the information in Subversion itself.
By using svn.exe or talking to the server over the wire, I am storing the information in Subversion itself.
If it is the namespace peculiarity that bothers you (DAV: not exactly being a URI), it goes back into WebDAV history, where the not-yet-solid namespace specifications were misunderstood.
I didn't look for other problems, but I suspect they have an origin in a misunderstanding of WebDAV (and any misunderstandings that still lurk in WebDAV).
Dennis,
Take a look at the element name:
<C:bugtraq:label>
sql server 2005 has something similar for its theasuarus
does not make live any easier
That last one is a perfectly valid uri; there's only one colon and both the left and right are valid names.
The <C:bugtraq:label> certainly looks nasty, but I do believe it is syntactically allowable as an element name as the XML RFC seems to permit it.
It does note that colons have a reserved meaning in the XML namespaces RFC and so authors shouldn't use them in element names, but it also says that this must be handled by parsers.
A very quick look at the relevant parts of the RFC is here:
http://blog.roblevine.co.uk/?p=11
That is not to say I think this sample of XML is actually nice...
Rob.
No XML parser that I tried could handle them, and I tried 3 different ones.
Interestingly enough, it turns out that you can get the .Net XmlTextReader to accept this format (not that I knew this before about an hour ago).
More info here:
http://blog.roblevine.co.uk/?p=12
I don't know if that is any help in your quest to read Subversion's xml, but maybe...
Comment preview