Sunday, November 10, 2013

Determine the Exact Position of an XmlReader

I was dealing with a scenario today where I wanted to read through an XML file capturing the locations of various elements so that I could come back on a second pass and process the file using random access.  To construct the XmlReader, I first created a StreamReader on the file and then passed that to the XmlReader.Create method to get my reader.

Initially I was looking at the StreamReader’s BaseStream property, but that stream reports the Position in increments of 1024 bytes due to the buffering behavior of the StreamReader.  I poked around a bit looking for a solution and when I couldn’t find one I decided to try and roll my own.

Here is what I came up with, but be warned that your mileage may vary because I’m doing non-future-proof things like reflecting on private properties of the .NET Framework classes.

The key to the approach is that the XmlReader is backed by a StreamReader which itself is backed by a FileStream.  Having access to the underlying FileStream gives us visibility into how much of the file has been read into the internal buffers of the XmlReader and the StreamReader .

Since the XmlReader uses an internal buffer of 4096 bytes, when it is first initialized from the StreamReader, it will read in the first 4096 bytes to fill its buffer.  Since the StreamReader uses an internal buffer of 1024 bytes, the XmlReader’s initialization activities will force it to retrieve four chunks of 1024 bytes.  With its own internal buffer exhausted, it will then read ahead in the FileStream another 1024 bytes.

image

The difficulty comes in determining where in the original file the XmlReader is positioned at any given moment since there are no public properties that report that information.  As it turns out, we can calculate it by reading a few additional private fields on the StreamReader and XmlReader implementation classes.  The basic formula looks (almost) like this: 

Actual XmlReader Position =
   
FileStream Position – StreamReader Buffer Size – XmlReader Buffer Size
       + XmlReader Buffer Position + StreamReader Buffer Position

Here’s the code for the XmlReader extension method:


public static class XmlReaderExtensions
{
private const long DefaultStreamReaderBufferSize = 1024;

public static long GetPosition(this XmlReader xr, StreamReader underlyingStreamReader)
{
// Get the position of the FileStream
long fileStreamPos = underlyingStreamReader.BaseStream.Position;

// Get current XmlReader state
long xmlReaderBufferLength = GetXmlReaderBufferLength(xr);
long xmlReaderBufferPos = GetXmlReaderBufferPosition(xr);

// Get current StreamReader state
long streamReaderBufferLength = GetStreamReaderBufferLength(underlyingStreamReader);
int streamReaderBufferPos = GetStreamReaderBufferPos(underlyingStreamReader);
long preambleSize = GetStreamReaderPreambleSize(underlyingStreamReader);

// Calculate the actual file position
long pos = fileStreamPos
- (streamReaderBufferLength == DefaultStreamReaderBufferSize ? DefaultStreamReaderBufferSize : 0)
- xmlReaderBufferLength
+ xmlReaderBufferPos + streamReaderBufferPos - preambleSize;

return pos;
}

#region Supporting methods

private static PropertyInfo _xmlReaderBufferSizeProperty;

private static long GetXmlReaderBufferLength(XmlReader xr)
{
if (_xmlReaderBufferSizeProperty == null)
{
_xmlReaderBufferSizeProperty = xr.GetType()
.GetProperty("DtdParserProxy_ParsingBufferLength",
BindingFlags.Instance | BindingFlags.NonPublic);
}

return (int) _xmlReaderBufferSizeProperty.GetValue(xr);
}

private static PropertyInfo _xmlReaderBufferPositionProperty;

private static int GetXmlReaderBufferPosition(XmlReader xr)
{
if (_xmlReaderBufferPositionProperty == null)
{
_xmlReaderBufferPositionProperty = xr.GetType()
.GetProperty("DtdParserProxy_CurrentPosition",
BindingFlags.Instance | BindingFlags.NonPublic);
}

return (int) _xmlReaderBufferPositionProperty.GetValue(xr);
}

private static PropertyInfo _streamReaderPreambleProperty;

private static long GetStreamReaderPreambleSize(StreamReader sr)
{
if (_streamReaderPreambleProperty == null)
{
_streamReaderPreambleProperty = sr.GetType()
.GetProperty("Preamble_Prop",
BindingFlags.Instance | BindingFlags.NonPublic);
}

return ((byte[]) _streamReaderPreambleProperty.GetValue(sr)).Length;
}

private static PropertyInfo _streamReaderByteLenProperty;

private static long GetStreamReaderBufferLength(StreamReader sr)
{
if (_streamReaderByteLenProperty == null)
{
_streamReaderByteLenProperty = sr.GetType()
.GetProperty("ByteLen_Prop",
BindingFlags.Instance | BindingFlags.NonPublic);
}

return (int) _streamReaderByteLenProperty.GetValue(sr);
}

private static PropertyInfo _streamReaderBufferPositionProperty;

private static int GetStreamReaderBufferPos(StreamReader sr)
{
if (_streamReaderBufferPositionProperty == null)
{
_streamReaderBufferPositionProperty = sr.GetType()
.GetProperty("CharPos_Prop",
BindingFlags.Instance | BindingFlags.NonPublic);
}

return (int) _streamReaderBufferPositionProperty.GetValue(sr);
}

#endregion
}

4 comments:

Unknown said...

Do we require this?

- (streamReaderBufferLength == DefaultStreamReaderBufferSize ? DefaultStreamReaderBufferSize : 0)

I was also going through some tests reading xml. so if suppose streamReaderBufferLength=540 and Default is 1024 and we can subtract 540 from it because we are again adding streamReaderBufferPos to it. Please let me know your thoughts on it.

Unknown said...
This comment has been removed by the author.
Unknown said...

Looking at this line now, months after writing the code, it does look a little odd. Are you seeing that it needs to subtract the 540 to be reporting the correct value in your scenario?

Also, what Windows OS and version of .NET are you using?

Kilian Hekhuis said...

Just encountered this (I know it's old, but I have the same problem), but doesn't seem to work. Apart from the StreamReader not needed (XmlReader can be created with a FileStreamer), it seems XmlReader copies part of the end of its previous buffer to the new one, skipping stuff in the process (like < and \). I'm currently investigating whether it's possible to use the DtdParserProxy_LineStartPosition (it's the virtual start of the current line and can be negative if it's a long line needing a buffer refresh).