3

How can I extract attribute value from XML element using XML Extractor in U-SQL for my Azure data lake analytic job.

Update: More details about the issue

My XML file looks like this:

<?xml version="1.0" encoding="utf-8"?>
<testelement testatr="xyz">
</testelement>

Here is my U-SQL Script:

DECLARE @testfile string = "sample2.xml";
@logText =
EXTRACT log string            
FROM @testfile
USING Extractors.Tsv();

@gethID = SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(@logText.log, "testelement/attribute::testatr").ElementAt(0) AS siteName FROM @logText;
OUTPUT @gethID TO "result.out" USING Outputters.Tsv(); 

After debugging I observed, exception occurred when Load method of XPath class try to load:

"<?xml version=1.0 encoding=utf-8?>"

Here is an exception:

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: '1.0' is an unexpected token. The expected token is '\"' or '''. Line 1, position 15.\nCurrent row dump: \tlog:\t\"<?xml version=1.0 encoding=utf-8?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration(Boolean isTextDecl)
\n   at System.Xml.XmlTextReaderImpl.Read()
\n   at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

Update 2:

After using quoting:false I am getting another exception:

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: Root element is missing.\nCurrent row dump: \tlog:\t\"<?xml version=\"1.0\" encoding=\"utf-8\"?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
\n   at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

2 Answers 2

3

You identify the values with XPath expressions. An attribute is queried with @attr_name (or the full axis expression attribute::attr_name).

UPDATE based on question update:

It looks like that the parser somehow gets confused by the " inside the XML declaration. I see that you use the built-in Tsv() extractor that per default currently processes " inside a field as a quoting character and then drops it. This is a bug we are planning on fixing.

Until then I suggest that you use Extractors.Tsv(quoting:false).

Also please make sure that your XML document does not contain any CR/LF if you are using any of the built-in text extractors (Extractors.*) and that it does not contain a tab-value if you are using .Tsv.

If your XML will contain CR and/or LF then you will have to use a custom extractor to use a different row-delimiter. If you need to do that, please leave me a message, because I am currently tracking such requests to see what we can improve in the built-in extractors.

If your file only contains a single XML document (as opposed to several rows of XML documents) I would suggest to use the XML extractor that is also part of the XML sample on GitHub.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Michael, I tried this approach but got an exception. Please see updated questions details.
Thanks Jamil. I updated my answer based on your additional details.
0

On the new error message: It looks like the XML document contains a CR or LF or both after the XML declaration and thus the Tsv() extractor splits the XML document. See my comment in the previous answer:

Also please make sure that your XML document does not contain any CR/LF if you are using any of the built-in text extractors (Extractors.*) and that it does not contain a tab-value if you are using .Tsv.

If your XML will contain CR and/or LF then you will have to use a custom extractor to use a different row-delimiter. If you need to do that, please leave me a message, because I am currently tracking such requests to see what we can improve in the built-in extractors.

4 Comments

So I guess, currently, I don't have any solution in default extractors to get attribute value from XML that contains CR/LF, right?
Correct. You should instead use the extractors provided in the sample library. Or remove the CR/LF (if they are so called "insignificant whitespace" in the XML).
Hi @michael-rys, Could you please give me some example for xaml extractors provided in the sample library for extracting attribute value from XML? I can see example for extracting element value but not for attribute value.
Hi Jamil... did you try to write @name to reference the attribute (instead of name to reference the element)?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.