I am attempting to parse an xml column within a data table on SQL Server, converting the contents into new columns within the dataframe I am trying to create. I keep getting the error
Msg 9420, Level 16, State 1, Line 1
XML parsing: line 20, character 2005, illegal xml character
and I don't know how to resolve this. This illegal character does not exist in every row's xml column.
My SQL code was able to parse 570,000 rows before it hit a row with an illegal character and stopped running. My WHERE clause is suppose to parse and pull 1,200,000 rows. Thus the code was able to successfully parse just under half of the needed rows before quitting. The xml column is stored as a varchar so I do need to CAST to xml in order to parse content.
This SQL code does work. It works on the raw data which contains a mix of production data and fake testing data. I was able to get access to the production only table and it was with this table that I encountered the error. Something must have happened to the data when it was transferred to the production only table.
I tried searching posts for something that could help, but I couldn't find anything. I don't know how to locate the error within the 1.2M records I am working with or which of the parsed columns is causing the problem. Is there a way for the parsing algorithm to skip over offending rows and continue to parse the remaining records?
My code is:
SELECT [Id]
,[EventDateTime]
,[TenantId]
,[EventType]
,[EventXml]
,[InsertDateTime]
,[AppInstanceId]
,[TokenCorrelationId]
,[AuditCorrelationId]
,[AuditId]
,CAST([EventXml] as XML).value('/PrescriptionEvent [1]/DateTimeStamp[1]','NVARCHAR(max)') AS xml_DateTimeStamp
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/AuditCorrelationId[1]','NVARCHAR(max)')) AS xml_AuditCorrelationId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/TokenCorrelationId[1]','NVARCHAR(max)')) AS xml_TokenCorrelationId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/ActingUserId[1]/Value[1]','NVARCHAR(max)')) AS xml_ActingUserId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/ActingUserId[1]/LegacyId[1]','NVARCHAR(max)')) AS xml_ActingUserId_LegacyId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/TenantId[1]/Value[1]','NVARCHAR(max)')) AS xml_TenantId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/TenantId[1]/LegacyId[1]','NVARCHAR(max)')) AS xml_TenantId_LegacyId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/AppInstanceId[1]/Value[1]','NVARCHAR(max)')) AS xml_AppInstanceId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/AppInstanceId[1]/LegacyId[1]','NVARCHAR(max)')) AS xml_AppInstanceId_LegacyId
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/ActionType[1]','NVARCHAR(max)')) AS xml_ActionType
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/Outcome[1]','NVARCHAR(max)')) AS xml_Outcome
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/OutcomeReason[1]','NVARCHAR(max)')) AS xml_OutcomeReason
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/RxSigningWorkflowActivity[1]','NVARCHAR(max)')) AS xml_RxSigningWorkflowActivity
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/Waypoint[1]','NVARCHAR(max)')) AS xml_Waypoint
,UPPER(CAST([EventXml] as XML).value('/PrescriptionEvent[1]/PrescriptionReferenceId[1]','NVARCHAR(max)')) AS xml_PrescriptionReferenceId
FROM [EpcsAuditDB].[dbo].[EpcsAuditEventData]
WHERE [EventType] = 4 AND [EventDateTime] >= '2020-03-24'
example of xml (this one does not have the illegal character; don't know how to find one that does contain an illegal character):
<?xml version="1.0" encoding="utf-8"?> <PrescriptionEvent xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <DateTimeStamp>2020-03-24T19:54:33.0169582Z</DateTimeStamp> <Outcome>true</Outcome> <OutcomeReason /> <AuditCorrelationId>3a4fb1cd-c39c-4e84-bfc4-dee98b29be2e</AuditCorrelationId> <TokenCorrelationId>d80bbd23-2e1d-44b3-9452-972b54f35cc9</TokenCorrelationId> <ActingUserId> <Value>91f78a00-ce26-4088-88eb-11x5565910d7</Value> </ActingUserId> <TenantId> <Value>00000000-0000-0000-0000-000000000000</Value> <LegacyId>10051804</LegacyId> </TenantId> <AppInstanceId> <Value>00000000-0000-0000-0000-000000000000</Value> <LegacyId>Hospital</LegacyId> </AppInstanceId> <PrescriptionReferenceId>ecf5fd42-096e-ea11-a852-005056a9ea50</PrescriptionReferenceId> <AdditionalPrescriptionReferenceId /> <ActionType>Received</ActionType> <RxSigningWorkflowActivity>RxArchive</RxSigningWorkflowActivity> <Waypoint>SMS</Waypoint> </PrescriptionEvent>