0

Hi I have below sample xml and i need to get the distinct invalid emails from the xml document. i guess all the time items like "nested exception is: com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt" and ": Recipient address rejected: User unknown in virtual alias table ;" are constant

<?xml version = "1.0" encoding = "UTF-8"?>
<root>
    <Error_Message>Error sending mail message. Cause: javax.mail.SendFailedException: Invalid Addresses;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
;
  nested exception is:
    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;[email protected]>: Recipient address rejected: User unknown in virtual alias table
    </Error_Message>
    <err_mesage>5</err_mesage>
</root>

Expected Output is:

<root>
<EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]@gmail.com</EMAILID>
<EMAILID>[email protected]</EMAILID>
</root>
3
  • 1
    Interesting use for an XML file. If you create those XML files yourself, consider changing them so that they contain actual structured data, not multi-line plain text. If you don't create the XML, use a different tool that is better suited to handle lumps of plain text - i.e. a programming language other than XSLT. Commented Feb 16, 2013 at 18:09
  • its from JavaEmail exception and i have to generate invalid emailids in the expected format Commented Feb 16, 2013 at 18:14
  • 1
    Can you at least use an XSLT 2.0 processor like Saxon 9? In that case you could try your luck with xsl:analyze-string. Commented Feb 16, 2013 at 18:36

1 Answer 1

2

As Martin Honnen suggests, analyze-string is a good bet here. But the format of your message is so simple that you do not need anything more complicated than the simple string manipulation functions of XSLT 1.0 and a recursive named template. Here is an XSLT 1.0 stylesheet with embedded comments to explain what is going on.

The beginning of the stylesheet is perfectly conventional:

<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">

  <xsl:output method="xml" indent="yes"/>

We declare two variables for some of the constant text in the error message (for no particular reason except wanting to avoid giving these long constant strings more than once):

  <xsl:variable name="prefix"
                select="'    com.sun.mail.smtp.SMTPAddressFailedException: 550 5.1.1 &lt;'"/>
  <xsl:variable name="suffix"
                select="'>: Recipient address rejected: User unknown in virtual alias table'"/>

The root element replicates itself:

  <xsl:template match="root">
    <root>
      <xsl:apply-templates/>
    </root>
  </xsl:template>

The Error_Message element hands its string value over to the named template extract-email-addresses, which does what its name suggests (details further below).

  <xsl:template match="Error_Message">
    <xsl:call-template 
        name="extract-email-addresses">
      <xsl:with-param name="s" 
                      select="string(.)"/>
    </xsl:call-template>
  </xsl:template>

The err_mesage element and text nodes are suppressed:

  <xsl:template match="err_mesage | text()"/>

The extract-email-addresses template accepts a string as parameter, which defaults to the empty string.

  <xsl:template name="extract-email-addresses">
    <xsl:param name="s" select="''"/>

We are going to bite off a bit of the string s at a time, handle the part we've bitten off, and recur on the rest. So the first thing we do is check to see whether we are finished. If $s is the empty string, there is nothing left to do; we stop the recursion and allow the stack to pop.

    <xsl:choose>
      <xsl:when test="$s = ''">
        <!--* end of string, we are done. *-->
      </xsl:when>

When the string is not empty, we split the string $s on the first newline, assigning the two parts to the variables $s1 and $rest:

      <xsl:otherwise>
        <xsl:variable name="s1" 
            select="substring-before($s,'&#xA;')"/>
        <xsl:variable name="rest" 
            select="substring-after($s,'&#xA;')"/>

Now we look for various forms the line can take. Most of the lines in the error message are boilerplate to be ignored:

        <xsl:choose>
          <xsl:when test="$s1 = 'Error sending mail message. Cause: javax.mail.SendFailedException: Invalid Addresses;'">
            <!--* this line is of no 
                * interest, continue *-->    
          </xsl:when>
          <xsl:when test="$s1 = '  nested exception is:'">
            <!--* skip this line *-->    
          </xsl:when>
          <xsl:when test="$s1 = ';'">
            <!--* skip this line *-->    
          </xsl:when>
          <xsl:when test="$s1 = ''">
            <!--* skip this line *-->    
          </xsl:when>

When we see a line starting with the label for the SMTPAddressFailedException and ending with the boilerplate about the rejection of the recipient address, we take the substring that occurs after the prefix and before the suffix, and wrap it in an EMAILID element:

          <xsl:when test="starts-with($s1,$prefix)
                          and
                          contains($s1,$suffix)">
            <EMAILID>
              <xsl:value-of select="
                substring-before(
                  substring-after($s1,$prefix),
                  $suffix)
                "/>
            </EMAILID>
            <xsl:text>&#xA;</xsl:text>
          </xsl:when>

If we see any other form of line, then the input is not as expected, so we emit a diagnostic message and keep going:

          <xsl:otherwise>
            <xsl:message>Unrecognized line: |<xsl:value-of
              select="$s1"/>|</xsl:message>
          </xsl:otherwise>
        </xsl:choose>

Whatever we did the the first line, we now recur to handle the remainder of the lines in the string:

        <xsl:call-template name="extract-email-addresses">
          <xsl:with-param name="s" select="$rest"/>
        </xsl:call-template>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

The XSLT 2.0 analyze-string instruction, of course, will be more compact than this, and the regular expressions of XSLT 2.0 make it much more convenient to do complicated things than the XSLT 1.0 library does. (But if you knew how to use analyze-string, you wouldn't have asked your question. One advantage of the smaller library and language in XSLT 1.0 is that it's sometimes faster to solve a problem with 1.0 than it is to understand the more complicated constructs of XSLT 2.0 and how to apply them to a simple problem. This is a general fact about small and large languages, of course.)

Applied to the input you show, the stylesheet just listed produces almost exactly the output you show:

<?xml version="1.0"?>
<root><EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]</EMAILID>
<EMAILID>[email protected]</EMAILID>
</root> 

It does not include a line for [email protected]@gmail.com; I conjecture that perhaps that's a cut/paste error in the question.

It also does not check to see whether the email address in a given line has already been emitted; if that is essential in practice, I hope it is obvious to you how to pass a second argument containing all the email addresses extracted thus far (delimited by blanks or by U+A0 or any character you like that can't occur in an email address) and use it to test for duplicates before emitting an EMAILID element.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 A suggestion: use XML comments and format the stylesheet in the answer as one document. It will make it easier for the OP (and others) to copy the stylesheet and execute, and will retain the comments in the stylesheet for reference.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.