3

I am doing a mixed language script with the parent script being bash (don't ask why, it's a long story). Part of my script pulls the source of an XML page into a variable. I want to use bash to process the XML in the variable into several arrays. The XML is set up as follows:

<event>
    <id>34287352</id>
    <what>New Post</what>
    <when>1 Minute Ago 03:50 PM</when>
    <title>This is a title</title>
    <preview>sdfasd</preview>
    <poster>
            <![CDATA[ USERNAME ]]>
    </poster>
    <threadid>2346566</threadid>
    <postid>34287352</postid>
    <lastpost>1360021837</lastpost>
    <userid>3291696</userid>
    <forumid>2</forumid>
    <forumname>General Discussion</forumname>
    <views>201,913</views>
    <replies>6,709</replies>
    <statusicon>images/statusicon/thread.gif</statusicon>
</event>

There are 20 <event>'s in the XML file. I want to pull what title and preview from the XML and put them all into their own array

I followed an example here on SOF

for tag in  what title preview 
do
OUT=`grep  $tag $source | tr -d '\t' | sed 's/^<.*>\([^<].*\)<.*>$/\1/' `

# This is what I call the eval_trick, difficult to explain in words.
eval ${tag}=`echo -ne \""${OUT}"\"`
done

W_ARRAY=( `echo ${what}` )
T_ARRAY=( `echo ${title}` )
P_ARRAY=( `echo ${preview}` )

echo ${W_ARRAY[0]}
echo ${T_ARRAY[0]}
echo ${P_ARRAY[0]}

But using the above my script always freaks right out and repeats grep: <part of the xml>: No such file or directory

Thoughts?

EDIT:

Well it is ugly as hell but I managed to get the sudoxml into an array

windex=0
tindex=0
pindex=0
while read -r line
do
WHAT=$(echo ${line} | awk -F "</?what>" '{ print $2 }')
if [ "$WHAT" != "" ]; then
    W_ARRAY[$windex]=$OUT
    let windex+=1
fi
TITLE=$(echo ${line} | awk -F "</?title>" '{ print $2 }')
if [ "$TITLE" != "" ]; then
    T_ARRAY[$tindex]=$OUT
    let tindex+=1
fi
PREVIEW=$(echo ${line} | awk -F "</?preview>" '{ print $2 }')
if [ "$PREVIEW" != "" ]; then
    P_ARRAY[$pindex]=$OUT
    let pindex+=1
fi
done <<< "$source"
13
  • 1
    1) this is not a valid XML 2) for parsing XML, use xmllint or xmlstarlet Commented Feb 5, 2013 at 0:26
  • The XML is from the vbulletin mod VAISPY. I have no control over its validity I can only work with what it shows. =( Also I'm not that familiar with bash so the proper context of xmllint and xmlstarlet escape me. Commented Feb 5, 2013 at 0:31
  • is your $source variable set ? Commented Feb 5, 2013 at 0:39
  • Yes my $source variable is set. Commented Feb 5, 2013 at 0:43
  • You said the xml is in the variable, but greps expect the filename. So you would have to use echo "$source" | with - as filename Commented Feb 5, 2013 at 0:43

4 Answers 4

1

I had something sooo similar, parsing wise, here's a hacked version

I use xsltproc (which is in ubuntu, but can't remember if I have installed it specifically)

Command line

xsltproc tfile.xslt tfile.xml

tfile.xml (is your example copied 3 times), wrapped in events tags ie.

<events>
     <event> ... </event>
     <event> ... </event>
     <event> ... </event>
</events>

tfile.xsl :

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output method='text'/>
<!-- ================================================================== -->
<xsl:template match="/">
    <xsl:apply-templates select="//event"/>
</xsl:template>

<xsl:template match="event">
 <xsl:text>event[</xsl:text><xsl:value-of select="position()"/><xsl:text>]['id']=</xsl:text>
 <xsl:value-of select="id"/> <xsl:text> </xsl:text>

 <xsl:text>event[</xsl:text><xsl:value-of select="position()"/><xsl:text>]['what']=</xsl:text>
 <xsl:value-of select="what"/><xsl:text> </xsl:text>

 <xsl:text>event[</xsl:text><xsl:value-of select="position()"/><xsl:text>]['preview']=</xsl:text>
 <xsl:value-of select="preview"/><xsl:text> </xsl:text>

 <xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

Output

event[1]['id']=34287352 event[1]['what']=New Post event[1]['preview']=sdfasd 
event[2]['id']=34287353 event[2]['what']=New Post3 event[2]['preview']=sdfasd 
event[3]['id']=34287354 event[3]['what']=New Post4 event[3]['preview']=sdfasd

Hope you know a bit of xslt processing, change output as you want.

Sign up to request clarification or add additional context in comments.

Comments

0

Well, now this completely unhelpful, but I'm currently working on a command line xml parser. If it were finished (it would already be, if I weren't distracted by a topcoder marathon march...), you could write it as simply as:

eval $(echo "$source" | xidel - -e '<event>
    <what>{$W_ARRAY}</what>
    <title>{$T_ARRAY}</title>
    <preview>{$P_ARRAY}</preview>
</event>*' --output-format bash)

Looks amazing, doesn't it?

3 Comments

ZOOM! That is the sound of that flying over my head. I threw it in my batch script but i get xidel: command not found.
yeah, you cannot use it, since I haven't written the program, yet. But it would look like that, if I had.
I appreciate the gesture. I derped out and didnt read "well, now this completely unhelpful" Thanks though!
0

In recap of my comments, here's what's wrong with your code:

1- As your $source variable is not a filename, in your grep you should use:

OUT=`echo $source | grep  $tag | tr -d '\t' | sed 's/^<.*>\([^<].*\)<.*>$/\1/' `

2- Your tr command replaces all the tabs in your XML-like variable. However, rour variable does not contain tabs but instead 4 white spaces.

So instead you need to have :

... | tr -d '    ' | ...

3- An alternative solution would be:

OUT=`echo $source | grep  $tag | sed 's/<.*>\([^<].*\)<.*>$/\1/' `

(note that the ^ in the sed is removed)

4 Comments

So close, so so very close. When I used both suggested solutions when I echo ${W_ARRAY[1]} I get </event>.
that's what she said :) one of the possible reasons is that you do not handle multi-line contents.. another potential problem is that you grep your $tag which is "what" and that could be in any line even when it's not a <what> tag.. (use grep "<"$tag">" instead)..
while true; do face=desk done, Even when I add "<"$tag">" I keep getting </event>
I tried to break it apart a little and take a different approach. while read -r line do echo ${line} | grep "<what>" | sed -e 's/<what>\(.*\)<\/what>/\1/' echo ${line} | grep "<title>" | sed -e 's/<title>\(.*\)<\/title>/\1/' echo ${line} | grep "<preview>" | sed -e 's/<preview>\(.*\)<\/preview>/\1/' done <<< "$source" I successfully output the date between the tags. But adding the data to its own array is proving to be difficult
0

Got everything working. For those of you who ever plan to do something anything similar here is the haps:

on run argv
set region to item 1 of argv
set XML_URL to "http://" & region & ".<URL REMOVED>.com/board/vaispy-secret.php?do=xml"
try
    tell application "Safari"
        set URL of tab 1 of front window to XML_URL
        my waitforload()
        --delay 5
        -- Get page source
        set currentTab to current tab of front window
        set currentSource to currentTab's source
        return currentSource
    end tell
on error err
    log "Could not retrieve source."
    log err
    display dialog err
    --return "NULL"
end try

end run

on waitforload()
--check if page has loaded
local loadflag, zarg, test_html
set loadflag to 0
repeat until loadflag is 1
    delay 0.5
    tell application "Safari"
        set test_html to source of document 1
    end tell
    try
        set zarg to text ((count of characters in test_html) - 10) thru (count of characters in test_html) of test_html
        if "</events>" is in text ((count of characters in test_html) - 10) thru (count of characters in test_html) of test_html then
            set loadflag to 1
        end if
    end try
end repeat
end waitforload

Create bash script:

#!/bin/bash
clear

if [ "$1" == "na" ]; then
region="na"
elif [ "$1" == "eu" ]; then
region="euw"
else
echo "FRcli requires an argument."
echo "usage: [eu|na]"
echo "[eu scans EUW & EUNE]"
echo "[na scans NA]"
exit $?
fi


while true; do
clear
echo "Region: $region"
echo "...Importing Naughty"

declare -a NAUGHTY=()
nindex=0
while read line
do
    NAUGHTY[$nindex]=$line
    let nindex+=1

done < $HOME/Desktop/naughty.txt
NC=${#NAUGHTY[@]}
let NC-=1
echo "...Pulling Source"

source=$(osascript FRcli.scpt $region)

echo "...Extracting Arrays"

windex=0
tindex=0
pindex=0
dindex=0
while read -r line
do
    #WHAT=$(echo ${line} | awk -F "</?what>" '{ print $2 }')
    WHAT=$(echo ${line} | sed -n 's/^.*<what>\([^<]*\).*/\1/p')
    if [ "$WHAT" != "" ]; then
        W_ARRAY[$windex]=$WHAT
        let windex+=1
    fi

    #TITLE=$(echo ${line} | awk -F "</?title>" '{ print $2 }')
    TITLE=$(echo ${line} | sed -n 's/^.*<title>\([^<]*\).*/\1/p')
    if [ "$TITLE" != "" ]; then
        T_ARRAY[$tindex]=$TITLE
        let tindex+=1
    fi

    #PREVIEW=$(echo ${line} | awk -F "</?preview>" '{ print $2 }')
    #PREVIEW=$(echo ${line} | sed -n '/<preview*/,/<\/preview>/p')
    PREVIEW=$(echo ${line} | sed -n 's/^.*<preview>\([^<]*\).*/\1/p')
    if [ "$PREVIEW" != "" ]; then
        P_ARRAY[$pindex]=$PREVIEW
        let pindex+=1
    fi

    POSTID=$(echo ${line} | sed -n 's/^.*<postid>\([^<]*\).*/\1/p')
    if [ "$POSTID" != "" ]; then
        D_ARRAY[$dindex]=$POSTID
        let dindex+=1
    fi


done <<< "$source"

echo "What: ${#W_ARRAY[@]}"
echo "Title: ${#T_ARRAY[@]}"
echo "Preview: ${#P_ARRAY[@]}"
echo "PostID: ${#D_ARRAY[@]}"

for ((i=0; i <= 19; i++))
do
    found=0
    fpid=""
    if [ "${W_ARRAY[$i]}" = "New Thread" ]; then
        echo "Scanning Thread"
        scan=$(echo ${T_ARRAY[$i]} ${P_ARRAY[$i]})
        echo "Title: ${T_ARRAY[$i]}"
        echo "Post: ${P_ARRAY[$i]}"
    else
        echo "Scanning Post"
        scan=$(echo ${P_ARRAY[$i]})
        echo "Post: ${scan}"        
    fi
    sleep .5
    for ((n=0; n<=$NC; n++))
    do
        nw=${NAUGHTY[$n]}
        a=$(echo ${scan} | tr [:lower:] [:upper:])
        b=$(echo ${nw} | tr [:lower:] [:upper:])
        echo "Checking: $b"
        #echo "$a"

        if [[ $a == *$b* ]]; then
        ## Change != to == in release
            echo "Found: $b"
            found=1
            echo "...Loading PID"
            declare -a PID=()
            pindex=0
            while read line
            do
                PID[$pindex]=$line
                let pindex+=1

            done < $HOME/Desktop/pid.txt
            PIDC=${#PID[@]}

            for (( p=0; p<=$PIDC ; p++))
            do
                lpid=${PID[$p]}
                if [ "$region ${D_ARRAY[$i]}" == "$lpid" ]; then
                    echo "Found: $lpid"
                    echo "Ignoring Flag"
                    fpid=1
                elif [ "$region ${D_ARRAY[$i]}" != "$lpid" ]; then
                    echo "$region ${D_ARRAY[$i]} $lpid"
                    echo "PID not found, opening URL."
                    fpid=0
                    break
                else
                    echo "Hi"
                    fpid=1
                fi

            done


            if [ "$found" == "1" -a "$fpid" == "0" ]; then
                FFURL="http://$region.<URL REMOVED>.com/board/showthread.php?p=${D_ARRAY[$i]}&highlight=$nw"
                open -a Firefox "$FFURL"
                echo $region ${D_ARRAY[$i]} >> $HOME/Desktop/pid.txt            
                found=0
                fipd=""
            fi
        fi
    done
    sleep .5
done

if [ "$1" == "eu" ]; then
    if [ "$region" == "euw" ]; then
        region="eune"
    else
        region="euw"
    fi
fi
clear

done I'm sure their are far more efficient means of doing this. Using cURL in the bash script would have made this a once script deal (couldn't with this script due to security in place for this boards iSpy). But this works and it is pretty zippy. Uses only AVG 32.7 Mem and as far as I can tell doesn't have any memory leaks (like my 100% applescript version of this did)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.