updated parser


In my last post I was using if and else to look for each tag, and act on it. Tonight I'm going to convert that ugly mess to a case statement. It's easier to read, and doesn't have the hackish feel of if then, else.

I also learned a new trick that will help in parsing tags if the author mixes case. We'll add a 'shopt -s nocaseglob' to the top of the shell script. This causes wildcard and regexp matches to be case insensitive. It does not, however, change the behaviour of commands like sed and grep. I've already seen a couple feeds there the tags were all upper case, and was worried about having to write some ugly regexp to match the tags.

#!/bin/bash shopt -s nocaseglob

Now to get rid of the if elif else blocks.

while read URL ; do CHANNEL="" while read LINE; do TAG=$(echo ${LINE}|sed -n 's/<\([^>\ ]*\).*/\1/p')

We start out the same, except I've update TAG to match just the tag itself, and not everything between the <>'s. In the following strings, just the word "this" is matched: <this>, <this name="x1">

if [[ "${CHANNEL}" = "" ]] && [[ "${TAG}" =~ "title" ]]; then CHANNEL=$(echo ${LINE} | sed -n -e 's/<title>\([^<]*\)<\/title>/\1/pi'|sed -e 's/[\r\n]//') fi

Since we aren't building a true xml parser, we have to make some assumptions about the order and heirachy of the tags. In all of the feeds I've looked at, the channel tag is near the top, and it has a title tag for the name of the podcast. Following that we run into the item tags, each of which contains another title tag. If channel is not yet set, and we see a title tag, it should be safe to assume that it's the channel name. If the xml is malformed or ommits the channel title tag we will use the first item title as the channel title.

case "${TAG}" in 'title') TITLE=$(echo "${LINE}" | sed -n 's/<title>\([^<]*\)<\/title>/\1/pi') ;; 'link') LINK=$(echo "${LINE}" | sed -n 's/.*<link>\([^<]*\)<\/link>/\1/pi') ;; 'pubDate') DATE=$(echo "${LINE}" | sed -n 's/.*<pubDate>\([^<]*\)<\/pubDate>/\1/pi') ;; 'enclosure') ENCL=$(echo "${LINE}" | sed -n 's/.*<enclosure url=["'\'']\([^"'\'']*\)["'\''].*/\1/pi') ;; '/item') getmp3 ;; esac

The case statement makes this look much cleaner than the if then else if... Since we set nocaseglob at the top of the script we don't have to worry if someone uses TITLE or title, any combination of upper and lowercase letters will match. In all cases here, we're using sed to parse out the contents of the tag. We are making an assumption that all of these tags will always be on a line by themselves. I haven't found a case where this isn't true in about 100 feeds. Let me know if you find one. I've tossed in a call to getmp3 which is a stub function that just prints the variables for now. Soon it will go fetch the files into appropriately named directories.

if [[ "${LINE}" =~ 'href=["'\''][^"'\'']*[\.]mp3["'\'']' ]] ; then HREF=$(echo "${LINE}" | sed -n 's/.*href=["'\'']\([^"'\'']*\).mp3["'\''].*/\1.mp3/pi') fi

I did find a few feeds that did not use the enclosure tag, but included an anchor tag to the mp3 in the description. This is an attempt to find mp3 urls and set HREF. We will use this as a fallback in the event that neither enclosure, or link point to an mp3 file. The downside to this method is that we will get the last mp3 linked in the description, even if it's not the right one.

done < <(wget -q -O - $URL) done < <(grep -v -e '^[;#]' -e '^$' $FEEDS)


Expect this last part to be cleaned up and replaced with functions before we're done. It would be nice to cache the xml, and we could use a function to read the feeds.lst instead of how we're doing it now.

That should do for tonight. Next time I will add caching for the xml files, and flesh out the getmp3 function.

Leave a Reply

Your email address will not be published. Required fields are marked *