So far so good. The next exercise is to actually make this script useful by fetching the MP3s and arranging them into easy to understand directories.
The first order of business is to determine the name for the folder. I chose to use the channel name, and if that fails, the title of this episode. The later isn't a very good choice as we could easily end up with a directory for each item in the feed.
Last night we looked at parsing urls out of the feeds. Tonight I'm going look at parsing a little more information. Specifically the channel title, item title, item enclosure, and item pubDate.
Tonight I'm going to start off the script with reading a list of feeds, and fetching them for parsing.
while read URL ; do
while read LINE; do
echo $LINE|sed -n 's/.*<link>\([^<]*\)<\/link<.*/\1/p'
done < <(wget -q -O - $URL)
done < <(grep -v -e '^[;#]' -e '^$' $FEEDS)
We're using grep to filter out lines starting with ; and #, as well as blank lines. We could get fancy and validate the URL, but this will suffice for now.
If all we really wanted was a list of mp3 URLs, we could pipe wget directly through the sed command, but I have plans to parse out more than just the mp3. To keep our files organized and minimize network traffic I plan to also parse out the titles of the feed, show, and pubdates. We'll delve into the parser more tomorrow, for now good nigh, and happy bashing.
sed -n 's/.*href="\([^"]*\)".*/\1/p'
-n suppresses printing.
's/.../.../p' is a command. s is search and replace, s/pattern/replacement/. The trailing p is a command to print the result of the command. Since we used a -n to suppress normal printing, this causes sed to print only the replacement text. In most cases the replacement text will be static, but you can also use \1 through \9 to replace with the regular expresions within parentheses.
The pattern in this case is: .*href="([^"]*)".*
.*href=" matches the begining of the line, including the href="
([^"]*) matches everything except a quote (the url itself).
".* matches the quote and the rest of the line.
Using \1 as the replacement text causes the url, and only the url to be printed. If the line doesn't contain a matching pattern, sed continues on silently to the next line.
This method only catches the first url on a line, ignoring the rest. I will attempt to address that in a later article.
Here's a simple example:
# wget -q http://lr2.com/ -O - |sed -n 's/.*href="\([^"]*\)".*/\1/p'
Notice we escaped the parens so bash doesn't get confused and think that's a subscript.
Tomorrow I'll start to build this into a smarter parser that can be used to harvest both web pages and xml feeds for mp3 links. The end goal will be a simple script to fetch podcasts, add them to my library, and automatically dump them on my iPod if it's connected. Along the way I expect to learn a few more bash tricks.