Friday, September 7, 2007

RSS Newsfeed Workbench - read RSS feeds and OPML feeds into SQL Server 2005


RSS Newsfeed Workbench - read RSS feeds and OPML feeds into SQL Server 2005


06 July 2007


by Robyn Page and Phil Factor


Have you ever fancied being able to put Simple-Talk's content into a database? We haven't either, but we'll do it just for practice. The objective of this workbench is to show just how easy it is to implement an RSS-based datafeed in SQL Server 2005. The purpose is to try out downloading an XML file from the internet, checking it for validity, and if not, seeing if it is a redirect. We'll show a cool SQL Server 2005 way of reading a text file into a VARCHAR(MAX) variable. We'll do some shredding of an XML file too. By the time we've finished we hope we will have illustrated some useful techniques and given you an application that you can turn into a production-quality newsreader.
It is very easy to read RSS feeds and OPML feeds into SQL Server 2005. This is because the files are XML and are easily eaten by the XML functions. RSS, which doesn't actually stand for anything other than RDF Site Summary, is a simple way of publishing, or broadcasting, frequently updated Web content such as articles and blog entries. OPML, originally conceived for transferring outlines, is a format that has been used widely for distributing collections of RSS feeds. However, it has a large number of uses in transferring simple structured list-based information. As it is used for exchanging lists of RSS feeds, we'll show how to read in an OPML list of feeds
Please remember before we start that this code is for illustration and is not 'production quality'. You will need to sort out the security issues (see the process delegation workbench) and add logging and more error-checking.
We won't attempt to store all the information that RSS2 defines, simply because we can't find any sites that use all this information. In fact, most feeds use only a small part of the information and few aggregators seem to display the extra information anyway. We'll boil it down to two relational tables


/* Firstly, create a new database called RSSFeedWorkbench. Make sure you have the rights to execute xp_cmdshell too!*/
.USE RSSFeedWorkbench
GO
IF EXISTS ( SELECT *
FROM sysobjects --delete it if it exists
WHERE id = OBJECT_ID('RSSitem') )
DROP TABLE [RSSitem]
/*
We create two tables. RSSItem contains every news item that we collect.
It is linked to an RSSFeed table which contains every feed that we want to
read from. We have not included every parameter of a feed item because
they are rarely used. */
CREATE TABLE dbo.RSSitem
(
RSSitem_ID INT IDENTITY(1, 1)
NOT NULL,
RSSFeed_ID INT NULL,
title VARCHAR(80) NOT NULL,
link VARCHAR(200) NULL,
[description] VARCHAR(MAX) NULL,
PubDate VARCHAR(40) NULL,
[GUID] VARCHAR(80) NULL,
insertiondate DATETIME NOT NULL
CONSTRAINT dfRssItemDate DEFAULT ( GETDATE() ),
CONSTRAINT PK_RSSitem PRIMARY KEY CLUSTERED ( RSSitem_ID ASC )
)
ON
[PRIMARY]
/****** Object: Table dbo . RSSFeed Script Date: 07/02/2007 10:32:19 ******/
IF EXISTS ( SELECT *
FROM sysobjects --delete it if it exists
WHERE id = OBJECT_ID('RSSFeed') )
DROP TABLE [RSSFeed]
/* the RSSFeed table contains every RSS feed that we want to read from
.
It is used to poll around the feeds. This routine checks to see whether
the TYPE field is set to RSS.(or RSS2 etc) If it isn't then it isn't
read. */
CREATE TABLE dbo.RSSFeed
(
RSSFeed_ID INT IDENTITY(1, 1)
NOT NULL,
title VARCHAR(80) NOT NULL,
link VARCHAR(200) NOT NULL,
[description] VARCHAR(2000) NOT NULL
DEFAULT '',
[language] VARCHAR(10) NOT NULL
DEFAULT 'en-US',
Category VARCHAR(100) NOT NULL
DEFAULT 'Generic',
docs VARCHAR(80) NULL,
Generator VARCHAR(80) NULL,
ManagingEditor VARCHAR(80) NULL,
Webmaster VARCHAR(80) NULL,
actualURL VARCHAR(200) NULL,
[type] VARCHAR(80) NOT NULL
DEFAULT 'RSS',--is it RSS or ATOM?
insertiondate DATETIME NOT NULL
DEFAULT ( GETDATE() ),
CONSTRAINT PK_RSSFeed PRIMARY KEY CLUSTERED ( RSSFeed_ID ASC )
)
ON
[PRIMARY]
GO
ALTER TABLE dbo.RSSitem WITH CHECK
ADD
CONSTRAINT FK_RSSitem_RSSFeed FOREIGN KEY ( RSSFeed_ID )
REFERENCES dbo.RSSFeed ( RSSFeed_ID )
GO
ALTER TABLE dbo.RSSitem CHECK CONSTRAINT FK_RSSitem_RSSFeed
GO
CREATE NONCLUSTERED INDEX [idxRSSFeedTitle] ON [dbo].[RSSFeed] ( [title] ASC )
CREATE NONCLUSTERED INDEX [idxTitle] ON [dbo].[RSSitem] ( [title] ASC )
GO


/* Although there are more portable ways of getting the contents of the URL, we'll use CURL.EXE, which is free, and very easy to install, and useful for a wide range of purposes. You'll need to download and install this to get the Workbench to run. You'll need to install the OpenSSL package too from here .
Some RSS feeds will try to stop v non-browser software from reading an RSS feed. Curl has useful facilities that allow it to mimic a browser to the extent of allowing cookies and letting you specify the user agent. */
/* this is the stored procedure that grabs the contents of the feed, checks to see if is one we know about and, if not, saves its attributes. then it saves any items it has not already saved. As it saves the url of the feed, you can use it to add the feed to your repository, and then it will automatically refresh it (if you install the spUpdateAllFeeds stored procedure in the right place. This stored procedure is more complex than it might be because one can try to access a feed only to find it is a redirection to the actual feed. If this is the case then one has to read the file as an HTML file, get the anchor, and use the contents of the HREF as the source of the newsfeed, before trying again! */


IF EXISTS ( SELECT *
FROM sysobjects
WHERE id = OBJECT_ID('spUpdateRSSArchiveFrom') )
DROP PROCEDURE [spUpdateRSSArchiveFrom]
GO
CREATE PROCEDURE [dbo].[spUpdateRSSArchiveFrom]
@url VARCHAR(200),--URL of the RSS datafeed to register and read from
@tempfile VARCHAR(80) = 'c:\rssfeed.xml'--optional file to save as
AS /*
execute spUpdateRSSArchiveFrom
'http://www.simple-talk.com/community/forums/rss.aspx?ForumID=142&Mode=0',
'C:\forums.rss'
*/
SET nocount ON
DECLARE @ErrorMessage VARCHAR(100)
DECLARE @command NVARCHAR(255) --the command string used for spExecuteSQL
DECLARE @ExitCode INT--the code returned by the xp_cmdShell procedure
DECLARE @ExitCodeASCII VARCHAR(10)--the ascii version of above code
DECLARE @Badfile VARCHAR(MAX)--we use this to pull in any HTML files to
--examine if there was an error in reading in the XML file
DECLARE @Retries INT--the number of retries we allow
DECLARE @Ref VARCHAR(1000)--the reference supplied by the feed redirector
SELECT @retries = 0--set up the retry count
CREATE TABLE #lines ( line VARCHAR(2000) )--table used to remember what is
--passed back from the command line when executing xp_cmdShell
retry:--yes we use GOTOs. We sometimes do...
RAISERROR ( 'processing %s', 0, 1, @url )--a way of getting an immediate
--progress report
/* now we make up the command line for CURL.EXE. You may want to change the
parameters here for various reasons such as a proxy. If an error happens
then whatever CURL prints out is returned to the procedure and printed out
by the error processing */
SELECT @command = 'curl "' + @url + '" -o"' + @tempfile
+ '" -A"Mozilla/4.0 (compatible; MSIE 6.0;'
+' Windows NT 5.0; .NET CLR 1.1.4322"'
/*Firstly we read the RSSfeed into a file*/
INSERT INTO #lines ( line )
EXECUTE @exitcode= xp_cmdshell @command--and execute it
SELECT @ErrorMessage = COALESCE(@errorMessage, '') + COALESCE(line, '')
FROM #lines--put the entire result into one string
IF EXISTS ( SELECT 1
FROM #lines
WHERE line LIKE '%is not recognized as an%' )
BEGIN--the silly moo hasn't installed CURL!
RAISERROR ( 'Sorry. You must have CURL installed first!', 16, 1 )
RETURN 1
END
IF EXISTS ( SELECT 1--an error was returned from CURL
FROM #lines
WHERE line LIKE '%Could not resolve host%' )
OR EXISTS ( SELECT 1
FROM #lines
WHERE line LIKE 'curl: (%' )
BEGIN
RAISERROR ( 'Sorry. Could not get RSS feed because %s', 16, 1,
@ErrorMessage )
RETURN 1
END
IF @Exitcode <> 0 --ah CURL command line returned an error code
BEGIN
SELECT @ExitcodeASCII = CONVERT(VARCHAR(5), @Exitcode)
SELECT @ErrorMessage = COALESCE(@errorMessage, '') + COALESCE(line, '')
FROM #lines
RAISERROR ( 'Sorry. Errorcode %s. Curl reports %s', 16, 1,
@ExitcodeASCII, @ErrorMessage )
RETURN 1
END
/* ...then we read it in...*/
DECLARE @RSSfeed XML
DECLARE @Feed_ID INT
BEGIN TRY--we will catch errors at this point as they are usually XML
SELECT @Command = 'SELECT @RSSfeed = BulkColumn
FROM OPENROWSET(BULK ''' + @TempFile + ''', SINGLE_BLOB) AS x '
EXEC sp_executesql @command, N'@RSSfeed xml output', @RSSfeed OUTPUT
END TRY
BEGIN CATCH
IF @@Error IN (9413,9422)--if it was an XML parsing error
BEGIN
RAISERROR ( '%s was not valid XML. Was it a redirect?', 0, 1, @url )
SELECT @Command = 'SELECT @Badfile = BulkColumn
FROM OPENROWSET(BULK ''' + @TempFile + ''', SINGLE_BLOB) AS x '
EXEC sp_executesql @command, --have a look at it
--read the file as a TEXT file rather than an XML
N'@BadFile varchar(MAX) output', @BadFile OUTPUT
--we examine it to see if it is a redirect file
SELECT @ref = SUBSTRING(@BadFille,
CHARINDEX('<a href="', @BadFile + '<a href="', 1)
+ 9, 1000)
SELECT @ref = LEFT(@Ref, CHARINDEX('"', @ref + '"') - 1)
IF LEN(@Ref) < 10--check for obvious signs of problems
OR @ref NOT LIKE 'http://%'
BEGIN
RAISERROR ( 'The url %s was not a valid redirect!', 16, 1, @URL )
RETURN 1
END
UPDATE RSSFeed SET type='BAD' WHERE ActualURL=@URL
SELECT @URL = @ref,
@Retries = @Retries + 1
IF @retries < 3
GOTO retry
END
RAISERROR ( 'The url %s was not a valid RSS feed!', 16, 1, @URL )
RETURN 1
END CATCH
/*Now we see if this is a new feed. If it is we add it to the feed
database.
*/
DECLARE @RSSFeedAttributes TABLE
(
title VARCHAR(80),
link VARCHAR(200),
[description] VARCHAR(2000),
[language] VARCHAR(10) DEFAULT 'en-US',
Category VARCHAR(100) DEFAULT 'Generic',
docs VARCHAR(80),
Generator VARCHAR(80),
ManagingEditor VARCHAR(80),
Webmaster VARCHAR(80)
)
/*
<title>My Title</title>
<link>My Link</link>
<description>My Description</description>
<language>My Language e.g. en-us</language>
<category>Newspapers</category><docs>My Docs URL</docs>
<generator>My RSS Generator</generator>
<managingEditor>My.Editorial@Email.Address</managingEditor>
<webMaster>My.Webmaster@Email.Address</webMaster>
*/
INSERT INTO @RSSFeedAttributes
(
title,
link,
[description],
[language],
Category,
docs,
Generator,
ManagingEditor,
Webmaster
)
SELECT
x.feed.value('title[1]', 'varchar(80)') AS title,
x.feed.value('link[1]', 'varchar(200)') AS link,
x.feed.value('description[1]', 'varchar(2000)') AS [description],
COALESCE(x.feed.value('language[1]', 'Varchar(10)'), 'en-US')
AS [language],
COALESCE(x.feed.value('Category[1]', 'Varchar(100)'),
'Generic') AS [Category],
x.feed.value('docs[1]', 'Varchar(80)') AS [docs],
x.feed.value('generator[1]', 'Varchar(80)') AS [Generator],
x.feed.value('managingeditor[1]', 'Varchar(80)') AS [ManagingEditor],
x.feed.value('webmaster[1]', 'Varchar(80)') AS [Webmaster]
FROM @RSSfeed.nodes('//rss/channel') AS x ( feed )
--insert the feed if it doesn't exist
INSERT INTO RSSFeed
(
title,
link,
[description],
[language],
Category,
docs,
Generator,
ManagingEditor,
Webmaster,
ActualURL
)
SELECT f.title,
f.link,
f.[description],
f.[language],
f.Category,
f.docs,
f.Generator,
f.ManagingEditor,
f.Webmaster,
@URL
FROM @RSSFeedAttributes f
LEFT OUTER JOIN RSSFeed
ON RSSfeed.title = f.title
AND rssFeed.link = rssfeed.link
WHERE rssFeed.rssFeed_ID IS NULL
--..and get the ID of the feed
SELECT TOP 1
@Feed_ID = rssFeed.rssFeed_ID
FROM @RSSFeedAttributes f
INNER JOIN RSSFeed ON RSSfeed.title = f.title
AND rssFeed.link = rssfeed.link
--and add any RSSfeed item that doesn't exist
INSERT INTO RSSitem
(
RSSFeed_ID,
title,
link,
[description],
PubDate,
[GUID]
)
SELECT @feed_ID,
f.title,
f.link,
f.description,
COALESCE(f.pubdate, '-'),
COALESCE(f.[GUID], '-')
FROM ( SELECT
x.feed.value('title[1]', 'varchar(80)') AS title,
x.feed.value('link[1]', 'varchar(200)') AS link,
x.feed.value('description[1]', 'varchar(max)')
AS [description],
x.feed.value('pubdate[1]', 'Varchar(40)') AS [pubdate],
CONVERT(VARCHAR(80), x.feed.value('guid[1]',
'Varchar(2000)')) AS [GUID]
FROM @RSSfeed.nodes('//rss/channel/item') AS x ( feed )
) f
LEFT OUTER JOIN RSSitem ON COALESCE(f.title, '-') = RSSitem.title
AND COALESCE(f.GUID, '-') = RSSitem.GUID
AND COALESCE(f.PubDate, '-') = rssitem.pubdate
WHERE rssitem.RSSitem_ID IS NULL
GO
IF EXISTS ( SELECT *
FROM sysobjects
WHERE id = OBJECT_ID('spReadOPMLFile') )
DROP PROCEDURE [spReadOPMLFile]
GO
/*
Now we provide a routine for reading in any ompl FILE. This is just a list
so it is relatively easy to read, but note that the list may be
hierarchical. We are not too concerned about the OPML hierarchy so we
won't bother to record the hierarchy.
*/
CREATE PROCEDURE spReadOPMLFile @Filename VARCHAR(100)
/*
spReadOPMLFile 'S:MyFavouriteRSSFeeds.opml'
*/
AS
DECLARE @opmlfeed XML
DECLARE @command NVARCHAR(255)
SELECT @Command = 'SELECT @opmlfeed = BulkColumn
FROM OPENROWSET(BULK ''' + @filename + ''', SINGLE_BLOB) AS x '
EXEC sp_executesql @command, N'@opmlfeed xml output', @opmlfeed OUTPUT
INSERT INTO RSSfeed
(
title,
link,
[description],
[language],
Category,
actualURL,
[type]
)
SELECT f.*
FROM
( SELECT
x.opml.value('@title', 'nvarchar(80)') AS title,
x.opml.value('@xmlUrl', 'nvarchar(200)') AS link,
COALESCE(x.opml.value('@description',
'nvarchar(2000)'), '') AS [description],
COALESCE(x.opml.value('@language',
'nvarchar(2000)'), 'Een-US') AS [language],
'Generic' AS [category],
x.opml.value('@xmlUrl', 'nvarchar(200)') AS actualURL,
COALESCE(x.opml.value('@version', 'nvarchar(80)'),
'RSS') AS [type]
FROM @opmlfeed.nodes('/descendant::outline[@xmlUrl]')
AS x ( opml )
) f
LEFT OUTER JOIN rssfeed ON f.link = rssfeed.link
WHERE rssfeed.rssFeed_ID IS NULL
GO
IF EXISTS ( SELECT *
FROM sysobjects
WHERE id = OBJECT_ID('spUpdateAllFeeds') )
DROP PROCEDURE [spUpdateAllFeeds]
GO
/* this is the stored procedure that polls all your existing RSS feeds
and reads in any new news entries*/
CREATE PROCEDURE spUpdateAllFeeds
AS
DECLARE @command VARCHAR(MAX)
SELECT @Command = COALESCE(@command, '') + '
execute spUpdateRSSArchiveFrom ''' + actualURL + '''
'
FROM rssFeed
WHERE type LIKE 'RSS%'
EXECUTE ( @command )--execute them as a string
GO


/* In the list of files to download will be some sample OMPL files to try out. Now when you've loaded in a few RSS feeds, all you have to do is to put this stored procedure on the scheduler and this will keep your News up-to-date. Instant Blogrolls! The first person to write an OMPL exporter for this will get a Simple-Talk goodie bag! */




Technorati :
Del.icio.us :