This morning, I found myself on Baseline Scenario, a well-known site which discusses the economic crisis. I noticed that the authors of the site had laboured over producing a PDF version for each month of their archive, by copying and pasting to Word and producing a PDF. There’s a nicer way of doing this, I think. When you’ve done it once, it should take you no more than ten minutes to go through the whole process any other time.
- WordPress provides a way to filter content by date. In our example, we’ll grab the RSS feed from the first month of publications: http://baselinescenario.com/2008/09/feed The permalink structure is clear enough on WordPress. For Blogger, it’s nowhere near as intuitive.
- The feed will display the articles in descending date order. When you are reading the PDF or eBook version, you don’t want to read the last article first, as you would on the website. To reverse the order of the feed, use Yahoo Pipes (or for WordPress, see @mhawksey’s comment below). You can clone my example. If you’ve not used Yahoo Pipes before, don’t worry. You just need a Yahoo account. The example I give is as simple a pipe as you will see and should make sense as soon as you look at it.
- Once you’ve created the pipe of the feed in ascending order, save and run the pipe. Look for the RSS icon and copy the pipe’s RSS link, which should look like this: http://pipes.yahoo.com/pipes/pipe.run?_id=cb438b51b2819eb1f4f5ec6f10daf09e&_render=rss
- Next, go to FeedBooks. Sign up for an account if you don’t already have one. Now, we create a Newspaper.
- Click on News in the menu and then on Create a newspaper. Give it a name and tag it. In our example, we’ll call it Baseline Scenario Archive.
- Click on ‘Add a RSS feed’. Give it a name (in our case ‘September 2008’) and paste your RSS feed into the box. Once it’s found and accepted your feed, click ‘Publish’.
- You can now click on the name of the specific feed and you’ll be presented with a page that offers an ePub, Kindle and PDF versions of your feed. Here’s the Baseline Scenario September 2008 example.
- That’s it. You can do it with whole sites, too, if you like. Here’s one I did earlier (Blogger). The only thing you need to remember is to ensure that the RSS feed contains all the items you’re looking for. For the Blogger site, the source feed looks like this: http://www.blogger.com/feeds/27481991/posts/default?max-results=1000 A thousand items is more than enough to capture this site for quite some time. For WordPress, the site owner has to change their Reading Settings to include sufficient items. For the Baseline Scenario, they need to set this at a number high enough to ensure that a month’s worth of posts are included. I would just set it at 3000 and then forget about it. It would mean the entire site could be captured this way for the next year or so.
Having problems? Got a question? Leave a comment.
Should we offer pub variants of docs that appear on writetoreply (if license appears to allow it?)
Yes, I first did this for the Digital Britain – Interim Report and then for one or two others, but not consistently.
http://writetoreply.org/actually/2009/03/28/ways-to-read-and-navigate-documents/
I’m sure we’ve talked about the flexibility of WordPress query strings before ;). For your readers with WP you can skip yahoo pipes by adding to the query string e.g. http://baselinescenario.com/2008/09/feed/?orderby=post_date&order=ASC
More WP fun at http://codex.wordpress.org/Template_Tags/query_posts
[I also like using tabbloid.com for pdfing rss]
That’s fantastic. I had no idea. Thanks!
Thanks for doing this. I have been looking for an easy way to do this for a long time. I actually did those Baseline Scenario archives by pointing Google Reader to my blog and then copy/pasting into Word (so I did it one month at a time, not one post at a time), but still there must be a better way.
I still have a stumbling block, however. When I do FeedBooks, the resulting PDF only has post title and text; it doesn’t have date and author, which I consider must-haves. These must be in the feed from WordPress, since I can see them in Google Reader, but for some reason (choice?) FeedBooks is leaving them out and I can’t find a setting to include them. Does anyone know a solution to this? (Maybe there’s a way to pre-process the feed so the date and author are sucked out of the XML elements and placed at the beginning of the text content?) Otherwise I think it just won’t work for me.
A friend of mine suggested using BlogBooker, but that operates on my WordPress export file, which is 42 MB, and to do it month-by-month it looks like I would have to edit the RSS by hand (or write a program to do it, which is beyond my abilities at the moment).
Hi James, I can see how the author and date is significant for your site. I’ve looked at Tabbloid and a couple of other RSS2PDF online tools but they all ignore either or both the PubDate and Creator elements. I’ve not used BlogBooker, but like the look of it. As a reader of your site, I wouldn’t mind downloading an updated version of the entire ‘book’ every month, rather than simply monthly instalments.
I’m optimistic about Tabbloid, because Simon and I started manually adding our names to posts several months ago (because FeedBurner similarly doesn’t pick up author names). But I having trouble testing it–it seems to require a day’s cycle every time I want to try something.
WIth Tabbloid, you could just set up a subscription to your latest month’s archive and once you’ve received your first delivery (no need for it to be any more than an hour later), unsubscribe to it and place the PDF on your site for others to download. If you’ve started including the author’s name in the actual article content, then I can see this working quite nicely.
@james Yahoo pipes could be used to extract the author and date.
Here is a pipe http://pipes.yahoo.com/mashe/baseline I’ve created for Baseline Scenario
You can just run the pipe for the current rss feed or you can specifiy an archive path string. So if you want to get an rss feed for Sep 2009 enter ‘2009/09’.
By running the pipe and then clicking ‘get as RSS’ you could submit this to feedbooks. You can shortcut a lot of the button clicking for archives by modifiying this url:
http://pipes.yahoo.com/mashe/baseline?_render=rss&path=yyyy/mm
where yyyy would be replaced with the year
and mm is month
So Sep 2009 would be:
http://pipes.yahoo.com/mashe/baseline?_render=rss&path=2009/09
I’ve started an example here http://feedbooks.com/newspaper/8130 which has the current RSS and archive for Dec 2009
Hope this helps
Martin
Hi,
I’ve been trying to make an ebook of Jordan Mechner’s excellent old journals:
http://jordanmechner.com/old-journals/feed/?orderby=post_date&order=ASC&
But I’ve been unable to publish more than 10 posts at ones. I know all posts are accessable, because when i switch the sort order to DESC i can see the last 10 (and with ASC the first 10).
I’ve tried googling (‘rss parameters’ and similar searches) and tried lots of RSS parameters (e.g. show_count=1, and also max-results=500 and also posts_per_page=-1, but to no avail.
help?
I think this is simply because the site’s settings are to show no more than ten items in any feed. For example, this site is set up to show ten items in the feed: http://chemistryfm.dev.lincoln.ac.uk
If you sort ASC or DESC, you only ever get ten items.
If I were you, I’d contact the blog owner and ask them to increase the number of items from ten to 1000 or whatever you think might cover all the content.
Thanks for your reply, that explains why it didn’t work before.
There are about 436 items in the category for this feed.
Before bothering the blog owner I’ll try to look into other ways first (for instance, google reader can download everything, probably by making multiple requests, maybe I can get it to export).
Thanks for the help!
Really?
I’m surprised GReader can grab everything. I can’t make it pull everything from the chemistryfm feed.
Anyway, if you can view it all in GReader, then yes, you can create a feed from that, simply by creating a ‘bundle’. I’ve written about that here: http://joss.dev.lincoln.ac.uk/2009/05/27/using-google-reader-as-an-opml-editor-and-feed-blender/
Each bundle with one or more feeds in it, has an Atom feed, which you can then run through Yahoo Pipes if it needs reversing.
I also seem to observe that Google Reader can go arbitrarily far back, although I don’t know how it does it. Even when my blog was limiting feeds to 15 posts, I could scroll back to the beginning in Google Reader (it just takes a long time). That is how I did the original PDF blog archives.
My guess is that this only works because I have been subscribed to the blog for that whole time, so Google has been caching the old items somewhere.
Yes, I bet that’s it. Camiel, have you been subscribed to the Jordan Mechner blog since the first ten posts were published?
No, i’ve only recently discovered RSS.
But I have no doubt that Google has been 😀
Strange, i can see all the posts when i go to the subscriptions (like james said, by scrolling until all items are loaded) but when i drop it in a bundle only 11 items are there. And in the bundle framework (‘preview’), there is no way to load the other items.
So using the google reader feed link for the bundle doesn’t work either.
http://www.google.com/reader/public/atom/user%2F03115315646288341412%2Fbundle%2FJordan%20Mechner%27s%20old%20journals
James, how did you extract the PDF from that list?
Um, the old-fashioned way. I scrolled down in Google Reader until I got all the items I wanted, copied all of it with the mouse, pasted into Word, did a couple global find-and-replaces, and printed to PDF.
Hence the search for a better way of doing things.
James, Here’s an answer to your problem of including the author’s name and pub date:
http://www.rsc-ne-scotland.org.uk/mashe/2010/01/feedbooks-pipe/
Thanks, Martin!
Thanks for your help on this. -I was looking for a similar solution but hadn’t joined the dots.
Have you considered turning a blog into a proper book – it’s one way to solve all them digital preservation problems! 😉
http://blogbooker.com/
Have you tried http://www.tabbloid.com/ it is a service from HP and it lets you select if you want to exclude some of the items.
Great post!
Thanks for the info, been looking into this myself and it is great to hear what you have done!
TOM
The site now says that the service has been discontinued.