Troubleshooting & How-Tos 📡 🔍 Programming

Combining RSS and Atom Feeds with Python

I wanted to set up a primary feed for my entire site that would include posts from all the subsites. But when I looked for tools to combine multiple RSS or Atom feeds into one feed didn’t come up with anything useful. Mostly posts about feed reader programs, or services that will combine feeds for you, or how to create your own aggregator page from other sites for you to read. Nothing about combining feeds to publish.

I did find articles about how to read feeds in Python using feedparser, and how to build feeds in Python using feedgen, and I figured – why not just connect them?

It took a little fiddling, but here’s what I managed to put together:

Getting started

You’ll need to install feedparser and feedgen from pip or your distribution’s packages. I found feedparser in Fedora as python3-feedparser, but installed feedgen from pip.

Now import both libraries into your script, plus time so you can sort the individual posts.

#!/usr/bin/python
import feedparser
import time
from feedgen.feed import FeedGenerator

Then make a list of all the feed URLs you want to pull from.

sources = [
  'https://example.com/feed/',
  'https://anotherexample.com/path/to/feed.xml'
]

Reading the Source Feeds

Create a new list to store all the posts, then loop over each feed adding each item to the list.

fullList = []
for url in sources:
  feed = feedparser.parse(url)
  for item in feed['items']:
    fullList.append(item)

Sort the posts by time and cut it to only the most recent posts.

fullList.sort(key=lambda item: item['updated_parsed'], reverse=True)
outList = fullList [0:15]

Note that we’re using updated_parsed here - that’s the Python internal time representation of the date, so we can sort it. But we’re using the original updated field (which could be in any format, as long as it’s unambiguous) when building the new feed. You may want to re-build a consistently formatted timestamp from the updated_parsed value.

Generate the New Feed

First you need to create a new feed object and assign its overall properties, like the site title, the URLs, the main author, etc.

fg = FeedGenerator()
fg.id('https://example.com/allthethings')
fg.title('ALL The Things!')
fg.subtitle('Everything you might want to follow from this site.')
fg.author( {'name':'Jane Doe', 'email':'jane@example.com'} )
fg.link( href='https://example.com/',rel='alternate' )
fg.link( href='https://example.com/feed.xml', rel='self' )
fg.language('en')

Now you need to loop through the list you got from feedparser, and assign each item’s properties to a new one in the generator. Conveniently, feedparser uses the same property names whether the source is RSS or Atom, so you can mix both source types easily.

You may want to include other properties on the feed or the individual items. To do that, look up the names in both the feedparser reference and the feedgenerator api. But be sure to test for the existence of anything that’s optional!

for item in outList:
  fe = fg.add_entry(order='append')
  fe.id(item['id'])
  fe.title(item['title'])
  fe.link(href=item['link'])
  fe.updated(item['updated'])
  # add author if present in original
  if('author' in item):
    fe.author(name=item['author'])
  if('summary' in item):
    # If the summary contains HTML code, set its type.
    summary_type = 'html' if '<' in item['summary'] else None
    fe.summary(summary=item['summary'], type=summary_type)
  for tag in item.tags:
    fe.category(term=tag.term)
  for content in item.content:
    fe.content(content=content.value, type='html')

NOTE: You may not want to hard-code the content type. I tried copying it from the source and it mapped everything as text/html instead, and the content wouldn’t display when I tried to read it.

Update: Thanks to @FiXato@toot.cat for pointing out that I should check the type of the summary too, reminding me to look up how to preserve the order in feedgen, and suggesting some Python tips to clean up the code!

Anyway, now you just need to write the feed to a file. FeedGenerator makes this super-easy:

fg.atom_file('feed.xml')

Now you have a script you can run manually, or as part of a build process, or as a cron job to keep the combined feed up to date!

For a working example, you can take a look at my full site feed, which is built from the feeds of three 11ty subsites and my WordPress blog.

IndieWeb

Since I already had the data, I also built an IndieWeb-friendly microformats2 feed of the most recent posts as an HTML list.

indiefeed = []
indiefeed.append("<ul class='h-feed'>\n")
for item in outList:
  shortdate = time.strftime('%b %d', item['updated_parsed'])
  fulldate = time.asctime(item['updated_parsed'])
  indiefeed.append("<li class='h-entry'><time class='dt-updated' datetime='"+ fulldate + "'>" + shortdate + "</time>: <a class='u-url p-name' href='" + item['link'] + "'>" + item['title'] + "</a></li>\n")
indiefeed.append("</ul>\n")

with open('latest.html', 'w') as LatestFile:
  LatestFile.writelines(indiefeed)

I’m updating the site’s home page with this list, so it’s clear for people reading the home page, and for any IndieWeb reader software!