Add disclaimer, and accidentally deleted files

robalexdev · Nov 16, 2024 · 2788464 · 2788464
1 parent febeefd
commit 2788464
Show file tree

Hide file tree

Showing 10 changed files with 379 additions and 0 deletions.
diff --git a/content/cats/_index.md b/content/cats/_index.md
diff --git a/content/languages/_index.md b/content/languages/_index.md
diff --git a/content/opml/opml-archive.md b/content/opml/opml-archive.md
@@ -0,0 +1,9 @@
+---
+title: "Blogroll Network: Feeds for archive"
+description: All the feeds in the blogroll network (excluding noarchive feeds).
+url: /opml-archive.xml
+outputs:
+  - xml
+params:
+  isArchiveOnly: true
+---
diff --git a/content/opml/opml.md b/content/opml/opml.md
@@ -0,0 +1,9 @@
+---
+title: "Blogroll Network: All Feeds"
+description: All the feeds in the blogroll network
+url: /opml.xml
+outputs:
+  - xml
+params:
+  isArchiveOnly: false
+---
diff --git a/content/rss/rss-archive.md b/content/rss/rss-archive.md
@@ -0,0 +1,9 @@
+---
+title: "Blogroll Network: Latest posts from each feed for archive"
+description: The most recent post from each feed in the network (excluding noarchive feeds).
+url: /rss-archive.xml
+outputs:
+  - xml
+params:
+  isArchiveOnly: true
+---
diff --git a/content/rss/rss.md b/content/rss/rss.md
@@ -0,0 +1,9 @@
+---
+title: "Blogroll Network: Latest posts from each feed"
+description: The most recent post from each feed in the network
+url: /rss.xml
+outputs:
+  - xml
+params:
+  isArchiveOnly: false
+---
diff --git a/content/standalone/aggregate-feeds.md b/content/standalone/aggregate-feeds.md
@@ -0,0 +1,23 @@
+---
+title: Aggregate Feeds
+url: /aggregate-feeds/index.html
+---
+
+For compatibility with existing tools, several aggregate feeds are available.
+You may import these into RSS feed readers, although they contain many feeds across a variety of topics, so you probably won't want to.
+
+Feeds are available in two variants.
+The first is a full feed, which contains all feeds that have been discovered in the Blogroll Network.
+The second is feed excludes any feeds which have been served with a `X-Robots-Tag: noarchive`.
+As such, the second feed is suitable for building a feed archive while respecting the `noarchive` tag.
+
+## RSS Feeds
+
+* [Full feed](/rss.xml)
+* [Feed for archive](/rss-archive.xml)
+
+
+## OPML Reading Lists
+
+* [Full list](/opml.xml)
+* [List for archive](/opml-archive.xml)
diff --git a/content/standalone/privacy.md b/content/standalone/privacy.md
@@ -0,0 +1,205 @@
+---
+title: Privacy
+url: /privacy/index.html
+---
+
+Web scraping and indexing is an inherently privacy invasive practice.
+As such, this project uses several methods to reduce unwanted impacts while achieving the goal
+of improving discoverability of personal blogs, and their feeds.
+
+The privacy impacts of this project and controls are described here.
+
+## Terms
+
+The Website - this website at https://alexsci.com/rss-blogroll-network/.
+
+The Crawler - the web crawler that collects data displayed on The Website.
+
+The Project - The Website, The Crawler, and associated works.
+
+
+## Data collected by The Website
+
+The Website collects metrics about visitors using
+[GoatCounter](https://www.goatcounter.com) (a third-party).
+GoatCounter was selected as it is open source, collects metrics in a privacy preserving way, and does not use any cookies.
+The [GoatCounter docs](https://www.goatcounter.com) explain that GDPR consent is likely not needed as it doesn't collect any
+personally identifying information.
+
+Here's the settings being used on GoatCounter:
+
+{{< figure
+    src="/GoatCounterSettings.png"
+    title="From the GoatCounter settings page" 
+    alt="Checked items: sessions, referrer, user-agent, size, country. Unchecked: region, language."
+>}}
+
+This collection is helpful in understanding the technology used to access The Website and
+the popularity of various pages.
+
+You can opt-out of this data collection by installing a web browser extension like
+[uBlock Origin](https://github.com/gorhill/uBlock#ublock-origin).
+The default uBO configuration blocks GoatCounter (and many other types of content).
+Extensions like uBO work best on [Firefox](https://www.mozilla.org/en-US/firefox/new/).
+
+
+## Hosting
+
+The Website is hosted on GitHub Pages, which has it's own [privacy policy](https://docs.github.com/en/pages/getting-started-with-github-pages/about-github-pages#data-collection).
+
+
+## Cookies, tracking pixels, local storage, CDNs 
+
+These are not used on The Website.
+
+
+## Data collected by The Crawler
+
+The Crawler may collect the following data, as provided by a public available web page.
+The following types of information are collected:
+
+* Title
+* Descriptions
+* Category
+* Links
+
+Body content of web pages, like the text of a blog post, is not collected.
+
+The Crawler will collect Personally Identifiable Information if it is present in a collected field.
+For example, my blog is named "Robert Alexander's Blog", which has my name in the title.
+
+
+## Opt-outs for The Crawler
+
+You may use the following methods to opt-out your web content out of being indexed by
+The Project.
+These are industry standards and are useful to implement if you'd like to control how
+standards compliant web crawlers interact with your website.
+
+
+### Manual request
+
+You can always manually request for your site to be excluded from the project.
+Your domain will be listed under [`block_domains`](https://github.com/ralexander-phi/rss-blogroll-network/blob/main/feeds.yaml).
+Content related to your web pages will be removed after the crawlers next run.
+
+Contact methods:
+
+* [Open a GitHub Issue](https://github.com/ralexander-phi/rss-blogroll-network)
+* [DM me on Mastodon](https://indieweb.social/@robalex)
+* Send me an email: robert at robalexdev dot com
+
+You will need to demonstrate that you are the owner the requested domain.
+This is a personal project, so I'll process any requests as I am available.
+No timeline is provided, although I consider opt-outs as priority incidents.
+
+
+### robots.txt
+
+A `robots.txt` file hosted at the root of your domain (I.E. https://example.com/robots.txt) can
+be used to control what content various automated user agents are allowed to access.
+
+For example, if you don't want any web crawlers to access your site, you can block them all using:
+
+    User-agent: *
+    Disallow: /
+
+The Crawler (and any other well-behaved web crawler) will not access any content on
+your site (except the robots.txt file) when you use this setting.
+
+If you'd like to block every crawler other than The Crawler, you can write:
+
+    User-agent: *
+    Disallow: /
+    User-agent: Feed2Pages/0.1
+    Disallow:
+
+If you'd like to only block The Crawler use:
+
+    ...
+    User-agent: Feed2Pages/0.1
+    Disallow: /
+
+More fine-grained control of access is also possible.
+For example, if you'd like The Crawler to process your Atom feed but not your RSS feed you can
+write something like:
+
+    ...
+    User-agent: Feed2Pages/0.1
+    Disallow: /rss.xml
+
+
+### noindex tag
+
+The `noindex` tag instructs web crawlers not to index a page.
+You can ask all crawlers not to index your page by placing the following HTML inside your `<head>` section:
+
+    <meta name="robots" content="noindex">
+
+You can selectively ask The Project not to index your page using:
+
+    <meta name="feed2pages/0.1" content="noindex">
+
+Or you can indicate that you only want certain crawlers to index your page:
+
+    <meta name="robots" content="noindex">
+    <meta name="feed2pages/0.1" content="">
+
+You can also put `noindex` in an HTTP header:
+
+    X-Robots-Tag: noindex
+
+Any page with `noindex` set will not be shown on The Website.
+As crawling is an intermittent process, pages may remain on The Website until after the next crawl is completed.
+
+You can read more about [`noindex` in Google's documentation](https://developers.google.com/search/docs/crawling-indexing/block-indexing).
+
+
+### noarchive tag
+
+While The Project doesn't operate an archive,
+[other sites](https://github.com/ralexander-phi/rss-blogroll-network/issues/8)
+may use The Project's data as part of their archival process.
+
+When The Crawler sees the `noarchive` tag on a feed, the `noarchive` tag will be used on any pages on
+The Website that were generated for that feed.
+Note that RSS and Atom feeds are not HTML documents, so you'll need to use the HTTP header approach mentioned above.
+Archivers like the [Internet Archive](https://archive.org/post/31561/robots-archive-noarchive-meta-tags)
+respect the `noarchive` tag.
+
+
+### HTTP status codes
+
+The Crawler will not index content that is restricted.
+A HTTP request that returns a [401 Unauthorized](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
+status code, for example, will not be indexed.
+
+You can selectively block The Crawler by detecting the User Agent String
+([`Feed2Pages/0.1`](https://github.com/ralexander-phi/feed2pages-action/blob/main/const.go#L3))
+and returning
+[one of the supported status codes](https://github.com/ralexander-phi/feed2pages-action/blob/ffc36d4ff5827d3f8db2ad2f7ec9a47dc30ff2a3/crawler.go#L27).
+
+
+## Use of collected data
+
+The Website displays collected data for ease of browsing.
+The public data collected by the crawler is an [open data set and is publicly available](https://alexsci.com/rss-blogroll-network/index.json) on The Website.
+Expected additional uses include recommendation and discovery systems for RSS readers.
+As an open data set, others may use the data in other ways.
+
+
+## Open source code
+
+The crawler is open source and the code is available for review:
+
+* [The Crawler](https://github.com/ralexander-phi/feed2pages-action)
+* [The Website](https://github.com/ralexander-phi/rss-blogroll-network)
+
+You may inspect the behavior in detail.
+
+
+## Errata and changes
+
+This page will be updated as privacy impacting changes occur.
+This project is run by a human, I occasionally make mistakes, let me know if you see any
+bugs, errors or omissions.
diff --git a/content/standalone/scoring.md b/content/standalone/scoring.md
@@ -0,0 +1,107 @@
+---
+title: Scoring Criteria
+url: /scoring/index.html
+---
+
+The current scoring methodology aims to promote adoption of features that will support the social aspects of
+the blogroll network.
+A higher score doesn't mean the blog is more interesting to read, or has quality content,
+just that it has adopted features that support the blogroll network.
+Future recommendation system-like features are planned.
+
+
+## Promotes others
+
+***Up to 10 points***
+
+A blog should have an OPML blogroll that promotes other blogs which author enjoys reading.
+Peer recommendations are the core of the blogroll network.
+Each recommendation earns a point.
+
+See [the blog post](https://alexsci.com/blog/blogroll-network/#deploy-your-own-discoverable-blogroll) for more.
+
+
+## Promoted by others
+
+***Up to 5 points***
+
+Being promoted by others is a strong quality signal and ensures your blog is discoverable from the network.
+Peer recommendations are the core of the blogroll network.
+Each recommendation earns one point.
+
+You'll need other sites to adopt OPML blogrolls and promote your site to earn these points.
+Ask your friends.
+
+
+## Has a linked website
+
+***Up to 2 points***
+
+A blog should link to it's feed using the `<link rel="alternate" ...>` syntax (1 point).
+The feed should backlink to the website (1 point).
+These links help readers identify feeds and helps crawlers associate feeds with their websites.
+
+
+For example, to link your RSS feed from your website you'd include this code in the head of your pages:
+
+    <link rel="alternate" type="application/rss+xml" href="https://example.com/feed.xml" title="RSS Feed">
+
+Then in your RSS feed you'd link back to the website:
+
+    <link>https://example.com</link>
+
+
+If you are using a blogging framework, these should be automatically handled.
+
+
+## Has rel=me links
+
+***Up to 2 points***
+
+A website should link to other related websites using the `<link rel="me" ...>` syntax.
+This helps readers find your content across any websites or social media platforms you use.
+A `rel=me` link earns a point and a backlink (which verifies the link) earns another point.
+
+[Learn about rel=me](https://microformats.org/wiki/rel-me).
+
+
+## Has feed categories
+
+***Up to 5 points***
+
+A feed should include categories to help readers understand the themes of the blog.
+Each category earns one point.
+
+[Learn about post categories](https://www.rssboard.org/rss-specification#ltcategorygtSubelementOfLtitemgt)
+and read the [blog post](https://alexsci.com/blog/rss-categories/) about how categories are used.
+
+## Has post categories
+
+***Up to 3 points***
+
+A post should include categories to help readers understand the themes of each post.
+Each category tag on the latest post earns one point.
+
+[Learn about post categories](https://www.rssboard.org/rss-specification#ltcategorygtSubelementOfLtitemgt)
+and read the [blog post](https://alexsci.com/blog/rss-categories/) about how categories are used.
+
+
+## Has a feed title
+
+***3 points***
+
+A feed should have a title to help readers identify the blog.
+
+
+## Has a feed description
+
+***3 points***
+
+A feed should have a description to help readers understand what the blog is about.
+
+
+## Has a feed language
+
+***1 point***
+
+A feed should specify the language it uses.
diff --git a/layouts/index.html b/layouts/index.html
@@ -64,4 +64,12 @@ <h2 class="subtitle is-4 pt-6">Create your own OPML blogroll</h2>
 If your site is not at the edge of the network, you can <a href="https://github.com/ralexander-phi/rss-blogroll-network/issues/new/choose">request that your feed be added as a seed</a>.
 </p>
 
+<h2 class="subtitle is-4 pt-6">Disclaimer</h2>
+
+<p class="block">
+This project aggregates content from across the web in an inclusive fashion.
+Minimal filtering is applied.
+Linked content is not endorsed and does not reflect the opinions of the project owner or his employer.
+</p>
+
 {{- end -}}