[2020-04-04]
I want urls that represent information, regardless the way it’s presented
[2020-05-27]
Google no longer providing original URL in AMP for image search results [2019-10-11]
mobile versions of sites sometimes have different "canonical", e.g. mobile.twitter.com [2020-05-28]
archive.org is messing with canonical [[cannon]][2019-11-02]
e.g. this link doesn’t have ‘canonical’ even though it’s a mirror: https://solar.lowtechmagazine.com/2016/11/the-curse-of-the-modern-office.html [2019-11-08]
no canonical on gist https://gist.github.com/dneto/2258454 [2019-08-19]
parent and sibling relations can be determined from the URL [[cannon]] [[promnesia]][2019-11-01]
if the original page is gone I can still easily link my saved annotations (Instapaper/Pocket/Hypothesis) to archived page [2019-09-07]
urls a good candidate to determine ‘entities’ because they sure at least somewhat curated [[cannon]][2019-02-24]
normalization is tricky.. for some urls, stuff after # is important https://en.wikipedia.org/wiki/Tendon#cite_note-14 . for some, it’s utter garbage [2019-08-07]
The Problem With URLs https://blog.codinghorror.com/the-problem-with-urls/ [2020-01-02]
motivation: siloing: instapaper ‘imports’ pages and assigns an id: https://www.instapaper.com/read/1265139707 [2021-03-07]
could normalize historic URLs which are already down? [[linkrot]][2019-06-27]
Hmm could be helpful for hypothesis? [[hypothesis]]
[2021-01-30]
Ignore URL parameters - Feature Requests - Memex Community [[worldbrain]][2021-01-22]
wonder if we could cooperate? [[agora]] [[cannon]][2021-01-24]
would be useful to use the same normalising engine for #archivebox for example? [[webarchive]][2021-02-07]
could be useful for surfingkey/nyxt browser to hint ‘interesting’ urls? [2019-12-26]
archive.org [[linkrot]][2020-12-07]
einaregilsson/Redirector: Browser extension (Firefox, Chrome, Opera, Edge) to redirect urls based on regex patterns, like a client side mod_rewrite
[2020-11-20]
could reuse URL underlying etc with ampie? [[ampie]][2020-06-30]
ClearURLs / Addon: looks super super promising [2021-03-10]
https://github.com/ClearURLs/Addon/wiki/Rules: Not super convinced JSON would work well in general, but anyway it’s already pretty good. [2020-11-22]
WorldBrain/memex-url-utils: Shared URL processing utilities for Memex extension and mobile apps. [[worldbrain]][2019-07-09]
h/uri.py at 0fc8a0d345741d43b4f80856a7cbb8f5afa70f80 · hypothesis/h https://github.com/hypothesis/h/blob/0fc8a0d345741d43b4f80856a7cbb8f5afa70f80/h/util/uri.py [[hypothesis]]
[2020-05-12]
coleifer/micawber: a small library for extracting rich content from urls
[2019-03-27]
sindresorhus/compare-urls: Compare URLs by first normalizing them
[2019-07-09]
hypothesis: h/normalize_uris_test.py
[2019-04-16]
niksite/url-normalize: URL normalization for Python [2020-04-27]
john-kurkowski/tldextract: Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List. [2019-03-27]
rbaier/python-urltools: Some functions to parse and normalize URLs. [2021-03-07]
maybe we can achieve 95% accuracy with generic rules and by handling the most popular websites [2019-09-03]
should be idempotent? [2020-11-15]
Wendover Productions - YouTube [2020-04-19]
roam links [2021-02-07]
https://app.element.io/#/room/#blockchain:fosdem.org [[cannon]][2021-02-16]
A Relational Turn for Data Protection? by Neil M. Richards, Woodrow Hartzog :: SSRN [[cannon]][2019-06-23]
A Brief Intro to Topological Quantum Field Theories. - YouTube https://www.youtube.com/watch?v=59uLGIrkMxM&list=WL&index=61&t=0s [2020-11-16]
normalise DOI [2019-04-20]
fragments: Aharonov-Bohm Experiment https://physicstravelguide.com/experiments/aharonov-bohm#tab__concrete
[2019-08-25]
stuff like this: youtu.be/1TKSfAkWWN0
[2020-05-02]
https://hubs.mozilla.com/#/ [[cannon]][2020-04-30]
Writing well | defmacro
[2019-11-15]
maybe https://youtu.be/zRxI0DaQrag?t=1380 ? [2019-11-09]
github: https://twitter.com/i/web/status/928602151286386688 this end up trimmed with … :( [2019-11-07]
github: https://twitter.com/i/web/status/1156086851633131520 [2021-01-24]
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941827 [[cannon]][2021-02-28]
https://undeadly.org/cgi?action=article;sid=20170930133438 [[cannon]][2019-12-23]
https://cstheory.stackexchange.com/questions/1920/examples-of-unrelated-mathematics-playing-a-fundamental-role-in-tcs/1925#1925: need parent link to trigger on this in cannon [2020-06-16]
https://news.ycombinator.com/item?id=23537243#23540421 hmm, both id and # ? [2020-02-08]
https://bugzilla.mozilla.org/show_bug.cgi?id=1411873 : ugh need to keep id [2020-01-12]
old.reddit and new reddit [2019-06-02]
handle google.com/search [2020-11-30]
https://www.c-span.org/video/?c4808083/rust-language-chosen the ? is sneaky [2020-11-22]
https://melpa.org/#/async # is just redundant? [2019-08-25]
Lisp Language http://wiki.c2.com/?LispLanguage ? is sneaky [2020-11-18]
Vanquishing ‘Monsters’ in Foundations of Computer Science: Euclid, Dedekind, Frege, Russell, Gödel, Wittgenstein, Church, Turing, and Jaśkowski didn’t get them all … by Carl Hewitt :: SSRN [2020-12-04]
https://unix.stackexchange.com/questions/117609/capture-error-of-ls-to-file#comment183614_117609 [2019-02-18]
make sure ? extracted correctly https://play.google.com/store/apps/details?id=com.faultexception.reader [2019-05-04]
https://news.ycombinator.com/item?id=12973788 [2021-03-15]
wiki.c2.com pages don’t even have canonical? [[cannon]][2019-09-03]
potential pypi project? https://pypi.org/project/cannon [2020-05-11]
Vision, Mission & Values — 2020 Update - WorldBrain.io - Medium [2019-07-09]
Changed how threading works. by JakeHartnell · Pull Request 952 · hypothesis/h https://github.com/hypothesis/h/pull/952 [[hypothesis]] [[reddit]][2021-03-26]
URLTeam - Archiveteam [[cannon]][2021-03-25]
seomoz/url-py: URL Transformation, Sanitization [[cannon]][2021-03-03]
(5) Jon Borichevskiy (@jondotbo) / Twitter [[promnesia]] [[cannon]]Cannon is an idea for a project attempting to compute canonical/normalised URLs and extract some information from them (‘entities’), merely by looking at the URL, and ideally without using the "rel=canonical" metadata.
I describe the problem it tries to solve here: "urls are broken", also see "motivation".
At the moment it’s a subproject of promnesia: see cannon.py
and tests/cannon.py.
If anyone knows of similar efforts/prior art, please let me know! I’d really like to avoid reinvening the wheel here.
Once you are sold on motivation in this section, and wondering why would this require a separate library/database, check out "testcases" section.
[2020-04-04]
I want urls that represent information, regardless the way it’s presentedlet alone all the tracking/etc crap
[2020-05-23]
"document equivalence" is a good term: How to establish (or avoid) document equivalence in the Hypothesis system : Hypothesis[2020-05-27]
Google no longer providing original URL in AMP for image search results[2019-10-11]
mobile versions of sites sometimes have different "canonical", e.g. mobile.twitter.comNo one would argue that a tweet is the same regardless where it’s presented, yet there is no easy way to unify this
[2020-05-28]
archive.org is messing with canonical [[cannon]][2019-11-02]
e.g. this link doesn’t have ‘canonical’ even though it’s a mirror: https://solar.lowtechmagazine.com/2016/11/the-curse-of-the-modern-office.html[2019-11-08]
no canonical on gist https://gist.github.com/dneto/2258454same as https://gist.github.com/2258454 – hmm, this thing redirects now..
[2019-08-19]
parent and sibling relations can be determined from the URL [[cannon]] [[promnesia]]e.g. subreddit-post/user-comment/user-tweet, etc.
[2019-11-01]
if the original page is gone I can still easily link my saved annotations (Instapaper/Pocket/Hypothesis) to archived pagehttps://web.archive.org/web/20090902224414/http://reason.com/news/show/119237.html
[2019-09-07]
urls a good candidate to determine ‘entities’ because they sure at least somewhat curated [[cannon]][2019-02-24]
normalization is tricky.. for some urls, stuff after # is important https://en.wikipedia.org/wiki/Tendon#cite_note-14 . for some, it’s utter garbagehowever we can sort of get away with normalizing on server only?
[2019-08-07]
The Problem With URLs https://blog.codinghorror.com/the-problem-with-urls/[2019-08-27]
not very insigntful, example of msdn with weird characters in urls[2020-01-02]
motivation: siloing: instapaper ‘imports’ pages and assigns an id: https://www.instapaper.com/read/1265139707so you can’t connect your annotations on instapaper to notes etc
[2021-03-07]
could normalize historic URLs which are already down? [[linkrot]]perhaps not super useful if we can’t access them, but still
Apart from Promnesia, I believe it could be quite useful for other projects.
[2019-06-27]
Hmm could be helpful for hypothesis? [[hypothesis]][2020-04-29]
write about it? the future?[2021-01-16]
discuss about cannon (maybe on Slack)? [[hypothesis]] [[cannon]][2019-05-24]
Annotation of content on sites like Facebook or Twitter? - Google Groups [[hypothesis]]kinda related since they basically want canonical urls
[2021-01-30]
Ignore URL parameters - Feature Requests - Memex Community [[worldbrain]][2021-01-22]
wonder if we could cooperate? [[agora]] [[cannon]][2021-01-24]
would be useful to use the same normalising engine for #archivebox for example? [[webarchive]][2021-03-10]
although I guess it needs to fetch the page anyway so "rel=canonical" works we ll enough[2021-02-07]
could be useful for surfingkey/nyxt browser to hint ‘interesting’ urls?[2019-12-26]
archive.org [[linkrot]]e.g. if the link is not present in archive.org, it doesn’t mean it’s not archived under a different canonical
e.g. blockers, various highlighters, hypothesis, etc
[2020-12-07]
einaregilsson/Redirector: Browser extension (Firefox, Chrome, Opera, Edge) to redirect urls based on regex patterns, like a client side mod_rewrite
[2020-11-20]
could reuse URL underlying etc with ampie? [[ampie]]URL normalization algorithm should be shared with other projects to the maximum extent possible.
If not the exact algorithm, at least the ‘curated’ parts of it like regexes, testcases, etc should be shared.
It’s a crap boring work that should be only done once (e.g. like timezones database).
[2020-06-30]
ClearURLs / Addon: looks super super promisingOnce ClearURLs has cleaned the address, it will look like this: https://www.amazon.com/dp/exampleProduct
[2021-03-10]
https://github.com/ClearURLs/Addon/wiki/Rules: Not super convinced JSON would work well in general, but anyway it’s already pretty good.[2020-11-22]
WorldBrain/memex-url-utils: Shared URL processing utilities for Memex extension and mobile apps. [[worldbrain]][2019-07-09]
h/uri.py at 0fc8a0d345741d43b4f80856a7cbb8f5afa70f80 · hypothesis/h https://github.com/hypothesis/h/blob/0fc8a0d345741d43b4f80856a7cbb8f5afa70f80/h/util/uri.py [[hypothesis]][2019-07-09]
excluded query params![2019-07-09]
right, I could probably reuse hypothesis’s canonify and contribute back. looks very similar to mine[2020-05-12]
coleifer/micawber: a small library for extracting rich content from urls[2021-03-10]
ok, pretty interesting. it probably uses network, but could at least use it for testing (or maybe even ‘enriching’?)[2019-03-27]
sindresorhus/compare-urls: Compare URLs by first normalizing themcompareUrls('HTTP://sindresorhus.com/?b=b&a=a', 'sindresorhus.com/?a=a&b=b');
[2019-12-25]
sindresorhus/normalize-urlstripWWW
can’t handle amp etc
[2019-07-09]
hypothesis: h/normalize_uris_test.py
[2019-04-16]
niksite/url-normalize: URL normalization for Python[2020-04-27]
john-kurkowski/tldextract: Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.hmm could use this for better extraction…
[2019-03-27]
rbaier/python-urltools: Some functions to parse and normalize URLs.[2021-03-07]
maybe we can achieve 95% accuracy with generic rules and by handling the most popular websitesfor the rest
Could also be useful for Archive.org/archivebox/etc. But a bit out of scope for this project..
e.g. twitter.com/user/status/statusid
maybe normalise to this?
twitter.com/i/web/status/1053151870791835649
reddit.com/comments/5ombk8 – huh, normalise to this?
TODO m.readdit/old.reddit
en.m.wikipedia/ru.m.wikipedia
maybe stripp off subdom completely?
youtube.com/watch?v=xAy—wpDQ&list=PL0kyDgrqAiUEF5d7krLIds1ebhTxCjm&shuffle=221
youtube.com/watch?v=Woa3MPijE3s&list=PL0kyDgrqAiXKspaa1GIS0jbbLrsAa3sk&spfreload=10
[2019-11-09]
also this to summarizesqlite3 promnesia.sqlite ‘select domain, count(domain) from (select substr(normurl, 0, instr(normurl, "/")) as domain from visits) group by domain order by count(domain)’
consider https://www.youtube.com/watch?v=wHrCkyoe72U&list=WL
basically
ok so how do we generalize from two examples?
e.g. say we also have
youtube.ru/watch?v=abacaba -> youtube/abacaba
we get
youtube | keep
ru | drop
watch | drop
v abacaba | keep
I suppose it could guess that if we keep a query parameter once, we’ll keep it always?
and if we extracted a certain substring without a query parameter, we’ll also always keep it as is?
TODO how about this?
https://news.ycombinator.com/reply?id=25100810&goto=item%3Fid%3D25099862%2325100810
it’s a reply to https://news.ycombinator.com/item?id=25100035
which is a comment to https://news.ycombinator.com/item?id=25099862
[2019-09-03]
should be idempotent?http get 'http://archive.org/wayback/available?url=https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories'
{
"archived_snapshots": {
"closest": {
"available": true,
"status": "200",
"timestamp": "20210219235548",
"url": "http://web.archive.org/web/20210219235548/https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories"
}
},
"url": "https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories"
}
Some tricky cases which would be nice to get right
[2020-11-15]
Wendover Productions - YouTube[2020-04-19]
roam links[2021-02-07]
https://app.element.io/#/room/#blockchain:fosdem.org [[cannon]][2021-02-16]
A Relational Turn for Data Protection? by Neil M. Richards, Woodrow Hartzog :: SSRN [[cannon]]abstractid
[2019-06-23]
A Brief Intro to Topological Quantum Field Theories. - YouTube https://www.youtube.com/watch?v=59uLGIrkMxM&list=WL&index=61&t=0seh, rules might be a bit complicated. E.g. if both v and list are present, we wanna ditch list, otherwise keep list
[2020-11-16]
[normalise DOI](https://twitter.com/amogh_jalihal/status/1328393853599059970 )Ah sure: This DOI: https://doi.org/10.1073/pnas.1211902109 should lead to this paper: https://pnas.org/content/109/48/E3324 .
[2019-07-23]
X.m.wikipedia.org[2019-07-23]
mm, it’s got canonical though..[2019-07-23]
perhaps promnesia should respond both to canonical and its own idea of normalised (preferring canonical)[2019-04-20]
fragments: Aharonov-Bohm Experiment https://physicstravelguide.com/experiments/aharonov-bohm#tab__concreteurl normalising… this is an example where fragments are important
[2019-08-26]
here I guess it could yield url with hash + parent url?[2019-08-26]
always assume that parents in uri hierarchy are actual parents? I guess that’s fairly reasonable[2019-08-25]
stuff like this: youtu.be/1TKSfAkWWN0[2019-08-25]
this is also motivation for canonifying. this is a redirect link in tweet, and there is no way to associate it with canonical[2020-05-02]
https://hubs.mozilla.com/#/ [[cannon]][2020-04-30]
Writing well | defmacrosupport for archive.org and test on this page
[2020-05-28]
Wayback Machine https://web.archive.org/web/2019*/http://www.defmacro.org/2016/12/22/writing-well.html[2019-11-15]
maybe https://youtu.be/zRxI0DaQrag?t=1380 ?[2019-11-09]
github: https://twitter.com/i/web/status/928602151286386688 this end up trimmed with … :([2019-11-07]
github: https://twitter.com/i/web/status/1156086851633131520[2021-01-24]
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941827 [[cannon]]https://wiki.debian.org/SecureBoot#MOK_-_Machine_Owner_Keycanonical: wiki.debian.org/SecureBootsources : notes[[https://wiki.debian.org/SecureBoot][SecureBoot - Debian Wiki]]
[2021-02-28]
https://undeadly.org/cgi?action=article;sid=20170930133438 [[cannon]]‘sid’ matters here
ru.wikipedia.org/wiki/Грамматикализация
[2019-12-23]
https://cstheory.stackexchange.com/questions/1920/examples-of-unrelated-mathematics-playing-a-fundamental-role-in-tcs/1925#1925: need parent link to trigger on this in cannon[2020-06-16]
https://news.ycombinator.com/item?id=23537243#23540421 hmm, both id and # ?[2020-02-08]
https://bugzilla.mozilla.org/show_bug.cgi?id=1411873 : ugh need to keep id[2020-01-12]
old.reddit and new reddit[2019-06-02]
handle google.com/search[2020-11-30]
https://www.c-span.org/video/?c4808083/rust-language-chosen the ? is sneaky[2020-11-22]
https://melpa.org/#/async # is just redundant?[2019-08-25]
Lisp Language http://wiki.c2.com/?LispLanguage ? is sneakyeh, urls can have commas… e.g. http://adit.io/posts/2013-04-17-functors,_applicatives,_and_monads_in_pictures.html
so, for csv need a separate extractor.
[2020-11-18]
Vanquishing ‘Monsters’ in Foundations of Computer Science: Euclid, Dedekind, Frege, Russell, Gödel, Wittgenstein, Church, Turing, and Jaśkowski didn’t get them all … by Carl Hewitt :: SSRNValueError: netloc ' +79869929087, mak34@gmail.com' contains invalid characters under NFKC normalization
[2019-08-26]
did I do it?** [2020-12-09]
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=955208 ‘bug’ parameter[2020-12-04]
https://unix.stackexchange.com/questions/117609/capture-error-of-ls-to-file#comment183614_117609[2019-02-18]
make sure ? extracted correctly https://play.google.com/store/apps/details?id=com.faultexception.reader[2019-05-04]
https://news.ycombinator.com/item?id=12973788id here is important
[2021-03-15]
wiki.c2.com pages don’t even have canonical? [[cannon]][2019-09-03]
potential pypi project? https://pypi.org/project/cannone.g. this is very likely to be mapped to normal py docss
file:///usr/share/doc/python3/html/library/contextlib.html
[2020-05-11]
Vision, Mission & Values — 2020 Update - WorldBrain.io - Mediumfragments are often random and useless
even default org-mode is guilty
[2019-07-09]
Changed how threading works. by JakeHartnell · Pull Request 952 · hypothesis/h https://github.com/hypothesis/h/pull/952 [[hypothesis]] [[reddit]]huh, so reddit seems to normalise to the main page, and displays annotations as ‘orphaned’ for comment views?
[2019-07-09]
so look like reddit referes to the ‘post’ page as canonical. Right.[2021-03-26]
URLTeam - Archiveteam [[cannon]][2021-03-25]
seomoz/url-py: URL Transformation, Sanitization [[cannon]][2021-03-03]
(5) Jon Borichevskiy (@jondotbo) / Twitter [[promnesia]] [[cannon]]hmm how to resolve twitter renames?…
Rendering context...