A Web Clipper of Sorts for Org-Mode
Table of Contents
1. An Org-Mode Web Clipper
1.1. What Is It?
I wanted a way to quickly capture web pages, or selections from web pages, when using either the W3M or EWW Emacs internal browsers (think: Evernote web clipper and the like). Now, there are things out there that allow for capturing web pages, but they weren’t the fast and simple sort of thing that I was looking for.
For instance, one of them requires you to set up a file of links, with URLs and other information as properties for a given headline describing the website being archived. This is really powerful but it’s multi-step, and relies on an intermediate file.
There are also clever methods using org-protocol, but I wanted to work with an internal browser, not an external one. Again, I was looking for speed and simplicity.
So I rolled my own from existing functions and components. It’s the Emacs way.
1.2. Org-Clipper Coding
First, I set up a special org-capture template:
("w" "Website" plain (function org-website-clipper) "* %a\n%T\n" :immediate-finish t)
It turned out to be not so easy to get org-capture to call a custom function. There was only one place to do it, the capture file positioning logic, and so I made use of that and essentially ’overloaded’ it.
Then I put together the code that makes it work.
;; org-eww and org-w3m should be in your org distribution, but see ;; note below on patch level of org-eww. (require 'org-eww) (require 'org-w3m) (defvar org-website-page-archive-file "~/organize/website/websites.org") (defun org-website-clipper () "When capturing a website page, go to the right place in capture file, but do sneaky things. Because it's a w3m or eww page, we go ahead and insert the fixed-up page content, as I don't see a good way to do that from an org-capture template alone. Requires Emacs 25 and the 2017-02-12 or later patched version of org-eww.el." (interactive) ;; Check for acceptable major mode (w3m or eww) and set up a couple of ;; browser specific values. Error if unknown mode. (cond ((eq major-mode 'w3m-mode) (org-w3m-copy-for-org-mode)) ((eq major-mode 'eww-mode) (org-eww-copy-for-org-mode)) (t (error "Not valid -- must be in w3m or eww mode"))) ;; Check if we have a full path to the archive file. ;; Create any missing directories. (unless (file-exists-p org-website-page-archive-file) (let ((dir (file-name-directory org-website-page-archive-file))) (unless (file-exists-p dir) (make-directory dir)))) ;; Open the archive file and yank in the content. ;; Headers are fixed up later by org-capture. (find-file org-website-page-archive-file) (goto-char (point-max)) ;; Leave a blank line for org-capture to fill in ;; with a timestamp, URL, etc. (insert "\n\n") ;; Insert the web content but keep our place. (save-excursion (yank)) ;; Don't keep the page info on the kill ring. ;; Also fix the yank pointer. (setq kill-ring (cdr kill-ring)) (setq kill-ring-yank-pointer kill-ring) ;; Final repositioning. (forward-line -1) )
This works for both EWW and W3M. You’ll want to change the variable ’org-website-page-archive-file’ to something suitable for you.
1.3. Doing It
It’s simplicity itself. In EWW or W3M, when you’re on a page you want to capture, you can mark out a capture region. If you don’t, the default is to save the whole page. Then invoke org-capture, probably with ’C-c c’. Select the ’w’ template and that’s it.
On pages with a lot of links, it’s not as speedy as I might wish, as all those links get converted to org-mode compatible links (but you really want that). I imagine if you have really too many links the thing could blow up, but I haven’t seen that yet.
Your pages get continuously concatenated in your single archive file. This might get pretty big after a while (okay, it will get pretty big). Every so often you might want to do a little killing and yanking, moving entries to other files, or getting rid of cruft that you don’t want. Emacs does fine with large files up to a point, but if you’re starting to look at zillions of megabytes, you might want to do something about it.