Hi Emacs community,
I’m an elisp noob, and I recently wrote a function to get the references on a wikipedia page. I plan on using it for org-mode/org-roam so I can do research faster (even though there’s probably already a package for that sort of thing). Unfortunately, it’s probably not as robust as I would like to think it is, as some of the dois/isbns appear to be missing in some wikipedia pages I’ve tested. Here it is for reference:
(defun get-wikipedia-references (subject)
"Gets references for a wikipedia article"
(let ((wikipedia-prefix-url "https://en.wikipedia.org/wiki/"))
(with-current-buffer
(url-retrieve-synchronously (concat wikipedia-prefix-url subject))
(let* ((html-start (progn (goto-char (point-min))
(re-search-forward "^$")))
(dom (libxml-parse-html-region (1+ (point)) (point-max)))
(result))
(dolist (cite-tag (dom-by-tag dom 'cite) result)
(let ((cite-class (dom-attr cite-tag 'class)))
(cond ((string-search "journal" cite-class)
(let ((a-tag (dom-search cite-tag (lambda (tag) (string-prefix-p "https://doi.org" (dom-attr tag 'href))))))
(setq result (cons (cons (concat "doi:" (dom-text a-tag))
(let* ((cite-texts (dom-texts cite-tag))
(title-beg (1+ (string-search "\"" cite-texts)))
(title-end (string-search "\"" cite-texts (1+ title-beg))))
(substring cite-texts title-beg title-end)
))
result))))
((string-search "book" cite-class)
(let ((a-tag (dom-search cite-tag (lambda (tag) (string-prefix-p "/wiki/Special:BookSources" (dom-attr tag 'href))))))
(setq result (cons (cons (concat "isbn:" (dom-text (dom-child-by-tag a-tag 'bdi)))
(dom-text (dom-child-by-tag cite-tag 'i)))
result))))
(t
(let ((a-tag (assoc 'a cite-tag)))
(setq result (cons (cons (dom-attr a-tag 'href) (dom-text a-tag)) result))))
))
)))))
(get-wikipedia-references "Graph_traversal")
(("doi:10.1109/SFCS.1979.34" . "Random walks, universal traversal sequences, and the complexity of maze problems")
("doi:10.1016/j.tcs.2015.11.017" . "Lower and upper competitive bounds for online directed graph exploration")
("doi:10.1016/j.tcs.2020.06.007" . "Online graph exploration on a restricted graph class: Optimal solutions for tadpole graphs")
("doi:10.1587/transinf.E92.D.1620" . "The Online Graph Exploration Problem on Restricted Graphs")
("doi:10.1016/j.tcs.2021.04.003" . "An improved lower bound for competitive graph exploration")
("doi:10.1137/0206041" . "An Analysis of Several Heuristics for the Traveling Salesman Problem"))
And yes, I know that I could probably use a library like s, dash, seq, or cl, but I try to keep my elisp functions free of those kind of things. I would appreciate any criticism from the Emacs community about my elisp!
having
)) )))))
is not very lispy
My first suggestion would be to use
plz
for HTTP. Then I’d usecl-loop
andpcase
to simplify the rest of the code. Here’s a partial rewrite with a TODO for further exercise. :)(defun wikipedia-article-references (subject) (let* ((url (format "https://en.wikipedia.org/wiki/%s" (url-hexify-string subject))) (dom (plz 'get url :as #'libxml-parse-html-region))) (cl-loop for cite-tag in (dom-by-tag dom 'cite) for cite-class = (dom-attr cite-tag 'class) collect (pcase cite-class ((rx "journal") (let ((a-tag (dom-search cite-tag (lambda (tag) (string-prefix-p "https://doi.org" (dom-attr tag 'href)))))) (cons (concat "doi:" (dom-text a-tag)) ;; TODO: Use `string-match' with `rx' and `match-string' here. (let* ((cite-texts (dom-texts cite-tag)) (title-beg (1+ (string-search "\"" cite-texts))) (title-end (string-search "\"" cite-texts (1+ title-beg)))) (substring cite-texts title-beg title-end))))) ((rx "book") (let ((a-tag (dom-search cite-tag (lambda (tag) (string-prefix-p "/wiki/Special:BookSources" (dom-attr tag 'href)))))) (cons (concat "isbn:" (dom-text (dom-child-by-tag a-tag 'bdi))) (dom-text (dom-child-by-tag cite-tag 'i))))) (_ (let ((a-tag (assoc 'a cite-tag))) (cons (dom-attr a-tag 'href) (dom-text a-tag))))))))
Regarding this:
And yes, I know that I could probably use a library like s, dash, seq, or cl, but I try to keep my elisp functions free of those kind of things
First of all,
cl
andseq
are built-in to Emacs and are used in core Emacs code. There’s no reason not to use them. Second,dash
ands
are on ELPA and are widely used; it’s largely a matter of style, but they are solid libraries, so again, no reason not to use them. They don’t have cooties. ;)I read a reddit post saying that using cl-lib was kind of a bad thing, and I think I’ve always had a fear that using libraries in my config would just make it more bloated/slow Emacs down. But after all the comments here, I think I’ll change my stance on that.
You don’t have anything to guard against a bad response from the server. e.g.
(unless (equal url-http-response-status 200) (error "Server responded with status: %S" url-http-response-status))
To position point at the end of the headers:
(goto-char url-http-end-of-headers)
This:
(setq result (cons (cons ...) result))
Is more clearly expressed as:
(push (cons ...) result)
Better yet, you could map over the elements you’re interested in and accumulate the results via
mapcar
orcl-loop
. That would obviate the need for the “results” variable.You could probably shorten things by using the
dom-elements
function to directly search for the href’s you’re interested in in combination with dom-parent to get at the parent elements.Overall your function gets a 65 out of 130 ERU (elisp rating units).