Slimdown your HTML to Markdown

·

5 min read

Ever needed to convert a basic Wiki page to Markdown? Well recently, that was the situation I was in. And with a desire to learn more Python, I decided to ignore great alternatives such as Markdownify, Python-Markdownify, and html2text, and instead opted to take a stab at writing a basic tool to do the job for me.


What is Slimdown?

slimdown@github

Slimdown was a weekend project, born from my need to convert some HTML documents to Markdown.

Whilst I could have used the established tools I listed above, since I had a free weekend and a yearning to learn more Python, I thought why not.

The goal was simple: hand the script an HTML file and get a Markdown file in return. But as we all know, things are never that simple...

Future Development

Currently, the version on Github is 1.2. It's not perfect, and there are a few minor annoyances to resolve. However, at the end of the day, it performs its purpose well. I will revisit from time to time to continually develop and update it, as there are some issues to resolve and potentially more use cases and functionality I'd like to support.

At this point, I feel it's mostly stable and functional enough not to require too much time and attention. I am working on one last update (refactoring and adding tests, more PEP8 adherent) but it's nothing game-changing for functionality - it's mostly for my own sanity. Besides, the beauty of open source is that those who value it can give their time and expertise to it when I can't, right?

Problems Faced

One of the key problems I faced during the development of Slimdown was the reliance on Pythons HTMLParser library.

The trouble was, the feed() function is the only public interface to the library's functionality, within which it calls the internal functions with handle_*() names, procedurally. Ultimately, this meant it turned out to be less straightforward than expected when attempting to parse HTML hyperlinks to their equivalent Markdown format.

Take the following example:

<a href="https://www.hooli.xyz">Link Text</a>

Although many other functions exist to parse other components of the element, feed() will process those you override in your client class, in order. In our case, lets say we override handle_starttag(), handle_data(), and handle_endtag(). This means these three factors of the above hyperlink are processed in order.

Let's see how the processing is done. Since the anchor tag and its href attribute are handled together by handle_starttag(), let's see how that translates to markdown. First, it will write out the markdown equivalent of [] when it detects the <a> tag. Then, it will retrieve and write the href attribute, second. So, our markdown translation so far is:

 \[](https://www.hooli.xyz)

Cool. Along the right lines, no?

Here's the issue. Because it's procedural in nature when calling these functions, and there appears no way to manually call them as they require feed() to digest and pass HTML arguments to them, data processing comes afterwards. Meaning our markdown link output ends up as:

\[](https://www.hooli.xyz)Link Text

While the rest of the program worked and translates the markup into markdown without issue, as we can see, this part of its core functionality fails. While we could just open the resulting markdown file afterwards and manually copy and paste the link text inside the [], imagine if a user wanted to translate a document with 500 external links. In scenarios like this, while functionality is mostly there, the most fundamental heuristic of being fit-for-purpose is far from achieved.

A Solution

While I may come to see one day that the solution I've implemented is bogus or overly complicated, it is at this time, a solution nonetheless. And I'm quite pleased with it. While it does have the extra overhead of rewriting the file, it now handles these externally linking anchor tags correctly. And it does so by using Regular Expressions.

It was the first time I'd properly delved into regex with a real-world problem to solve and I'm rather pleased with the outcome. Again, in time, I may well come to see flaws in it or find a way around the interface constraints. However, for now, it was good fun and seems to do the job!

I think I'll leave the dissection of the regex I designed for another post where I can get into detail about them, as I'm still learning them myself, and I find them exciting and powerful.

No Use Crying Over Spilt Milk

Better to clean it up and be more careful not to knock it over again, in my view. While I've long been making the effort to manage and beware of dependencies, this one escaped me as it was more to do with the constraints of the libraries interface, rather than concerns about potential breaking changes, tight coupling, or dependency hell. This was an implementation oversight rather than a design oversight, I think.

While I've long been careful to manage my dependencies as best possible, this one escaped me as it was more to do with the constraints of the library and its interface, rather than concerns about potential breaking changes.

Ultimately, the lesson learned here was to be more conscious and aware of dependencies on other libraries and services. While the possibility of having the implementation or interface for this library changed under my feet is fairly low, the more prominent issue turned out to be using the library before doing a little more upfront planning and investigation.

While I had experimented with a prototype to evaluate whether it was fit for purpose in this use case, the oversight was in the constraints of the interface to the library's functionality and the effect this had on my overall architecture.

It's a tough balance to strike. How much is enough planning and investigating? However much it is for each problem, constraints, or developer, I can admit there wasn't quite enough this time. This is great because after the annoyance this caused, It'll certainly be in the back of my mind here on in.