This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.

Title

Earthli now uses the UTF8 encoding 🍾

Description

Earthli started out in 1997 on an Apache 1.2.x server running on a Windows desktop in my office cubicle in New York. I set it up at the time with PHP 2.x and MySQL 3.x. <h>History</h> The default encoding at the time was to use the <c>latin1</c> character set, which mapped to <c>Windows-1252</c> and roughly corresponds to <c>ANSI</c>, which is somewhat standardized, but is platform-dependent for some characters (hence the moniker). Over the years, I migrated to PHP 3.x, then 4.x. The migration to 5.x was a lot of work because they changed the entire reference/copy semantics. From there, though the upgrade to 7.2.x was relatively easy. The jump to 8.x shouldn't pose too many issues, either.<fn> The move through MySql versions to the current 5.7.33 and Apache versions to the current 2.4.29, all running on Ubuntu 18.04 was more or less without incident. <h>Fixing the Encoding</h> One last vestige of that original setup was the encoding. I'd opened an issue called <a href="https://www.earthli.com/projects/view_job.php?id=2084">Add support for UTF8 as the default encoding</a> almost 8 years ago, but never found the time or inspiration to actually make the change. I did end up making some workarounds for missing characters, but it was always less work to just add another one. See the section on "proprietary markup" below. You have to be careful when changing encodings, making the sure that the tool that makes the conversion is aware of both the source and target encodings. You have to be dead-certain of the source encoding or the conversion will not succeed. You'll end up with garbled characters. The earthli database isn't large by any enterprise standards, but it does contain almost 27k pictures and almost 4k articles<fn>, some of them quite long. There is a lot of text, created with various browsers and tools over almost a quarter of a century. A search online for tips on converting to UTF-8 turned up a lot of useless advice for older versions or based on an inadequate understanding of encodings or the available tools. Luckily, my database dump was already in UTF-8, but it contained SQL commands to use the <c>latin1</c> collation instead of <c>utf8</c> or, to be more precise, <c>utf8mb4</c> (which is the version of UTF8 that supports 4-byte characters, e.g. 🧐). With <c>utf8</c>, you can store anything you want in MySql other than emojis. It's unclear why they have two formats, but MySql has always had its idiosyncrasies.<fn> <h>Converting <a href="{app}text_formatting.php">earthli's proprietary markup</a></h> The most fiddly bit I had to deal with was actually optional and involved converting some of the ad-hoc character combinations that I'd added to my markup language in order to use characters not supported in the <c>latin1</c> encoding (e.g. I would write Slavoj Žižek as "Slavoj (Z-)i(z-)ek). I used <c>sed</c> to replace these character combinations. Obviously, no-one else uses these, but for the sake of posterity---and to serve as an example for other replacements---here they are: <code> sed \ -e 's@(a_)@ā@g' \ -e 's@(c,)@ç@g' \ -e 's@(C,)@Ç@g' \ -e 's@(c-)@č@g' \ -e 's@(C-)@Č@g' \ -e 's@(g-)@ğ@g' \ -e 's@(i-)@ı@g' \ -e 's@(I.)@İ@g' \ -e 's@(l-)@ł@g' \ -e 's@(L-)@Ł@g' \ -e 's@(s,)@ş@g' \ -e 's@(S,)@Ş@g' \ -e 's@(s-)@š@g' \ -e 's@(S-)@Š@g' \ -e 's@(u-)@ū@g' \ -e 's@(z-)@ž@g' \ -e 's@(Z-)@Ž@g' \ -e "s@(a\\\')@á@g" \ -e "s@(A\\\')@Á@g" \ -e "s@(C\\\')@Ć@g" \ -e "s@(n\\\')@ń@g" \ -e "s@(N\\\')@Ń@g" \ -e "s@(o\\\')@ő@g" \ -e "s@(O\\\')@Ő@g" \ earthli.sql > earthli_utf8.sql </code> I used @ as the separator character and had to escape the backslash twice (once for <c>sed</c> and once for <c>bash</c>). Also, you have to use a different output file because <c>sed</c> truncates the output before it does anything. If you use the same file, then you just end up with an empty file. Neat. It's not super-efficient, but it was done in a few seconds. A bonus to doing these replacements for me is that a full-text search for "Zizek" or "Žižek" now finds all articles where I mention the Slovenian philosopher. That didn't work before because MySql was indexing "(Z-)i(z-)ek" instead. <h>Working with the dump file</h> If you need to open the dump file, be aware that the lines are <i>very long</i>. <c>vim</c> does a good job of searching and editing and jumping to locations (e.g. <c>+normal 15G25|</c> jumps to line 15, column 25. <c>nano</c> can also find text (<kbd>(-cmd)</kbd> + <kbd>W</kbd>) pretty well and quickly. Both edit the text without a problem, once you've found the location you're interested in. Desktop editors (e.g. <i>Visual Studio Code</i> or <i>Sublime Text</i>) and differs (e.g. <i>BeyondCompare</i>) were mostly overwhelmed by both the file size and the line lengths. Luckily, I only ended up needing to make one edit to avoid an error creating an index because the UTF-8 encoding considered "bugin" and "bügin" to be equivalent. <h>Commands</h> I made most of the following changes from the command line, but made one change using <i>PHPMyAdmin</i>. Here's what I ended up doing: <ol> <div><p>Dump the current database. MySQL dumps to UTF-8 by default and converts all text. <code><macro convert="-punctuation"> mysqldump --user=earthli -p --add-drop-table earthli > earthli.sql <macro convert="+punctuation"></code></p></div> <div><p>Verify that the dump file is in UTF8 format. If it's not, then you can use <c>iconv</c> to change the encoding (example from <a href="https://en.wikipedia.org/wiki/Iconv">Wikipedia</a>): <code>iconv -f iso-8859-1 -t utf-8 <infile> -o <outfile></code></p></div> <div><p>Search/replace the character set for each table with the following command: <code> sed -e 's@CHARSET=latin1@CHARSET=utf8mb4@g' earthli.sql > earthli_utf8.sql </code></p></div> Use <i>PHPMyAdmin</i> to change the default encoding for the database to <div><c>utf8mb4</c> in the <i>Operations</i> pane for the database.</div> <div>Import the database. <code><macro convert="-punctuation"> cat earthli_utf8.sql | mysql --user=earthli -p earthli <macro convert="+punctuation"></code></div> </ol> In PHP and the configuration, I made the following changes: <ol> Call <c>mysqli_set_charset ($this->_connection, 'utf8mb4');</c> after opening the connection to the database Change the encoding in all generated pages by including the tag <c>< meta charset="utf-8"></c> Change the default charset in the Apache config files <c>php_value default_charset UTF-8</c> (it's possible that this is already the default by now) </ol> <h>Conclusion</h> It took a bunch of research and preparation and nerves to dump, globally modify, and re-import a database that contains the last quarter-century of my writing. In the end, though, it wasn't even that much work and it went smoothly. As always with encodings, it serves you well to understand <i>exactly</i> what you're doing---it often saves a lot of steps. And, now, because I can: ✊🏼. <hr> <ft>The <c>zip</c> functions will be removed in favor the object-based API.</ft> <ft>2.6k of which are published ... I write a lot of drafts that I never end up publishing. Some of them are quite long, as well and serve as notes for myself.</ft> <ft>It's probably a legacy thing or a desire to provide an option that uses one less byte if you know you're not going to want emojis?</ft>