Skip to content

The Great Blog Migration

richard edited this page Jun 4, 2017 · 5 revisions

Migration In Detail

1 - get old blog

The old blog is a Moveable Type install. I found that the following wget command will crawl through the site and bring down all the blog entries, thumbnails, and JPGs.

wget -p -P rcb --convert-links -m -nH http://www.richardcampbell.com/blog/

So now I have all the files safe. DONE

Notes

  • 199 main HTML blog entires
  • main entries are named 000NNN.html

2 - get new file names

Each existing MT html file has some boilerplate header, scripts, and stuff before the blog entry itself. There's a "title" H3 element and also a "posted by" element with the date of the post.

As Jekyll requires the posts to have a YYYY-MM-DD-title-text.html format, I need to change 000NNN.html to the new format.

I use good old awk to find the H3 element and posted by and make a new filename. Tweaks: Need to change the occasional special character ('/',"'",'*','#','"','!','?','&'). I don't believe the Jekyll file name is important; the title part is just used to make the file unique, so I'll just delete them. DONE

3 - create first draft of markdown post

Some more info on Jekyll format at https://jekyllrb.com/docs/posts/

I've got to add yaml front matter. Same awk script that puts together the new html filename can also keep the original title handy and figure out the exact post time

There's a well-known html2text.py python script that produces good enough markdown.

Tweaks: Still some odd characters to be taken out with a simple sed script, 0xa0.

Ack, some more odd hex chars, 0x92, 0x85... Taken to editing a couple files by hand - have to check the original anyway to see what MT did with my text. DONE

4 - draft 2 - fix image URLs

including popup full-size image html fragments and their images

5 - draft 3 - fix categories