Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instructions for custom dataset #23

Open
SB2020-eye opened this issue Oct 12, 2021 · 8 comments
Open

instructions for custom dataset #23

SB2020-eye opened this issue Oct 12, 2021 · 8 comments

Comments

@SB2020-eye
Copy link
Contributor

Hi. How would I run this on one person's handwriting? I have images of hundreds of pages of the handwriting of this person. But I don't comprehend what to do. (Sorry!) For example, how do I make a dataset? Then I need to train it on that dataset, right? Then how do I get it to write out the text I want it to write in that handwriting style?

Also, if the original writing is in Latin, do I need to add a Latin word dataset (text file)?

Thanks so much!

@herobd
Copy link
Owner

herobd commented Oct 13, 2021

You just need to define a custom dataset, copy datasets/author_hw_dataset.py and fit it to your data.
Now there is a problem in that the my method is focused around extracting a style vetor for example images and using this to generate. In your case, this is not needed as the "style" should always be the same. To do it properly you'd want to rewrite things. Perhaps learn a single (randomly initialized) style vector (part of the model)? This means you'd throw out the style extractor as well (as it isn't needed). A fair amount would need adjusted in the trainer trainer/hw_with_style_trainer.py to accomidate this, although it's mostly removing things. I'm not sure if you'd want the reconstruction loss or not.

If you don't want to go to the work of adjusting everyting, training a custom dataset, and then just extracting the mean style vector from the training data should work.

As far as the text file, it should match what you want the model to be producing (do you want it produce Latin?). It's character set of course needs match the one used throughout the model.

@SB2020-eye
Copy link
Contributor Author

Thank you so much, @herobd !

I'm afraid I require a little (or perhaps a lot of) extra hand-holding. 👶

  1. Dataset: How do I define a custom data set? I have a bunch of images of full pages of writing from a manuscript, plus a little extra around the edges of the background against which the picture was taken. Should I cut out a big rectangle of just the text on the page, or each line, or each word, or something else? And the page itself is not white; it's more off-white or beige. Does that need to be removed, so it's just the ink (on white)? Are there any particular dimensions of images that are acceptable vs unacceptable? Then do I just stick all those images in a folder? Are any annotations or typed-in transcriptions of the text on the page needed in any way?

  2. datasets/author_hw_dataset.py: How do I fit this to my data? (I looked at it, but it wasn't apparent what I would be fitting.)

  3. Rewriting things: While this is the way I'd prefer to go, I am rather sure I do not have the know-how to do this.

  4. Training a custom dataset: Do I follow the instructions found in "Reproducability instructions" in the README.md to do this? If so, I see everything pointing to a .json file. What .json file should I be pointing to?

  5. Extracting the mean style vector: How do I do this?

  6. Text file: This I think I understand. I think you're saying that the original language of the dataset doesn't matter as long as the character set (glyphs, letters of the alphabet) is the same between languages. If I want the model to spit out "Hello world" in English, I just need the txt file you already have in the repo. Is that correct? And if so, do I need to do anything anywhere in this whole process to reference that file? (How do I make it "know" I want English?)

  7. CPU: If I change the config file to read "gpu": -1,, will it work (on only CPU)? (If not, I'll try it on Google Colab.)

  8. config files: Apart from GPU/CPU, what changes do I need to make in the config files?

Thanks in advance for any 🍼 you're willing to offer me.

@herobd
Copy link
Owner

herobd commented Oct 14, 2021

--1. & 2. This is expecting the data to be cropped at the line level. (There are better handwriting sythesis methods than mine, but they all expect words.) You'll need to find the bounding box for each text line or crop it into a seperate image and then save the associated text for that line. You can put this in whatever format is easy for you to create and read in with your dataset code. The dataset handles resizing the images (to a height of 64) so you don't need to do that.

The different colored background should be fine, although you may want to set no_bg_loss in the config to be false. It uses a foreground mask extracted by binarizing the image, which may not turn out as well.

A custom dataset is just a new Dataset object. The easiest way is to copy datasets/author_hw_dataset.py and change it ("fit it to your data"). You should only need to change the __init__ function and then a couple things in __getitem__. What you are changing is how the data is read in. Alternatively, you can change to format of your dataset to match the IAM (or RIMES) dataset and then datasets/author_hw_dataset.py/TODO will work fine. I personally think changeing the coding parsing the data is easier.

__init__ reads in what lines are available and indexes information to be retrieved in getitem. I do this with the set_list, which is an index of the image names (which have corresponding xmls. You could have a text file with the paths to cropped line images and there associated text and read this in and store it into some list (self.images = [(image1_path,image1_text),(image2_path,image2_text)...] Lines 136-231 need rewritten (or mostly removed). The mask related things can be removed if you're setting no_bg_loss to false. I think everything else stays the same.

__getitem__ has a index for the line it is to return. You need to load the line image, either from a seperate cropped image or by reading the whole image and cropping it down. Throw out the triplet stuff. It's going to be building a set of lines by the same author. You have one author, so I'd just give the next self.batch_size lines after the given index. On line 365 is where it's getting the image path and ground truth text for the line (however you've stored that). You'll see I'm cropping from the original image (lb is bounding box), but it might be easier just to have the cropped images seperate.
I think you can also remove the call to makeMask and set it's results to None or something. They shouldn't be used in the current training configuration.

You'll also need to add your dataset to data_loader/data_loader.py. Just copy the author_hw_dataset` things and put in your name.

Be sure to also copy datasets/testauthor_hw_dataset.py to test your dataset. This will run through the dataset and display the images and ground truth text to be sure everthing is looking right.

--3. Ya, I didn't have time to clean up the code as much as I would have liked, so it is a bit spaghetti-like.

--4., 7. & 8. Yes, although you'll make your own cf_*.jsons. Just copy the ones in the instructions, but make a few changes.

  • change the data_loader to have your dataset name, point to the dataset directory
  • For handwriting recognition and encoder training you'll want a_batch_size to be 1.
  • for the generator training set the batch size to 1, and set a_batch_size (author batch size) to be the actual batch size. This last one is to extract the style vector from the entire batch, which is correct since it is all one author.
  • Set cuda to false if you aren't using a GPU (although training will take a long time...).
  • You may want to define your own char_file to match the characters in your dataset. You can just edit the one I have. Any characters not in this are invisible to the model.
  • You'll need to adjust all the models' num_class to be the number of characters in youe char_file +1 (for blank token)
  • As auto-encoding was intended to help learning the style extractor (which is not really needed in your case), I'd experiment with changing the generator training cirriculum to have an extra ["gen"]'["disc"]steps or something. (the"auto" steps are the auto-encoder training).

--5. get_styles.py extracts the style vector for each image. You can just average all of them or something.

--6. Yes you are correct. In the generator config it's set in trainer: text_data

@SB2020-eye
Copy link
Contributor Author

Many, many thanks!

I'll come back around once I manage to give all this a shot and let you know how I fare.

@cuonghv0298
Copy link

Hi sir, in training autoencoder why did you train even hwr while we had train another hwr. What happened if we just train the encoder network.

@herobd
Copy link
Owner

herobd commented Dec 5, 2021

It should work fine to just train a normal autoencoder. The thought was that including hwr as part of the encoder's task would force it to learn character features. Including hwr in the encoder does improve performance, but I don't remember how much. It was probably fairly small.

@cuonghv0298
Copy link

Yup thanks sir. There are many good ideas from you. I will try to train both when I handle nan issue in hwr model of the encoder's task and bendmark them again.

@Candyman2222
Copy link

Did you make any progress, @SB2020-eye ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants