Hi,
This post is a short overview over a work project, where I trained a language model for invoices. This so-called base model is then fine-tuned for text classification on customer data. Due to data privacy, a non-disclosure agreement, ISO 27001 and SOAP2, I’m not allowed to publish any results. Believe me, it works like 🚀✨🪐.
A language model is trained on large amounts of textual data to understand the patterns and structure of language. The primary goal of a language model is to predict the probability of the next word or sequence of words in a sentence given the previous words.
Language models can be used for a variety of natural language processing (NLP) tasks, such as text classification, machine translation, text summarization, speech recognition, and sentiment analysis. There are many types of language models, ranging from simple n-gram models to more complex neural network-based models such as recurrent neural networks (RNNs) and transformers.
The transformer architecture is currently mostly used for language models and can be divided into an encoder and/or decoder architecture depending on the specific task. In general, transformers are trained on a large quantity of unlabeled text using self-supervised learning. The training of a transformer model on a lot of data takes a lot of computational effort and the training of language models can get expensive very quickly. So, often the best way to have a task-specific transformer model is to use a pre-trained model from Hugging Face and fine-tune the model based on your data.
Based on my work experience with invoices, fine-tuning a pre-existing model didn’t work well. I received the best results for text classification after fine-tuning a french base-model on german invoices. Nevertheless the overall F1-score wasn’t worth the effort. I assume that the content and structure of an invoice differs too much from the training data (e.g. no continuous text and many numbers). Additional, the tokenizers of the pre-trained models are not optimied for invoices, so the context window of a transformer will contain less text, which makes the training less effective.
I worked on text classification of invoices for multiple clients. I trained a base-model on a few million invoices (mostly german and english) and fine-tuned the base model for each client with around 2000 - 50000 invoices and 70 - 2000 labels. Initially I used the Longformer architecture (Beltagy et al. 2020), but a bug prevented the model deployment. Besides its limitations, I used the BERT architecture Devlin et al. 2019. Hugging Face also provides a tutorial for training language models. .
Tokenizer
A tokenizer converts raw text into smaller units, such as words or subwords, that can be used for training machine learning models. The tokenizer takes as input a string of text and outputs a sequence of tokens, each of which represents a distinct unit of meaning. The subword tokenizer breaks down words into smaller subword units. This is useful for handling out-of-vocabulary (OOV) words, which are words that are not in the training data.
The Byte-Pair Encoding tokenizer replaces the most common pair of consecutive bytes with bytes that does not occur in that data (Gage 1994, Sennrich et al. 2016).
First, we define our BPE tokenizer with the preprocessing steps for the incoming text data. As normalization we use unicode-normalization and set the text to lower case. Further preprocessing steps are a ByteLevel representation of the text followed by splitting the text by whitespaces. As a last step, we decode a tokenized input to the original one.
|
|
We define the vocabulary size of the tokenizer, add the special tokens and define the initial alphabet. The provided batch iterator trains the tokenizer from our streaming data.
|
|
Here is an example of the tokenizer output
|
|
Data pipeline
The training data is stored in multiple parquet files and split into a training and evaluation dataset in a preprocessing step. I used a train-test split of 0.01. Since the data doesn’t fit into memory, the data is streaming from disk. The text will be padded or truncated to the defined context length. The data collator for masked language modeling masks the incoming text data to allow the model training.
|
|
Model Training
So far, the data is processed, the data streaming is set up and a tokenizer is trained. Finally the model training can start. I follow the BERT architecture Devlin et al. 2019 and use their initial setup and hyperparameters. The model is trained via masked language modelling, where 20 % of the tokens will be randomly masked. From those 20% of masked tokens, 80 % will be untouched, 10 % will be replaced with random tokens and 10 % will be replaced with the original tokens. Hugging Face provides an implementation for it. Wettig et al. 2023 scrutinized the impact of the mlm parameters towards the model result.
Here is an example, which shows some randomly mask tokens from an incoming text. The model is trained on predicting the masked tokens based on the context of the whole sentence.
|
|
I’m not a big fan of using too many libraries, but I didn’t have enough time to set up the BERT model with Pytorch. I go the happy dependancy path and use the transformer library. Probably, I will create another post, where I describe the transition from the transformer library to plain pytorch.
I use the standard BERT configuration with eight attention layers with eight attention heads for each layer. A context size of 512 will truncate multiple invoices, but some experiments indicate that the overall effect can be neglected on the model performance . To understand the attention mechanism better, please follow my short blog post.
|
|
The model will use 82 million parameters. Depending on the data size and GPUs, it will train less than 1,5 weeks on 4x T4 GPUs. The model train for five epochs with the AdamW optimizer Loshchilov & Hutter 2019 and used the learning rate published in the BERT paper with the same weight decay parameters. The batch size is optimized for maximum utilization of the GPU memory. The gradient accumulation step updates the model weights with a batch size of 64. To speed up training, we use fp16.
|
|
Finally, everything is set up, and we can train our model. Depending on the data, model, and budget size, you can enjoy your holidays, and hopefully, the model training is finished, when you come back.
|
|
As a final step, we can evaluate the model output. Since I can’t share any data, I use the output from my Kaggle notebook. For
|
|
Fine-tuning
For fine-tuning the language model, you can use the script above. The pre-trained model weights can be loaded into a classification model. The BertForSequenceClassification changes only the head from a MaskedLMHead to a ClassifierHead. All the other model weights will stay the same. Also, the data collator has to be adapted, and we output some metrics for the evaluation. That’s all.
|
|
Thank you for your attention.