A federated LM
Clone from original source
git clone [email protected]:google-research/electra.git
Then use docker-compose
files in test
folder to run the preprocessing and
other necessary steps.
First move them into the electra
folder, then build and run.
Build the docker container:
docker-compose -f docker-compose.yaml build
and run it interactively
docker-compose run -u $(id -u):$(id -g) --rm electra bash
Create shards and put everything into the TF format.
python3 build_pretraining_dataset.py --corpus-dir ../data/${lang} --vocab-file ../data/vocab.${lang}.txt --output-dir ./data/ --max-seq-length 128 --num-processes 15 --blanks-separate-docs True --do-lower-case
Then run the pre-training:
python3 run_pretraining.py --data-dir data/ --model-name electra_small_$lang --hparams '{"debug": false, "do_train": true, "do_eval": false, "vocab_size": 31000, "vocab_file": "vocab.$lang.txt"}'
- kb-labb-1 Danish
- kb-labb-2 Norwegian
- kb-labb-3 Swedish
- 50% - 500k
- 25% - 250k
- 12.5% - 125k
- 6.25% - 62.5k
Strongly depends on GPU (yay RTX 3090).
RTX 3090 : 13.5k / h RTX 2080 : 7.8k / h RTX 2060 : ?.?k / h