bio_embedding

installation

pip install bio_embeddings[all]

usage

bio_embeddings /scratch/ch29576/common-scripts/bio_embedding/parameters_blueprint.yml 

worth mention

  1. is that there should be no duplicated sequences in the fasta input even with different id
  2. no * allowed
sed -i 's/*//g' filename.fasta

no docker found actually

docker run --rm --gpus all \
    -v "$(pwd)/dev":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

seff

about 483 shout peptide sequences


State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:03:42
CPU Efficiency: 40.66% of 00:09:06 core-walltime
Job Wall-clock time: 00:04:33
Memory Utilized: 2.27 GB
Memory Efficiency: 5.66% of 40.00 GB

stdout and err

out

[t-SNE] Computing 19 nearest neighbors...
[t-SNE] Indexed 483 samples in 0.000s...
[t-SNE] Computed neighbors for 483 samples in 0.110s...
[t-SNE] Computed conditional probabilities for sample 483 / 483
[t-SNE] Mean sigma: 0.104177
[t-SNE] KL divergence after 250 iterations with early exaggeration: 141.740021
[t-SNE] KL divergence after 15000 iterations: 1.613045

err

2021-02-26 00:25:32,697 INFO Created the prefix directory unassigned
2021-02-26 00:25:32,721 INFO Created the file unassigned/input_parameters_file.yml
2021-02-26 00:25:33,176 INFO Created the file unassigned/sequences_file.fasta
2021-02-26 00:25:33,195 INFO Created the file unassigned/mapping_file.csv
2021-02-26 00:25:33,209 INFO Created the file unassigned/remapped_sequences_file.fasta
2021-02-26 00:25:33,287 INFO Created the stage directory unassigned/stage_1
2021-02-26 00:25:33,311 INFO Created the file unassigned/stage_1/input_parameters_file.yml
2021-02-26 00:25:33,353 INFO Loading weights_file for seqvec from cache at '/home/ch29576/.cache/bio_embeddings/seqvec/weights_file'
2021-02-26 00:25:33,358 INFO Loading options_file for seqvec from cache at '/home/ch29576/.cache/bio_embeddings/seqvec/options_file'
2021-02-26 00:25:33,359 INFO CUDA available, using the GPU
2021-02-26 00:25:33,360 INFO Initializing ELMo.
2021-02-26 00:25:46,571 INFO Running ELMo warmup
2021-02-26 00:25:47,197 INFO The minimum expected size for the reduced_embedding_file is 1.978MB.
2021-02-26 00:25:47,197 INFO The minimum expected size for the embedding_file is 683.532MB.
2021-02-26 00:25:47,197 INFO You are going to generate a total of 685.511MB of embeddings, and have 1020691378.041MB available at unassigned.
2021-02-26 00:25:47,199 INFO Created the file unassigned/stage_1/embeddings_file.h5
2021-02-26 00:25:47,226 INFO Created the file unassigned/stage_1/reduced_embeddings_file.h5
  0%|          | 0/483 [00:00<?, ?it/s]2021-02-26 00:25:49,122 ERROR Error processing batch of 36 sequences: CUDA out of memory. Tried to allocate 17.05 GiB (GPU 0; 15.90 GiB total capacity; 1.81 GiB already allocated; 12.99 GiB free; 2.09 GiB reserved in total by PyTorch). You might want to consider adjusting the `batch_size` parameter. Will try to embed each sequence in the set individually on the GPU.
100%|█████████▉| 482/483 [00:29<00:00, 16.34it/s] 
2021-02-26 00:26:16,830 INFO Created the file unassigned/stage_1/ouput_parameters_file.yml
2021-02-26 00:26:16,840 INFO Created the stage directory unassigned/stage_2
2021-02-26 00:26:16,924 INFO Created the file unassigned/stage_2/input_parameters_file.yml
2021-02-26 00:27:35,811 INFO Created the file unassigned/stage_2/projected_embeddings_file.csv
2021-02-26 00:27:35,872 INFO Created the file unassigned/stage_2/ouput_parameters_file.yml
2021-02-26 00:27:35,883 INFO Created the stage directory unassigned/stage_3
2021-02-26 00:27:35,906 INFO Created the file unassigned/stage_3/input_parameters_file.yml
2021-02-26 00:28:28,721 INFO Created the file unassigned/stage_3/plot_file.html
2021-02-26 00:28:29,028 INFO Created the file unassigned/stage_3/ouput_parameters_file.yml
2021-02-26 00:28:29,041 INFO Created the stage directory unassigned/stage_4
2021-02-26 00:28:29,063 INFO Created the file unassigned/stage_4/input_parameters_file.yml
2021-02-26 00:28:29,101 INFO Loading secondary_structure_checkpoint_file for seqvec_from_publication_annotations_extractors from cache at '/home/ch29576/.cache/bio_embeddings/seqvec_from_publication_annotations_extractors/secondary_structure_checkpoint_file'
2021-02-26 00:28:29,106 INFO Loading subcellular_location_checkpoint_file for seqvec_from_publication_annotations_extractors from cache at '/home/ch29576/.cache/bio_embeddings/seqvec_from_publication_annotations_extractors/subcellular_location_checkpoint_file'
2021-02-26 00:28:29,194 INFO Created the file unassigned/stage_4/DSSP3_predictions_file.fasta
2021-02-26 00:28:29,207 INFO Created the file unassigned/stage_4/DSSP8_predictions_file.fasta
2021-02-26 00:28:29,209 INFO Created the file unassigned/stage_4/disorder_predictions_file.fasta
2021-02-26 00:28:29,244 INFO Created the file unassigned/stage_4/per_sequence_predictions_file.csv
2021-02-26 00:28:32,944 INFO Created the file unassigned/stage_4/ouput_parameters_file.yml
2021-02-26 00:28:33,001 INFO Created the file unassigned/ouput_parameters_file.yml
The Latest