Price ’ s Protein Puzzle: 2023 update

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so. What is the longest coherent word or phrase present in the amino acid sequence of a real protein?— Dr. Caroline Bartman (@Caroline_Bartma) July 21, 2023 Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019. The good news is that my code still runs, so I’ve updated the results of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too. So what’s new? In terms of English word matches: not much. Some new proteins but no new 9-letter words. The Twitter thread, above, contains an interesting reply about an approach using generative AI: I found PARALEGALS in addition to AGGRAVATES for 10 with the algorithm:(1) Ask chat-gpt for a list of x-letter words likely to be found in a coding sequence(2) blastp and check it's not an obvious artifact(3) check chat-gpt didn't hallucinate a word (TRAVELGAGS?)— Zach Hensel (@alchemytoday) July 22, 2023 Note that the match is to the NR protein database. I’d like to work with this locally but I believe it’s in the order of 150 GB now, so it would take some work to optimise. Other languages are somewhat constrained by (1) the quality of the word lists that I could find online (in Github) and (2) for some languages, the presence of characters not found in ...
Source: What You're Doing Is Rather Desperate - Category: Bioinformatics Authors: Tags: bioinformatics statistics algorithm amino acid rstats search words Source Type: blogs