The Protein Expression Problem
You have designed the perfect protein. The sequence looks right, the structure prediction shows a well-folded domain, and the stability score is excellent. You order the gene, clone it into your expression vector, transform E. coli, induce with IPTG, and run an SDS-PAGE gel. The band is there – in the insoluble pellet fraction. Your protein has expressed as inclusion bodies.
This is the most common failure mode in recombinant protein production. Estimates vary, but roughly 30–50% of heterologous proteins expressed in E. coli end up as insoluble aggregates. Each failed expression attempt costs days of work and hundreds of dollars in reagents. Multiply that by a protein engineering campaign with 20 variants, and the cost of expression failures becomes a major bottleneck.
Solubility prediction tools like SoluProt can flag problematic sequences before you clone them, saving weeks of wasted effort. In this guide, we explain why proteins aggregate, how SoluProt works, and demonstrate hands-on prediction for 5 real proteins.
Why Do Proteins Aggregate?
Protein aggregation during recombinant expression is driven by a combination of thermodynamic and kinetic factors. Understanding these mechanisms helps you interpret solubility predictions and design better-expressing variants.
Hydrophobic Exposure
The primary driving force for aggregation is the exposure of hydrophobic surfaces. In a correctly folded protein, hydrophobic residues are buried in the core. During expression, if the protein folds slowly or incompletely, hydrophobic patches on partially folded intermediates can interact with other molecules, nucleating aggregates. Proteins with large hydrophobic cores relative to their surface area are more prone to this failure mode.
Folding Kinetics
E. coli grows fast and produces protein rapidly. If the protein's folding rate cannot keep up with its translation rate, unfolded chains accumulate and aggregate. This is why lowering induction temperature (from 37 to 16–20 degrees Celsius) often rescues solubility – it slows translation, giving each chain more time to fold.
Missing Modifications
Many eukaryotic proteins require post-translational modifications (glycosylation, disulfide bonds) for correct folding. E. coli's cytoplasm is reducing, so disulfide bonds do not form unless you use specialized strains (SHuffle, Origami). Glycosylation is absent entirely. Proteins that depend on these modifications will often aggregate in standard bacterial expression.
Charge and Isoelectric Point
Proteins with extreme isoelectric points (pI very high or very low) can have solubility issues at cellular pH. Additionally, proteins with large patches of the same charge can repel water and promote self-association through complementary charge interactions between molecules.
SoluProt: ML-Based Solubility Prediction
SoluProt is a gradient-boosted tree model trained on experimental solubility data from the TargetTrack and PSI:Biology structural genomics databases. It takes a protein amino acid sequence as input and outputs a solubility probability score between 0 and 1.
How SoluProt Works
- Feature extraction: Calculates sequence-derived features including amino acid composition, dipeptide composition, physicochemical properties (molecular weight, charge, hydrophobicity), and disorder propensity
- Model: Gradient-boosted decision tree ensemble trained on ~14,000 proteins with known solubility outcomes
- Output: Probability score from 0 (predicted insoluble) to 1 (predicted soluble)
- Speed: Sub-second inference – sequence features only, no structure required
Hands-On: Predicting Solubility for 5 Real Proteins
We will test SoluProt on 5 well-characterized proteins with known expression behavior in E. coli. This includes proteins that express well as soluble, proteins known to form inclusion bodies, and borderline cases.
import scirouter
client = scirouter.SciRouter()
# 5 proteins with known E. coli expression behavior
proteins = [
{
"name": "GFP (1EMA)",
"sequence": (
"MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTT"
"GKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFF"
"KDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNV"
"YIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHY"
"LSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
),
"expected": "Soluble (well-known high expressor)",
},
{
"name": "T4 Lysozyme (2LZM)",
"sequence": (
"MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELD"
"KAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAV"
"RRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYND"
"QTPNRAKRVITTFRTGTWDAYKNL"
),
"expected": "Soluble (standard model protein)",
},
{
"name": "Human p53 DBD (2OCJ)",
"sequence": (
"SSSVPSQKTYPQGLNGTVNLPGRNSFEVRVCA"
"CPHERCTEGVSVGAGSAHCIPHETFNRAISEN"
"RRMHCQHTGCHFCQECEQCPSCNRRGECCTSC"
"PNTDCATYPLCAASECIDPSVK"
),
"expected": "Low solubility (aggregation-prone)",
},
{
"name": "MBP (1ANF)",
"sequence": (
"MKIKTGARILALSALTTMMFSASALAKIEEGKLVIWINGDKGYNGLAEVGK"
"KFEKDTGIKVTVEHPDKLEEKFPQVAATGDGPDIIFWAHDRFGGYAQSGL"
"LAEITPDKAFQDKLYPFTWDAVRYNGKLIAYPIAVEALSLIYNKDLLPNPP"
"KTWEEIPALDKELKAKGKSALMFNLQEPYFTWPLIAADGGYAFKYENGKYF"
"DAAALKGEAPDGYLAIKTYNGALDNQKGIPVRGCAALNLCPYSSVWG"
),
"expected": "Highly soluble (common fusion tag)",
},
{
"name": "Human Insulin (P01308)",
"sequence": (
"MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYT"
"PKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICS"
"LYQLENYCN"
),
"expected": "Insoluble in E. coli (requires refolding)",
},
]
print(f"{'Protein':<25} {'Score':>6} {'Prediction':<15} {'Expected'}")
print("-" * 80)
for protein in proteins:
result = client.design.solubility(sequence=protein["sequence"])
prediction = "Soluble" if result.score > 0.5 else "Insoluble"
print(
f"{protein['name']:<25} {result.score:>5.2f} {prediction:<15} "
f"{protein['expected']}"
)Interpreting Solubility Scores
SoluProt scores should be interpreted as relative rankings rather than absolute probabilities. Here is a practical interpretation guide:
- Score above 0.7: High confidence soluble. Proceed with standard expression protocols.
- Score 0.5 to 0.7: Borderline. Consider using solubility tags (MBP, SUMO) or lowering expression temperature as a precaution.
- Score 0.3 to 0.5: Likely problematic. Use fusion tags, co-express chaperones, and optimize expression conditions.
- Score below 0.3: High risk of inclusion bodies. Consider switching to a eukaryotic expression host or redesigning the construct.
Improving Solubility with Rational Mutations
When SoluProt flags a protein as likely insoluble, you can try computational mutagenesis to find variants with improved solubility. The strategy is to replace surface-exposed hydrophobic residues with hydrophilic alternatives:
# Start with a protein that has low predicted solubility
problem_sequence = (
"SSSVPSQKTYPQGLNGTVNLPGRNSFEVRVCA"
"CPHERCTEGVSVGAGSAHCIPHETFNRAISEN"
"RRMHCQHTGCHFCQECEQCPSCNRRGECCTSC"
"PNTDCATYPLCAASECIDPSVK"
)
baseline = client.design.solubility(sequence=problem_sequence)
print(f"Baseline solubility score: {baseline.score:.3f}")
# Try replacing hydrophobic surface residues with polar ones
# Leucine -> Glutamine, Valine -> Threonine, Isoleucine -> Asparagine
substitutions = {
"L": "Q", # Leu -> Gln (similar size, polar)
"V": "T", # Val -> Thr (similar size, polar)
"I": "N", # Ile -> Asn (similar size, polar)
}
improvements = []
for i, aa in enumerate(problem_sequence):
if aa in substitutions:
mutant = problem_sequence[:i] + substitutions[aa] + problem_sequence[i+1:]
result = client.design.solubility(sequence=mutant)
delta = result.score - baseline.score
if delta > 0.01:
improvements.append({
"position": i + 1,
"mutation": f"{aa}{i+1}{substitutions[aa]}",
"score": result.score,
"delta": delta,
})
improvements.sort(key=lambda x: x["delta"], reverse=True)
print(f"\nFound {len(improvements)} improving mutations:")
for imp in improvements[:10]:
print(f" {imp['mutation']}: {imp['score']:.3f} (+{imp['delta']:.3f})")Combining Solubility with Structure and Stability
The most robust protein engineering workflow combines three computational checks before any wet lab work:
- 1. Solubility (SoluProt): Will the protein express as soluble? Screen the sequence first since it requires no structure.
- 2. Structure (ESMFold): Does the protein fold correctly? Check that the predicted structure is well-folded (high pLDDT).
- 3. Stability (ThermoMPNN): Are the mutations stabilizing? Verify that designed variants improve or maintain thermostability.
import scirouter
client = scirouter.SciRouter()
# Your candidate sequence
candidate_seq = (
"MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELD"
"KAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAV"
"RRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYND"
"QTPNRAKRVITTFRTGTWDAYKNL"
)
# Check 1: Solubility
sol = client.design.solubility(sequence=candidate_seq)
print(f"Solubility score: {sol.score:.2f} {'PASS' if sol.score > 0.5 else 'FAIL'}")
# Check 2: Structure
fold_job = client.proteins.fold(sequence=candidate_seq, model="esmfold")
fold = client.proteins.fold_result(fold_job.job_id, poll=True)
print(f"Mean pLDDT: {fold.mean_plddt:.1f} {'PASS' if fold.mean_plddt > 70 else 'FAIL'}")
# Check 3: Stability (relative to a reference mutation)
stab = client.design.stability(
pdb=fold.pdb,
mutations=["A98V"], # Example mutation to test
)
print(f"Mutation DDG: {stab.ddg:+.2f} {'PASS' if stab.ddg < 0 else 'REVIEW'}")
# Overall verdict
checks_passed = sum([
sol.score > 0.5,
fold.mean_plddt > 70,
stab.ddg < 0,
])
print(f"\nOverall: {checks_passed}/3 checks passed")Practical Tips for Improving Protein Solubility
When computational predictions indicate solubility problems, several experimental strategies can help rescue expression:
Construct Design
- Solubility tags: MBP (maltose-binding protein) is the most effective solubility-enhancing fusion tag, improving solubility for roughly 80% of partners. SUMO and Trx are alternatives.
- Truncations: Remove disordered N- or C-terminal extensions that may promote aggregation without contributing to function.
- Domain boundaries: If the full-length protein is insoluble, express individual domains separately. Use structure predictions to identify domain boundaries.
Expression Conditions
- Lower temperature: 16–20 degrees Celsius instead of 37 degrees. Slower translation gives proteins more time to fold.
- Reduce inducer: Lower IPTG concentration (0.1 mM instead of 1 mM) reduces expression rate and aggregation.
- Co-expression: Co-express chaperones like GroEL/GroES or DnaK/DnaJ/GrpE to assist folding.
- Specialized strains: SHuffle (NEB) for disulfide bonds, ArcticExpress for cold-adapted chaperones.
Sequence Engineering
- Surface mutations: Replace exposed hydrophobic residues (Leu, Val, Ile, Phe) with polar ones (Gln, Thr, Asn, Ser)
- Charge optimization: Add surface lysines or glutamates to increase net charge and electrostatic repulsion between molecules
- Consensus mutations: Align homologous sequences and substitute rare residues with consensus amino acids, which are statistically more likely to fold correctly
Next Steps
Solubility prediction is a fast, sequence-only screen that should be your first check in any protein engineering workflow. Combine it with:
- Stability prediction with ThermoMPNN to ensure your variants are thermostable
- ESMFold for structure validation of designed sequences
- ProteinMPNN for automated sequence design that optimizes multiple properties simultaneously
Get a free SciRouter API key and check whether your proteins will express before you spend a week cloning them. 500 free credits per month, no credit card needed.