ProteinsProtein Engineering

Protein Solubility Prediction: Will Your Protein Express? Check Before You Clone

Predict protein solubility before cloning with SoluProt ML. Why proteins aggregate, hands-on predictions, and combining with stability for optimal expression.

Ryan Bethencourt
April 8, 2026
8 min read

The Protein Expression Problem

You have designed the perfect protein. The sequence looks right, the structure prediction shows a well-folded domain, and the stability score is excellent. You order the gene, clone it into your expression vector, transform E. coli, induce with IPTG, and run an SDS-PAGE gel. The band is there – in the insoluble pellet fraction. Your protein has expressed as inclusion bodies.

This is the most common failure mode in recombinant protein production. Estimates vary, but roughly 30–50% of heterologous proteins expressed in E. coli end up as insoluble aggregates. Each failed expression attempt costs days of work and hundreds of dollars in reagents. Multiply that by a protein engineering campaign with 20 variants, and the cost of expression failures becomes a major bottleneck.

Solubility prediction tools like SoluProt can flag problematic sequences before you clone them, saving weeks of wasted effort. In this guide, we explain why proteins aggregate, how SoluProt works, and demonstrate hands-on prediction for 5 real proteins.

Why Do Proteins Aggregate?

Protein aggregation during recombinant expression is driven by a combination of thermodynamic and kinetic factors. Understanding these mechanisms helps you interpret solubility predictions and design better-expressing variants.

Hydrophobic Exposure

The primary driving force for aggregation is the exposure of hydrophobic surfaces. In a correctly folded protein, hydrophobic residues are buried in the core. During expression, if the protein folds slowly or incompletely, hydrophobic patches on partially folded intermediates can interact with other molecules, nucleating aggregates. Proteins with large hydrophobic cores relative to their surface area are more prone to this failure mode.

Folding Kinetics

E. coli grows fast and produces protein rapidly. If the protein's folding rate cannot keep up with its translation rate, unfolded chains accumulate and aggregate. This is why lowering induction temperature (from 37 to 16–20 degrees Celsius) often rescues solubility – it slows translation, giving each chain more time to fold.

Missing Modifications

Many eukaryotic proteins require post-translational modifications (glycosylation, disulfide bonds) for correct folding. E. coli's cytoplasm is reducing, so disulfide bonds do not form unless you use specialized strains (SHuffle, Origami). Glycosylation is absent entirely. Proteins that depend on these modifications will often aggregate in standard bacterial expression.

Charge and Isoelectric Point

Proteins with extreme isoelectric points (pI very high or very low) can have solubility issues at cellular pH. Additionally, proteins with large patches of the same charge can repel water and promote self-association through complementary charge interactions between molecules.

SoluProt: ML-Based Solubility Prediction

SoluProt is a gradient-boosted tree model trained on experimental solubility data from the TargetTrack and PSI:Biology structural genomics databases. It takes a protein amino acid sequence as input and outputs a solubility probability score between 0 and 1.

How SoluProt Works

  • Feature extraction: Calculates sequence-derived features including amino acid composition, dipeptide composition, physicochemical properties (molecular weight, charge, hydrophobicity), and disorder propensity
  • Model: Gradient-boosted decision tree ensemble trained on ~14,000 proteins with known solubility outcomes
  • Output: Probability score from 0 (predicted insoluble) to 1 (predicted soluble)
  • Speed: Sub-second inference – sequence features only, no structure required
Note
SoluProt is a sequence-based predictor, which means it does not require a 3D structure. This is a significant advantage over structure-based solubility methods – you can screen sequences before you even predict their structures.

Hands-On: Predicting Solubility for 5 Real Proteins

We will test SoluProt on 5 well-characterized proteins with known expression behavior in E. coli. This includes proteins that express well as soluble, proteins known to form inclusion bodies, and borderline cases.

Predict solubility for 5 proteins
import scirouter

client = scirouter.SciRouter()

# 5 proteins with known E. coli expression behavior
proteins = [
    {
        "name": "GFP (1EMA)",
        "sequence": (
            "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTT"
            "GKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFF"
            "KDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNV"
            "YIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHY"
            "LSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
        ),
        "expected": "Soluble (well-known high expressor)",
    },
    {
        "name": "T4 Lysozyme (2LZM)",
        "sequence": (
            "MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELD"
            "KAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAV"
            "RRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYND"
            "QTPNRAKRVITTFRTGTWDAYKNL"
        ),
        "expected": "Soluble (standard model protein)",
    },
    {
        "name": "Human p53 DBD (2OCJ)",
        "sequence": (
            "SSSVPSQKTYPQGLNGTVNLPGRNSFEVRVCA"
            "CPHERCTEGVSVGAGSAHCIPHETFNRAISEN"
            "RRMHCQHTGCHFCQECEQCPSCNRRGECCTSC"
            "PNTDCATYPLCAASECIDPSVK"
        ),
        "expected": "Low solubility (aggregation-prone)",
    },
    {
        "name": "MBP (1ANF)",
        "sequence": (
            "MKIKTGARILALSALTTMMFSASALAKIEEGKLVIWINGDKGYNGLAEVGK"
            "KFEKDTGIKVTVEHPDKLEEKFPQVAATGDGPDIIFWAHDRFGGYAQSGL"
            "LAEITPDKAFQDKLYPFTWDAVRYNGKLIAYPIAVEALSLIYNKDLLPNPP"
            "KTWEEIPALDKELKAKGKSALMFNLQEPYFTWPLIAADGGYAFKYENGKYF"
            "DAAALKGEAPDGYLAIKTYNGALDNQKGIPVRGCAALNLCPYSSVWG"
        ),
        "expected": "Highly soluble (common fusion tag)",
    },
    {
        "name": "Human Insulin (P01308)",
        "sequence": (
            "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYT"
            "PKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICS"
            "LYQLENYCN"
        ),
        "expected": "Insoluble in E. coli (requires refolding)",
    },
]

print(f"{'Protein':<25} {'Score':>6} {'Prediction':<15} {'Expected'}")
print("-" * 80)

for protein in proteins:
    result = client.design.solubility(sequence=protein["sequence"])
    prediction = "Soluble" if result.score > 0.5 else "Insoluble"
    print(
        f"{protein['name']:<25} {result.score:>5.2f} {prediction:<15} "
        f"{protein['expected']}"
    )

Interpreting Solubility Scores

SoluProt scores should be interpreted as relative rankings rather than absolute probabilities. Here is a practical interpretation guide:

  • Score above 0.7: High confidence soluble. Proceed with standard expression protocols.
  • Score 0.5 to 0.7: Borderline. Consider using solubility tags (MBP, SUMO) or lowering expression temperature as a precaution.
  • Score 0.3 to 0.5: Likely problematic. Use fusion tags, co-express chaperones, and optimize expression conditions.
  • Score below 0.3: High risk of inclusion bodies. Consider switching to a eukaryotic expression host or redesigning the construct.

Improving Solubility with Rational Mutations

When SoluProt flags a protein as likely insoluble, you can try computational mutagenesis to find variants with improved solubility. The strategy is to replace surface-exposed hydrophobic residues with hydrophilic alternatives:

Screen mutations to improve solubility
# Start with a protein that has low predicted solubility
problem_sequence = (
    "SSSVPSQKTYPQGLNGTVNLPGRNSFEVRVCA"
    "CPHERCTEGVSVGAGSAHCIPHETFNRAISEN"
    "RRMHCQHTGCHFCQECEQCPSCNRRGECCTSC"
    "PNTDCATYPLCAASECIDPSVK"
)

baseline = client.design.solubility(sequence=problem_sequence)
print(f"Baseline solubility score: {baseline.score:.3f}")

# Try replacing hydrophobic surface residues with polar ones
# Leucine -> Glutamine, Valine -> Threonine, Isoleucine -> Asparagine
substitutions = {
    "L": "Q",  # Leu -> Gln (similar size, polar)
    "V": "T",  # Val -> Thr (similar size, polar)
    "I": "N",  # Ile -> Asn (similar size, polar)
}

improvements = []
for i, aa in enumerate(problem_sequence):
    if aa in substitutions:
        mutant = problem_sequence[:i] + substitutions[aa] + problem_sequence[i+1:]
        result = client.design.solubility(sequence=mutant)
        delta = result.score - baseline.score
        if delta > 0.01:
            improvements.append({
                "position": i + 1,
                "mutation": f"{aa}{i+1}{substitutions[aa]}",
                "score": result.score,
                "delta": delta,
            })

improvements.sort(key=lambda x: x["delta"], reverse=True)
print(f"\nFound {len(improvements)} improving mutations:")
for imp in improvements[:10]:
    print(f"  {imp['mutation']}: {imp['score']:.3f} (+{imp['delta']:.3f})")
Warning
Surface mutations that improve solubility can sometimes reduce stability or disrupt functional interfaces. Always cross-check solubility-improving mutations against stability predictions and known functional sites.

Combining Solubility with Structure and Stability

The most robust protein engineering workflow combines three computational checks before any wet lab work:

  • 1. Solubility (SoluProt): Will the protein express as soluble? Screen the sequence first since it requires no structure.
  • 2. Structure (ESMFold): Does the protein fold correctly? Check that the predicted structure is well-folded (high pLDDT).
  • 3. Stability (ThermoMPNN): Are the mutations stabilizing? Verify that designed variants improve or maintain thermostability.
Three-check protein engineering pipeline
import scirouter

client = scirouter.SciRouter()

# Your candidate sequence
candidate_seq = (
    "MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELD"
    "KAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAV"
    "RRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYND"
    "QTPNRAKRVITTFRTGTWDAYKNL"
)

# Check 1: Solubility
sol = client.design.solubility(sequence=candidate_seq)
print(f"Solubility score: {sol.score:.2f} {'PASS' if sol.score > 0.5 else 'FAIL'}")

# Check 2: Structure
fold_job = client.proteins.fold(sequence=candidate_seq, model="esmfold")
fold = client.proteins.fold_result(fold_job.job_id, poll=True)
print(f"Mean pLDDT: {fold.mean_plddt:.1f} {'PASS' if fold.mean_plddt > 70 else 'FAIL'}")

# Check 3: Stability (relative to a reference mutation)
stab = client.design.stability(
    pdb=fold.pdb,
    mutations=["A98V"],  # Example mutation to test
)
print(f"Mutation DDG: {stab.ddg:+.2f} {'PASS' if stab.ddg < 0 else 'REVIEW'}")

# Overall verdict
checks_passed = sum([
    sol.score > 0.5,
    fold.mean_plddt > 70,
    stab.ddg < 0,
])
print(f"\nOverall: {checks_passed}/3 checks passed")

Practical Tips for Improving Protein Solubility

When computational predictions indicate solubility problems, several experimental strategies can help rescue expression:

Construct Design

  • Solubility tags: MBP (maltose-binding protein) is the most effective solubility-enhancing fusion tag, improving solubility for roughly 80% of partners. SUMO and Trx are alternatives.
  • Truncations: Remove disordered N- or C-terminal extensions that may promote aggregation without contributing to function.
  • Domain boundaries: If the full-length protein is insoluble, express individual domains separately. Use structure predictions to identify domain boundaries.

Expression Conditions

  • Lower temperature: 16–20 degrees Celsius instead of 37 degrees. Slower translation gives proteins more time to fold.
  • Reduce inducer: Lower IPTG concentration (0.1 mM instead of 1 mM) reduces expression rate and aggregation.
  • Co-expression: Co-express chaperones like GroEL/GroES or DnaK/DnaJ/GrpE to assist folding.
  • Specialized strains: SHuffle (NEB) for disulfide bonds, ArcticExpress for cold-adapted chaperones.

Sequence Engineering

  • Surface mutations: Replace exposed hydrophobic residues (Leu, Val, Ile, Phe) with polar ones (Gln, Thr, Asn, Ser)
  • Charge optimization: Add surface lysines or glutamates to increase net charge and electrostatic repulsion between molecules
  • Consensus mutations: Align homologous sequences and substitute rare residues with consensus amino acids, which are statistically more likely to fold correctly

Next Steps

Solubility prediction is a fast, sequence-only screen that should be your first check in any protein engineering workflow. Combine it with:

Get a free SciRouter API key and check whether your proteins will express before you spend a week cloning them. 500 free credits per month, no credit card needed.

Frequently Asked Questions

What does SoluProt predict exactly?

SoluProt predicts the probability that a recombinant protein will be soluble when expressed in E. coli. The output is a score between 0 and 1, where higher values indicate greater predicted solubility. A score above 0.5 indicates the protein is more likely to be soluble than insoluble. The model was trained on experimental solubility data from the TargetTrack and PSI:Biology databases.

How accurate is SoluProt?

SoluProt achieves approximately 70% accuracy on held-out test sets, with an AUC (area under the ROC curve) of around 0.74. This is competitive with other sequence-based solubility predictors like SOLpro and PROSO II. The model is most accurate for globular proteins between 100 and 600 residues. Accuracy decreases for membrane proteins, intrinsically disordered proteins, and very large multi-domain proteins.

Can I predict solubility for expression hosts other than E. coli?

SoluProt was trained primarily on E. coli expression data, so its predictions are most reliable for bacterial expression. For mammalian cell expression (CHO, HEK293), the predictions are less calibrated but still informative as a relative ranking tool. Proteins that score very low (below 0.3) in SoluProt are likely to have expression issues in any host system.

What causes recombinant proteins to aggregate?

Protein aggregation during recombinant expression is caused by several factors: exposed hydrophobic patches that promote intermolecular association, incomplete folding kinetics that trap folding intermediates, high local protein concentration in inclusion bodies, missing post-translational modifications (in bacterial hosts), missing binding partners or cofactors, and extreme isoelectric points that reduce solubility at cellular pH.

How can I improve a protein with a low solubility score?

Several strategies can improve solubility: add solubility-enhancing fusion tags (MBP, SUMO, Trx), replace surface-exposed hydrophobic residues with charged or polar amino acids, truncate disordered termini or flexible loops, engineer disulfide bonds for structural stability, codon-optimize for your expression host, lower expression temperature to 16-20 degrees Celsius, and co-express chaperones (GroEL/ES, DnaK). Computationally, you can screen mutations with SoluProt to find variants with improved solubility scores.

Should I trust solubility or stability predictions more?

Neither should be trusted in isolation. Stability (DDG from ThermoMPNN) and solubility (SoluProt score) measure different properties. A protein can be thermostable but insoluble (it folds tightly but aggregates), or soluble but thermolabile (it dissolves well but denatures easily). The best candidates score well on both metrics. Use both predictions together and always validate experimentally.

Try It Free

No Login Required

Try this yourself

500 free credits. No credit card required.