M Protein Gene (emm) Typing

The M protein gene (emm) encodes the cell surface M virulence protein responsible for at least 100 Streptococcus pyogenes M serotypes. emm typing is based on sequence analysis of the portion of the emm gene that dictates the M serotype. The hypervariable sequence associated with M serospecificity is adjacent to an amplifying primer sequence, allowing for direct sequencing.

Assigning New Types and Subtypes to emm Sequence Data

CDC assigns new emm types or subtypes when a defined emm segment differs from known emm types based on the following:

  1. A new subtype is assigned for any change relative to the 180-base sequences previously defined as emm subtypes in the CDC emm subtype database. The 180 bp subtype-encoding sequence consists of 60 codons encoding signal sequence (10 codons) plus the mature M protein (50 codons).
  2. New emm types are assigned for more drastic changes within the first 30 codons encoding the mature M protein. A new emm type is dictated by: < 92% identity to a reference emm type (e.g., types emm1.0, emm2.0, emm3.0).
  3. Interruption of the reference open reading frame by >7 codons. Penalty of 0.5% subtracted from overall percentage identity score for each out of frame codon.

When assigning emm types and subtypes, it is very important to use the sequence immediately adjacent to primer 1. These sequences are usually linked to M protein genes rather than the similar mrp or enn genes. For this reason, using alternative primers for obtaining the emm sequence is discouraged (see emm typing protocol). CDC uses a de novo assembly approach for whole-genome sequence-based emm subtype determination. This helps to avoid confounding of the 180-base M protein gene segment by the similar emm-like mrp and enn sequences (see CDC Streptococcus Laboratory GAS bioinformatic pipeline for S. pyogenesexternal icon).

Instructions for Assigning Types and Subtypes for a GAS strain

This database of trimmed 180 base entries corresponds to the first 50 residues of the mature M protein and the adjacent 10 C terminal residues of the signal sequence.

  1. Use at least the first 240 bases of sequence (edited for accuracy) obtained with primer 1 or emmseq2 (see emm typing protocol page) to query the type-specific DNA sequence database. Most queries will obtain an exact 180/180 match to one of the database entries that can be assigned to the sequence (e.g., emm4.4 is equivalent to type emm4, subtype emm4.4). The subtype has been correctly identified if a perfect 180/180 match is obtained from the type-specific BLAST.
  2. If you do not obtain a subtype assignment in step 1, please submit the untrimmed sequence. After verification, CDC will assign the sequence a new type and/or subtype and add it to the emm sequence database. Strain, epidemiologic, and clinical information will be included if provided and the database will acknowledge you and your institution for the contribution.
Page last reviewed: July 23, 2021