Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to site content Skip directly to page options
CDC Home

Assigning emm Types and Subtypes

New parameters for assigning types and subtypes to emm sequence data:

It has become increasingly obvious that the previous type definition for emm genes of S. pyogenes (and emm genes from S. dysgalactiae subsp. equisimilis ) based upon > 95% sequence identity over the first 160 bases of sequence obtained with primers 1 or emmseq 2, (allowing for one interruption of the reading frame of no more than 7 codons) is not optimal. Even though this basic formula usually works, the basic problem is that this definition relies upon an undefined starting point for all readable sequences. It is evident that a segment of DNA with defined boundaries must be used. Since the sequence encoding the N terminus is widely believed to determine the M serotype, it appears most logical to base the emm type upon a defined segment of this sequence. An average of about 72 bases (24 codons) encoding a portion of the relatively conserved signal peptide was used for the previous type definition. Thus, a type definition based upon the 90 bases encoding the N terminal 30 residues of the processed M protein would be predicted to be most consistent with the previous typing scheme. Although type definition will rely upon these 90 bases, subtypes will continue to be assigned according to exact 150 base sequences encoding the N terminal 50 residues of the mature M protein.

New types will now be identified by the curator of this site ( on the basis of sharing less than 92% sequence identity over the first 90 bases encoding the deduced processed M protein of the emm type reference strain, using the SSEARCH program in the Wisconsin Package version 10.3 and bases 1-90 of emm reference strain sequences (identified as subtype 0, eg emm1.0 ) to compare to the full length 150 base subtype-determining region of the query sequence. As before, a single interruption of the reference sequence reading frame (through frame shift, in frame deletion or insertion) by no more than 7 codons is tolerated and not quantitated for mismatches. However, for each codon involved in such interruptions, a penalty of 0.5% is subtracted from the overall % identity score.

Instructions for assigning known types and subtypes:

  1. A. Use at least the first 220 bases of sequence (edited for accuracy) obtained with primer 1 or emmseq2 to query the type-specific DNA sequence database. For the majority of queries, one will obtain an exact 180/180 match to a 180 base entry, which can be assigned to the sequence (eg. emm4.4 is equivalent to type emm4, subtype emm4.4 ).
    B. This database of trimmed 180 base entries corresponds to the first 50 residues of the mature M protein and the adjacent 10 C terminal residues of the signal sequence. If a perfect 180/180 match is obtained to an entry from the type-specific BLAST option, the subtype has been correctly identified with no additional steps required for correct subtype designation. If a perfect match to bases 31-180 is combined with 3 or fewer mismatches to bases 1-30 is found, this also indicates identification of the indicated subtype.
  2. If you do not obtain a subtype assignment in steps 1 and 2, please submit the sequence trace to me ( and after verification I will assign it as a new type and/or subtype and add it to the CDC emm sequence database. Any new subtypes are assigned a new subtype designation. New types are assigned as described above through comparisons to the emm type reference strains. Within the downloadable sequence file I will add whatever strain, epidemiologic, and clinical info that you care to share, and acknowledge you and your institution for the contribution of the sequence and information.

Information including any of the following (but not limited to it) is also greatly appreciated if you care to share it:

  • Your name and institution.
  • Isolate designation.
  • Country where isolated.
  • Year isolated.
  • Group carbohydrate (A ,C,G, etc).
  • Specimen (skin lesion, blood, throat, etc.).
  • Clinical manifestation (if any).
  • Multilocus sequence type
  • sof positive or negative.
  • opacity factor positive or negative.
  • bacitracin sensitivity
  • T antigen type.
  • spe gene profile.
  • other virulence determinants.
  • antibiotic resistance phenotypes/genotypes
  • GenBank designation (if you have it)

Bernard Beall, Ph.D.
CDC Respiratory Diseases Branch
1600 Clifton Rd., NE, MS -C02
Atlanta, GA 30333

Top of Page


Images and logos on this website which are trademarked/copyrighted or used with permission of the trademark/copyright or logo holder are not in the public domain. These images and logos have been licensed for or used with permission in the materials provided on this website. The materials in the form presented on this website may be used without seeking further permission. Any other use of trademarked/copyrighted images or logos requires permission from the trademark/copyright holder...more

External Web Site Policy This graphic notice means that you are leaving an HHS Web site. For more information, please see the Exit Notification and Disclaimer policy.

Contact Us:
  • Centers for Disease Control and Prevention
    1600 Clifton Rd
    Atlanta, GA 30333
  • 800-CDC-INFO
    TTY: (888) 232-6348
    Contact CDC-INFO The U.S. Government's Official Web PortalDepartment of Health and Human Services
Centers for Disease Control and Prevention   1600 Clifton Road Atlanta, GA 30329-4027, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348 - Contact CDC–INFO
A-Z Index
  1. A
  2. B
  3. C
  4. D
  5. E
  6. F
  7. G
  8. H
  9. I
  10. J
  11. K
  12. L
  13. M
  14. N
  15. O
  16. P
  17. Q
  18. R
  19. S
  20. T
  21. U
  22. V
  23. W
  24. X
  25. Y
  26. Z
  27. #