Flu Treemap

David Wells Home page

How is the flu sequence data distributed across subtypes, hosts and countries on the NCBI data base?

Host plot

The NCBI flu database is dominated by Human, avian, and swine sequences (because of the concern about avian flu and swine flu). The majority of these sequences are all from the USA by a wide margin. However, a lot of Avian sequences are from China and a large proportion of Equine sequences are from the UK.

In [6]:
#DaTa Summary
dts <- summarise(group_by(data, host, country), n=n(), .groups="drop_last")

treemap(dts, index=c("host", "country"), vSize="n", type="index", bg.labels=c("transparent"),
        align.labels=list(c("center", "top"), c("center", "center")))

Subtype plot

The majorit of influenza sequences on NCBI are H1N1, H3N2 and flu B, the three strains circulating in humans, all of which are monitored by the WHO and are targeted by the flu vaccine.

In [5]:
#DaTa Summary
dts <- summarise(group_by(data, ha, na), n=n(), .groups="drop_last")

treemap(dts, index=c("ha", "na"), vSize="n", type="index", palette="Set2", bg.labels=c("transparent"),
        align.labels=list(c("center", "top"), c("center", "center")), border.col=c("black","white"))

Host and subtype

You can see that there are hardly any human sequences which are not H1N1, H3N2, or flu B. H7N9 and H5N1 in humans are cases of bird flu which receive a lot of scientific attention because of their pandemic risk.

Most host groups displayed are dominated by a small number of HA NA pairings, except Avian which shows a huge diversity of HAs, NAs and combinations.

In [4]:
#DaTa Summary
dts <- summarise(group_by(data, host, ha, na), n=n(), .groups="drop_last")

treemap(dts, index=c("host", "ha", "na"), vSize="n", type="index", palette="Set3",
        bg.labels=c("transparent"),
        border.col=c("black","white","black"), border.lwds=c(3:1),
        align.labels=list(c("left", "top"),c("center", "top"), c("center", "center"))
       )

Prep data

In [7]:
packages<-c("treemap", "dplyr")
sapply(packages,require,character.only=T)

data <- read.csv("../data/metadata_flu_A_B.csv")
treemap
TRUE
dplyr
TRUE
In [ ]:
# Python

#Load data
colnames = ["accession","host","segment","subtype","country","date","length","strain","age","gender","completeness"]
metadata = pd.read_csv("rawdata/influenza_na.dat", sep="\t", low_memory=False, names=colnames)

metadata.date = pd.to_datetime(metadata.date, errors="coerce")

#Extract only flu B
flub_metadata = metadata[metadata.strain.str.contains("Influenza B virus")]
flub_metadata['subtype'] = "Flu B"
flub_metadata['ha'] = "Flu B"

#Extract only flu A
flua_metadata = metadata.dropna(subset=["subtype"])
flua_metadata = flua_metadata[flua_metadata.subtype.str.contains('H')]
flua_metadata = flua_metadata[flua_metadata.subtype.str.contains('N')]

#Create column of HA and NA subtpes
flua_metadata["ha"] = flua_metadata.subtype.apply(lambda x: x[x.find("H"):x.find("N")])
flua_metadata["na"] = flua_metadata.subtype.apply(lambda x: x[x.find("N"):])


#Limit to real subtypes
NAs = ["N1","N2","N3","N4","N5","N6","N7","N8","N9","N10","N11"]
HAs = ["H1","H2","H3","H4","H5","H6","H7","H8","H9","H10","H11","H12","H13","H14","H15","H16","H17","H18"]
flua_metadata = flua_metadata[(flua_metadata.ha.isin(HAs)) & (flua_metadata.na.isin(NAs))]

flua_metadata

# Concat flu A and flu B
metadata = pd.concat([flua_metadata, flub_metadata], sort=False)

metadata.to_csv("metadata_flu_A_B.csv", index=False)
In [ ]: