Methodology, Data and Definitions of the WHO Early AI-supported Response with Social Listening Platform

The WHO Early AI-supported Response with Social Listening Platform shows real-time information about how people are talking about COVID-19 online, so we can better manage as the infodemic and pandemic evolve. This project is a pilot of 30 countries, with plans to expand in the future.

Here you can read more about how the data is collected, how the data is processed, what to consider when using the data, and further definition of terms used.

The platform is powered by Citibeats, a text analytics platform specialized in social understanding. More information can be found at

Data collection

Data sources

Data is collected daily from online conversations in publicly available sources, including Twitter, Facebook public pages, online forums, news comments, and blogs – for 30 pilot countries in eight languages: English, French, Spanish, Portuguese, German, Italian, Thai, Bahasa Indonesia and Arabic

Other languages and countries will be considered in next phases of the initiative.

We are continuously adding data sources. If you have a suggestion of a data source to add to the initiative, please let us know here.

Normalization and sampling

People’s opinions data from publicly available sources requires normalizing, sampling and cleaning to make it usable, and even then, we must be aware of the limitations.

Since each country has different population sizes as well as different levels of internet access and participation in sharing opinions online, we need to make them comparable. We normalize the data by ensuring that whenever countries are compared, it is always by relative proportion of the captured conversation per country.

Sampling is primarily used in order to control the amount of data we process (rather than for comparability purposes, which is covered by normalization). Sampling is determined by the data ‘query’ we use to define which opinions are collected from the data sources. In this case, the query contains broad COVID-19 keywords, adapted for every language. This does mean that if someone shares an opinion that is implicitly related to COVID, but not explicitly mentioning COVID (or closely related keyword), it will not be included in our sample. Any significant changes to sampling will be updated in the change-log, and users of the API will be automatically notified.


The data is representative of populations that use online sources to share opinions about COVID-19. It is difficult to give precise representativity information, since opinions do not contain demographic information (which can only be inferred) and utilization of data sources per demographic varies significantly by country. It should be noted that, as a global summary of our data, women are under-represented (this information is disaggregated and differences made visible in Gender Gap), as are elderly populations and low-income populations.

This social listening platform is intended as a sensing tool and early warning system, but this representativity limitation must be kept in mind.

Country level statistics about internet penetration, platform use and demographics can be found at:


All data is presented anonymously and aggregated. Anonymous means no names of the authors of opinions are shared. Aggregated means we show summary statistics, rather than individual opinions, so no raw text is shared publicly. Moreover, it should be noted that this opinion's data comes from publicly available data and not sensitive private data.

Geographic attribution of data

Country attribution of the data depends on the data source.

In some cases (such as Twitter), this is self-reported in the profile information of users. In others, this is inferred from country level top-level domains (e.g. ‘’ for the UK), or from local references mentioned in the text. This should be noted as a limitation of the precision of the data.

Data processing


Once the data is collected, it is categorized, or classified, into one of the defined categories.

The categories have been defined as topics of interest by health information experts, as well as through a bottom-up analysis of the data. Categories may be adjusted, added or removed during the initiative, which will be notified via the API portal and github.

Data is categorized automatically, with human quality controls. This is achieved through semi-supervised machine learning. This means that from initial human-inputted examples defining a category, the system learns and infers which opinions belong to that category. Regular human review ensures quality control.

The categorization system learns and infers in each local context (in this case, in each country), to adapt to terminology and references made in each country, accounting for differences in language use and social context.

Gender gap

Female and Male data is estimated using aggregated and anonymized profiling (e.g. from names, bios). More can be read about Citibeats estimated gender disaggregation here.

Intent detection

It is possible to filter results by ‘intent’. Intent refers to opinions shared with a particular purpose - in this case, we are monitoring ‘questions’ and ‘complaints’. Intents are automatically detected by the Citibeats system, based on machine-learning models and fine-tuned to the context of COVID-19.

How should this data be used?

Intended use

The WHO Early AI-supported Response with Social Listening Platform has been designed with health information professionals in mind, who need regular (typically weekly) snapshots of the public conversation.

Change log

Please be aware that depending on how the conversation evolves, category definitions may be changed, or new categories added (thereby changing relative proportions of the conversation), for the analysis to stay relevant.

Please note that if such changes are made, it would break the consistency of the analysis. For example, if we started with 40 categories in month 1, and added 2 new categories in month 3, since we are working with proportion of the conversation, it would not be entirely consistent to compare month 3 proportions of conversation for a category which has appeared during all months. Any such changes will be documented in the API portal (with registered users notified), and in the github.

Frequency of data updates

Data will be updated daily, at 6:00 am UTC.

Differences between opinions data and objective health data

Since the WHO Early AI-supported Response with Social Listening Platform is intended for use by health information professionals, it is important to reflect on the differences of this type of data compared to typical data types analyzed by the community.

Most importantly, opinions data are just that - subjective opinions. If the top category for a given country in the platform is ‘Category X’, it does not necessarily mean that ‘Category X’ is the topic that health professionals should consider the top priority. Category X may be the most mentioned by people, but not necessarily be the most important to them; furthermore, if Category X is the most important in the minds of the general public, it may not necessarily be the most important to the public health community. These are signals that information professionals should use within the context and their knowledge of the current situation.

Differences between public big data and surveys

Whereas in a survey the questions asked and answers collected are generally structured, that is not the case in analyzing people’s opinions from public big data, which are unstructured. The benefits of analyzing social big data is that it is real-time and has large geographic and topic coverage. This should be kept in mind - this approach is suited as a ‘sensing’ or ‘early warning’ system, rather than a precise measurement tool.

Level of depth of information

The WHO Early AI-supported Response with Social Listening Platform is intended as a straightforward resource for health information professionals. For deeper analysis of public big data for your country, you may consider setting up your own social listening platform.


Category definitions

The cause

How did the virus emerge and how is it spreading?

The cause of the virus

Narratives about the origin of SARS-CoV-2.

Stigma about the spread

Stigma on people who are thought of spreading the virus: racist expressions, attribution to poor people or immigrants.

Stigma about or by infected people

Stigma expressed about or by infected people or have been infected.

The illness

What are the symptoms and how is it transmitted?

Confirmed symptoms

Confirmed symptoms as defined by WHO, excluding longer-term symptoms.

Other discussed symptoms

Other discussed symptoms that have not yet been confirmed by WHO.

Prolonged symptoms

Reports on long covid that may or may not be confirmed by WHO.

Modes of transmission

Modes of transmission confirmed and unconfirmed by WHO. This includes discussion of asymptomatic and pre-symptomatic transmission as well as possible ways the virus can be transmitted (for example, aerosols and fomites).

Transmission settings

Narratives about settings where transmission can be amplified: closed and semi-closed settings.


General conversations on re-infection, confusion over immunity after infection or the possibility of being infected more than once.

COVID 19 Variants

Narratives and concerns about about the development, spread and impact of new COVID 19 Variants.

Demographic vulnerability & risks

Vulnerable and risk groups:

- elderly

- individuals with health conditions like lung or heart disease, diabetes or conditions that affect their immune system

- pregnant women

Impact on mental health

Anxiety, depression and other affections derived from the pandemic situation

The treatment

How can it be treated or cured?

Current treatment*

Medical treatment as per WHO treatment recommendations

COVID-19 vaccine

Narratives about the vaccine itself: efficacy, side effects, safety, etc.

Health care workers (HCW) and vaccine

Narratives by and about health care workers and vaccine

General vaccine discussion

Narratives about vaccines in general, including discussion about others or communities that have different opinions about vaccines; can include any vaccine concerns, not just COVID-19

Science and R&D

Comments on new treatment and vaccines from research and development and evidence and scientific processes

Non proven treatments

Discussion about treatments that are not proven to be effective (examples: sunlight, nutrition, herbal remedies, etc)


Specific myths that WHO and partners have reacted to taken steps to debunk reference

The interventions

What is being done by government and health authorities and societal institutions?


Any discussion about tests – everything from reliability, to access to tests, types of tests, requirement to have tests, etc.

Contact tracing

Any discussion about the process, requirements and steps involved in contact tracing, use of technology

Supportive care

Care given to patients in hospitals by medical personnel

Vaccine distribution and policies on access

Narratives about distribution, equity, access to COVID-19 vaccine

Personal measures

Individual protection measures recommended by governments/WHO such as wearing masks, handwashing, social distance, isolation when ill...

Measures in public settings

Measures implemented by governments in public settings: schools, workplaces, public transport...

Travel measures

Measures implemented or suggested by governments/WHO/population/private companies on travel: negative PCR or negative rapid test to enter a country, mandatory quarantine

Immunity pass

Vaccine certificates, immunity / health passports, digital and hard copy, including implications for access to businesses, schools, and other services.

Reduction of movement

Measures implemented by governments related to movement reduction: lock-down at home, territory lock-down, etc.

Protection: medical equipment

Equipment for health workers: PPE advances and accessibility for public.

Health Technology

Health technology used to treat patients: medicines, medical devices, vaccines, procedures and systems

Digital health technology

Discussions about digital technology used to respond to pandemic: electronic data exchange, electronic notices of passenger lists to health authorities, biometric data coming from wearables, proximity apps (App Covid). Includes people’s attitudes to data privacy, or for modelling and predictive analytics.

Pandemic Fatigue

Fatigue from interventions (lock-down, movement restrictions, masks...)


Narratives about faith and religion and COVID-19 (these narratives are recurring, usually around the time of religious holidays and outbreaks in faith based settings)


Narratives about industry, unions and COVID-19


Narratives about the environment and COVID-19 – some examples: shading in environment, waste water, air pollution as a secondary byproduct of lockdowns

Inequalities & Human Rights

Narratives about social inequalities and relation to COVID-19

Civil Unrest

Narratives about civil unrest and COVID-19


Narratives about youth, effects of pandemic on them, or actions youth is taking

Type of information

What types of information are most engaging?

Statistics & data

Conversations about facts, official statistics and data

Mis- and Disinformation

Conversations about mis- and disinformation

Sources & influencers

Conversations about where people look for information


An ‘opinion’ is considered to be a unique contribution. We are not including social interactions (e.g. retweets, likes, shares) in our analysis.

‘Top category’

‘Top category’ shows which category contains the most opinions, compared to other categories in that country. Values are proportions (%) of the conversation, where all the categories sum to 100% for each country.


Shows in which country the selected category is a ‘rising priority’. ‘Priority’ refers to the proportion of the conversation for that category, compared to the other categories in that country. ‘Rising’ means the change in priority, comparing the last 7 days with the 7 days prior to that.

It is important to note that ‘rising’ here is relative to the other categories in that country. For example, if Country A doubled the number of opinions in each and every category from Week 1 to Week 2, ‘rising’ would not show any increase.

This definition of ‘rising’ is used to enable comparability between countries. If you are interested in the absolute (rather than relative) rising, this is viewable on the Country Report page under ‘Trends’.

‘Gender gap’

Shows which categories are talked about more by women than men (brown), and more by men than women (blue), as a proportion of the conversation of that gender.

Values are the difference between female and male proportions (%) of the conversation per category. So all female category %s sum to 100%, all male category %s sum to 100%, and we show the difference between these numbers.

Female and Male data is estimated using aggregated and anonymized profiling (e.g. from names, bios). Learn more.

Citibeats recognizes that female and male are not the only genders.


Filters only the opinions which are questions, and, highlights where the outlier countries are, according to proportion of the conversation per category. Values are proportions (%) of the conversation, where all the categories sum to 100% for each country.

‘Questions’ are defined here as a phrase expressed to elicit information, including expressing one's doubts about something or checking it’s validity or accuracy.

‘Complaints’ are defined here as statements that something is unsatisfactory or unacceptable, and which have some potential to be actionable.