How to create a representative sample of 1 billion Instagram profiles with only public data

davidjstier
9 min readNov 8, 2019

--

Overview & Objective:

This article describes the steps that I took to create a random sample of 13–17 year old Instagram users who live in the EU and the US that would be statistically representative so that we can calculate a minimum number of kids who have their private information shown.

There are approximately 7,000,000 Instagram users in EU countries who are between 13–17 years old. The specific characteristic that I’m measuring is the number of users who have changed their personal Instagram profile to a business account. Approximately 15% of all users have made this change to their profile.

The ‘obvious’ reason that users change their profile to a business account is that they are indeed a business. There are in fact a large # of sole-proprietors and self-employed users who have made this change to their profile (if you’re an employee of a company, it’s extremely unlikely that you would change your profile to be that of your employer.

However, I discovered that a large number of kids have changed their profile to a business account because they can receive detailed statistics about which posts are read by whom and Instagram makes it extremely easy to change your profile to a business profile — there is literally no verification required.

Instagram account information that can be used for sampling

Instagram profiles are identified by a unique alphanumeric label - the ‘profile name’ (examples of profile names include “Lucy09f”, “robmax” or “____ziag___”)

  • One can gather a list of profile names from Instagram’s publicly available website and I will discuss a number of methods for doing so below
  • To view any Instagram profile on the internet, the url is constructed with a simple rule: www.instagram.com/PROFILENAME (eg “www.instagram.com/robmax”)
  • You do not need to have an Instagram account to be able to access the data elements of any Instagram profile that is on the web and all Instagram profiles do have their own webpage, using the url convention above.
  • Instagram does not provide any structured information about a user’s age or country of residence.
  • Instagram profiles do have a numeric id # but this is not identifiable until you access the webpage for a specific profile
  • This means that one cannot create a string of random profile id #s using some range of known profile id numbers

With the exception of accounts that are set to “Private”, one can ‘scrape’ the profile names of a particular user’s followers. This is a bit laborious and does require that you login to Instagram.

For example, if user “robmax” has 412 followers, I can extract the user names of all those followers and from that extract, access their profile information on each user’s Instagram webpage.

The specific data elements that I extracted from a profile’s webpage for my research:

User’s biographical statement:

  • This is user generated content that may or may not be present
  • The character length is rather limited
  • Users can include emoji in their bio statement
  • I extracted the html source code for each user’s biographical statement
  • The html source code of any particular emoji is a distinct Unicode variable such as “U+1F600” that is a unique image of a smiling face. For my research I have ignored the emoji in the source code except if it was readable in the app or on the web

Profile type

The profile type is the highest level of categorizing users

  • In html source code, the data element that identifies a profile’s type is “@type
  • The “@type” value for every profile is contained within a specific section of the source code — specifically line 193 which appears directly after “<script type=”application/ld+json”>

Business category

The second level of categorizing users is by their business category.

  • In html source code, the data element that identifies this is “business_category_name
  • A personal profile can have only one subtype. Confusingly, this subtype for personal profiles is the “business_category_name
  • A business profile will always have a value for “business_category_name” and this value is associated with the type of business for that profile.
  • For example, one type of a business profile is “AutomotiveBusiness”. This particular type can have a business_category_name of “Auto Dealers”
  • I am not aware of any situation where two profiles that have a different type (the first level of classification) have the same business category (the second level of classification)

I also extracted the following data elements in case additional segmentation would improve the prediction:

o “is_private”: This identifies if a personal account has been set to ‘Private’ mode which means that a stranger cannot view the list of followers

  • Only personal profiles can be set to ‘Private’
  • The biographical statement is still displayed on the webpage of users whose account is set to ‘Private’

o Account activity measurements:

  • # of followers (how many people follow this particular user)
  • # of posts (a value of zero is recorded if that user does not have any posts)
  • # of profiles that the user is following
  • I have this data for approximately 90% of all profiles

Additional available data:

· The number of users by age range, gender and country is available on several third-party websites

· This data has not been corroborated by Instagram but is highly cited and I’ve deemed it to be authoritative.

· Below is a sample of the information available for each country (France in this case):

  • Instagram does publicly state that it has over 1 billion active users worldwide. How Instagram determines if a user is ‘active’ is not known to me

Sampling methodology

I used three separate methods of gathering my sample group

  • ‘Explore locations’ webpages
  • ‘Directory’ page
  • Non-random identification of 13 users who appear to live in Austria or Germany

Method 1: Scrape ‘Explore locations’ webpages (sample size = 58,154 users)

Overview: The goal was to compile a sample group with implied nationality of residence for each user and whose members were distributed by nationality in the same proportion as Instagram’s worldwide user base. Instagram has a separate series of webpages that are collections of posts users made that are ‘tagged’ with a unique location identifier. These locations are highly specific and there are literally millions of separate location pages and all pages are organized by country then by city and finally by the locations in that city. These location pages were used as a proxy for identifying residents of a particular country based on the assumption that posts that are tagged with the location detail of a specific local restaurant were likely made by people who live there (international tourist attractions such as the Eiffel Tower of course probably had few posts from people living in France).

· For example, if you are at a particular restaurant and you post a photo on Instagram you can ‘tag’ that post with the geographic identifier for that restaurant.

· Instagram maintained publicly available webpages that show posts made by users which are ‘tagged’ with a specific location (as of a few weeks ago, the pages still exist but the links are inactive).

· No statistics are available on the percent of posts which were tagged with a geographic location

· After a news article questioned the data privacy aspect of these pages Instagram now requires that you log in to your own Instagram profile before you can view these on the web.

· Here’s a screen shot of the location page for Valencia Spain taken Nov 4:

  • Every location page has the same structure of 9 images at the top of the page which are the “Top Posts”, followed by a scrollable sequence of the “Most recent” images posted
  • To reiterate, an Instagram user must self-select to ‘tag’ their post with a location so the ‘most recent’ posts shown are some subset of posts made by users at that location
  • “Top Posts” do not have an ‘expiration date’ per se and could have been posted to Instagram more than 2 (or more) years ago.
  • The popularity of the location seems to determine the ‘age’ of any post. A remote beach in France may not have many postings so a “Top Post” from that page could have been posted several years ago
  • No other location information is available for any post or for any user unless the user themselves provided the information
  • Each location page loads with at least 12 “Most Recent” posts and I limited my sampling to the 9 “Top Posts” and 12 “Most Recent” posts on each location page.
  • Until late this summer, Instagram had provided a searchable directory of every single location page. Currently only the top level page of this directory is still available online.

· Importantly, the directory pages always load 48 results without scrolling. Virtually every directory page, regardless of level (Country, City, Location) had a “See More” option at the bottom of the list.

· For a walk-through of my randomization process, please see the screen shots at the bottom of my other article for further details on the sampling method for these location pages.

Method 2: Scrape 100k webpages from Instagram’s ‘directory’ (sample size = 92,980 users)

· Instagram maintains a ‘directory’ of 100,000 profiles instagram.com/directory/profiles/ organized by numeric levels of 0–99, then 0–9, creating 1,000 unique directory pages.

· The directory is organized just like a phone book with the exception that each ’page’ has exactly 100 records (some have a few less than 100 but none have more than 100)

· Strangely, the profiles listed on each page do not have any identifiable connection to the ‘index’ value for that directory page. For example, the page with an index of 99–8 (literally the 8th page listed on a higher level directory whose index is ‘99’) do not have profile id #s that begin with 998.

· I don’t know how Instagram chooses to assign or exclude profiles from this directory.

· Regardless, the fact that Instagram makes this directory available right on their home page indicates that in some manner these should be considered a representative sample of IG users around the world.

Method 3: Scrape 15k webpages of fewer than 20 Instagram users that I ‘found’ who appear to live in Austria or Germany.

· This method was the one I used to first gauge how many users in EU countries were kids under age 17 whose phone # or email was being shown in the app.

· The selection criteria for the ~20 users was that they appear to live in Germany or Austria (by presence of German language in profile biography) and that they have between 500–2,500 followers — which was a range of followers that I could manually extract from a drop-down menu on the person’s profile page.

Aggregate measurements to approximate ground truth

· The percent of users who have changed their account to a business profile is consistent across all countries analyzed:

· I’ve identified a number of kids who have changed their profile to a business account and when I look at how many of their followers also are kids and have changed to a business profile, it ranges from ~10% — 30%. I interpret this to mean that kids tell each other about this feature and that in doing so, many adopt this behavior (naturally, the sequence of who changed when is unknown). This to me indicates that this behavior is not a ‘rare’ event and appeals to a broad range of kids.

Above statistics are for any profile of type = “Person”. This includes users who have a personal profile and have set it to “Private”

Thanks so much for reading.

--

--

davidjstier

Here to help lead a socially responsible company / organization with kindness, insight, tenacity + teamwork. Let’s accomplish good with greatness as our guide.