Healthcare Twitter Analysis

The use of social media data and data science to gain insights into health care and medicine

This website gives easy access to the datafiles created by George Fisher for this project. Refer to https://github.com/grfiv/healthcare_twitter_analysis for the details of the data and the project so far as I took it.

If you click a file name it will download to your computer; if you right-click and select 'Copy link address' you can paste the link into a command or program (wget, for example).

The files listed below are hosted in an Amazon S3 bucket. They are displayed here in a web page that uses the AWS SDK for JavaScript in the Browser to read and list objects stored in the Amazon S3 bucket. You can view the source of this page to see the JavaScript that powers it (in Chrome, right-click and select 'View page source').

About the Data

All of the tweets for this project have been processed and consolidated into a single file HTA_noduplicates.gz
1.85 Gb zipped / 15.80 Gb unzipped

Each of the 4 million rows in this file is a tweet in json format.

  • Every record contains the following information:
    • All the Twitter data in exactly the json format of the original
    • Unix time stamp
    • data from the original project files:
      • originating file name
      • score
      • author screen name
      • URLs
  • In addition, 60% of the records have geographic information
    • Latitude & Longitude
    • Country name & ISO2 country code
    • City
    • For country code "US"
      • Zipcode
      • Telephone area code
      • Square miles inside the zipcode
      • 2010 Census population of the zipcode
      • County & FIPS code
      • State name & USPS abbreviation

The other files are referenced in the project document found at the GitHub site and represent the data as it underwent various transformations from the original project files, which essentially contained nothing other than the tweet ID, to the final product which contained not only the original full Twitter JSON but also quite a lot of useful geographic and census information besides.

About the Code

This page uses the AWS SDK for JavaScript in the Browser to dynamically query the contents of the S3 bucket.

To keep things simple, the code does not ask for credentials. Instead, it makes unauthenticated calls to the S3 API. This means that it will only work against buckets that are publicly-readable.

The JavaScript SDK makes it very simple to list the objects in an S3 bucket. The code:

  • links to the JavaScript SDK
  • configures the SDK with a default AWS region of us-west-1
  • creates and initializes an S3 object
  • via the S3 object, makes an unathenticated listObjects call to S3
  • iterates over the 'data' object returned and displays each file in a row of the DataTable

Amazon S3 Explorer 

File Size
View Stats