I have been carefully re-reading the detailed and fascinating Gnip blog post “Tweet Metadata Timeline” by Twitter Boulder’s Jim Moffitt (@snowman). This is a post that everyone who studies Twitter data should read. Among the most useful observations (that I would guess very few Twitter researchers realize) were these gems:
“Most account metadata is static, but some change slowly over time. People change jobs and move. Companies updates their information. When you are collecting historical Tweets, it is important to understand how some metadata is as it was when Tweeted, and other metadata is as it is when the query is submitted. The metadata that is potentially updated depends on the historical API. With the Search APIs, the user profile metadata reflects the current values at the time of query. If you request a 2006 Tweet posted by Twitter co-founder @biz, you’ll see that the encapsulated user bio mentions being an advisor at Pinterest, which did not exist in 2006. If you pull those same Tweets with Historical PowerTrack you’ll see @biz account metadata as it was in September 2011, when the [Historical PowerTrack] archive was first constructed. Note that all Tweets since September 2011 contain the user profile as it was at the time the Tweet was posted.”
The difference betwen static and dynamic metadata has always been one reason to use the live Twitter display (to see updated RTs and favorites) but this observation forces any Twitter researcher reporting on profile data to be careful when they describe or make inferences based on a specific or aggregate metadata value.
Readers of the full blog post will benefit from Jim’s “select timeline of Twitter” because it tells a data story that can only inspire better research questions and methodologies for historical Twitter work. Certain “first-class objects” such as @replies and retweets, emerged at specific points in the timeline of Twitter. The payload of objects encoded in the JSON evolved as Twitter functionality evolved. When preparing a historical Twitter rule, researchers should review Jim’s timeline and Gnip’s Historical PowerTrack documentation to make sure the constraints imposed by the evolution of Twitter over time are fully understood. For example:
“The timeline information can also help better interpret the Tweet data received. Say you were researching the sharing of content about the 2008 and 2012 Summer Olympics. If you applied only the
is:retweet Operator to match on Retweets, no data would match in 2008. However, for 2012 there would likely be millions of Retweets. From this you potentially could erroneously conclude that in 2008 Retweets were not a user convention, or that simply no one Retweeted about those Olympics. Since Retweets became a first-class object in 2009, you need to add a
"RT @" rule clause to help identify them in 2008.”
There are countless subtlies and similar permutations suggested by the Twitter timeline evolution. This information suggests the need for an itereative and experimental approach. As we have previously noted, you can get a lot of clues by experimenting with the Twitter advanced search.