Key Word Search
A great place to start is a simple keyword search. This CSDL searches within ALL content sources e.g. Twitter, Facebook, Digg.. etc looking for the word "social" or the hash tag "#social" in content:interaction.content CONTAINS "social"If we wanted to look specifically at Twitter data only, you could use:
twitter.text CONTAINS "social"If we wanted to look for multiple keywords within all content sources, you could use the CONTAINS_ANY operator that takes a comma-separated list of string arguments:
twitter.text CONTAINS_ANY "social,media,monitoring"If all keywords must be present, use the "AND" operator:
twitter.text CONTAINS "social" AND twiter.text CONTAINS "media"
Twitter Users
It is quite common to be interested in a specific Twitter user or group of users. You can capture all tweets from these user based on their Twitter name. For a single user, use the "==" operator:twitter.user.screen_name == "ladygaga"Or for a set of users, use the "in" operator:
twitter.user.screen_name IN "name1, name2, name3"To track Twitter mentions of a user or group of users:
twitter.mentions IN "pepsi, ladygaga"Another powerful filter is the ability to look at a Twitter users profile description and to search for keywords. For example, to look for a specific word within their profile:
twitter.user.description CONTAINS "teacher"Or to search for a specific string within a user profile description (rather than a word surrounded by spaces):
twitter.user.description SUBSTR "linkedin.com" // will match linkedin.com...anything...
URL's and Domains
Another common use case is to track specific domains. DataSift provides full link resolution so that you can filter in real-time. To filter for any interaction content that contains a link to google.com or bbc.co.uk:links.domain IN "google.com, bbc.co.uk"Tracking specific keywords within a URL is also very useful. Perhaps a URL has been created for a specific campaign or product launch. The Substring operator matches an exact sequence of characters: Example URL: http://domain.com/testing/kindle?campaignid=123 We could filter on any part of the URL using SUBSTR:
links.url SUBSTR "kindle"Likewise, the url parameters can be filtered on in exactly the same way:
links.url SUBSTR "campaignid=123"And we can of course combine both links.domain and links.url. Here we look for all interactions that contains links to amazon.com or amazon.co.uk and have the string "kindle" as part of the URL:
links.domain IN "amazon.com, amazon.co.uk" AND links.url SUBSTR "kindle"In certain situations, you may wish to track all links that point to a specific website section that may be deeper in a URL heirachy. For example, there may be a language parameter that could vary. In this instance, a simple regex is useful:
Likewise, you may wish to track all sub-domains for a specific domain:
Another good example are YouTube links, as these can take a number of different formats. Using a simple regex caters for all options:
Specific Data Sources
One of the biggest benefits of the DataSift platform is that you have access to all of the data sources from a single location and interface. It is really simple to include and exclude specific data sources. This can be done either from within your Data Sources page after login (if you would like to include or exclude data sources for ALL of your streams), or by CSDL as follows.To monitor all sources except for Facebook:
interaction.type != "facebook"To monitor a set of specific sources only:
interaction.type IN "facebook,digg,myspace"
Tagging
The Tagging functionality allows you to effectively stamp each interaction with additional meta data if specific CSDL returns, all in real time. Here are some common examples: Sentimenttag "Positive" { salience.content.sentiment > 0 } tag "Neutral" { salience.content.sentiment == 0 } tag "Negative" { salience.content.sentiment < 0 } return { interaction.content CONTAINS_ANY "keyword1,keyword2" }Gender
tag "male" { demographic.gender CONTAINS_ANY "male, mostly_male" } tag "female" { demographic.gender CONTAINS_ANY "female, mostly_female" } return { interaction.content CONTAINS_ANY "keyword1,keyword2" }Klout
tag "Klout <10" { klout.score < 10 } tag "Klout 20+" { klout.score >= 20 AND klout.score < 30 } tag "Klout 30+" { klout.score >= 30 AND klout.score < 40 } tag "Klout 40+" { klout.score >= 40 AND klout.score < 50 } tag "Klout 50+" { klout.score >= 50 AND klout.score < 60 } tag "Klout 60+" { klout.score >= 60 AND klout.score < 70 } tag "Klout 70+" { klout.score >= 70 } return { interaction.content CONTAINS_ANY "keyword1,keyword2" }Miscellaneous Interaction contains a specific string within a URL:
tag "Campaign" { links.url SUBSTR "2012-campaign" }The source came from within 100KM radius of London (and geo is enabled):
tag "London" { interaction.geo GEO_RADIUS "51.52269412781852,-0.13432091250001577:100" }Look at the Twitter users profile description for a specific keyword:
tag "Fashion" { twitter.user.description CONTAINS "fashion" }
Subject Experts, Spam and Interaction Quality
There are several filters (AKA "targets") that can be used to increase the likelihood that you receive high quality results either from subject experts or based on popular content depending upon your needs. One of the simplest methods is to utilise the Klout integration and look for users who have a Klout score above a specified level:klout.score > 30When looking to avoid spam, it is interesting to look at the number of users following the author. This can be done with:
twitter.user.followers_count > 500When observing links included within content, it is simple to filter for links that have been re-tweeted more than a given value:
links.retweet_count > 200The Salience Topics augmentation can also assist with easily extracting interactions for a specific topic:
salience.content.topics == "Social Media"Looking at the user's Twitter profile is also a useful method for helping select subject matter experts:
twitter.user.description contains "social media"
Trends
Velocity of diffusion - Tracking the rate new (less than an hour old) links are seen reaching count milestones. It may be preferable to remove the count looking for the first occurrences, as this will increase traffic significantly, with the majority of links not getting to the 10 or 100 count. With the first count removed, it is still possible to plot when the link was first seen using the link.created_at field embedded within the link data.tag "1" { links.retweet_count == 1 } tag "10" { links.retweet_count == 10 } tag "100" { links.retweet_count == 100 } tag "1000" { links.retweet_count == 1000 } tag "10000" { links.retweet_count == 10000 } return { links.age < 3600 and links.retweet_count in [1, 10,100,1000,10000] }
Location
Limiting a filter by location can be achieved using several targets in combination with the GEO capabilities. Note that these may not be 100% reliable as these are user defined/auto-generated fields, and the user may not have GEO enabled. At the time of writing, about 1% of Twitter users have GEO enabled. This does of course vary based on the demographic for the use case. Other useful targets include twiter.user.* (user profile information) and twitter.place.* (location data entered or generated at the time of a tweet).Twitter Source Segmentation
When looking at Twitter data specifically, it is useful to be able to segment the data between mobile, desktop, bot etc. This set of tags takes the top 20 sources:Generic Examples
A generic example filtering for keywords, links, links titles and mentions:tag "Positive" { salience.content.sentiment > 0 } tag "Neutral" { salience.content.sentiment == 0} tag "Negative" { salience.content.sentiment < 0 } tag "male" { demographic.gender CONTAINS_ANY "male, mostly_male" } tag "female" { demographic.gender CONTAINS_ANY "female, mostly_female" } return { // Keyword and hash tag in any sources e.g. twitter, facebook, digg etc interaction.content CONTAINS_ANY "hp, Hewlett Packard, Hewlett-Packard" // Interactions that contain a hp.com link or links.domain == "hp.com" // Interactions that contain a link that points to a page that's title // contains any of the words hp, Hewlett Packard or Hewlett-Packard or links.title contains_any "hp, Hewlett Packard, Hewlett-Packard" // Looks for any mentions of the HP twitter account or twitter.mentions == "HPUK" }