Traditional ML API

Traditional ML-based API

1. soft_news_url_cat_us

Uses URL patterns in prominent outlets to classify the type of news. It is based on a slightly amended version of the regular expression used to classify news, and non-news in Exposure to ideologically diverse news and opinion on Facebook by Bakshy, Messing, and Adamic in Science in 2015. Our only amendment: sport rather than sports. The classifier success is liable to vary over time and across outlets.

  • Arguments:

    • df: pandas dataframe. No default.

    • url: column with the domain names/URLs. Default is url

  • What it does:

    • converts url to lower case

    • regex

    URL containing any of the following words is classified as soft news:
    sport|entertainment|arts|fashion|style|lifestyle|leisure|celeb|movie|music|gossip|food|travel|horoscope|weather|gadget
    
    URL containing any of following words is classified as hard news:
    politi|usnews|world|national|state|elect|vote|govern|campaign|war|polic|econ|unemploy|racis|energy|abortion|educa|healthcare|immigration
    
  • Output:

    • Given both the regex can return true, the potential set is: soft, hard, soft and hard, or empty string.

    • By default it creates two columns, hard_lab and soft_lab

  • Examples:

    >>> import pandas as pd
    >>> from notnews import soft_news_url_cat_us
    >>>
    >>> df = pd.DataFrame([{'url': 'http://nytimes.com/sports/'}])
    >>> df
                            url
    0  http://nytimes.com/sports/
    >>>
    >>> soft_news_url_cat_us(df)
                            url  soft_lab hard_lab
    0  http://nytimes.com/sports/         1     None
    

2. pred_soft_news_us

We use data from NY Times to train a model. The function uses the trained model to predict soft news.

  • Arguments:

    • df: pandas dataframe. No default.

    • text: column with the story text.

  • Functionality:

    • Normalizes the text and gets the bi-grams and tri-grams

    • Outputs calibrated probability of soft news using the trained model

  • Output

    • Appends a column with probability of soft news (prob_soft_news_us)

  • Examples:

    >>> import pandas as pd
    >>> from notnews import pred_soft_news_us
    >>>
    >>> df = pd.read_csv('tests/sample_us.csv')
    >>> df
                src                                                url                                               text
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...
    >>>
    >>> pred_soft_news_us(df)
    Using model data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_soft_news_classifier.joblib...
    Using vectorizer data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_soft_news_vectorizer.joblib...
    Loading the model and vectorizer data file...
                src                                                url                                               text  prob_soft_news_us
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...           0.175099
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...           0.044617
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...           0.010398
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...           0.011246
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...           0.021861
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...           0.372437
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...           0.077207
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...           0.481287
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...           0.004383
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...           0.694037
    >>>
    

3. pred_what_news_us

We use a model trained on the annotated NY Times corpus to predict the type of news—Arts, Books, Business Finance, Classifieds, Dining, Editorial, Foreign News, Health, Leisure, Local, National, Obits, Other, Real Estate, Science, Sports, Style, and Travel.

  • Arguments:

    • df: pandas dataframe. No default.

    • text: column with the story text.

  • Functionality:

    • Normalizes the text and gets the bi-grams and tri-grams

    • Outputs calibrated probability of the type of news using the trained model

  • Output

    • Appends a column of predicted category (pred_what_news_us) and the columns for probability of each category. (prob_*)

  • Examples:

    >>> import pandas as pd
    >>> from notnews import pred_what_news_us
    >>>
    >>> df = pd.read_csv('tests/sample_us.csv')
    >>> df
                src                                                url                                               text
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...
    >>>
    >>> pred_what_news_us(df)
    
    Using model data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_classifier.joblib...
    Using vectorizer data from /opt/notebooks/not_news/notnews_pub/notnews/data/us_model/nyt_us_vectorizer.joblib...
    Loading the model and vectorizer data file...
                src                                                url                                               text  ... prob_sports  prob_style  prob_travel
    0             nyt  http://www.nytimes.com/2017/02/11/us/politics/...  Mr. Kushner on something of a crash course in ...  ...    0.000000    0.037708     0.000000
    1  huffingtonpost  http://grvrdr.huffingtonpost.com/302/redirect?...  Authorities are still searching for a man susp...  ...    0.000505    0.000243     0.000416
    2             nyt  http://www.nytimes.com/2016/09/19/us/politics/...  Photo  WASHINGTON — In releasing a far more so...  ...    0.000000    0.051815     0.000000
    3          google  http://www.foxnews.com/world/2016/07/17/turkey...  The Turkish government on Sunday ratcheted up ...  ...    0.001302    0.001378     0.000040
    4             nyt  http://www.nytimes.com/interactive/2016/08/29/...  NYTimes.com no longer supports Internet Explor...  ...    0.003500    0.010600     0.000973
    5           yahoo  https://www.yahoo.com/news/pittsburgh-symphony...  PITTSBURGH AP — Pittsburgh Symphony Orchestra ...  ...    0.161347    0.009316     0.000476
    6         foxnews  http://www.foxnews.com/politics/2016/08/13/cli...  Hillary Clintons campaign is questioning a rep...  ...    0.006366    0.003844     0.005973
    7         foxnews  http://www.foxnews.com/us/2017/04/15/april-gir...  April the giraffe has given birth at a New Yor...  ...    0.000808    0.047357     0.015018
    8         foxnews  http://www.foxnews.com/politics/2017/05/03/hil...  Want FOX News Halftime Report in your inbox ev...  ...    0.000626    0.000459     0.000000
    9             nyt  http://www.nytimes.com/2016/09/06/obituaries/p...  Shes an extremely liberated woman Ms. DeCrow s...  ...    0.000000    0.019162     0.000000
    
    [10 rows x 22 columns]
    >>>
    

4. soft_news_url_cat_uk

Uses URL patterns in prominent outlets to classify the type of news. It is based on a slightly amended version of the regular expression used to classify news, and non-news in Exposure to ideologically diverse news and opinion on Facebook by Bakshy, Messing, and Adamic. Science. 2015. Amendment: sport rather than sports. The classifier success is liable to vary over time and across outlets.

  • Arguments:

    • df: pandas dataframe. No default.

    • url: column with the domain names/URLs. Default is url

  • What it does:

    • converts url to lower case

    • regex

    URL containing any of the following words is classified as soft news:
    sport|entertainment|arts|fashion|style|lifestyle|leisure|celeb|movie|music|gossip|food|travel|horoscope|weather|gadget
    
    URL containing any of following words is classified as hard news:
    politi|usnews|world|national|state|elect|vote|govern|campaign|war|polic|econ|unemploy|racis|energy|abortion|educa|healthcare|immigration
    
  • Output:

    • Given both the regex can return true, the potential set is: soft, hard, soft and hard, or empty string.

    • By default it creates two columns, hard_lab and soft_lab

  • Examples:

    >>> import pandas as pd
    >>> from notnews import soft_news_url_cat_uk
    >>>
    >>> df = pd.DataFrame([{'url': 'https://www.theguardian.com/us/sport'}])
    >>> df
                                        url
    0  https://www.theguardian.com/us/sport
    >>>
    >>> soft_news_url_cat_uk(df)
                                        url  soft_lab hard_lab
    0  https://www.theguardian.com/us/sport         1     None
    >>>
    

5. pred_soft_news_uk

We use the model to predict soft news for UK news media.

  • Arguments:

    • df: pandas dataframe. No default.

    • text: column with the story text.

  • Functionality:

    • Normalizes the text and gets the bi-grams and tri-grams

    • Outputs calibrated probability of soft news using the trained model

  • Output

    • Appends a column with probability of soft news (prob_soft_news_uk)

  • Examples:

    >>> import pandas as pd
    >>> from notnews import pred_soft_news_uk
    >>>
    >>> df = pd.read_csv('tests/sample_uk.csv')
    >>> df
                        src_name                                                url                                               text
    0           your local guardian  http://www.yourlocalguardian.co.uk/news/local/...  friday octob comment say speed bump dug counci...
    1          liverpool daily post  http://icliverpool.icnetwork.co.uk/0100news/03...  man shot dead takeaway four mask gunmen victim...
    2           the daily telegraph  http://telegraph.feedsportal.com/c/32726/f/534...  euromillion jackpot reach imag euromillion tic...
    3                liverpool echo  http://icliverpool.icnetwork.co.uk/0100news/03...  father one three men kill last summer riot sai...
    4           the daily telegraph  http://telegraph.feedsportal.com/c/32726/f/579...  duchess cambridg rush duchess cambridg yet nam...
    5              buckingham today  http://www.buckinghamtoday.co.uk/latest-scotti...  man accus murder nineyearold girl innoc court ...
    6        northumberland gazette  http://www.northumberlandgazette.co.uk/latest-...  singersongwrit ami winehous appeal fine mariju...
    7                  daily record  http://www.dailyrecord.co.uk/entertainment/ent...  apr beverley lyon laura sutherland former crea...
    8  international business times  http://www.ibtimes.com/articles/331256/2012042...  deep valu found small medtech jason mill sourc...
    9                the daily mail  http://www.dailymail.co.uk/news/article-252383...  ca nt afford third child foot bill key down st...
    >>>
    >>> pred_soft_news_uk(df)
    Using model data from /opt/notebooks/not_news/notnews/notnews/data/uk_model/url_uk_classifier.joblib...
    Using vectorizer data from /opt/notebooks/not_news/notnews/notnews/data/uk_model/url_uk_vectorizer.joblib...
    Loading the model and vectorizer data file...
                        src_name                                                url                                               text  prob_soft_news_uk
    0           your local guardian  http://www.yourlocalguardian.co.uk/news/local/...  friday octob comment say speed bump dug counci...           0.152979
    1          liverpool daily post  http://icliverpool.icnetwork.co.uk/0100news/03...  man shot dead takeaway four mask gunmen victim...           0.038663
    2           the daily telegraph  http://telegraph.feedsportal.com/c/32726/f/534...  euromillion jackpot reach imag euromillion tic...           0.944237
    3                liverpool echo  http://icliverpool.icnetwork.co.uk/0100news/03...  father one three men kill last summer riot sai...           0.119689
    4           the daily telegraph  http://telegraph.feedsportal.com/c/32726/f/579...  duchess cambridg rush duchess cambridg yet nam...           0.903285
    5              buckingham today  http://www.buckinghamtoday.co.uk/latest-scotti...  man accus murder nineyearold girl innoc court ...           0.049645
    6        northumberland gazette  http://www.northumberlandgazette.co.uk/latest-...  singersongwrit ami winehous appeal fine mariju...           0.070025
    7                  daily record  http://www.dailyrecord.co.uk/entertainment/ent...  apr beverley lyon laura sutherland former crea...           0.926814
    8  international business times  http://www.ibtimes.com/articles/331256/2012042...  deep valu found small medtech jason mill sourc...           0.491505
    9                the daily mail  http://www.dailymail.co.uk/news/article-252383...  ca nt afford third child foot bill key down st...           0.004905
    >>>