Newsfeed - GDELT Data Access

Overview

Query and retrieve data from the GDELT Project (Global Database of Events, Language, and Tone) through CLI tools and Python APIs.

When to Use

Use this skill when you need to:

•Query global news articles by date range
•Access GDELT Events, Mentions, and Global Knowledge Graph (GKG) databases
•Perform timeline analysis of news events
•Download full-text articles from URLs
•Analyze global events, themes, and relationships

Prerequisites

To use this skill, set up your environment with the following commands:

•Create and activate a new or existing Python environment (Python 3.10+ recommended)
•
Install newsfeed package:
bash
```
pip install newsfeed==0.1.7
```

Or just download from source code:

bash

 git clone https://github.com/Cyclododecene/newsfeed.git
 cd newsfeed
 git checkout dev
 pip install -e .

Common Patterns

•Event Search: Query GDELT Events Database for events based on date range and version
•Mentions Search: Query GDELT Mentions Database for media mentions of events
•Global Knowledge Graph Search: Query GDELT GKG for themes, locations, and relationships
•Full-Text Download: Download complete article content from URLs
•Export Results: Save query results to JSON or CSV for further analysis

CLI Usage (Recommended)

The primary way to interact with GDELT databases is through the CLI interface.

Basic Database Query

bash

newsfeed --db <DATABASE> --version <VERSION> --start <START_DATE> --end <END_DATE> [--format <FORMAT>] [--output <FILENAME>]

Parameters

Parameter	Description	Required	Options	Example
`--db`	Database type	Yes	EVENT, GKG, MENTIONS	EVENT
`--version`	Database version	Yes	V1, V2	V2
`--start`	Start date	Yes	V1: YYYY-MM-DD, V2: YYYY-MM-DD-HH-MM-SS	2021-01-01 or 2021-01-01-00-00-00
`--end`	End date	Yes	V1: YYYY-MM-DD, V2: YYYY-MM-DD-HH-MM-SS	2021-01-02 or 2021-01-02-00-00-00
`--format`	Output format	No	csv, json (default: csv)	json
`--output`	Output filename	No	Any filename (auto-generated if not specified)	results.csv

Database Query Examples

•

Query Events V2 Database:

bash

newsfeed --db EVENT --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00

•

Query GKG V1 Database:

bash

newsfeed --db GKG --version V1 --start 2021-01-01 --end 2021-01-02

•

Query Mentions V2 with JSON output:

bash

newsfeed --db MENTIONS --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00 --format json

•

Specify output filename:

bash

newsfeed --db EVENT --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00 --output my_events.csv

Full-Text Download

Download complete article text from URLs using standalone mode or query mode.

Standalone Mode

•

Download from single URL:

bash

newsfeed --fulltext --url "https://example.com/article" --output article.json

•

Download from URL list file (one URL per line):

bash

newsfeed --fulltext --input urls.txt --output fulltexts.csv

•

Download from CSV file:

bash

newsfeed --fulltext --input results.csv --url-column SOURCEURL --output with_fulltext.csv

Query Mode with Full-Text Download

Query database and automatically download full text:

bash

newsfeed --db EVENT --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00 --download-fulltext

This will:

•Query GDELT Events database
•Extract unique URLs from SOURCEURL column
•Download full text for each article
•Add full text to FULLTEXT column
•Export CSV/JSON with full text

Full-Text Download Parameters

Parameter	Description	Mode	Default
`--fulltext`	Enable full-text download mode	Standalone	-
`--download-fulltext`	Download full text after query	Query	False
`--url`	Single URL to download	Standalone	-
`--input`	Input file with URLs (txt or csv)	Standalone	-
`--url-column`	URL column name in CSV	Both	SOURCEURL
`--fulltext-column`	Column name for full text in output	Query	FULLTEXT
`--format`	Output format (csv, json, txt)	Both	csv

Python API Usage

For advanced use cases, you can use the Python API directly.

Events Database

python

from newsfeed.news.db.events import EventV1, EventV2
import pandas as pd

# Version 1 (Daily updates, date format: YYYY-MM-DD)
event_v1 = EventV1(start_date="2021-01-01", end_date="2021-01-02")
results_v1 = event_v1.query()

# Version 2 (15-minute updates, date format: YYYY-MM-DD-HH-MM-SS)
event_v2 = EventV2(start_date="2021-01-01-00-00-00", end_date="2021-01-02-00-00-00", table="events")
results_v2 = event_v2.query()

Mentions Database

python

from newsfeed.news.db.events import EventV2

# Mentions only available in V2
mentions = EventV2(start_date="2021-01-01-00-00-00", end_date="2021-01-02-00-00-00", table="mentions")
results = mentions.query()

GKG Database

python

from newsfeed.news.db.gkg import GKGV1, GKGV2

# Version 1
gkg_v1 = GKGV1(start_date="2021-01-01", end_date="2021-01-02")
results_v1 = gkg_v1.query()

# Version 2
gkg_v2 = GKGV2(start_date="2021-01-01-00-00-00", end_date="2021-01-02-00-00-00")
results_v2 = gkg_v2.query()

Full-Text Download

python

from newsfeed.utils.fulltext import download

# Download full text from URL
article = download("https://example.com/article")
if article:
    print(f"Title: {article.title}")
    print(f"Text: {article.text}")
    print(f"Authors: {article.authors}")
    print(f"Publish Date: {article.publish_date}")

Database Details

Events Database

Contains global event data including event codes, actors, geographic locations, and sentiment analysis.

•V1: Date format YYYY-MM-DD, daily updates
•V2: Date format YYYY-MM-DD-HH-MM-SS, updates every 15 minutes

Key columns:

•GLOBALEVENTID: Global event ID
•SQLDATE: Date in SQL format
•Actor1Code, Actor2Code: Country/organization codes
•EventCode: CAMEO event code
•GoldsteinScale: Impact score
•AvgTone: Sentiment score
•SOURCEURL: Article URL

GKG Database

Contains global knowledge graph data including themes, locations, persons, organizations, and sentiment.

•V1: Date format YYYY-MM-DD, daily updates
•V2: Date format YYYY-MM-DD-HH-MM-SS, updates every 15 minutes

Key columns:

•DATE: Date
•V2SOURCECOMMONNAME: Source name
•V1THEMES, V2ENHANCEDTHEMES: Themes
•V1LOCATIONS, V2ENHANCEDLOCATIONS: Locations
•V1PERSONS, V2ENHANCEDPERSONS: Persons
•V1ORGANIZATIONS, V2ENHANCEDORGANIZATIONS: Organizations

Mentions Database

Contains media mentions of events, only available in V2.

•V2: Date format YYYY-MM-DD-HH-MM-SS, updates every 15 minutes

Key columns:

•GLOBALEVENTID: Global event ID
•MentionTimeDate: When the event was mentioned
•MentionSourceName: Source name
•MentionDocTone: Sentiment of mention
•Confidence: Confidence score

Common Use Cases

1. Analyze Events by Country

python

import pandas as pd

# Query data
df = pd.read_csv('EVENT_V2_20210101000000_20210102000000.csv')

# Filter by country
china_events = df[df['Actor1CountryCode'] == 'CHN']
print(f"Found {len(china_events)} events in China")

2. Extract Top Themes from GKG

python

import pandas as pd
from collections import Counter

# Query data
df = pd.read_csv('GKG_V2_20210101000000_20210102000000.csv')

# Extract themes
all_themes = []
for themes in df['V2ENHANCEDTHEMES'].dropna():
    all_themes.extend(themes.split(';'))

# Count themes
theme_counts = Counter(all_themes)
print("Top 10 themes:")
for theme, count in theme_counts.most_common(10):
    print(f"  {theme}: {count}")

3. Analyze Sentiment Trends

python

import pandas as pd
import matplotlib.pyplot as plt

# Query data
df = pd.read_csv('EVENT_V2_20210101000000_20210102000000.csv')

# Convert date
df['date'] = pd.to_datetime(df['SQLDATE'], format='%Y%m%d')

# Group by date and calculate average tone
daily_tone = df.groupby('date')['AvgTone'].mean()

# Plot
plt.figure(figsize=(12, 6))
daily_tone.plot()
plt.title('Average Sentiment Over Time')
plt.xlabel('Date')
plt.ylabel('Average Tone')
plt.show()

Tips and Best Practices

•Date Formats: Always use the correct date format for the version (V1: YYYY-MM-DD, V2: YYYY-MM-DD-HH-MM-SS)
•Query Range: Keep date ranges reasonable to avoid long download times
•Output Format: Use JSON for programmatic processing, CSV for data analysis
•Full-Text Download: Download times vary based on URL count and network speed
•Error Handling: The CLI will report failed URLs during full-text download
•File Size: GDELT databases are large; be mindful of disk space

Troubleshooting

Download fails or takes too long

•Check internet connection
•Reduce date range
•Some URLs may be inaccessible or have anti-scraping measures

No results found

•Verify date format matches version
•Check if data exists for the date range
•Try a different date range

Full-text download fails

•Some websites block automated downloads
•Try again later or use Internet Archive fallback (built-in)
•Check failed URL list in output

Additional Resources

•GitHub Repository: https://github.com/Cyclododecene/newsfeed
•GDELT Project: https://www.gdeltproject.org/
•CLI Documentation: See CLI_USAGE.md in the repository
•API Documentation: See docstrings in source code

Help

For CLI help:

bash

newsfeed --help
python -m newsfeed --help