Newsfeed - GDELT Data Access
Overview
Query and retrieve data from the GDELT Project (Global Database of Events, Language, and Tone) through CLI tools and Python APIs.
When to Use
Use this skill when you need to:
- •Query global news articles by date range
- •Access GDELT Events, Mentions, and Global Knowledge Graph (GKG) databases
- •Perform timeline analysis of news events
- •Download full-text articles from URLs
- •Analyze global events, themes, and relationships
Prerequisites
To use this skill, set up your environment with the following commands:
- •Create and activate a new or existing Python environment (Python 3.10+ recommended)
- •Install
newsfeedpackage:bashpip install newsfeed==0.1.7
Or just download from source code:
git clone https://github.com/Cyclododecene/newsfeed.git cd newsfeed git checkout dev pip install -e .
Common Patterns
- •Event Search: Query GDELT Events Database for events based on date range and version
- •Mentions Search: Query GDELT Mentions Database for media mentions of events
- •Global Knowledge Graph Search: Query GDELT GKG for themes, locations, and relationships
- •Full-Text Download: Download complete article content from URLs
- •Export Results: Save query results to JSON or CSV for further analysis
CLI Usage (Recommended)
The primary way to interact with GDELT databases is through the CLI interface.
Basic Database Query
newsfeed --db <DATABASE> --version <VERSION> --start <START_DATE> --end <END_DATE> [--format <FORMAT>] [--output <FILENAME>]
Parameters
| Parameter | Description | Required | Options | Example |
|---|---|---|---|---|
--db | Database type | Yes | EVENT, GKG, MENTIONS | EVENT |
--version | Database version | Yes | V1, V2 | V2 |
--start | Start date | Yes | V1: YYYY-MM-DD, V2: YYYY-MM-DD-HH-MM-SS | 2021-01-01 or 2021-01-01-00-00-00 |
--end | End date | Yes | V1: YYYY-MM-DD, V2: YYYY-MM-DD-HH-MM-SS | 2021-01-02 or 2021-01-02-00-00-00 |
--format | Output format | No | csv, json (default: csv) | json |
--output | Output filename | No | Any filename (auto-generated if not specified) | results.csv |
Database Query Examples
- •
Query Events V2 Database:
bashnewsfeed --db EVENT --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00
- •
Query GKG V1 Database:
bashnewsfeed --db GKG --version V1 --start 2021-01-01 --end 2021-01-02
- •
Query Mentions V2 with JSON output:
bashnewsfeed --db MENTIONS --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00 --format json
- •
Specify output filename:
bashnewsfeed --db EVENT --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00 --output my_events.csv
Full-Text Download
Download complete article text from URLs using standalone mode or query mode.
Standalone Mode
- •
Download from single URL:
bashnewsfeed --fulltext --url "https://example.com/article" --output article.json
- •
Download from URL list file (one URL per line):
bashnewsfeed --fulltext --input urls.txt --output fulltexts.csv
- •
Download from CSV file:
bashnewsfeed --fulltext --input results.csv --url-column SOURCEURL --output with_fulltext.csv
Query Mode with Full-Text Download
Query database and automatically download full text:
newsfeed --db EVENT --version V2 --start 2021-01-01-00-00-00 --end 2021-01-02-00-00-00 --download-fulltext
This will:
- •Query GDELT Events database
- •Extract unique URLs from SOURCEURL column
- •Download full text for each article
- •Add full text to FULLTEXT column
- •Export CSV/JSON with full text
Full-Text Download Parameters
| Parameter | Description | Mode | Default |
|---|---|---|---|
--fulltext | Enable full-text download mode | Standalone | - |
--download-fulltext | Download full text after query | Query | False |
--url | Single URL to download | Standalone | - |
--input | Input file with URLs (txt or csv) | Standalone | - |
--url-column | URL column name in CSV | Both | SOURCEURL |
--fulltext-column | Column name for full text in output | Query | FULLTEXT |
--format | Output format (csv, json, txt) | Both | csv |
Python API Usage
For advanced use cases, you can use the Python API directly.
Events Database
from newsfeed.news.db.events import EventV1, EventV2 import pandas as pd # Version 1 (Daily updates, date format: YYYY-MM-DD) event_v1 = EventV1(start_date="2021-01-01", end_date="2021-01-02") results_v1 = event_v1.query() # Version 2 (15-minute updates, date format: YYYY-MM-DD-HH-MM-SS) event_v2 = EventV2(start_date="2021-01-01-00-00-00", end_date="2021-01-02-00-00-00", table="events") results_v2 = event_v2.query()
Mentions Database
from newsfeed.news.db.events import EventV2 # Mentions only available in V2 mentions = EventV2(start_date="2021-01-01-00-00-00", end_date="2021-01-02-00-00-00", table="mentions") results = mentions.query()
GKG Database
from newsfeed.news.db.gkg import GKGV1, GKGV2 # Version 1 gkg_v1 = GKGV1(start_date="2021-01-01", end_date="2021-01-02") results_v1 = gkg_v1.query() # Version 2 gkg_v2 = GKGV2(start_date="2021-01-01-00-00-00", end_date="2021-01-02-00-00-00") results_v2 = gkg_v2.query()
Full-Text Download
from newsfeed.utils.fulltext import download
# Download full text from URL
article = download("https://example.com/article")
if article:
print(f"Title: {article.title}")
print(f"Text: {article.text}")
print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
Database Details
Events Database
Contains global event data including event codes, actors, geographic locations, and sentiment analysis.
- •V1: Date format
YYYY-MM-DD, daily updates - •V2: Date format
YYYY-MM-DD-HH-MM-SS, updates every 15 minutes
Key columns:
- •
GLOBALEVENTID: Global event ID - •
SQLDATE: Date in SQL format - •
Actor1Code,Actor2Code: Country/organization codes - •
EventCode: CAMEO event code - •
GoldsteinScale: Impact score - •
AvgTone: Sentiment score - •
SOURCEURL: Article URL
GKG Database
Contains global knowledge graph data including themes, locations, persons, organizations, and sentiment.
- •V1: Date format
YYYY-MM-DD, daily updates - •V2: Date format
YYYY-MM-DD-HH-MM-SS, updates every 15 minutes
Key columns:
- •
DATE: Date - •
V2SOURCECOMMONNAME: Source name - •
V1THEMES,V2ENHANCEDTHEMES: Themes - •
V1LOCATIONS,V2ENHANCEDLOCATIONS: Locations - •
V1PERSONS,V2ENHANCEDPERSONS: Persons - •
V1ORGANIZATIONS,V2ENHANCEDORGANIZATIONS: Organizations
Mentions Database
Contains media mentions of events, only available in V2.
- •V2: Date format
YYYY-MM-DD-HH-MM-SS, updates every 15 minutes
Key columns:
- •
GLOBALEVENTID: Global event ID - •
MentionTimeDate: When the event was mentioned - •
MentionSourceName: Source name - •
MentionDocTone: Sentiment of mention - •
Confidence: Confidence score
Common Use Cases
1. Analyze Events by Country
import pandas as pd
# Query data
df = pd.read_csv('EVENT_V2_20210101000000_20210102000000.csv')
# Filter by country
china_events = df[df['Actor1CountryCode'] == 'CHN']
print(f"Found {len(china_events)} events in China")
2. Extract Top Themes from GKG
import pandas as pd
from collections import Counter
# Query data
df = pd.read_csv('GKG_V2_20210101000000_20210102000000.csv')
# Extract themes
all_themes = []
for themes in df['V2ENHANCEDTHEMES'].dropna():
all_themes.extend(themes.split(';'))
# Count themes
theme_counts = Counter(all_themes)
print("Top 10 themes:")
for theme, count in theme_counts.most_common(10):
print(f" {theme}: {count}")
3. Analyze Sentiment Trends
import pandas as pd
import matplotlib.pyplot as plt
# Query data
df = pd.read_csv('EVENT_V2_20210101000000_20210102000000.csv')
# Convert date
df['date'] = pd.to_datetime(df['SQLDATE'], format='%Y%m%d')
# Group by date and calculate average tone
daily_tone = df.groupby('date')['AvgTone'].mean()
# Plot
plt.figure(figsize=(12, 6))
daily_tone.plot()
plt.title('Average Sentiment Over Time')
plt.xlabel('Date')
plt.ylabel('Average Tone')
plt.show()
Tips and Best Practices
- •Date Formats: Always use the correct date format for the version (V1: YYYY-MM-DD, V2: YYYY-MM-DD-HH-MM-SS)
- •Query Range: Keep date ranges reasonable to avoid long download times
- •Output Format: Use JSON for programmatic processing, CSV for data analysis
- •Full-Text Download: Download times vary based on URL count and network speed
- •Error Handling: The CLI will report failed URLs during full-text download
- •File Size: GDELT databases are large; be mindful of disk space
Troubleshooting
Download fails or takes too long
- •Check internet connection
- •Reduce date range
- •Some URLs may be inaccessible or have anti-scraping measures
No results found
- •Verify date format matches version
- •Check if data exists for the date range
- •Try a different date range
Full-text download fails
- •Some websites block automated downloads
- •Try again later or use Internet Archive fallback (built-in)
- •Check failed URL list in output
Additional Resources
- •GitHub Repository: https://github.com/Cyclododecene/newsfeed
- •GDELT Project: https://www.gdeltproject.org/
- •CLI Documentation: See
CLI_USAGE.mdin the repository - •API Documentation: See docstrings in source code
Help
For CLI help:
newsfeed --help python -m newsfeed --help