iMessage Extraction
Extract and process iMessage conversations from the macOS Messages database.
Database Location
~/Library/Messages/chat.db
Full Disk Access required: Grant Full Disk Access to Terminal in System Preferences > Security & Privacy > Privacy > Full Disk Access.
Setup steps:
- •Go to System Preferences > Security & Privacy > Privacy > Full Disk Access
- •Click the
+button and add/Applications/Utilities/Terminal.app(or your terminal app) - •Restart Terminal for changes to take effect
- •Test by running:
sqlite3 ~/Library/Messages/chat.db "SELECT COUNT(*) FROM chat;"
Quick Start
1. List All Conversations
import sqlite3
import os
conn = sqlite3.connect(os.path.expanduser('~/Library/Messages/chat.db'))
cursor = conn.cursor()
cursor.execute("""
SELECT c.ROWID, c.chat_identifier, c.display_name,
COUNT(cmj.message_id) as message_count
FROM chat c
LEFT JOIN chat_message_join cmj ON c.ROWID = cmj.chat_id
GROUP BY c.ROWID
ORDER BY message_count DESC
""")
for row in cursor.fetchall():
print(f"Chat {row[0]}: {row[1]} ({row[2] or 'No name'}) - {row[3]} messages")
2. Extract Messages from a Conversation
chat_id = 123 # Replace with actual chat ID
cursor.execute("""
SELECT m.ROWID, m.date, m.is_from_me, m.text, m.attributedBody,
h.id as sender
FROM message m
JOIN chat_message_join cmj ON m.ROWID = cmj.message_id
LEFT JOIN handle h ON m.handle_id = h.ROWID
WHERE cmj.chat_id = ?
ORDER BY m.date ASC
""", (chat_id,))
Apple Timestamp Conversion
iMessage uses Apple's Core Data timestamp: nanoseconds since 2001-01-01.
from datetime import datetime
APPLE_EPOCH = 978307200 # Unix timestamp of 2001-01-01
def apple_to_datetime(apple_ns):
"""Convert Apple nanosecond timestamp to datetime."""
if not apple_ns:
return None
unix_ts = (apple_ns / 1_000_000_000) + APPLE_EPOCH
return datetime.fromtimestamp(unix_ts)
def apple_to_unix(apple_ns):
"""Convert Apple nanosecond timestamp to Unix timestamp."""
return int(apple_ns / 1_000_000_000 + APPLE_EPOCH) if apple_ns else None
def datetime_to_apple(dt):
"""Convert datetime to Apple nanosecond timestamp."""
unix_ts = dt.timestamp()
return int((unix_ts - APPLE_EPOCH) * 1_000_000_000)
SQL conversion:
datetime(m.date/1000000000 + 978307200, 'unixepoch', 'localtime') as readable_date
Decoding attributedBody
Many messages store content in attributedBody (NSKeyedArchiver format) rather than text. See references/decoding.md for the full decoder.
Quick decode with PyObjC (if installed):
from Foundation import NSData, NSKeyedUnarchiver
def decode_attributed_body(data):
if not data:
return None
try:
ns_data = NSData.dataWithBytes_length_(data, len(data))
unarchiver = NSKeyedUnarchiver.alloc().initForReadingWithData_(ns_data)
unarchiver.setRequiresSecureCoding_(False)
obj = unarchiver.decodeObjectForKey_("root")
return str(obj.string()) if obj and hasattr(obj, 'string') else None
except:
return None
Install PyObjC: pip install pyobjc-framework-Cocoa
Fallback without PyObjC: See references/decoding.md for manual binary parsing.
Cleaning Message Artifacts
Decoded messages often contain artifacts from the typedstream encoding:
import re
def clean_message(text):
"""Clean decoding artifacts from message text."""
if not text:
return text
# Replace object replacement character (marks images)
text = text.replace('\ufffc', '[image]')
# Remove non-printable characters
text = ''.join(c for c in text if c.isprintable() or c in '\n\r\t')
# Remove leading format codes
text = re.sub(r'^[\+\*\!\.\,\;\#\%\`\~\^\&\@\$\s]+', '', text)
# Remove trailing NSDictionary artifacts
text = re.sub(r'\s*NSDictionary\s*$', '', text)
text = re.sub(r'\s*NSMutable[A-Za-z]+\s*$', '', text)
# Remove trailing iMessage metadata (e.g., &__kIMBaseWritingDirection...)
text = re.sub(r'\s*&?__kIM[^\s]*.*$', '', text)
# Remove trailing iMessage metadata
text = re.sub(r'[ij]I[^\s]*$', '', text)
# Remove leading digit artifact
text = re.sub(r'^[0-9](?=[a-zA-Z])', '', text)
# Remove leading single-letter artifacts from iMessage encoding
# Patterns: "dOMG" -> "OMG", "Ohttps://" -> "https://", "Ci believe" -> "i believe"
# BUT NOT: "OMG" -> "MG" (uppercase + uppercase = real word)
if len(text) > 2 and text[0].isalpha():
second_char = text[1]
rest = text[1:]
# Case 1: lowercase followed by uppercase (e.g., "dOMG")
if text[0].islower() and second_char.isupper():
text = rest
# Case 2: any letter followed by "http" (e.g., "Ohttps://")
elif rest[:4].lower() == 'http':
text = rest
# Case 3: any letter followed by "i " or "i'" (e.g., "Ci believe")
elif second_char == 'i' and len(text) > 2 and text[2] in " '":
text = rest
return text.strip()
Detecting Reactions and Quotes
iMessage reactions (tapbacks) and quoted messages require special handling.
Reaction format: ReactionWord + space + curly_quote + original_message + curly_quote
- •Example:
Loved "When I first heard this song..." - •Uses Unicode curly quotes:
"(U+201C) and"(U+201D), not straight quotes
Important: Use regex matching, not startswith(). The curly quotes vary and cause issues with string prefix matching.
# Quote characters (ASCII and Unicode curly quotes)
QUOTE_CHARS = '"\'""\u201c\u201d\u2018\u2019'
def is_reaction_message(text):
"""Check if message is a reaction (Loved, Laughed at, etc.).
Uses regex instead of startswith() because reaction messages use
Unicode curly quotes (U+201C, U+201D) which vary and cause matching issues.
"""
if not text:
return False
return bool(re.match(r'^(Reacted|Loved|Laughed|Emphasized|Disliked|Questioned|Liked)\s', text))
def is_quoted_message(text):
"""Check if message ends with a quote (usually quoting someone)."""
if not text:
return False
stripped = text.rstrip()
return stripped[-1] in QUOTE_CHARS if stripped else False
Use these to filter analysis:
# Skip reactions and quotes when analyzing conversation flow
meaningful_messages = [
m for m in messages
if not is_reaction_message(m['text']) and not is_quoted_message(m['text'])
]
Creating a Clean Database
For easier querying, create a new database with decoded messages. Run scripts/create_clean_db.py or use this schema:
CREATE TABLE messages (
id INTEGER PRIMARY KEY,
guid TEXT UNIQUE,
text TEXT,
decoded_text TEXT,
date TEXT,
date_unix INTEGER,
is_from_me INTEGER,
handle_id INTEGER,
has_attachments INTEGER,
service TEXT
);
CREATE TABLE handles (
id INTEGER PRIMARY KEY,
contact_id TEXT,
service TEXT
);
CREATE VIEW messages_readable AS
SELECT
m.id, m.guid, m.date,
CASE WHEN m.is_from_me = 1 THEN 'You' ELSE h.contact_id END as sender,
m.decoded_text as message,
m.has_attachments
FROM messages m
LEFT JOIN handles h ON m.handle_id = h.id
ORDER BY m.date_unix;
Common Queries
Search messages:
SELECT * FROM messages_readable WHERE message LIKE '%keyword%';
Messages by date range:
SELECT * FROM messages_readable WHERE date BETWEEN '2023-01-01' AND '2023-12-31';
Count by sender:
SELECT sender, COUNT(*) FROM messages_readable GROUP BY sender;
Messages with attachments:
SELECT m.*, a.filename, a.mime_type FROM message m JOIN message_attachment_join maj ON m.ROWID = maj.message_id JOIN attachment a ON maj.attachment_id = a.ROWID;
Workflow Summary
- •Find chat ID: Query
chattable to find conversation - •Extract messages: Join
message+chat_message_join+handle - •Decode text: Check
textfirst, then decodeattributedBody - •Clean artifacts: Apply cleaning patterns
- •Convert timestamps: Apply Apple epoch conversion
- •Export: Save to CSV, new SQLite, or other format
Reference Files
- •
references/decoding.md- Full attributedBody decoder with fallback methods - •
references/schema.md- Complete chat.db schema reference - •
scripts/create_clean_db.py- Create queryable clean database