Archive Information Extraction Skill
Assess archival documents (emails, correspondence, receipts) to identify information worth preserving in the knowledge base. This skill provides judgment criteria only - actual storage is handled by Skill(skill="remember").
Purpose
Filter noise from archival documents and identify genuinely significant information that should be preserved. Most archival documents have NO long-term value - be highly selective.
Assessment Criteria
Decision Framework
For each document, use LLM judgment to answer:
- •Is this important enough to mention at a monthly research meeting?
- •Would I want to remember this in 5 years?
- •Is this a concrete outcome, or just noise?
- •Did Nic take significant action, or is this passive/routine?
Trust your judgment - you understand context better than regex patterns.
What to Extract
1. Projects & Publications
- •Papers: Actual submissions, acceptances, revisions, publications (not CFPs)
- •Grants: Applications submitted, outcomes received, progress reports
- •Books/Chapters: Contracts signed, submissions, publications
- •Research Projects: Project initiation, significant milestones, new collaborations
- •Awards: Significant awards or recognitions
- •Milestones: Major career milestones, important advocacy/policy work
- •Media: Media appearances or public engagement
Think: Is this a concrete action or outcome? Or just noise?
2. Professional Activities
- •Events Organized: Conferences, workshops, panels Nic ran/chaired/coordinated
- •Speaking: Confirmed talks, keynotes where Nic actually spoke
- •Service: OSB decisions, editorial board work, advisory roles
- •Reviews: Actual reviews submitted (not review requests)
Think: Did Nic DO something significant? Or just receive an invitation?
3. Applications & Career
- •Grant Applications: Submitted applications (not drafts or ideas)
- •Job Applications: Roles applied for
- •Awards/Honors: Nominations received, awards won
- •Promotion/Tenure: Applications, outcomes
Think: Is this a concrete milestone or application? Or just planning?
4. Important Contacts & Relationships
- •Collaborations: Research collaboration proposals or agreements
- •Partnerships: Grant writing partnerships, joint projects
- •Networks: Connections made at conferences, international research networks
- •Students: PhD supervision milestones, significant correspondence
- •Institutions: New institutional relationships, key ongoing relationships
Think: Is this the start/milestone of an important relationship? Or routine correspondence?
5. Financial Records
- •Receipts, invoices, contracts by project and category
- •Log with: message ID, date, entity, project, description, amount, currency, payment status
- •Essential for grant acquittal and tax purposes
Think: Will we need this for financial reporting?
What to Skip
Most documents are noise. Skip:
- •Mass communications (newsletters, announcements, marketing, digests)
- •Administrative routine (meeting scheduling, calendar invites, reminders)
- •Declined invitations (conferences not attended, webinars skipped)
- •Generic outreach (CFPs, generic collaboration requests, surveys)
- •Automated systems (notifications, confirmations - unless significant)
- •Spam/phishing
- •Personal chitchat without professional substance
- •Room bookings and logistics
- •Normal student supervision (unless exceptional)
- •Regular staff interactions
- •Forum posts and mailing list discussions
- •Course coordination and teaching logistics
- •Standard reference requests
Test: Would I forget this within a week with no consequence?
Uncertain Cases
When uncertain:
- •Lean toward extraction in initial pass
- •Tag output with
#review-classificationfor manual review - •Better to extract and discard later than miss important information
Judgment Examples
EXTRACT: "Your paper 'Platform Governance' has been accepted by Nature" → Clear publication outcome
SKIP: "CFP: Submit to Journal of Platform Studies by Dec 1" → Generic invitation, no submission
EXTRACT: "Congratulations, your FT210100263 grant has been awarded $500K" → Grant outcome with specifics
SKIP: "Reminder: Your FT210100263 annual report is due next month" → Administrative reminder
EXTRACT: "Following our chat at IGF, I'd love to collaborate on disinformation research..." → New substantive collaboration starting
SKIP: "You're invited to join our webinar on content moderation" → Mass invitation, not personal
EXTRACT: Email from examiner with detailed feedback on PhD thesis → Supervision milestone
SKIP: "HDR student seminars happening this week" → Generic announcement
EXTRACT: "OSB Case 2025R final decision: Upheld with modifications..." → Actual OSB work product
SKIP: "OSB weekly meeting reminder" → Routine scheduling
Key Information to Identify
When you decide to extract, identify these elements:
People
- •Full name and affiliations
- •Contact information (if provided)
- •Role/position
- •Relationship to Nic (collaborator, contact, student, etc.)
Organizations
- •Universities and research institutions
- •Research groups or labs
- •Funding bodies
- •Conferences or professional organizations
Events
- •Conference/workshop name
- •Speaking engagements
- •Meetings with significance
- •Research visits
Projects/Collaborations
- •Research topics discussed
- •Grant proposals
- •Joint publications
- •Ongoing or proposed collaborations
Dates & Timing
- •When correspondence occurred
- •When events happened or will happen
- •Project timelines if mentioned
Topics/Tags
- •Research areas (platform-governance, content-moderation, AI-ethics, etc.)
- •Geographic locations
- •Types of activity (collaboration, speaking, reviewing)
Identifiers (CRITICAL)
Always extract canonical identifiers to allow future reference:
- •Email message ID (preferred)
- •Source UUID or unique identifier
- •URL or DOI
- •If no canonical ID exists, create citation: "email from [sender] to [recipient] dated [YYYY-MM-DD] subject: [subject]"
Store identifier inline with extracted information.
Storage
Use Skill(skill="remember") to store extracted information in the knowledge base. The remember skill handles:
- •Entity creation/updates
- •Format validation
- •Deduplication checking
- •Proper categorization
Store information using the memory server with properly formatted markdown. Tag extracted information appropriately for searchability and future retrieval.
Common approaches for email extraction:
- •Email metadata (sender, date, subject) → Use memory tags like
#email,#correspondence - •Project information → Use memory tags like
#project - •Coordinator/team roles → Store with descriptive relationships
- •Timeline/duration → Use memory tags like
#timeline,#deadline - •Status information → Use memory tags like
#status - •Relationships between people → Document relationships clearly with role descriptions
Storage principles:
- •Use clear, structured memory entries
- •Include all relevant metadata and timestamps
- •Tag appropriately for semantic search
- •Avoid inventing custom categories or types
- •Delegate all storage to the remember skill
Quality Standards
Confidence Levels
- •Only extract when information is clear and valuable
- •Mark uncertain extractions with
#review-classification - •Skip ambiguous or unclear information
Atomic Information
- •Each piece of information should be specific and verifiable
- •Include dates, sources, and context
- •Tag appropriately for searchability
Fail Fast
- •If document is malformed, skip it and log error
- •If extraction is ambiguous, skip it or tag for review
- •Continue processing other documents