Weka Data Processing
Objectives
Process Weka ARFF files for use in Python machine learning projects, particularly converting to CSV format with proper column headers.
Instructions
1. Understanding ARFF Format
ARFF (Attribute-Relation File Format) is Weka's native data format:
code
@relation diabetes
@attribute preg numeric
@attribute plas numeric
@attribute pres numeric
@attribute skin numeric
@attribute insu numeric
@attribute mass numeric
@attribute pedi numeric
@attribute age numeric
@attribute class {tested_negative,tested_positive}
@data
6,148,72,35,0,33.6,0.627,50,tested_positive
1,85,66,29,0,26.6,0.351,31,tested_negative
...
2. Converting ARFF to CSV
Method 1: Using Python (Recommended)
python
import pandas as pd
from scipy.io import arff
# Load ARFF file
data, meta = arff.loadarff('diabetes.arff')
df = pd.DataFrame(data)
# Convert byte strings to regular strings (for categorical columns)
for col in df.columns:
if df[col].dtype == 'object':
df[col] = df[col].str.decode('utf-8')
# Save to CSV
df.to_csv('diabetes.csv', index=False)
Method 2: Manual Extraction
Read ARFF file and extract:
- •Column names from
@attributelines - •Data from
@datasection - •Write to CSV with headers
3. Common Weka Datasets
Diabetes Dataset:
- •Location:
weka/data/diabetes.arff - •Features: 8 numeric attributes
- •Target: binary class (tested_negative, tested_positive)
- •Instances: 768
Column Names:
- •preg (pregnancies)
- •plas (plasma glucose)
- •pres (blood pressure)
- •skin (skin thickness)
- •insu (insulin)
- •mass (BMI)
- •pedi (diabetes pedigree function)
- •age
- •class (target)
4. Handling Class Labels
Weka uses string labels, Python ML libraries prefer numeric:
python
# Convert class labels to numeric
df['class'] = df['class'].map({
'tested_negative': 0,
'tested_positive': 1
})
5. Required Packages
bash
uv add scipy pandas
Validation
After conversion, verify:
- • CSV file has correct number of columns
- • Column names match ARFF attributes
- • Number of rows matches ARFF data section
- • Class labels properly converted
- • No missing values introduced
Common Issues
Issue: Byte strings in DataFrame
- •Solution: Use
.str.decode('utf-8')for object columns
Issue: Missing column names
- •Solution: Extract from
@attributelines in ARFF
Issue: Class labels as bytes
- •Solution: Decode and optionally map to numeric
Example Workflow
- •Locate Weka data file (usually in
weka/data/) - •Load ARFF using scipy.io.arff
- •Convert to pandas DataFrame
- •Decode string columns
- •Save to CSV with headers
- •Verify column names and data integrity
Anti-Patterns
- •❌ Using CSV without column headers
- •❌ Not decoding byte strings
- •❌ Losing class label information
- •❌ Not verifying data integrity after conversion