Train Test Split and Cross Validation¶

In this notebook we will work through the train test-split and the process of cross validation.
The following short video describes the motivation behind the train test split and cross validation.

from IPython.display import YouTubeVideo
YouTubeVideo("UJ3WGuJ1zao")

Imports¶

As with other notebooks we will use the same set of standard imports.

import numpy as np
import pandas as pd

import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');

from sklearn.linear_model import LinearRegression

The Data¶

For this notebook, we will use the seaborn mpg dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

from seaborn import load_dataset
data = load_dataset("mpg")
data

Train Test Split¶

The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running the data cleaning and EDA process on the test dataset.

YouTubeVideo("_Bzfy7BTjz0")

Using Pandas Operations¶

We can sample the entire dataset to get a permutation and then select a range of rows.

shuffled_data = data.sample(frac=1., random_state=42)
shuffled_data

Selecting a range of rows for training and test

split_point = int(shuffled_data.shape[0]*0.90)
split_point

358

tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]

tr

te

Checking that they add up.

len(tr) + len(te) == len(data)

True

Using SKLearn¶

We can use the train_test_split function from sklearn.model_selection to do this easily.

from sklearn.model_selection import train_test_split

tr, te = train_test_split(data, test_size=0.1, random_state=83)

tr

te

Quick Visualization¶

Out of curiosity what does the mgp field look like? I am going to look at both the train and test distributions but in practice we should avoid looking at the test data.

ff.create_distplot([tr['mpg'], te['mpg']], ['train mpg', 'test mpg'])

Building A Basic Model¶

Let's go through the process of building a model. Let's start by looking at just engine characteristics like "cylinders" and the "displacement". We will first use just our own feature function (as we did in previous lectures). Then we will introduce how to use sklearn Pipelines to combine feature functions and models. As we will see, by combining the feature function and model, we can simplify subsequent training and testing since we are guaranteed that our feature functions are the same on both the training and test datasets.

YouTubeVideo("2i7yj4JhIkw")

My first feature function will just extract the two features that I want to use in my model.

def phi(df):
    return df[["cylinders", "displacement"]]

Then if it an sklearn LinearRegression model to my training data.

from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(phi(tr), tr['mpg'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

To evaluate the error we will use the Root Mean Squared Error (RMSE) which is like the mean squared error but in the correct units (mpg) instead of (mpg^2).

def rmse(y, yhat):
    return np.sqrt(np.mean((y - yhat)**2))

The training error is:

Y_hat = model.predict(phi(tr))
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))

Training Error (RMSE): 4.589009902639972

Don't try this at home!¶

The test error is:

Y_hat = model.predict(phi(te))
Y = te['mpg']
print("Test Error (RMSE):", rmse(Y, Y_hat))

Test Error (RMSE): 5.034009303909176

Oh no! We just used the test data to evaluate our model! We shouldn't have done that.

(Don't worry, we are trained professionals and this is only for demonstration purposes. But seriously, don't try this at home.)

Notice: The test error is slightly higher than the training error. This is typically (but not always) the case. Sometimes we get lucky and the test data is "easier to predict" or happens to closely follow the training data.

SKLearn Pipelines¶

Notice that we need to apply the same transformations to the test data when we go to evaluate our model. Keeping track of phi can get a bit tedious. We will use the sklearn Pipelines package to combine feature engineering stages and the model into a single "model" that applies the same transformations when you call fit and predict. You may want to skim the documentation on pipelines.

YouTubeVideo("9vz0n1cyMbc")

A Pipeline is composed of stages (in a list) where each stage has a name which can be used to access that stage (more on this later) and an operation that must either be a transformation or a model.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

model = Pipeline([
    ("SelectColumns", ColumnTransformer([("keep", "passthrough", ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
])

The first stage of this pipeline just selects the columns:

model['SelectColumns']

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('keep', 'passthrough',
                                 ['cylinders', 'displacement'])],
                  verbose=False)

The ColumnTransformer is designed to capture basic feature engineering operations on pandas DataFrames. It takes a list of column transformation triples. Each transformation gets a name (e.g., "keep"), an operation which could be something like OneHotEncoder, or a string describing what to do "drop" or "passthrough" and then finally the list of column names on which to apply the transformation.

We can now fit the entire pipeline:

model.fit(tr, tr['mpg'])

Pipeline(memory=None,
         steps=[('SelectColumns',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('keep', 'passthrough',
                                                  ['cylinders',
                                                   'displacement'])],
                                   verbose=False)),
                ('LinearModel',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Then we can make predictions on the entire DataFrame and error of our model on the training data

Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))

Training Error (RMSE): 4.589009902639972

Again, we shouldn't do this but I want to illustrate how this model can be easily applied to the test data without having to worry about applying the feature function first.

Y_hat = model.predict(te)
Y = te['mpg']
print("Test Error (RMSE):", rmse(Y, Y_hat))

Test Error (RMSE): 5.034009303909176

Keeping track of all the models.¶

In this notebook (and in life) we will want to keep track of all our models. Here I will store the models in a dictionary with a (not great) name so I can remember which model is which and can easily compare my models in a plot.

models = {"c+d": model}

More Feature Transformations¶

We might also want to look at the displacement per cylinder. This is an additional feature transformation that we can add to the first stage of our pipeline. To define this transformation we first need to create a function transformer:

from sklearn.preprocessing import FunctionTransformer

def compute_volume(X):
    return np.expand_dims(X[:,1] / X[:,0]  , axis=1)

volume_transformer = FunctionTransformer(compute_volume, validate=True)

We can then add this as an additional column transformation:

model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", ["cylinders", "displacement"]),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
])

model.fit(tr, tr['mpg'])

Pipeline(memory=None,
         steps=[('SelectColumns',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('keep', 'passthrough',
                                                  ['cylinders',
                                                   'displacement']),
                                                 ('cyl_vol',
                                                  FunctionTransformer(accept_sparse=False,
                                                                      check_inverse=True,
                                                                      func=<function compute_volume at 0x7fcffbeb1e60>,
                                                                      inv_kw_args=None,
                                                                      inverse_func=None,
                                                                      kw_args=None,
                                                                      pass_y='deprecated',
                                                                      validate=True),
                                                  ['cylinders',
                                                   'displacement'])],
                                   verbose=False)),
                ('LinearModel',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Again, we evaluate the model on our training dataset and see a reduction in error:

Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))

Training Error (RMSE): 4.318799994795116

models["c+d+d/c"] = model

Adding More Features¶

We can now add additional features about the car.

quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration"]
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
])

I have put the following code in a try/except statement because I know it will raise an error. What do you think will go wrong?

try:
    model.fit(tr, tr['mpg'])
except ValueError as err:
    print(err)

Input contains NaN, infinity or a value too large for dtype('float64').

There appear to be NaN (missing values) in the data (take a look at the horsepower column). We need to deal with these missing values. In previous lectures I mentioned imputation and a standard imputation technique is to replace the missing value with the mean for that column. Scikit learn has a built-in imputation function that we can add to our pipeline after we select the desired columns. The imputation will actually be applied to all the columns. If we wanted to apply it to a specific column then we would need to put it inside the ColumnTransformer.

Notice: The imputation function actually needs to be fit to data so it is also part of the model.

from sklearn.impute import SimpleImputer
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])

We can now train our model.

model.fit(tr, tr['mpg'])

Pipeline(memory=None,
         steps=[('SelectColumns',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('keep', 'passthrough',
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration']),
                                                 ('cyl_vol',
                                                  FunctionTransformer(accept_sparse=False,
                                                                      check_inverse=True,
                                                                      func=<function compute_volume at 0x7fcffbe...
                                                                      inv_kw_args=None,
                                                                      inverse_func=None,
                                                                      kw_args=None,
                                                                      pass_y='deprecated',
                                                                      validate=True),
                                                  ['cylinders',
                                                   'displacement'])],
                                   verbose=False)),
                ('Imputation',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('LinearModel',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Saving the model for later comparison:

models['c+d+d/c+h+w+a'] = model

Evaluating the training error:

Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))

Training Error (RMSE): 4.025034093127813

We reduced the training error but what about the test error? We really shouldn't look at the test error so instead we will use cross validation to compare the accuracy:

Cross Validation¶

In the following function we use the sklear KFold cross validation class.

YouTubeVideo("m8580Et4pjY")

Here we define a five fold cross validation with

five_fold = KFold(n_splits=5)

Then we loop over the 5 splits and get the indicies (tr_ind) in the training data to use for training and the indices (va_ind) in the training data to use for validation:

for tr_ind, te_ind in five_fold.split(tr):

from sklearn.model_selection import KFold
from sklearn.base import clone

def cross_validate_rmse(model):
    model = clone(model)
    five_fold = KFold(n_splits=5)
    rmse_values = []
    for tr_ind, va_ind in five_fold.split(tr):
        model.fit(tr.iloc[tr_ind,:], tr['mpg'].iloc[tr_ind])
        rmse_values.append(rmse(tr['mpg'].iloc[va_ind], model.predict(tr.iloc[va_ind,:])))
    return np.mean(rmse_values)

Valiating the model

cross_validate_rmse(model)

4.05180814386984

The following helper function generates a plot comparing all the models in the models dictionary.

def compare_models(models):
    # Compute the training error for each model
    training_rmse = [rmse(tr['mpg'], model.predict(tr)) for model in models.values()]
    # Compute the cross validation error for each model
    validation_rmse = [cross_validate_rmse(model) for model in models.values()]
    # Compute the test error for each model (don't do this!)
    test_rmse = [rmse(te['mpg'], model.predict(te)) for model in models.values()]
    names = list(models.keys())
    fig = go.Figure([
        go.Bar(x = names, y = training_rmse, name="Training RMSE"),
        go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
        go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
    return fig

fig = compare_models(models)
fig.update_yaxes(range=[2,5.1], title="RMSE")

Notice I made the Test RMSE invisible(ish) because you shouldn't look at it until we are done. But again for demonstration purposes I plotted in so we can see how it compares to the training and cross validation errors.

Can you improve the model further? Let's try adding the model year

quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])

model.fit(tr, tr['mpg'])
models['c+d+d/c+h+w+a+y'] = model

Comparing the models

fig = compare_models(models)
fig.update_yaxes(range=[2,5.1], title="RMSE")

The model year improved accuracy quite a bit! This improvement also appears to generalize as it also reduced the cross validation error.

Going too Far?¶

Can we use the car's name to predict MPG? The name contains general features like the manufacturer that might help but it also contains the vehicle make which is probably too specific and not all the vehicles in test will appear in training.

Let's try by applying the CountVectorizer (which implements the bag of words features). At this point we are also likely to have too many dimensions in our model and we are not applying any regularization technique to compensate (because we haven't covered regularization in lecture yet.)

from sklearn.feature_extraction.text import CountVectorizer
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"]),
        ("text", CountVectorizer(), "name")
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])

Notice: That we are using an additional column transformation.

model.fit(tr, tr['mpg'])
models['c+d+d/c+h+w+a+y+n'] = model

fig = compare_models(models)
fig.update_yaxes(range=[0,5.1], title="RMSE")

Overfitting!. We substantially reduced the training error but actually made the generalization error worse!

best_model = clone(models['c+d+d/c+h+w+a+y'])

best_model.fit(data, data['mpg'])

Pipeline(memory=None,
         steps=[('SelectColumns',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('keep', 'passthrough',
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration',
                                                   'model_year']),
                                                 ('cyl_vol',
                                                  FunctionTransformer(accept_sparse=False,
                                                                      check_inverse=True,
                                                                      func=<function compute_volu...
                                                                      inv_kw_args=None,
                                                                      inverse_func=None,
                                                                      kw_args=None,
                                                                      pass_y='deprecated',
                                                                      validate=True),
                                                  ['cylinders',
                                                   'displacement'])],
                                   verbose=False)),
                ('Imputation',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('LinearModel',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

rmse(best_model.predict(te), te['mpg'])

2.8167549672284293

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	usa	ford torino
...	...	...	...	...	...	...	...	...	...
393	27.0	4	140.0	86.0	2790	15.6	82	usa	ford mustang gl
394	44.0	4	97.0	52.0	2130	24.6	82	europe	vw pickup
395	32.0	4	135.0	84.0	2295	11.6	82	usa	dodge rampage
396	28.0	4	120.0	79.0	2625	18.6	82	usa	ford ranger
397	31.0	4	119.0	82.0	2720	19.4	82	usa	chevy s-10

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
198	33.0	4	91.0	53.0	1795	17.4	76	japan	honda civic
396	28.0	4	120.0	79.0	2625	18.6	82	usa	ford ranger
33	19.0	6	232.0	100.0	2634	13.0	71	usa	amc gremlin
208	13.0	8	318.0	150.0	3940	13.2	76	usa	plymouth volare premier v8
93	14.0	8	318.0	150.0	4237	14.5	73	usa	plymouth fury gran sedan
...	...	...	...	...	...	...	...	...	...
71	19.0	3	70.0	97.0	2330	13.5	72	japan	mazda rx2 coupe
106	12.0	8	350.0	180.0	4499	12.5	73	usa	oldsmobile vista cruiser
270	21.1	4	134.0	95.0	2515	14.8	78	japan	toyota celica gt liftback
348	37.7	4	89.0	62.0	2050	17.3	81	japan	toyota tercel
102	26.0	4	97.0	46.0	1950	21.0	73	europe	volkswagen super beetle

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
198	33.0	4	91.0	53.0	1795	17.4	76	japan	honda civic
396	28.0	4	120.0	79.0	2625	18.6	82	usa	ford ranger
33	19.0	6	232.0	100.0	2634	13.0	71	usa	amc gremlin
208	13.0	8	318.0	150.0	3940	13.2	76	usa	plymouth volare premier v8
93	14.0	8	318.0	150.0	4237	14.5	73	usa	plymouth fury gran sedan
...	...	...	...	...	...	...	...	...	...
391	36.0	4	135.0	84.0	2370	13.0	82	usa	dodge charger 2.2
134	16.0	6	258.0	110.0	3632	18.0	74	usa	amc matador
306	28.8	6	173.0	115.0	2595	11.3	79	usa	chevrolet citation
381	36.0	4	107.0	75.0	2205	14.5	82	japan	honda accord
319	31.3	4	120.0	75.0	2542	17.5	80	japan	mazda 626

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
243	21.5	3	80.0	110.0	2720	13.5	77	japan	mazda rx-4
54	35.0	4	72.0	69.0	1613	18.0	71	japan	datsun 1200
50	28.0	4	116.0	90.0	2123	14.0	71	europe	opel 1900
174	18.0	6	171.0	97.0	2984	14.5	75	usa	ford pinto
189	15.5	8	304.0	120.0	3962	13.9	76	usa	amc matador
395	32.0	4	135.0	84.0	2295	11.6	82	usa	dodge rampage
187	17.5	8	305.0	140.0	4215	13.0	76	usa	chevrolet chevelle malibu classic
169	20.0	6	232.0	100.0	2914	16.0	75	usa	amc gremlin
58	25.0	4	97.5	80.0	2126	17.0	72	usa	dodge colt hardtop
48	18.0	6	250.0	88.0	3139	14.5	71	usa	ford mustang
344	39.0	4	86.0	64.0	1875	16.4	81	usa	plymouth champ
235	26.0	4	97.0	75.0	2265	18.2	77	japan	toyota corolla liftback
252	19.2	6	231.0	105.0	3535	19.2	78	usa	pontiac phoenix lj
21	24.0	4	107.0	90.0	2430	14.5	70	europe	audi 100 ls
313	28.0	4	151.0	90.0	2678	16.5	80	usa	chevrolet citation
160	17.0	6	231.0	110.0	3907	21.0	75	usa	buick century
276	21.6	4	121.0	115.0	2795	15.7	78	europe	saab 99gle
191	22.0	6	225.0	100.0	3233	15.4	76	usa	plymouth valiant
293	31.9	4	89.0	71.0	1925	14.0	79	europe	vw rabbit custom
343	39.1	4	79.0	58.0	1755	16.9	81	japan	toyota starlet
257	19.4	6	232.0	90.0	3210	17.2	78	usa	amc concord
308	33.5	4	151.0	90.0	2556	13.2	79	usa	pontiac phoenix
149	24.0	4	120.0	97.0	2489	15.0	74	japan	honda civic
130	26.0	4	122.0	80.0	2451	16.5	74	usa	ford pinto
151	31.0	4	79.0	67.0	2000	16.0	74	europe	fiat x1.9
359	28.1	4	141.0	80.0	3230	20.4	81	europe	peugeot 505s turbo diesel
99	18.0	6	232.0	100.0	2945	16.0	73	usa	amc hornet
372	27.0	4	151.0	90.0	2735	18.0	82	usa	pontiac phoenix
87	13.0	8	350.0	145.0	3988	13.0	73	usa	chevrolet malibu
330	40.9	4	85.0	NaN	1835	17.3	80	europe	renault lecar deluxe
214	13.0	8	302.0	130.0	3870	15.0	76	usa	ford f108
121	15.0	8	318.0	150.0	3399	11.0	73	usa	dodge dart custom
397	31.0	4	119.0	82.0	2720	19.4	82	usa	chevy s-10
20	25.0	4	110.0	87.0	2672	17.5	70	europe	peugeot 504
188	16.0	8	318.0	150.0	4190	13.0	76	usa	dodge coronet brougham
71	19.0	3	70.0	97.0	2330	13.5	72	japan	mazda rx2 coupe
106	12.0	8	350.0	180.0	4499	12.5	73	usa	oldsmobile vista cruiser
270	21.1	4	134.0	95.0	2515	14.8	78	japan	toyota celica gt liftback
348	37.7	4	89.0	62.0	2050	17.3	81	japan	toyota tercel
102	26.0	4	97.0	46.0	1950	21.0	73	europe	volkswagen super beetle

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
143	26.0	4	97.0	78.0	2300	14.5	74	europe	opel manta
391	36.0	4	135.0	84.0	2370	13.0	82	usa	dodge charger 2.2
182	28.0	4	107.0	86.0	2464	15.5	76	europe	fiat 131
63	14.0	8	400.0	175.0	4385	12.0	72	usa	pontiac catalina
52	30.0	4	88.0	76.0	2065	14.5	71	europe	fiat 124b
...	...	...	...	...	...	...	...	...	...
394	44.0	4	97.0	52.0	2130	24.6	82	europe	vw pickup
256	20.5	6	225.0	100.0	3430	17.2	78	usa	plymouth volare
295	35.7	4	98.0	80.0	1915	14.4	79	usa	dodge colt hatchback custom
23	26.0	4	121.0	113.0	2234	12.5	70	europe	bmw 2002
82	23.0	4	120.0	97.0	2506	14.5	72	japan	toyouta corona mark ii (sw)

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
232	16.0	8	351.0	149.0	4335	14.5	77	usa	ford thunderbird
365	20.2	6	200.0	88.0	3060	17.1	81	usa	ford granada gl
225	17.5	6	250.0	110.0	3520	16.4	77	usa	chevrolet concours
277	16.2	6	163.0	133.0	3410	15.8	78	europe	peugeot 604sl
115	15.0	8	350.0	145.0	4082	13.0	73	usa	chevrolet monte carlo s
314	26.4	4	140.0	88.0	2870	18.1	80	usa	ford fairmont
219	25.5	4	122.0	96.0	2300	15.5	77	usa	plymouth arrow gs
33	19.0	6	232.0	100.0	2634	13.0	71	usa	amc gremlin
312	37.2	4	86.0	65.0	2019	16.4	80	japan	datsun 310
306	28.8	6	173.0	115.0	2595	11.3	79	usa	chevrolet citation
280	21.5	6	231.0	115.0	3245	15.4	79	usa	pontiac lemans v6
97	18.0	6	225.0	105.0	3121	16.5	73	usa	plymouth valiant
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
100	18.0	6	250.0	88.0	3021	16.5	73	usa	ford maverick
29	27.0	4	97.0	88.0	2130	14.5	71	japan	datsun pl510
300	23.9	8	260.0	90.0	3420	22.2	79	usa	oldsmobile cutlass salon brougham
281	19.8	6	200.0	85.0	2990	18.2	79	usa	mercury zephyr 6
252	19.2	6	231.0	105.0	3535	19.2	78	usa	pontiac phoenix lj
388	26.0	4	156.0	92.0	2585	14.5	82	usa	chrysler lebaron medallion
376	37.0	4	91.0	68.0	2025	18.2	82	japan	mazda glc custom l
278	31.5	4	89.0	71.0	1990	14.9	78	europe	volkswagen scirocco
101	23.0	6	198.0	95.0	2904	16.0	73	usa	plymouth duster
76	18.0	4	121.0	112.0	2933	14.5	72	europe	volvo 145e (sw)
343	39.1	4	79.0	58.0	1755	16.9	81	japan	toyota starlet
362	24.2	6	146.0	120.0	2930	13.8	81	japan	datsun 810 maxima
327	36.4	5	121.0	67.0	2950	19.9	80	europe	audi 5000s (diesel)
170	23.0	4	140.0	78.0	2592	18.5	75	usa	pontiac astro
50	28.0	4	116.0	90.0	2123	14.0	71	europe	opel 1900
246	32.8	4	78.0	52.0	1985	19.4	78	japan	mazda glc deluxe
32	25.0	4	98.0	NaN	2046	19.0	71	usa	ford pinto
89	15.0	8	318.0	150.0	3777	12.5	73	usa	dodge coronet custom
230	15.5	8	350.0	170.0	4165	11.4	77	usa	chevrolet monte carlo landau
313	28.0	4	151.0	90.0	2678	16.5	80	usa	chevrolet citation
86	14.0	8	304.0	150.0	3672	11.5	73	usa	amc matador
46	22.0	4	140.0	72.0	2408	19.0	71	usa	chevrolet vega (sw)
364	26.6	8	350.0	105.0	3725	19.0	81	usa	oldsmobile cutlass ls
272	23.8	4	151.0	85.0	2855	17.6	78	usa	oldsmobile starfire sx
171	24.0	4	134.0	96.0	2702	13.5	75	japan	toyota corona
71	19.0	3	70.0	97.0	2330	13.5	72	japan	mazda rx2 coupe
231	15.5	8	400.0	190.0	4325	12.2	77	usa	chrysler cordoba