DiaObject Anomaly Detection in DP0.2¶

Contact author(s): Ryan Lau
Last verified to run: 2025-03-26
LSST Science Pipelines version: Weekly 2025_09
Container Size: medium
Targeted learning level: Intermediate
Description: Apply anomaly detection techniques to DIA Objects from the DP0.2 catalogs and inspect the anomalies.
Skills: Use Isolation Forest algorithm on the DP0.2 DiaObject Table. Plot results and inspect anomalies. Display calexp, difference template, and difference images using Butler.
LSST Data Products: TAP tables dp02_dc2_catalogs.DiaObject, DiaSource, TruthSummary
Packages: lsst, scikit-learn, pandas, matplotlib, numpy, astropy, PIL
Credit: Inspired by a notebook and discussion with Konstantin Malanchev. We thank Adam Miller and Brian Nord for feedback on the notebook.
Get Support: Find DP0-related documentation and resources at dp0-2.lsst.io. Questions are welcome as new topics in the Support - Data Preview 0 Category of the Rubin Community Forum. Rubin staff will respond to all questions posted there. For more information, please consider reaching out to the Anomaloy Detection interest group of the LSST Informatics & Statistics Science Collaboration.
1. Introduction¶
This notebook introduces an anomaly detection technique on difference-image analysis (DIA) objects from DP0.2. It demonstrates how to perform the following:
- apply the IsolationForest routine from
scikit-learn
; - identify anomalous DiaObjects from the DP0.2 catalogs;
- how to inspect the properties of the anomalous targets.
For further introduction and motivation on investigating anomalies, please view the following link: The Unimagined
1.1. Package imports¶
Import general python packages, the IsolationForest algorithm from scikit-learn
, and the Rubin TAP service utilities.
The matplotlib
, numpy
, and scikit-learn
libraries are widely used Python libraries for plotting, scientific computing, and conducting Machine-Learning data analysis. We will use these packages below, including the lsst.rsp
package to access the TAP service and query the DP0 catalogs.
We also use the lsst.rsp
package to access the TAP service and query the DP0 catalogs.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from lsst.rsp import get_tap_service
In Sec. 2.4, we display image cutouts of an anomalous DiaObjects by the utilizing Butler
to load the data. The following python packages are imported to display the image cutouts.
from lsst.daf.butler import Butler
import lsst.geom as geom
from astropy.visualization import LinearStretch, ImageNormalize
import warnings
warnings.simplefilter("ignore", category=UserWarning)
from astropy.stats import sigma_clipped_stats
Defining plotting parameters for nice displays.
plt.style.use('tableau-colorblind10')
plot_filter_labels = ['u', 'g', 'r', 'i', 'z', 'y']
plot_filter_colors = {'u': '#56b4e9', 'g': '#008060', 'r': '#ff4000',
'i': '#850000', 'z': '#6600cc', 'y': '#000000'}
plot_filter_symbols = {'u': 'o', 'g': '^', 'r': 'v',
'i': 's', 'z': '*', 'y': 'p'}
2. Search for Anomalous DiaObjects using IsolationForest¶
Here, we will apply the Isolation Forest Algorithm from scikit-learn
on the DP0.2 DiaObject catalog to identify anomalous sources.
In particular, we will look for DiaObjects that are anomalously variable, and then inspect their light curves. We will also obtain the display the science, reference, and difference image cutouts of the source with the highest anomaly score.
2.1. Create the Rubin TAP Service client and Obtain DiaObject Sample¶
First, get an instance of the TAP service, and assert that it exists.
service = get_tap_service("tap")
assert service is not None
To reduce spurious DiaObjects, we set a minimum number of DiaSource detections.
ndiasources_min = 10
For simplicity, we focus the search for anomalous behavior from DiaObjects in the r-band.
We set an additional threshold for number of detections in the r-band filter.
ndata_thresh = 8
We now retrieve a random sample of 1000000 DiaObjects and the following statistics of their r-band light curve:
rPSFluxNdata: The number of data points used to compute rPSFluxChi2.
rPSFluxSigma: The standard deviation of the fluxes.
rPSFluxLinearSlope: Slope of a linear model fit to diaSource PSF flux vs time.
rPSFluxMean: The average flux.
Note that the statistics are all based on the difference-image point source (PS) flux values.
An r-band total mean flux (rTOTFluxMean
) threshold of $<1\times10^5$ nJy is set to filter out saturated sources. Limits are also placed on the maximum r-band difference-image flux (rPSFluxMax
$<1\times10^5$ nJy) and the minimum r-band difference-image flux (rPSFluxMin
$>-1\times10^5$ nJy) to avoid artifacts from saturated sources.
The following cell should take less than 10 seconds to run.
results = service.search("SELECT TOP 1000000 "
"ra, decl, diaObjectId, nDiaSources, rPSFluxNdata, "
"rPSFluxSigma, rPSFluxLinearSlope, rPSFluxMean "
"FROM dp02_dc2_catalogs.DiaObject "
"WHERE nDiaSources > "+str(ndiasources_min)+" "
"AND rTOTFluxMean < 1e5 " + " "
"AND rPSFluxMax < 1e5 " + " "
"AND rPSFluxMin > -1e5 " + " "
"AND rPSFluxNdata > "+str(ndata_thresh)+" ")
DiaObjs = results.to_table()
df = DiaObjs.to_pandas()
del results
2.2. Run IsolationForest Algorithm¶
We run the Isolation Forest Algorithm on the DiaObject sample and look for outliers based on the rPSFluxMean
, rPSFluxLinearSlope
, and rPSFluxSigma
values. The algorithm outputs anomaly scores based on the input sample where negative scores represent outliers.
The indicies of the top 20 outliers are saved in the array idx
.
sample = df[[
'rPSFluxMean',
'rPSFluxLinearSlope',
'rPSFluxSigma',
]].values
rng = np.random.RandomState(42)
ifo = IsolationForest(max_samples=1000,
random_state=rng, n_jobs=1)
ifo.fit(sample)
idx = np.argsort(ifo.score_samples(sample))[:20]
Uncomment the following to show the scores of the top 20 outliers
# np.sort(ifo.score_samples(x))[:20]
We also show the histogram of the Isolation Forest scores from our input sample.
plt.title('Histogram of Isolation Forest Scores')
plt.hist(ifo.score_samples(sample), bins=50)
plt.yscale('log')
plt.show()
Figure 1: Histogram of Isolation Forest scores from the input sample. Scores with the most negative values indicate outliers.
Uncomment to Display the DIA Objects IDs of the anomolies
# [df['diaObjectId'][i] for i in idx]
2.3. Vizualising the Identified Anomalies¶
Plot the distribution of DiaObjects with the properties used to identify outliers: rPSFluxMean
, rPSFluxLinearSlope
, and rPSFluxSigma
.
fig, ax = plt.subplots(1, 2, figsize=(12, 5), sharey=False, sharex=False)
params = [
('rPSFluxMean', 'rPSFluxLinearSlope'),
('rPSFluxMean', 'rPSFluxSigma'),
]
for j, (px, py) in enumerate(params):
ax[j].plot(df[px], df[py], 'o', ms=1, color='grey', alpha=0.1)
ax[j].set_xlabel(px)
ax[j].set_ylabel(py)
for i in idx:
ax[j].plot(df[px][i], df[py][i], '*', ms=12, color='red', mec='black')
plt.tight_layout()
plt.show()
Figure 2: Distribution of DiaObjects showing (Left)
rPSFluxLinearSlope
vsrPSFluxMean
and (Right)rPSFluxSigma
vsrPSFluxMean
. The anomalies identified from the IsolationForest algorithm (red star) indeed appear to be outliers in the distribution of other DiaObjects (grey points)
Next, we plot the r-band light curves of the 20 DiaObject outliers.
fig, ax = plt.subplots(5, 4, figsize=(10, 10), sharey=False, sharex=False)
n = 0
filters = ['r']
for i in range(5):
for j in range(4):
results = service.search("SELECT ra, decl, diaObjectId, diaSourceId, "
"filterName, midPointTai, psFlux "
"FROM dp02_dc2_catalogs.DiaSource "
"WHERE diaObjectId = "+str(df['diaObjectId'][idx[n]]))
results = results.to_table()
for f, filt in enumerate(filters):
fx = np.where(results['filterName'] == filt)[0]
ax[i, j].plot(results['midPointTai'][fx], results['psFlux'][fx],
plot_filter_symbols[filt], ms=10, mew=0, alpha=0.5,
color=plot_filter_colors[filt])
ax[i, j].set_title(df['diaObjectId'][idx[n]])
if i == 4:
ax[i, j].xaxis.set_label_text('MJD (days)')
if j == 0:
ax[i, j].yaxis.set_label_text('psFlux(nJy)')
n += 1
del results
plt.tight_layout()
plt.show()
Figure 3: r-band light curves of the 20 DiaObject outliers. Note that the outlier light curves indeed show significant changes in r-band flux.
2.4. Display the Cutout Images of Anomalous Sources¶
We now utilize the butler to investigate the cutout images of the anomalous sources.
butler = Butler('dp02', collections='2.2i/runs/DP0.2')
Defining function to obtain cutouts
def cutout_im(butler, ra, dec, datasettype, visit, detector, cutoutsidelength=51, **kwargs):
"""
Produce a cutout from a calexp at the given ra, dec position.
Adapted from cutout_coadd which was adapted from a DC2 tutorial
notebook by Michael Wood-Vasey.
"""
dataid = {'visit': visit, 'detector': detector}
radec = geom.SpherePoint(ra, dec, geom.degrees)
cutoutsize = geom.ExtentI(cutoutsidelength, cutoutsidelength)
wcs = butler.get('%s.wcs' % datasettype, **dataid)
xy = geom.PointI(wcs.skyToPixel(radec))
bbox = geom.BoxI(xy - cutoutsize // 2, cutoutsize)
parameters = {'bbox': bbox}
cutout_image = butler.get(datasettype, parameters=parameters, **dataid)
return cutout_image
diff = 'goodSeeingDiff_differenceExp'
difftemp = 'goodSeeingDiff_templateExp'
calexp = 'calexp'
Next, we obtain the diaSources and their properties for a given diaOjbect in our outlier list defined by the index anom_ind
. We obtain the diaSource properties for the diaObject with the highest anomaly score (i.e., anom_ind
= 0).
anom_ind = 0
DiaObjID = df['diaObjectId'][idx[anom_ind]]
results = service.search("SELECT ra, decl, diaObjectId, diaSourceId, psFlux, "
"psFluxErr, filterName, midPointTai, SNR, ccdVisitId, "
"apFlux_flag, centroid_flag, forced_PsfFlux_flag, "
"pixelFlags, isDipole, "
"scisql_nanojanskyToAbMag(psFlux) AS psAbMag "
"FROM dp02_dc2_catalogs.DiaSource "
"WHERE diaObjectId = "+str(DiaObjID))
diasrc = results.to_table()
diasrc.sort('midPointTai')
del results
We will obtain two sets of image cutouts where the measured r-band flux (psFlux
) is brightest and faintest. To do this, we execute the following cell to identify the indicies in the table of diaSource properties where the r-band flux is brightest and faintest.
ind_mm = {}
filt_cond = diasrc['filterName'] == 'r'
ind_mm['ind_max'] = np.where(diasrc[filt_cond]['psFlux'] ==
np.max(diasrc[filt_cond]['psFlux']))[0][0]
ind_mm['ind_min'] = np.where(diasrc[filt_cond]['psFlux'] ==
np.min(diasrc[filt_cond]['psFlux']))[0][0]
The following light curve of the DiaObject is overlaid with dashed vertical lines that indicate where the r-band flux is brightest and faintest.
plt.figure(figsize=(6, 4))
filters = ['r']
mjds = diasrc[anom_ind]['midPointTai']
results = service.search("SELECT ra, decl, diaObjectId, diaSourceId, "
"filterName, midPointTai, psFlux "
"FROM dp02_dc2_catalogs.DiaSource "
"WHERE diaObjectId = "+str(df['diaObjectId'][idx[anom_ind]]))
results = results.to_table()
for f, filt in enumerate(filters):
fx = np.where(results['filterName'] == filt)[0]
plt.plot(results['midPointTai'][fx], results['psFlux'][fx],
plot_filter_symbols[filt], ms=10, mew=0, alpha=0.5, color=plot_filter_colors[filt])
del(results)
for im_ind in ind_mm:
plt.axvline(x=diasrc[filt_cond][ind_mm[im_ind]]['midPointTai'], ls='--', color='black')
plt.title(df['diaObjectId'][idx[anom_ind]])
plt.xlabel('MJD (days)')
plt.ylabel('psFlux (nJy)')
plt.tight_layout()
plt.show()
Figure 4: r-band light curve of the DiaObject outlier under inspection. Dashed lines indicate the date of maximum and minimum flux.
We then obtain and display two sets of image cutout triplets that includes the calexp, reference template, and the difference images.
The first set shows cutouts from the r-band flux maximum and the second set shows the cutouts from the r-band flux minimum. DiaSource info from each cutout set is provided above the images.
The following cell takes around 40 seconds to run.
for im_ind in ind_mm:
ind = np.where(diasrc['filterName'] == 'r')[0][ind_mm[im_ind]]
cutoutsize = 101
ra = diasrc[ind]['ra']
dec = diasrc[ind]['decl']
ccdvisitid = diasrc[ind]['ccdVisitId']
mag = diasrc[ind]['psAbMag']
flux = diasrc[ind]['psFlux']
print('Displaying calexp, difference template, and difference image for diaSource',
diasrc[ind]['diaSourceId'], 'from DiaObjectID', DiaObjID)
visit = str(ccdvisitid)[:-3]
detector = str(ccdvisitid)[-3:]
visit = int(visit)
detector = int(detector)
mjd = diasrc[ind]['midPointTai']
filt = diasrc[ind]['filterName']
print('Visit =', visit, ', Detector = ', detector)
cutout_diff = cutout_im(butler, ra, dec, diff, visit, detector, cutoutsidelength=cutoutsize)
cutout_ref = cutout_im(butler, ra, dec, difftemp, visit, detector, cutoutsidelength=cutoutsize)
cutout_calexp = cutout_im(butler, ra, dec, calexp, visit, detector, cutoutsidelength=cutoutsize)
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(16, 6))
triplet = [cutout_calexp, cutout_ref, cutout_diff]
titles = ['calexp', 'reference template', 'difference image']
for i, ax in enumerate(axs.flatten()):
plt.sca(ax)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_xticks([])
ax.set_yticks([])
plt.title('%s' % titles[i], fontsize=18)
im_arr = triplet[i].image.array
_, im_arr_med, im_arr_std = sigma_clipped_stats(im_arr)
minstd = -2
maxstd = 8
circle = plt.Circle((cutoutsize/2, cutoutsize/2), 0.5, color='r')
ax.add_patch(circle)
norm = ImageNormalize(im_arr-im_arr_med, vmin=minstd * im_arr_std,
vmax=maxstd * im_arr_std, stretch=LinearStretch())
plt.imshow(im_arr, origin='lower', norm=norm, cmap='gray')
plt.colorbar(fraction=0.046, pad=0.04)
plt.suptitle('MJD %.2f, Band = %s, psFlux =%.2f nJy, Cutout Size = %.2f arcsec'
'\n Flags: apFlux_flag = %s, centroid_flag = %s,'
'\n forced_PsfFlux_flag = %s, pixelFlags = %s, isDipole = %s '
% (mjd, filt, flux, 0.199918 * cutoutsize, diasrc[ind]['apFlux_flag'],
diasrc[ind]['centroid_flag'], diasrc[ind]['forced_PsfFlux_flag'],
diasrc[ind]['pixelFlags'], diasrc[ind]['isDipole']),
y=1.01, fontsize=22)
plt.show()
Displaying calexp, difference template, and difference image for diaSource 490552385341489231 from DiaObjectID 1737386701560479833 Visit = 913725 , Detector = 21
Displaying calexp, difference template, and difference image for diaSource 506429844884553804 from DiaObjectID 1737386701560479833 Visit = 943299 , Detector = 94
Figure 5: Cutout of the calexp, reference template, and difference image from the visits where the r-band flux was at (Top) maximum and (Bottom) minimum.
The set of cutout images show that the transient indeed appears to be "real" and that the associated DiaObject does not arise from an artifact.
Sec. 2.5 Identifying the Anomalous DiaObject in the TruthSummary Table¶
Using the TruthSummary
table we can investigate whether or not the anomalous DiaObject from the previous sub-section is an artifact or a transient.
We perform a coordinate search of the anomalous DiaObject in the TruthSummary table to identify the source, and we then present the resulting properties of the corresponding source from the TruthSummary table.
ra = df['ra'][idx[anom_ind]]
dec = df['decl'][idx[anom_ind]]
results = service.search("SELECT ts.ra, ts.dec, ts.host_galaxy, "
"ts.is_pointsource, ts.is_variable, ts.truth_type "
"FROM dp02_dc2_catalogs.TruthSummary AS ts "
"WHERE CONTAINS(POINT('ICRS', ts.ra, ts.dec), "
"CIRCLE('ICRS'," + str(ra) + ", "
+ str(dec) + ", 0.00014)) = 1 ", maxrec=100000)
SrcTruth = results.to_table()
SrcTruth
ra | dec | host_galaxy | is_pointsource | is_variable | truth_type |
---|---|---|---|---|---|
deg | deg | ||||
float64 | float64 | int64 | int32 | int32 | int64 |
63.6675783 | -34.3384596 | 60 | 1 | 1 | 3 |
The truth_type
= 3, which indicates the source is indeed a supernova (See Table B.2. from the
DESC DC2 Release Note).
3. Exercises for the learner¶
Incorporate different DiaObject properties and filter bands in the Isolation Forest fit (Sec. 2.1) and investigate how that affects the resulting anomalies.
Try testing other astro-specific implementations of the Isolation Forest fitting (e.g., https://coniferest.readthedocs.io/en/latest/index.html).